# Session 3: Harvesting data from the web: APIs  

### A first API

[Chronicling America](http://chroniclingamerica.loc.gov/about/) is a joint project of the National Endowment for the Humanities and the Library of Congress .

Search for articles that mention "[slavery](http://chroniclingamerica.loc.gov/search/pages/results/?andtext=slavery)".

<div class="alert alert-info">

Look at the URL. What happens if you change the word slavery to abolition? 

What happens to the URL when you go to the second page? Can you get to page 251?

</div>

What if we append ``&format=json`` to the end of the search URL? 


http://chroniclingamerica.loc.gov/search/pages/results/?andtext=slavery&format=json


[``requests``](http://docs.python-requests.org/en/master/) is a useful and commonly used HTTP library for python. It is not a part of the default installation, but is included with Anaconda Python Distribution. 

In [4]:
import requests

It would be possible to use the API URL and parameters directly in the requests command, but since the most likely scenario involves making repeating calls to ``requests`` as part of a loop -- the search returned less than 1% of the results -- I store the strings first. 

In [5]:
base_url   = 'http://chroniclingamerica.loc.gov/search/pages/results/'
parameters = '?andtext=slavery&format=json'

`requests.get()` is used for both accessing websites and APIs. The command can be modified by several arguements, but at a minimum, it requires the URL.

In [6]:
r = requests.get(base_url + parameters)

`r` is a `requests` response object. Any JSON returned by the server are stored in `.json().`

In [7]:
search_json = r.json()

JSONs are dictionary like objects, in that they have keys (think variable names) and values. `.keys()` returns a list of the keys.

In [8]:
print search_json.keys()

[u'totalItems', u'endIndex', u'startIndex', u'itemsPerPage', u'items']


You can return the value of any key by putting the key name in brackets.

In [9]:
search_json['totalItems']

434349

<div class="alert alert-info">
What else is in there? Where is the stuff we want?
</div>

As is often the case with results from an API, most of the keys and values are metadate about either the search or what is being returned. These are useful for knowing if the search is returning what you want, which is particularly important when you are making multiple calls to the API. 

The data I'm intereted in is all in `items`. 

In [10]:
print type(search_json['items'])
print len(search_json['items'])

<type 'list'>
20


`items` is a list with 20 items.

In [11]:
print type(search_json['items'][0])
print type(search_json['items'][19])

<type 'dict'>
<type 'dict'>


Each of the 20 items in the list is a dictionary. 

In [12]:
first_item = search_json['items'][0]

print first_item.keys()

[u'sequence', u'county', u'edition', u'frequency', u'id', u'section_label', u'city', u'date', u'title', u'end_year', u'note', u'state', u'subject', u'type', u'place_of_publication', u'start_year', u'edition_label', u'publisher', u'language', u'alt_title', u'lccn', u'country', u'ocr_eng', u'batch', u'title_normal', u'url', u'place', u'page']


<div class="alert alert-info">
What is the title of the first item?
</div>

While a standard CSV file has a header row that describes the contents of each column, a JSON file has keys identifying the values found in each case. Importantly, these keys need not be the same for each item. Additionally, values don't have to be numbers of strings, but could be lists or dictionaries. For example, this JSON could have included a `newspaper` key that was a dictionary with all the metadata about the newspaper the article and issue was published, an `article` key that include the article specific information as another dictionary, and a `text` key whose value was a string with the article text.

As before, we can examine the contents of a particular item, such as the publication's `title`.

In [13]:
print first_item['title']

Anti-slavery bugle. volume


The easiest way to view or analyze this data is to convert it to a dataset-like structure. While Python does not have a builting in dataframe type, the popular `pandas` library does. By convention, it is imported as `pd`.

In [14]:
import pandas as pd

# Make sure all columns are displayed
pd.set_option("display.max_columns",101)

pandas is prety smart about importing different JSON-type objects and converting them to dataframes with its `.DataFrame()` function.

In [15]:
df = pd.DataFrame(search_json['items'])

df.head(6)

Unnamed: 0,alt_title,batch,city,country,county,date,edition,edition_label,end_year,frequency,id,language,lccn,note,ocr_eng,page,place,place_of_publication,publisher,section_label,sequence,start_year,state,subject,title,title_normal,type,url
0,[],batch_ohi_ariel_ver02,"[New Lisbon, Salem]",Ohio,"[Columbiana, Columbiana]",18490316,,,1861,Weekly,/lccn/sn83035487/1849-03-16/ed-1/seq-1/,[English],sn83035487,[Archived issues are available in digital form...,"LAVE\nam\nJlile\nVOL. 4. NO. 30.\nSALEM. OHIO,...",,"[Ohio--Columbiana--New Lisbon, Ohio--Columbian...","New-Lisbon, Ohio",Ohio American Antislavery Society,,1,1845,"[Ohio, Ohio]",[Antislavery movements--United States--Newspap...,Anti-slavery bugle. volume,anti-slavery bugle.,page,http://chroniclingamerica.loc.gov/lccn/sn83035...
1,[],batch_iune_golf_ver01,[Chicago],Illinois,[Cook County],19140516,,NOON EDITION,1917,Daily (except Sunday and holidays),/lccn/sn83045487/1914-05-16/ed-1/seq-10/,[English],sn83045487,"[""An adless daily newspaper."", Archived issues...",r\nmmmmmmmmmmmmmmmmmmmmmmmm\n'SLAVERY RIFE IN ...,,[Illinois--Cook County--Chicago],"Chicago, Ill.",N.D. Cochran,,10,1911,[Illinois],"[Chicago (Ill.)--Newspapers., Illinois--Chicag...",The day book.,day book.,page,http://chroniclingamerica.loc.gov/lccn/sn83045...
2,[],batch_iune_india_ver01,[Chicago],Illinois,[Cook County],19161109,,EXTRA,1917,Daily (except Sunday and holidays),/lccn/sn83045487/1916-11-09/ed-1/seq-26/,[English],sn83045487,"[""An adless daily newspaper."", Archived issues...",us remaining whites if we expect to\nstay on t...,,[Illinois--Cook County--Chicago],"Chicago, Ill.",N.D. Cochran,,26,1911,[Illinois],"[Chicago (Ill.)--Newspapers., Illinois--Chicag...",The day book.,day book.,page,http://chroniclingamerica.loc.gov/lccn/sn83045...
3,[],batch_iune_golf_ver01,[Chicago],Illinois,[Cook County],19150327,,NOON EDITION,1917,Daily (except Sunday and holidays),/lccn/sn83045487/1915-03-27/ed-1/seq-24/,[English],sn83045487,"[""An adless daily newspaper."", Archived issues...",THOUSANDS OF VEILED WOMEN OF TURKISH\nHAREM ON...,,[Illinois--Cook County--Chicago],"Chicago, Ill.",N.D. Cochran,,24,1911,[Illinois],"[Chicago (Ill.)--Newspapers., Illinois--Chicag...",The day book.,day book.,page,http://chroniclingamerica.loc.gov/lccn/sn83045...
4,[],batch_iune_foxtrot_ver01,[Chicago],Illinois,[Cook County],19130815,,,1917,Daily (except Sunday and holidays),/lccn/sn83045487/1913-08-15/ed-1/seq-5/,[English],sn83045487,"[""An adless daily newspaper."", Archived issues...",LOLA NORRiajQlVS SiENSAT-iPN AL t EVIDENCE IN ...,,[Illinois--Cook County--Chicago],"Chicago, Ill.",N.D. Cochran,,5,1911,[Illinois],"[Chicago (Ill.)--Newspapers., Illinois--Chicag...",The day book.,day book.,page,http://chroniclingamerica.loc.gov/lccn/sn83045...
5,[],batch_iune_foxtrot_ver01,[Chicago],Illinois,[Cook County],19130308,,NOON EDITION,1917,Daily (except Sunday and holidays),/lccn/sn83045487/1913-03-08/ed-1/seq-6/,[English],sn83045487,"[""An adless daily newspaper."", Archived issues...",that every possible weakness in. a\ngirl as &e...,,[Illinois--Cook County--Chicago],"Chicago, Ill.",N.D. Cochran,,6,1911,[Illinois],"[Chicago (Ill.)--Newspapers., Illinois--Chicag...",The day book.,day book.,page,http://chroniclingamerica.loc.gov/lccn/sn83045...


Note that I converted `search_json['items']` to  dataframe and not the entire JSON file. This is because I wanted each row to be an article. 

In [16]:
pd.DataFrame(search_json)

Unnamed: 0,endIndex,items,itemsPerPage,startIndex,totalItems
0,20,"{u'sequence': 1, u'county': [u'Columbiana', u'...",20,1,434349
1,20,"{u'sequence': 10, u'county': [u'Cook County'],...",20,1,434349
2,20,"{u'sequence': 26, u'county': [u'Cook County'],...",20,1,434349
3,20,"{u'sequence': 24, u'county': [u'Cook County'],...",20,1,434349
4,20,"{u'sequence': 5, u'county': [u'Cook County'], ...",20,1,434349
5,20,"{u'sequence': 6, u'county': [u'Cook County'], ...",20,1,434349
6,20,"{u'sequence': 13, u'county': [u'Cook County'],...",20,1,434349
7,20,"{u'sequence': 1, u'county': [None], u'edition'...",20,1,434349
8,20,"{u'sequence': 30, u'county': [u'Cook County'],...",20,1,434349
9,20,"{u'sequence': 4, u'county': [None], u'edition'...",20,1,434349


If this dataframe contained all the items that you were looking for, it would be easy to save this to a csv file for storage and later analysis.

In [17]:
df.to_csv('lynching_articles.csv')

In [18]:
!head lynching_articles.csv

,alt_title,batch,city,country,county,date,edition,edition_label,end_year,frequency,id,language,lccn,note,ocr_eng,page,place,place_of_publication,publisher,section_label,sequence,start_year,state,subject,title,title_normal,type,url
0,[],batch_ohi_ariel_ver02,"[u'New Lisbon', u'Salem']",Ohio,"[u'Columbiana', u'Columbiana']",18490316,,,1861,Weekly,/lccn/sn83035487/1849-03-16/ed-1/seq-1/,[u'English'],sn83035487,"[u'Archived issues are available in digital format as part of the Library of Congress Chronicling America online collection.', u'Editors: Benjamin S. Jones, J. Elizabeth Hitchcock, 1845-1846; Benjamin S. Jones, J. Elizabeth Jones, 1846-1849; Oliver Johnson 1849-1851; Marius R. Robinson, 1851-1859; Benjamin S. Jones, 1859-1861.', u'Not published June 27-July 18, 1845.', u'Printers: John Frost, 1845; J.H. Painter, 1845-1846; G.N. Hapgood, 1846-1848.', u'Published in: New Lisbon, Ohio, June 20, 1845-Aug. 29, 1845, and: Salem, Ohio, Sept. 5, 1845-May 4, 1861.', u'Publisher: Executive 

This is only a small subset of the articles on lynching that are available, however. The API returns results in batches of 20 and this is only the first page of results. As is often the case, I'll need to make multiple calls to the API to retrieve all the data of interest. The easiest way to do that is to define a small function for getting the article information and put that in a loop. While it isn't a requirement that you create a function for making the API call, it will make your code easier to read and debug.


Looking at the API guidelines, there is an additional paramater `page` that tells the API which subset of results we want. This name varies by API but their is usually some mechanism for retrieiving results beyond the initial JSON.

Before creating the loop and making multiple calls to the API, I want to make sure that the API is working the way I think it is. 

<div class="alert alert-info">
Look at the API guidelines. How can we get the third page?
</div>

In [20]:
base_url   = 'http://chroniclingamerica.loc.gov/search/pages/results/'
parameters = '?andtext=slavery&format=json&page=3'

r = requests.get(base_url + parameters)
results =  r.json()

print results['startIndex']
print results['endIndex']

41
60


A call to random selected page 3 returns results 41 through 60, which is what I expected since each page has 20 items.

The parameters are getting pretty ugly, so fortunately `requests` accepts a dictionary where the keys are the parameter names as defined by the API and the values are the search paramaters you are looking for. So the same request can be rewritten:

In [21]:
base_url = 'http://chroniclingamerica.loc.gov/search/pages/results/'
parameters = {'andtext': 'lynching',
              'page' : 3,
              'format'  : 'json'}
r = requests.get(base_url, params=parameters)

results =  r.json()

print results['startIndex']
print results['endIndex']

41
60


This can be rewritten as function:

In [22]:
def get_articles():
    '''
    Make calls to the Chronicling America API.
    '''
    
    base_url   = 'http://chroniclingamerica.loc.gov/search/pages/results/'
    parameters = {'andtext': 'lynching',
                  'page'   : 3,
                  'format' : 'json'}
    
    r = requests.get(base_url, params = parameters)
    results =  r.json()
    
    return results

In [23]:
results = get_articles()

print results['startIndex']
print results['endIndex']

41
60


The advantage of writing a function, however, would be that you can pass along your own parameters, such as the search term and page number, which would make this much more useful. 

In [24]:
def get_articles(search_term, page_number):
    '''
    Make calls to the Chronicling America API.
    '''
    
    base_url = 'http://chroniclingamerica.loc.gov/search/pages/results/'
    parameters = {'andtext': search_term,
                  'page'   : page_number,
                  'format' : 'json'}
    
    r = requests.get(base_url, params = parameters)
    results =  r.json()

    return results

In [25]:
results = get_articles('lynching', 3)

print results['startIndex']
print results['endIndex']

41
60


Now, the first 60 results could downloaded in a just a few lines:

In [26]:
for page_number in range(1,4): # range stops before it gets to the last number
    results = get_articles('lynching', page_number)
    print results['startIndex'], results['endIndex']
    

1 20
21 40
41 60


Everything appears to be working, but unfortunately I only have the last page of results still. Each call to the API was redefining `results` variable. In this case, I set up an empty dataframe to store the results and will append the items from each page of results.

In [28]:
df = pd.DataFrame() # empty dataframe to store results

for page_number in range(1,4):
    results = get_articles('lynching', page_number)
    new_df = pd.DataFrame(results['items'])
    df = df.append(new_df , ignore_index=True) # otherwise, index would be 0-20 3x
    
print len(df)
df.head(5)

60


Unnamed: 0,alt_title,batch,city,country,county,date,edition,edition_label,end_year,frequency,id,language,lccn,note,ocr_eng,page,place,place_of_publication,publisher,section_label,sequence,start_year,state,subject,title,title_normal,type,url
0,[],batch_mimtptc_jackson_ver01,[Dearborn],Michigan,[Wayne],19211022,,,1927,Weekly,/lccn/2013218776/1921-10-22/ed-1/seq-1/,[English],2013218776,"[""The Ford international weekly"" appears with ...","""Mis-Picturing Us Abroad"" Introduces the Serie...",,[Michigan--Wayne--Dearborn],"Dearborn, Mich.",Suburban Pub. Co.,,1,1901,[Michigan],"[Dearborn (Mich.)--Newspapers., Michigan--Dear...",Dearborn independent.,dearborn independent.,page,http://chroniclingamerica.loc.gov/lccn/2013218...
1,[],batch_iune_hotel_ver01,[Chicago],Illinois,[Cook County],19150818,,LAST EDITION,1917,Daily (except Sunday and holidays),/lccn/sn83045487/1915-08-18/ed-1/seq-4/,[English],sn83045487,"[""An adless daily newspaper."", Archived issues...",25 patriots who took into their own\nhands a l...,,[Illinois--Cook County--Chicago],"Chicago, Ill.",N.D. Cochran,,4,1911,[Illinois],"[Chicago (Ill.)--Newspapers., Illinois--Chicag...",The day book.,day book.,page,http://chroniclingamerica.loc.gov/lccn/sn83045...
2,[Saint Paul tidende],batch_mnhi_gemma_ver01,[Saint Paul],Minnesota,[Ramsey],19160310,,,1928,Weekly,/lccn/sn90059649/1916-03-10/ed-1/seq-4/,[Danish],sn90059649,[Available on microfilm from the Minnesota His...,"}'i» Room 201 Court Six#, St. Paul, fflinn.\nB...",,[Minnesota--Ramsey--Saint Paul],"St. Paul, Minn.",C. Rasmussen Pub. Co.,,4,1902,[Minnesota],"[Danes--Minnesota--Newspapers., Danes.--fast--...",St. Paul tidende.,st. paul tidende.,page,http://chroniclingamerica.loc.gov/lccn/sn90059...
3,[Saint Paul tidende],batch_mnhi_gemma_ver01,[Saint Paul],Minnesota,[Ramsey],19091119,,,1928,Weekly,/lccn/sn90059649/1909-11-19/ed-1/seq-5/,[Danish],sn90059649,[Available on microfilm from the Minnesota His...,November 1909.\nLynching i Illinois.\nMinnesot...,,[Minnesota--Ramsey--Saint Paul],"St. Paul, Minn.",C. Rasmussen Pub. Co.,,5,1902,[Minnesota],"[Danes--Minnesota--Newspapers., Danes.--fast--...",St. Paul tidende.,st. paul tidende.,page,http://chroniclingamerica.loc.gov/lccn/sn90059...
4,"[Star, Sunday star]",batch_dlc_dalek_ver01,[Washington],District of Columbia,[None],19221123,,,1972,Daily,/lccn/sn83045462/1922-11-23/ed-1/seq-34/,[English],sn83045462,"[""From April 25 through May 24, 1861 one sheet...",T\nTU^\nnit;\n- V E\nxl\nII\nb<\nin rour\n3436...,34.0,[District of Columbia--Washington],"Washington, D.C.",W.D. Wallach & Hope,,34,1854,[District of Columbia],"[Washington (D.C.)--fast--(OCoLC)fst01204505, ...",Evening star.,evening star.,page,http://chroniclingamerica.loc.gov/lccn/sn83045...


For a large download, you would still want to tweak this a bit by pausing between each API call and making it robust to internet or API errors, but this is a solid framework for collecting data from an API.

<div class="alert alert-info">
<h1> Geocoding</h1>

Work in groups of three!
<p>
You have been handed a list of addresses. You want to geocode them. 
<p>Read about the Google Maps Geocoding API.
<p>On paper, map out your work flow. What functions will you need?

</div>

File location:

https://raw.githubusercontent.com/nealcaren/CSSS-CABD/master/files/locations.csv

<div class="alert alert-info">

My workflow:   
   
   
1. Use `requests` to test out a single address.    
2. Turn that into a function that accepts a location.    
3. Read in the CSV file with all the locations.    
4. Store the results    
5. Write them to a CSV.    
</div>

### A third API

While the Chronicling America API allows annonymous usage, most APIs require you to register in advance. This usually involves going to their website, signing up for the service, and then going through a second signup for developers.  


When you sign up  to use an API, you usually agree to only use the API to facilitate other people using the service (e.g. customer's finding their way to your store) and that you won't store the data. API providers usually enforce this through rate limiting, meaning you can only access the service so many times per minute or per day. For example, you can only search status updates 180 times every 15 minutes according to [Twitter guidelines](https://dev.twitter.com/docs/rate-limiting/1.1/limits). [Yelp](http://www.yelp.com/developers/documentation/faq) limits you to 10,000 calls per day. If you go over your limit, you won't be able to access the service for a bit. You will also get in trouble if you redistribute the data, so don't plan on doing that. 

Two of the major reasons that web services require API authentication is so that they know who you are and so they can make sure that you don't go over their rate limits. Since you shouldn't be giving your password to random people on the internet, API authentication works a little bit differently. 



Like many other places, in order to use the Yelp API you have to sign up as [developer](http://www.yelp.com/developers). After telling them a little bit about what you plan to do--feel free to be honest; they aren't going to deny you access if you put "research on food cultures" as the purpose--you will get a *Consumer Key*, *Consumer Secret*, *Token*, and *Token Secret*. Copy and paste them somewhere special. 

Using the Yelp API goes something like this. First, you tell Yelp who you are and what you want. Assuming you are authorized to have this information, they respond with a URL where you can retrieve the data. The coding for this in practice is a little bit complicated, so there are often single use tools for accessing APIs, like [Tweepy](http://tweepy.github.io) for Twitter. 

Yelp uses the OAuth protocol for authentication. There are several python libraries for handling this, but you will likely need to install one (via `conda` or `pip`) yourself first.

In [29]:
import oauth2


There's no module to install for the Yelp API, but Yelp does provide some [sample Python code](https://github.com/Yelp/yelp-api/tree/master/v2/python). I've slightly modified the code below to show a sample search for restaurants near Chapel Hill, NC, sorted by distance. You can find more options in the search [documentation](http://www.yelp.com/developers/documentation/v2/search_api). The API's search options include things like location and type of business, and allows you to sort either by distance or popularity.



In [30]:
consumer_key    = 'qDBPo9c_szHVrZwxzo-zDw'
consumer_secret = '4we8Jz9rq5J3j15Z5yCUqmgDJjM'
token           = 'jeRrhRey_k-emvC_VFLGrlVHrkR4P3UF'
token_secret    = 'n-7xHNCxxedmAMYZPQtnh1hd7lI'

consumer = oauth2.Consumer(consumer_key, consumer_secret)

In [31]:
category_filter = 'restaurants'
location        = 'Oslo, Norway'

options         =  'category_filter=%s&location=%s&sort=1' % (category_filter, location)
url             = 'http://api.yelp.com/v2/search?' + options

oauth_request = oauth2.Request('GET', url, {})

oauth_request.update({'oauth_nonce'      : oauth2.generate_nonce(),
                      'oauth_timestamp'  : oauth2.generate_timestamp(),
                      'oauth_token'       : token,
                      'oauth_consumer_key': consumer_key})

token = oauth2.Token(token, token_secret)
oauth_request.sign_request(oauth2.SignatureMethod_HMAC_SHA1(), consumer, token)
signed_url = oauth_request.to_url()

print signed_url

http://api.yelp.com/v2/search?sort=1&oauth_body_hash=2jmj7l5rSw0yVb%2FvlWAYkK%2FYBwk%3D&oauth_nonce=13853216&oauth_timestamp=1501417028&oauth_consumer_key=qDBPo9c_szHVrZwxzo-zDw&oauth_signature_method=HMAC-SHA1&category_filter=restaurants&oauth_token=jeRrhRey_k-emvC_VFLGrlVHrkR4P3UF&location=Oslo%2C+Norway&oauth_signature=we0CxC12Un504SHisw3Pk0xJBp8%3D


The URL returned expires after a couple of seconds, so don't expect for the above link to work. The results are provided in the JSON file format, so I'm going to use the already imported `requests` module to download them.

In [32]:
resp = requests.get(url=signed_url)
oslo_restaurants = resp.json()

print oslo_restaurants.keys()

[u'region', u'total', u'businesses']


As with the Chronacling America API, the top level of the JSON contains some metadata about the search with all the specific items in one field. In this case, `businesses`.

In [33]:
oslo_restaurants['businesses'][1]

{u'categories': [[u'Scandinavian', u'scandinavian']],
 u'display_phone': u'+47 22 69 60 00',
 u'id': u'smalhans-oslo',
 u'image_url': u'https://s3-media1.fl.yelpcdn.com/bphoto/5QT795Zc4TcUl8ue-iA-Og/ms.jpg',
 u'is_claimed': True,
 u'is_closed': False,
 u'location': {u'address': [u'Waldemar Thranes gate 10'],
  u'city': u'Oslo',
  u'coordinate': {u'latitude': 59.9235229, u'longitude': 10.7395983},
  u'country_code': u'NO',
  u'display_address': [u'Waldemar Thranes gate 10',
   u'St. Hanshaugen',
   u'0171 Oslo',
   u'Norway'],
  u'geo_accuracy': 8.0,
  u'neighborhoods': [u'St. Hanshaugen', u'Bislett'],
  u'postal_code': u'0171',
  u'state_code': u'03'},
 u'mobile_url': u'https://m.yelp.com/biz/smalhans-oslo?adjust_creative=qDBPo9c_szHVrZwxzo-zDw&utm_campaign=yelp_api&utm_medium=api_v2_search&utm_source=qDBPo9c_szHVrZwxzo-zDw',
 u'name': u'Smalhans',
 u'phone': u'+4722696000',
 u'rating': 4.0,
 u'rating_img_url': u'https://s3-media4.fl.yelpcdn.com/assets/2/www/img/c2f3dd9799a5/ico/stars/

Inspecting the returned results for one restaraunt, it is clear that Yelp is keeping a lot of the review data for themselves. They returned the overall restaurant `rating`, but they provide only a small bit of text (`snippet_text`) instead of the full reviews and ratings. 



In [34]:
print oslo_restaurants['total']
print len(oslo_restaurants['businesses'])

40
20


Additionally, they cap the total number of business the search will return at 40 and only provide 20 results for each API call.



Even with these restrictions, it still might be useful for social science research. As before, you would likely want to define a function in order to make repeated calls to the API. In this, the easier solution might be to create two functions. One that gets a single page and another which retrieves both pages for a single geographical area by calling the first function twice. While it would be possible to do this with zero or one new functions, creating two functions allows for better control over finding and debugging errors since you can test each function independently. Creating lots of small functions generally the code more readable, especially in case like this where you are looping over pages within restaurants within geographic areas. In general, I think the principle of a workflow consisting of small functions, as is commonly found in Python code, is something that social scientists should adopt even when they aren't writing Python.

In [35]:
def get_yelp_page(location, offset):
    '''
    Retrieve one page of results from the Yelp API
    Returns a JSON file
    '''
    # from https://github.com/Yelp/yelp-api/tree/master/v2/python
    consumer_key    = 'qDBPo9c_szHVrZwxzo-zDw'
    consumer_secret = '4we8Jz9rq5J3j15Z5yCUqmgDJjM'
    token           = 'jeRrhRey_k-emvC_VFLGrlVHrkR4P3UF'
    token_secret    = 'n-7xHNCxxedmAMYZPQtnh1hd7lI'
    
    consumer = oauth2.Consumer(consumer_key, consumer_secret)
    
    url = 'http://api.yelp.com/v2/search?category_filter=restaurants&location=%s&sort=1&offset=%s' % (location, offset)
    
    oauth_request = oauth2.Request('GET', url, {})
    oauth_request.update({'oauth_nonce': oauth2.generate_nonce(),
                          'oauth_timestamp': oauth2.generate_timestamp(),
                          'oauth_token': token,
                          'oauth_consumer_key': consumer_key})
    
    token = oauth2.Token(token, token_secret)
    
    oauth_request.sign_request(oauth2.SignatureMethod_HMAC_SHA1(), consumer, token)
    
    signed_url = oauth_request.to_url()
    resp = requests.get(url=signed_url)
    return resp.json()

def get_yelp_results(location):
    '''
    Retrive both pages of results from the Yelp API
    Returns a dataframe
    '''
    df = pd.DataFrame()
    for offset in [1,21]:
        results = get_yelp_page(location, offset)
        new_df = pd.DataFrame(results['businesses'])
        df = df.append(new_df , ignore_index=True)
    return df

In [36]:
ch_df = get_yelp_results('Chapel Hill, NC')

print len(ch_df)

40


In [37]:
ch_df.keys()

Index([u'categories', u'display_phone', u'id', u'image_url', u'is_claimed',
       u'is_closed', u'location', u'menu_date_updated', u'menu_provider',
       u'mobile_url', u'name', u'phone', u'rating', u'rating_img_url',
       u'rating_img_url_large', u'rating_img_url_small', u'review_count',
       u'snippet_image_url', u'snippet_text', u'url'],
      dtype='object')

In [38]:
ch_df[['name','categories','review_count','rating']].sort_values(by='rating', ascending=False)

Unnamed: 0,name,categories,review_count,rating
11,George's Seafood Jumbo,"[[Seafood, seafood]]",1,5.0
33,The Purple Bowl,"[[Breakfast & Brunch, breakfast_brunch], [Acai...",8,5.0
37,Mediterranean Deli,"[[Greek, greek], [Mediterranean, mediterranean...",607,4.5
4,Imbibe,"[[American (New), newamerican], [Wine Bars, wi...",27,4.5
7,Sutton's Drug Store,"[[Burgers, burgers], [Sandwiches, sandwiches],...",103,4.5
29,1.5.0. Fresh,"[[American (New), newamerican]]",2,4.5
34,Cholanad,"[[Indian, indpak], [Vegan, vegan]]",261,4.0
15,Cosmic Cantina,"[[Tex-Mex, tex-mex], [Tacos, tacos], [Beer Bar...",103,4.0
14,R & R Grill,"[[American (Traditional), tradamerican], [Burg...",93,4.0
23,Crepe Traditions,"[[Creperies, creperies], [Coffee & Tea, coffee...",43,4.0


In [40]:
oslo_df = get_yelp_results('Oslo, Norway')
oslo_df[['name','categories','review_count','rating']].sort_values(by='rating', ascending=False)


Unnamed: 0,name,categories,review_count,rating
20,Pila,"[[Cafes, cafes], [Scandinavian, scandinavian],...",6,5.0
15,Happolati,"[[Asian Fusion, asianfusion]]",6,5.0
3,Stangeriet,"[[Sandwiches, sandwiches], [Meat Shops, meats]]",10,5.0
37,Arakataka,"[[Modern European, modern_european]]",23,4.5
36,Nøkken,"[[Scandinavian, scandinavian]]",13,4.5
34,Restaurant Fjord,"[[Seafood, seafood]]",16,4.5
33,Crêperie de Mari,"[[Creperies, creperies]]",30,4.5
32,Meatballs,"[[Fast Food, hotdogs]]",28,4.5
31,Kamai,"[[Sushi Bars, sushi], [Asian Fusion, asianfusi...",21,4.5
27,Way Down South,"[[American (Traditional), tradamerican], [Barb...",14,4.5


<div class="alert alert-info">
Your turn. Modify the function to take different kinds of business. 

You can also add a category of business to search for from the [list](https://www.yelp.com/developers/documentation/v2/category_list) of acceptable values. 
</div>

<div class="alert alert-info">
Your turn. Sign up to be a developer on Twitter. Figure out the next steps...
</div>

The function expects that the first thing you input will be a location. Taking advantage of both `oath2`'s ability to clean up the text so that it is functional when put in a URL (e.g., escape spaces) and Yelp's savvy ability to parse locations, the value for location can be fairly wide (e.g., "Chapel Hill" or "90210"). You can also add a category of business to search for from the [list](https://www.yelp.com/developers/documentation/v2/category_list) of acceptable values. If you don't provide a value, `category_filter = 'restaurants'` provides a default value of 'restaurants'. This function returns the JSON formatted results. Note that this doesn't have any mechanism for handling errors, which will need to happen elsewhere.