# Harvesting data from the web: APIs  

### A first API

[Chronicling America](http://chroniclingamerica.loc.gov/about/) is a joint project of the National Endowment for the Humanities and the Library of Congress .

Search for articles that mention "[slavery](http://chroniclingamerica.loc.gov/search/pages/results/?andtext=slavery)".

![](https://raw.githubusercontent.com/nealcaren/UiOBigData/master/notebooks/images/chron.png)

![](https://raw.githubusercontent.com/nealcaren/UiOBigData/master/notebooks/images/chron_slavery.png)

<div class="alert alert-info">

Look at the URL. What happens if you change the word slavery to abolition? 

What happens to the URL when you go to the second page? Can you get to page 251?

</div>

https://chroniclingamerica.loc.gov/search/pages/results/?state=&date1=1789&date2=1963&proxtext=abolition&x=0&y=0&dateFilterType=yearRange&rows=20&searchType=basic

What if we append ``&format=json`` to the end of the search URL? 


http://chroniclingamerica.loc.gov/search/pages/results/?andtext=slavery&format=json


![](https://raw.githubusercontent.com/nealcaren/UiOBigData/master/notebooks/images/chron_json.png)

[``requests``](http://docs.python-requests.org/en/master/) is a useful and commonly used HTTP library for python. It is not a part of the default installation, but is included with Anaconda Python Distribution. 

In [1]:
import requests

It would be possible to use the API URL and parameters directly in the requests command, but since the most likely scenario involves making repeating calls to ``requests`` as part of a loop -- the search returned less than 1% of the results -- I store the strings first. 

In [2]:
base_url   = 'http://chroniclingamerica.loc.gov/search/pages/results/'
parameters = '?andtext=slavery&format=json'

url = base_url + parameters
print(url)

http://chroniclingamerica.loc.gov/search/pages/results/?andtext=slavery&format=json


`requests.get()` is used for both accessing websites and APIs. The command can be modified by several arguements, but at a minimum, it requires the URL.

In [3]:
r = requests.get(base_url + parameters)

`r` is a `requests` response object. Any JSON returned by the server are stored in `.json().`

In [4]:
r.json()

{'totalItems': 515327,
 'endIndex': 20,
 'startIndex': 1,
 'itemsPerPage': 20,
 'items': [{'sequence': 1,
   'county': ['Columbiana', 'Columbiana'],
   'edition': None,
   'frequency': 'Weekly',
   'id': '/lccn/sn83035487/1849-03-16/ed-1/seq-1/',
   'subject': ['Antislavery movements--United States--Newspapers.',
    'Antislavery movements.--fast--(OCoLC)fst00810800',
    'Lisbon (Ohio)--Newspapers.',
    'Ohio--Lisbon.--fast--(OCoLC)fst01249658',
    'Ohio--Salem.--fast--(OCoLC)fst01223494',
    'Salem (Ohio)--Newspapers.',
    'Slavery--United States--Newspapers.',
    'Slavery.--fast--(OCoLC)fst01120426',
    'United States.--fast--(OCoLC)fst01204155'],
   'city': ['New Lisbon', 'Salem'],
   'date': '18490316',
   'title': 'Anti-slavery bugle. [volume]',
   'end_year': 1861,
   'note': ['Archived issues are available in digital format as part of the Library of Congress Chronicling America online collection.',
    'Editors: Benjamin S. Jones, J. Elizabeth Hitchcock, 1845-1846; Benjam

In [5]:
search_json = r.json()

JSONs are dictionary like objects, in that they have keys (think variable names) and values. `.keys()` returns a list of the keys.

In [6]:
search_json.keys()

dict_keys(['totalItems', 'endIndex', 'startIndex', 'itemsPerPage', 'items'])

You can return the value of any key by putting the key name in brackets.

In [7]:
c

NameError: name 'c' is not defined

<div class="alert alert-info">
What else is in there? Where is the stuff we want?
</div>

In [8]:
search_json['items']

[{'sequence': 1,
  'county': ['Columbiana', 'Columbiana'],
  'edition': None,
  'frequency': 'Weekly',
  'id': '/lccn/sn83035487/1849-03-16/ed-1/seq-1/',
  'subject': ['Antislavery movements--United States--Newspapers.',
   'Antislavery movements.--fast--(OCoLC)fst00810800',
   'Lisbon (Ohio)--Newspapers.',
   'Ohio--Lisbon.--fast--(OCoLC)fst01249658',
   'Ohio--Salem.--fast--(OCoLC)fst01223494',
   'Salem (Ohio)--Newspapers.',
   'Slavery--United States--Newspapers.',
   'Slavery.--fast--(OCoLC)fst01120426',
   'United States.--fast--(OCoLC)fst01204155'],
  'city': ['New Lisbon', 'Salem'],
  'date': '18490316',
  'title': 'Anti-slavery bugle. [volume]',
  'end_year': 1861,
  'note': ['Archived issues are available in digital format as part of the Library of Congress Chronicling America online collection.',
   'Editors: Benjamin S. Jones, J. Elizabeth Hitchcock, 1845-1846; Benjamin S. Jones, J. Elizabeth Jones, 1846-1849; Oliver Johnson 1849-1851; Marius R. Robinson, 1851-1859; Benjami

As is often the case with results from an API, most of the keys and values are metadate about either the search or what is being returned. These are useful for knowing if the search is returning what you want, which is particularly important when you are making multiple calls to the API. 

The data I'm intereted in is all in `items`. 

In [9]:
type(search_json['items'])

list

In [10]:
len(search_json['items'])

20

`items` is a list with 20 items.

In [11]:
type(search_json['items'][3])

dict

Each of the 20 items in the list is a dictionary. 

In [12]:
first_item = search_json['items'][0]

first_item.keys()

dict_keys(['sequence', 'county', 'edition', 'frequency', 'id', 'subject', 'city', 'date', 'title', 'end_year', 'note', 'state', 'section_label', 'type', 'place_of_publication', 'start_year', 'edition_label', 'publisher', 'language', 'alt_title', 'lccn', 'country', 'ocr_eng', 'batch', 'title_normal', 'url', 'place', 'page'])

<div class="alert alert-info">
What is the title of the first item?
</div>

In [13]:
print(first_item['title'])

Anti-slavery bugle. [volume]


While a standard CSV file has a header row that describes the contents of each column, a JSON file has keys identifying the values found in each case. Importantly, these keys need not be the same for each item. Additionally, values don't have to be numbers of strings, but could be lists or dictionaries. For example, this JSON could have included a `newspaper` key that was a dictionary with all the metadata about the newspaper the article and issue was published, an `article` key that include the article specific information as another dictionary, and a `text` key whose value was a string with the article text.

As before, we can examine the contents of a particular item, such as the publication's `title`.

In [14]:
first_item['ocr_eng']

'LAVE\nam\nJlile\nVOL. 4. NO. 30.\nSALEM. OHIO, FRIDAY, MARCH 1G, 1849.\nWHOLE NO. 186.\nANTI\nTi v Tr-\nHI\nTHE ANTI-SLAVERY BUGLE\ngovernments and pro-slavery churchy organi-\nRations. It is, Edited by Uenjamin S. and J.\nElizabeth Jones; ami wane urging y,"\nleople the duty of holding " N union witn\nIs published every Friday, at Salem, Colum\nbiana Co., OAio.by the Executive Committee\nof the Western Anti-lavehy oocikiit,\nand is the only paper in the Great West\nwhich advocates secession trom pro-siavery\nE\n.,.. 1 l.o ittlttf\nSlaveholders, euner in yi.......\nthe only consistent position an Abolitionist\ncan occupy, and as the best means for I he do\ntraction of slavery ; it will, so far as Is l.m\nits permit, k\'ivb a history of the daily prog ess\nof Um antUslavery cause-exhibit the policy\nand practice of slaveholders, and by facts and\narguments endeavor to increase the zeal and\nctivily of every true lover of Freedom. In\n:. oni .olnverv matter, it will\naauiiiuu iv .\nchoi

In [15]:
print(first_item['ocr_eng'])

LAVE
am
Jlile
VOL. 4. NO. 30.
SALEM. OHIO, FRIDAY, MARCH 1G, 1849.
WHOLE NO. 186.
ANTI
Ti v Tr-
HI
THE ANTI-SLAVERY BUGLE
governments and pro-slavery churchy organi-
Rations. It is, Edited by Uenjamin S. and J.
Elizabeth Jones; ami wane urging y,"
leople the duty of holding " N union witn
Is published every Friday, at Salem, Colum
biana Co., OAio.by the Executive Committee
of the Western Anti-lavehy oocikiit,
and is the only paper in the Great West
which advocates secession trom pro-siavery
E
.,.. 1 l.o ittlttf
Slaveholders, euner in yi.......
the only consistent position an Abolitionist
can occupy, and as the best means for I he do
traction of slavery ; it will, so far as Is l.m
its permit, k'ivb a history of the daily prog ess
of Um antUslavery cause-exhibit the policy
and practice of slaveholders, and by facts and
arguments endeavor to increase the zeal and
ctivily of every true lover of Freedom. In
:. oni .olnverv matter, it will
aauiiiuu iv .
choice exitaeis, iliuidi
tale, to? It 

In [16]:
print(first_item['ocr_eng'][:200])

LAVE
am
Jlile
VOL. 4. NO. 30.
SALEM. OHIO, FRIDAY, MARCH 1G, 1849.
WHOLE NO. 186.
ANTI
Ti v Tr-
HI
THE ANTI-SLAVERY BUGLE
governments and pro-slavery churchy organi-
Rations. It is, Edited by Uenjamin


The easiest way to view or analyze this data is to convert it to a dataset-like structure. While Python does not have a builting in dataframe type, the popular `pandas` library does. By convention, it is imported as `pd`.

In [17]:
import pandas as pd

# Make sure all columns are displayed
pd.set_option("display.max_columns",101)

pandas is prety smart about importing different JSON-type objects and converting them to dataframes with its `.DataFrame()` function.

In [18]:
df = pd.DataFrame(search_json['items'])

df.head(6)

Unnamed: 0,alt_title,batch,city,country,county,date,edition,edition_label,end_year,frequency,id,language,lccn,note,ocr_eng,page,place,place_of_publication,publisher,section_label,sequence,start_year,state,subject,title,title_normal,type,url
0,[],ohi_ariel_ver02,"[New Lisbon, Salem]",Ohio,"[Columbiana, Columbiana]",18490316,,,1861,Weekly,/lccn/sn83035487/1849-03-16/ed-1/seq-1/,[English],sn83035487,[Archived issues are available in digital form...,"LAVE\nam\nJlile\nVOL. 4. NO. 30.\nSALEM. OHIO,...",,"[Ohio--Columbiana--New Lisbon, Ohio--Columbian...","New-Lisbon, Ohio",Ohio American Antislavery Society,,1,1845,"[Ohio, Ohio]",[Antislavery movements--United States--Newspap...,Anti-slavery bugle. [volume],anti-slavery bugle.,page,https://chroniclingamerica.loc.gov/lccn/sn8303...
1,[],iune_golf_ver01,[Chicago],Illinois,[Cook County],19140516,,NOON EDITION,1917,Daily (except Sunday and holidays),/lccn/sn83045487/1914-05-16/ed-1/seq-10/,[English],sn83045487,"[""An adless daily newspaper."", Archived issues...",r\nmmmmmmmmmmmmmmmmmmmmmmmm\n'SLAVERY RIFE IN ...,,[Illinois--Cook County--Chicago],"Chicago, Ill.",N.D. Cochran,,10,1911,[Illinois],"[Chicago (Ill.)--Newspapers., Illinois--Chicag...",The day book. [volume],day book.,page,https://chroniclingamerica.loc.gov/lccn/sn8304...
2,[],iune_india_ver01,[Chicago],Illinois,[Cook County],19161109,,EXTRA,1917,Daily (except Sunday and holidays),/lccn/sn83045487/1916-11-09/ed-1/seq-26/,[English],sn83045487,"[""An adless daily newspaper."", Archived issues...",us remaining whites if we expect to\nstay on t...,,[Illinois--Cook County--Chicago],"Chicago, Ill.",N.D. Cochran,,26,1911,[Illinois],"[Chicago (Ill.)--Newspapers., Illinois--Chicag...",The day book. [volume],day book.,page,https://chroniclingamerica.loc.gov/lccn/sn8304...
3,[],iune_foxtrot_ver01,[Chicago],Illinois,[Cook County],19130308,,NOON EDITION,1917,Daily (except Sunday and holidays),/lccn/sn83045487/1913-03-08/ed-1/seq-6/,[English],sn83045487,"[""An adless daily newspaper."", Archived issues...",that every possible weakness in. a\ngirl as &e...,,[Illinois--Cook County--Chicago],"Chicago, Ill.",N.D. Cochran,,6,1911,[Illinois],"[Chicago (Ill.)--Newspapers., Illinois--Chicag...",The day book. [volume],day book.,page,https://chroniclingamerica.loc.gov/lccn/sn8304...
4,[],iune_foxtrot_ver01,[Chicago],Illinois,[Cook County],19130424,,,1917,Daily (except Sunday and holidays),/lccn/sn83045487/1913-04-24/ed-1/seq-13/,[English],sn83045487,"[""An adless daily newspaper."", Archived issues...",mpICFED FOR WHITE -SLAVERY.\nTop Lola Norris-a...,,[Illinois--Cook County--Chicago],"Chicago, Ill.",N.D. Cochran,,13,1911,[Illinois],"[Chicago (Ill.)--Newspapers., Illinois--Chicag...",The day book. [volume],day book.,page,https://chroniclingamerica.loc.gov/lccn/sn8304...
5,[],iune_foxtrot_ver01,[Chicago],Illinois,[Cook County],19130815,,,1917,Daily (except Sunday and holidays),/lccn/sn83045487/1913-08-15/ed-1/seq-5/,[English],sn83045487,"[""An adless daily newspaper."", Archived issues...",LOLA NORRiajQlVS SiENSAT-iPN AL t EVIDENCE IN ...,,[Illinois--Cook County--Chicago],"Chicago, Ill.",N.D. Cochran,,5,1911,[Illinois],"[Chicago (Ill.)--Newspapers., Illinois--Chicag...",The day book. [volume],day book.,page,https://chroniclingamerica.loc.gov/lccn/sn8304...


Note that I converted `search_json['items']` to  dataframe and not the entire JSON file. This is because I wanted each row to be an article. 

In [19]:
pd.DataFrame(search_json)

Unnamed: 0,totalItems,endIndex,startIndex,itemsPerPage,items
0,515327,20,1,20,"{'sequence': 1, 'county': ['Columbiana', 'Colu..."
1,515327,20,1,20,"{'sequence': 10, 'county': ['Cook County'], 'e..."
2,515327,20,1,20,"{'sequence': 26, 'county': ['Cook County'], 'e..."
3,515327,20,1,20,"{'sequence': 6, 'county': ['Cook County'], 'ed..."
4,515327,20,1,20,"{'sequence': 13, 'county': ['Cook County'], 'e..."
5,515327,20,1,20,"{'sequence': 5, 'county': ['Cook County'], 'ed..."
6,515327,20,1,20,"{'sequence': 24, 'county': ['Cook County'], 'e..."
7,515327,20,1,20,"{'sequence': 1, 'county': [None], 'edition': N..."
8,515327,20,1,20,"{'sequence': 30, 'county': ['Cook County'], 'e..."
9,515327,20,1,20,"{'sequence': 4, 'county': [None], 'edition': N..."


If this dataframe contained all the items that you were looking for, it would be easy to save this to a csv file for storage and later analysis.

In [20]:
df.to_csv('slavery_articles.csv', index = False)

In [21]:
!head slavery_articles.csv

alt_title,batch,city,country,county,date,edition,edition_label,end_year,frequency,id,language,lccn,note,ocr_eng,page,place,place_of_publication,publisher,section_label,sequence,start_year,state,subject,title,title_normal,type,url
[],ohi_ariel_ver02,"['New Lisbon', 'Salem']",Ohio,"['Columbiana', 'Columbiana']",18490316,,,1861,Weekly,/lccn/sn83035487/1849-03-16/ed-1/seq-1/,['English'],sn83035487,"['Archived issues are available in digital format as part of the Library of Congress Chronicling America online collection.', 'Editors: Benjamin S. Jones, J. Elizabeth Hitchcock, 1845-1846; Benjamin S. Jones, J. Elizabeth Jones, 1846-1849; Oliver Johnson 1849-1851; Marius R. Robinson, 1851-1859; Benjamin S. Jones, 1859-1861.', 'Not published June 27-July 18, 1845.', 'Printers: John Frost, 1845; J.H. Painter, 1845-1846; G.N. Hapgood, 1846-1848.', 'Published in: New Lisbon, Ohio, June 20, 1845-Aug. 29, 1845, and: Salem, Ohio, Sept. 5, 1845-May 4, 1861.', 'Publisher: Executive Committee of the Wes

In [22]:
df.to_json('slavery_articles.json', orient='records')

In [23]:
!head slavery_articles.json

[{"alt_title":[],"batch":"ohi_ariel_ver02","city":["New Lisbon","Salem"],"country":"Ohio","county":["Columbiana","Columbiana"],"date":"18490316","edition":null,"edition_label":"","end_year":1861,"frequency":"Weekly","id":"\/lccn\/sn83035487\/1849-03-16\/ed-1\/seq-1\/","language":["English"],"lccn":"sn83035487","note":["Archived issues are available in digital format as part of the Library of Congress Chronicling America online collection.","Editors: Benjamin S. Jones, J. Elizabeth Hitchcock, 1845-1846; Benjamin S. Jones, J. Elizabeth Jones, 1846-1849; Oliver Johnson 1849-1851; Marius R. Robinson, 1851-1859; Benjamin S. Jones, 1859-1861.","Not published June 27-July 18, 1845.","Printers: John Frost, 1845; J.H. Painter, 1845-1846; G.N. Hapgood, 1846-1848.","Published in: New Lisbon, Ohio, June 20, 1845-Aug. 29, 1845, and: Salem, Ohio, Sept. 5, 1845-May 4, 1861.","Publisher: Executive Committee of the Western Anti-slavery Society, 1848-1861."],"ocr_eng":"LAVE\nam\nJlile\nVOL. 4. NO. 30.\n



<div class="alert alert-info">
<h3> Your Turn</h3>
<p> Conduct your own search of the API. Store the results in a file.

</div>



In [24]:
search_word = 'silver'


base_url   = 'http://chroniclingamerica.loc.gov/search/pages/results/'
parameters = '?andtext=' + search_word + '&format=json'

r = requests.get(base_url + parameters)
search_json = r.json()
df = pd.DataFrame(search_json['items'])
df.to_csv(search_word+'_articles.csv')



![](images/exchange.png)

In [25]:
r = requests.get('https://api.exchangeratesapi.io/latest?base=NOK')

In [26]:
pd.DataFrame(r.json())

Unnamed: 0,rates,base,date
AUD,0.166266,NOK,2019-07-30
BGN,0.200867,NOK,2019-07-30
BRL,0.433787,NOK,2019-07-30
CAD,0.150881,NOK,2019-07-30
CHF,0.113364,NOK,2019-07-30
CNY,0.78836,NOK,2019-07-30
CZK,2.634336,NOK,2019-07-30
DKK,0.766843,NOK,2019-07-30
EUR,0.102703,NOK,2019-07-30
GBP,0.094131,NOK,2019-07-30


<div class="alert alert-info">
<h3> Your turn</h3>
<p> Conduct your own search of the API. Change the base rate to Euros. Store the results in a csv file.

</div>




In [27]:
r = requests.get('https://api.exchangeratesapi.io/latest?base=EUR')

In [28]:
exchange_df = pd.DataFrame(r.json())

In [29]:
exchange_df.to_csv('exchange.csv')

In [30]:
base_url = 'https://api.exchangeratesapi.io/latest'

In [31]:
parameters = {'base' : 'USD'}

In [32]:
r = requests.get(base_url, 
                 params = parameters)

In [33]:
r.url

'https://api.exchangeratesapi.io/latest?base=USD'

In [34]:
pd.DataFrame(r.json())

Unnamed: 0,rates,base,date
AUD,1.451408,USD,2019-07-30
BGN,1.753452,USD,2019-07-30
BRL,3.786713,USD,2019-07-30
CAD,1.317106,USD,2019-07-30
CHF,0.9896,USD,2019-07-30
CNY,6.881926,USD,2019-07-30
CZK,22.996235,USD,2019-07-30
DKK,6.694101,USD,2019-07-30
EUR,0.896539,USD,2019-07-30
GBP,0.821705,USD,2019-07-30


<div class="alert alert-info">
<h3> Your turn</h3>
<p>What is the current exchange rate using the Japanese yen (JPY) as the base rate? Use it as a parameter.
</div>

In [36]:
base_url = 'https://api.exchangeratesapi.io/latest'
parameters = {'base' : 'JPY'}
r = requests.get(base_url, params = parameters)
pd.DataFrame(r.json())


Unnamed: 0,rates,base,date
AUD,0.013379,JPY,2019-07-30
BGN,0.016164,JPY,2019-07-30
BRL,0.034907,JPY,2019-07-30
CAD,0.012141,JPY,2019-07-30
CHF,0.009122,JPY,2019-07-30
CNY,0.063439,JPY,2019-07-30
CZK,0.211983,JPY,2019-07-30
DKK,0.061707,JPY,2019-07-30
EUR,0.008264,JPY,2019-07-30
GBP,0.007575,JPY,2019-07-30


### Bringing functions back in.

In [37]:
base_url = 'https://api.exchangeratesapi.io/latest'
parameters = {'base': 'USD'}
r = requests.get(base_url, params = parameters)
df = pd.DataFrame(r.json())

Spot the difference...

In [38]:
currency = 'USD'
base_url = 'https://api.exchangeratesapi.io/latest'
parameters = {'base': currency}
r = requests.get(base_url, params = parameters)
df = pd.DataFrame(r.json())

![](https://raw.githubusercontent.com/nealcaren/UiOBigData/master/notebooks/images/function.png)

In [39]:
def get_exchange(currency):
    base_url = 'https://api.exchangeratesapi.io/latest'
    parameters = {'base': currency}

    r = requests.get(base_url, params = parameters)

    df = pd.DataFrame(r.json())
    return df

def get_exchange_csv(currency):
    base_url = 'https://api.exchangeratesapi.io/latest'
    parameters = {'base': currency}

    r = requests.get(base_url, params = parameters)

    df = pd.DataFrame(r.json())
    df.to_csv(currency + '.csv')
    
    return df


In [40]:
get_exchange('USD')

Unnamed: 0,rates,base,date
AUD,1.451408,USD,2019-07-30
BGN,1.753452,USD,2019-07-30
BRL,3.786713,USD,2019-07-30
CAD,1.317106,USD,2019-07-30
CHF,0.9896,USD,2019-07-30
CNY,6.881926,USD,2019-07-30
CZK,22.996235,USD,2019-07-30
DKK,6.694101,USD,2019-07-30
EUR,0.896539,USD,2019-07-30
GBP,0.821705,USD,2019-07-30


<div class="alert alert-info">
<h3> Your turn</h3>
<p>What is the current exchange rate using the Russina ruble (RUB) as the base rate? Use the function and save the results to a csv file.
</div>

In [41]:
get_exchange('RUB').to_csv('rubble.csv')

This is only a small subset of the articles on lynching that are available, however. The API returns results in batches of 20 and this is only the first page of results. As is often the case, I'll need to make multiple calls to the API to retrieve all the data of interest. The easiest way to do that is to define a small function for getting the article information and put that in a loop. While it isn't a requirement that you create a function for making the API call, it will make your code easier to read and debug.


Looking at the API guidelines, there is an additional paramater `page` that tells the API which subset of results we want. This name varies by API but their is usually some mechanism for retrieiving results beyond the initial JSON.

Before creating the loop and making multiple calls to the API, I want to make sure that the API is working the way I think it is. 

<div class="alert alert-info">
Back to the newspapers. Look at the API guidelines. How can we get the third page?
</div>


[Guidelines](https://chroniclingamerica.loc.gov/about/api/)

In [42]:
base_url   = 'http://chroniclingamerica.loc.gov/search/pages/results/'
parameters = '?andtext=slavery&format=json&page=3'

r = requests.get(base_url + parameters)
results =  r.json()


In [43]:

print(results['startIndex'])
print(results['endIndex'])

41
60


A call to random selected page 3 returns results 41 through 60, which is what I expected since each page has 20 items.

The parameters are getting pretty ugly, so fortunately `requests` accepts a dictionary where the keys are the parameter names as defined by the API and the values are the search paramaters you are looking for. So the same request can be rewritten:

In [44]:
base_url = 'http://chroniclingamerica.loc.gov/search/pages/results/'
parameters = {'andtext': 'lynching',
              'page'   : 3,
              'format'  : 'json'}

r = requests.get(base_url, params=parameters)

results =  r.json()

print(results['startIndex'], results['endIndex'])

41 60


This can be rewritten as function:

In [45]:
def get_articles():
    '''
    Make calls to the Chronicling America API.
    '''
    
    base_url   = 'http://chroniclingamerica.loc.gov/search/pages/results/'
    parameters = {'andtext': 'lynching',
                  'page'   : 3,
                  'format' : 'json'}
    
    r = requests.get(base_url, params = parameters)
    results =  r.json()
    
    
    return pd.DataFrame(results)

In [46]:
get_articles()

results['startIndex'], results['endIndex']

(41, 60)

The advantage of writing a function, however, would be that you can pass along your own parameters, such as the search term and page number, which would make this much more useful. 

In [47]:
def get_articles(search_term, page_number):
    '''
    Make calls to the Chronicling America API.
    '''
    
    base_url = 'http://chroniclingamerica.loc.gov/search/pages/results/'
    parameters = {'andtext': search_term,
                  'page'   : page_number,
                  'format' : 'json'}
    
    r = requests.get(base_url, params = parameters)
    results =  r.json()
    new_df = pd.DataFrame(results['items'])

    return new_df

In [48]:
results = get_articles('lynching', 2)

results.head()

Unnamed: 0,alt_title,batch,city,country,county,date,edition,edition_label,end_year,frequency,id,language,lccn,note,ocr_eng,page,place,place_of_publication,publisher,section_label,sequence,start_year,state,subject,title,title_normal,type,url
0,[],kyu_albatross_ver01,[Berea],Kentucky,[Madison],19221129,,,1958,Weekly,/lccn/sn85052076/1922-11-29/ed-1/seq-5/,[English],sn85052076,[Archived issues are available in digital form...,"November so, im\nTBS CITIZEN\nTHE CITIZEN\nA s...",Page Five,[Kentucky--Madison--Berea],"Berea, Ky.",T.G. Pasco,,5,1899,[Kentucky],"[Berea (Ky.)--Newspapers., Kentucky--Berea.--f...",The citizen.,citizen.,page,https://chroniclingamerica.loc.gov/lccn/sn8505...
1,[],lu_iceman_ver01,[Lafayette],Louisiana,[Lafayette],19030822,,,1921,Daily (except Sun.),/lccn/sn88064111/1903-08-22/ed-1/seq-2/,[English],sn88064111,"[""Official journal of the parish"", Sept. 15, ...",W tGAZETTE.\nBY HOMER MOUTON.\n.' 5Sea..wsat o...,,[Louisiana--Lafayette--Lafayette],"Lafayette, La.",Chas. A. Thomas and Homer J. Mouton,,2,1893,[Louisiana],"[Lafayette (La.)--Newspapers., Lafayette Paris...",The Lafayette gazette.,lafayette gazette.,page,https://chroniclingamerica.loc.gov/lccn/sn8806...
2,"[Daily Bridgeport farmer, Daily Republican Far...",ct_goshen_ver01,[Bridgeport],Connecticut,[Fairfield],19160517,,,1917,Daily (except Sun.),/lccn/sn84022472/1916-05-17/ed-1/seq-6/,[English],sn84022472,[Also issued on microfilm from Connecticut Sta...,"f\nTHE FABMER: MAY 17, 1916\nBRJDGEPOR TE VENI...",6,[Connecticut--Fairfield--Bridgeport],"Bridgeport, Conn.","Pomeroy, Gould & Co.",,6,1866,[Connecticut],"[Bridgeport (Conn.)--Newspapers., Connecticut-...",The Bridgeport evening farmer.,bridgeport evening farmer.,page,https://chroniclingamerica.loc.gov/lccn/sn8402...
3,[],iune_golf_ver01,[Chicago],Illinois,[Cook County],19150227,,LAST EDITION,1917,Daily (except Sunday and holidays),/lccn/sn83045487/1915-02-27/ed-1/seq-3/,[English],sn83045487,"[""An adless daily newspaper."", Archived issues...",wmmmmzmmmmwmm\n;POLICE FEAR LYNCHING Of AURORA...,,[Illinois--Cook County--Chicago],"Chicago, Ill.",N.D. Cochran,,3,1911,[Illinois],"[Chicago (Ill.)--Newspapers., Illinois--Chicag...",The day book. [volume],day book.,page,https://chroniclingamerica.loc.gov/lccn/sn8304...
4,[],wa_elm_ver01,[Seattle],Washington,[King],19191206,,,1921,Weekly,/lccn/sn87093353/1919-12-06/ed-1/seq-3/,[English],sn87093353,"[""A publication of general information, but in...",ARKANSAS METHODIST A LIAR\nThere is not much c...,,[Washington--King--Seattle],"Seattle, Wash.",H.R. Cayton,,3,1916,[Washington],[African Americans--Washington (State)--Seattl...,Cayton's weekly.,cayton's weekly.,page,https://chroniclingamerica.loc.gov/lccn/sn8709...


In [49]:
results = get_articles('cows', 45)

results.head()

Unnamed: 0,alt_title,batch,city,country,county,date,edition,edition_label,end_year,frequency,id,language,lccn,note,ocr_eng,page,place,place_of_publication,publisher,section_label,sequence,start_year,state,subject,title,title_normal,type,url
0,[],msar_chert_ver01,[Starkville],Mississippi,[Oktibbeha],19060915,,,1909,Weekly,/lccn/sn87065613/1906-09-15/ed-1/seq-6/,[English],sn87065613,"[""A weekly journal for farmers and stock raise...",^t**********************e<t\n$ THE DAIRY S\n§ ...,6.0,[Mississippi--Oktibbeha--Starkville],"Starkville, Miss.",Dr. Tait Butler,,6,1895,[Mississippi],"[Agriculture--Mississippi--Newspapers., Agricu...",The Southern farm gazette.,southern farm gazette.,page,https://chroniclingamerica.loc.gov/lccn/sn8706...
1,[],iahi_ferguson_ver01,"[Denison, Dow City]",Iowa,"[Crawford, Crawford]",19110301,,,9999,Weekly,/lccn/sn84038095/1911-03-01/ed-1/seq-3/,[English],sn84038095,"[<Vol. 17, no. 13 (Mar. 30, 1883)-v. 18, no. 3...","•u.\nM&M\n1:\n-. '"".\n'$• '•'.\nIf\nPORK CHOPS...",,"[Iowa--Crawford--Denison, Iowa--Crawford--Dow ...","Denison, Iowa",James D. Ainsworth,,3,1867,"[Iowa, Iowa]","[Denison (Iowa)--Newspapers., Iowa--Denison.--...",The Denison review.,denison review.,page,https://chroniclingamerica.loc.gov/lccn/sn8403...
2,"[Ranch, Ranch and range]",wa_kinnikinnick_ver01,"[Seattle, Spokane, Yakima]",Washington,"[King, Spokane, Yakima]",19020522,,,1902,Weekly,/lccn/2007252185/1902-05-22/ed-1/seq-9/,[English],2007252185,"[""In the interest of the Farmers, Horticultura...",the long run by selecting the best at\ntainabl...,9.0,"[Washington--King--Seattle, Washington--Spokan...","North Yakima, Wash.",[s.n.,,9,1897,"[Washington, Washington, Washington]","[Agriculture--Northwest, Pacific--Newspapers.,...",Ranche and range.,ranche and range.,page,https://chroniclingamerica.loc.gov/lccn/200725...
3,[],msar_chert_ver01,[Starkville],Mississippi,[Oktibbeha],19060401,,,1909,Weekly,/lccn/sn87065613/1906-04-01/ed-1/seq-7/,[English],sn87065613,"[""A weekly journal for farmers and stock raise...",( Till: DAIRY :\n^ tfS\nJ Kt i - ' rcijuc'teU ...,,[Mississippi--Oktibbeha--Starkville],"Starkville, Miss.",Dr. Tait Butler,,7,1895,[Mississippi],"[Agriculture--Mississippi--Newspapers., Agricu...",The Southern farm gazette.,southern farm gazette.,page,https://chroniclingamerica.loc.gov/lccn/sn8706...
4,[Live-stock journal],msar_emerald_ver02,[Starkville],Mississippi,[Oktibbeha],18890711,,,1891,Weekly,/lccn/sn87065614/1889-07-11/ed-1/seq-3/,[English],sn87065614,"[Also issued on microfilm from UMI., Descripti...","material costs are reduced, the bot\ntle syste...",1477.0,[Mississippi--Oktibbeha--Starkville],"Starkville, Miss.",Live-stock Journal Co.,,3,1876,[Mississippi],"[Livestock--Mississippi--Newspapers., Livestoc...",Southern live-stock journal.,southern live-stock journal.,page,https://chroniclingamerica.loc.gov/lccn/sn8706...


Back to Chronicling America. Now, the first 60 results could downloaded in a just a few lines:

In [50]:
for page_number in [1, 2, 3]: 
    print(page_number)
    

1
2
3


In [51]:
for page_number in range(1, 4): 
    print(page_number)
    

1
2
3


In [52]:
for page_number in range(1,3):
    
    results = get_articles('lynching', page_number)
    

In [53]:
len(results)

20

Everything appears to be working, but unfortunately I only have the last page of results still. Each call to the API was redefining `results` variable. In this case, I set up an empty dataframe to store the results and will append the items from each page of results.

In [54]:
list_of_dataframes = [] # empty list to store dataframes

for page_number in range(1,4):
    new_df = get_articles('lynching', page_number)
    list_of_dataframes.append(new_df) 

df = pd.concat(list_of_dataframes, ignore_index = True)
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 60 entries, 0 to 59
Data columns (total 28 columns):
alt_title               60 non-null object
batch                   60 non-null object
city                    60 non-null object
country                 60 non-null object
county                  60 non-null object
date                    60 non-null object
edition                 2 non-null object
edition_label           60 non-null object
end_year                60 non-null int64
frequency               60 non-null object
id                      60 non-null object
language                60 non-null object
lccn                    60 non-null object
note                    60 non-null object
ocr_eng                 60 non-null object
page                    60 non-null object
place                   60 non-null object
place_of_publication    60 non-null object
publisher               60 non-null object
section_label           60 non-null object
sequence                60 non-null int

In [55]:
len(list_of_dataframes)

3

For a large download, you would still want to tweak this a bit by pausing between each API call and making it robust to internet or API errors, but this is a solid framework for collecting data from an API.

In [56]:
from time import sleep

In [58]:
dfs = [] # empty list to store dataframes
for page_number in range(1,4):
    new_df = get_articles('lynching', page_number)    
    dfs.append(new_df) 
    sleep(1)
    print(page_number)
    
df = pd.concat(dfs, ignore_index = True)


1
2
3


In [60]:
df.sample(5)

Unnamed: 0,alt_title,batch,city,country,county,date,edition,edition_label,end_year,frequency,id,language,lccn,note,ocr_eng,page,place,place_of_publication,publisher,section_label,sequence,start_year,state,subject,title,title_normal,type,url
47,[],txdn_audi_ver01,"[Dallas, Houston]",Texas,"[Dallas, Harris]",19191018,,,9999,Weekly,/lccn/sn83025779/1919-10-18/ed-1/seq-4/,[English],sn83025779,[Also issued on microfilm from the Library of ...,"TIIE DALLAS EXPRESS, DALLAS TEXAS, SATURDAY, O...",PAGE FOUR,"[Texas--Dallas--Dallas, Texas--Harris--Houston]","Dallas, Tex.",W.E. King,,4,1000,"[Texas, Texas]","[African American newspapers--Texas., African ...",The Dallas express.,dallas express.,page,https://chroniclingamerica.loc.gov/lccn/sn8302...
37,"[Star, Sunday star]",dlc_dorsey_ver02,[Washington],District of Columbia,[None],19250317,,,1972,Daily,/lccn/sn83045462/1925-03-17/ed-1/seq-22/,[English],sn83045462,"[""From April 25 through May 24, 1861 one sheet...","22\n“LYNCHLESS LAND""\nIS CHURCHES’ PLEA\nSixte...",22,[District of Columbia--Washington],"Washington, D.C.",W.D. Wallach & Hope,,22,1854,[District of Columbia],"[Washington (D.C.)--fast--(OCoLC)fst01204505, ...",Evening star. [volume],evening star.,page,https://chroniclingamerica.loc.gov/lccn/sn8304...
14,[National Afro-American newspaper],mnhi_funkley_ver02,"[Chicago, Minneapolis, Saint Paul]",Minnesota,"[Cook, Hennepin, Ramsey]",19221202,,,1999,Weekly,/lccn/sn83016810/1922-12-02/ed-1/seq-2/,[English],sn83016810,[Archived issues are available in digital form...,rJS*\n$r'\nTHE APPEAL\nAN AMERICAN NEWSPAPER\n...,,"[Illinois--Cook--Chicago, Minnesota--Hennepin-...","Saint Paul, Minn. ;",Northwestern Pub. Co.,,2,1889,"[Illinois, Minnesota, Minnesota]","[African American newspapers--Illinois., Afric...",The Appeal.,appeal.,page,https://chroniclingamerica.loc.gov/lccn/sn8301...
42,[],wa_elm_ver01,[Seattle],Washington,[King],19201009,,,1921,Weekly,/lccn/sn87093353/1920-10-09/ed-1/seq-2/,[English],sn87093353,"[""A publication of general information, but in...",or more white citizens hereabouts. In the\nSou...,,[Washington--King--Seattle],"Seattle, Wash.",H.R. Cayton,,2,1916,[Washington],[African Americans--Washington (State)--Seattl...,Cayton's weekly.,cayton's weekly.,page,https://chroniclingamerica.loc.gov/lccn/sn8709...
5,"[Star, Sunday star]",dlc_2nevelson_ver01,[Washington],District of Columbia,[None],19520210,,,1972,Daily,/lccn/sn83045462/1952-02-10/ed-1/seq-123/,[English],sn83045462,"[""From April 25 through May 24, 1861 one sheet...",'Z£Hit/&tAa£-iy*X&vn4tfuma£Jfy£AeHt6 «\nJAMES ...,19,[District of Columbia--Washington],"Washington, D.C.",W.D. Wallach & Hope,,123,1854,[District of Columbia],"[Washington (D.C.)--fast--(OCoLC)fst01204505, ...",Evening star. [volume],evening star.,page,https://chroniclingamerica.loc.gov/lccn/sn8304...


In [62]:
def get_search_term(search_term, pages):
    dfs = [] # empty list to store dataframes
    for page_number in range(1, pages + 1):
        new_df = get_articles(search_term, page_number)    
        new_df['page_number'] = page_number
        dfs.append(new_df) 
        sleep(1)
        print(page_number)

    df = pd.concat(dfs, ignore_index = True)
    df['search_term'] = search_term
    
    return df

chair_df = get_search_term('abolition', 5)

1
2
3
4
5


In [63]:
chair_df.sample(5)

Unnamed: 0,alt_title,batch,city,country,county,date,edition,edition_label,end_year,frequency,id,language,lccn,note,ocr_eng,page,place,place_of_publication,publisher,section_label,sequence,start_year,state,subject,title,title_normal,type,url,page_number,search_term
66,[],iune_india_ver01,[Chicago],Illinois,[Cook County],19161215,,LAST EDITION,1917,Daily (except Sunday and holidays),/lccn/sn83045487/1916-12-15/ed-1/seq-11/,[English],sn83045487,"[""An adless daily newspaper."", Archived issues...",tiojial commission and the abolition\nof the d...,,[Illinois--Cook County--Chicago],"Chicago, Ill.",N.D. Cochran,,11,1911,[Illinois],"[Chicago (Ill.)--Newspapers., Illinois--Chicag...",The day book. [volume],day book.,page,https://chroniclingamerica.loc.gov/lccn/sn8304...,4,abolition
85,"[Voice, Voice of freedom]",vtu_green_ver02,"[Brandon, Montpelier]",Vermont,"[Rutland, Washington]",18390615,,,1848,Weekly,/lccn/sn84022687/1839-06-15/ed-1/seq-1/,[English],sn84022687,"[""Published under the sanction of the Vermont ...","THE\nVOICE OF FffiEEBOM.\nALLEN POLAND, Publis...",,"[Vermont--Rutland--Brandon, Vermont--Washingto...","Montpelier, Vt.",,,1,1839,"[Vermont, Vermont]","[Antislavery movements--Vermont--Newspapers., ...",The voice of freedom. [volume],voice of freedom.,page,https://chroniclingamerica.loc.gov/lccn/sn8402...,5,abolition
71,"[Convention clarion, Free press]",idhi_doyle_ver01,[Grangeville],Idaho,[Idaho],19091202,,,9999,Weekly,/lccn/sn86091100/1909-12-02/ed-1/seq-2/,[English],sn86091100,[Archived issues are available in digital form...,r\nOur Next Great Serial Story\nTHE\nj\nk\n(CO...,,[Idaho--Idaho--Grangeville],"Grangeville, Idaho Territory",Free Press Pub. Co.,,2,1886,[Idaho],"[Grangeville (Idaho)--Newspapers., Idaho Count...",Idaho County free press. [volume],idaho county free press.,page,https://chroniclingamerica.loc.gov/lccn/sn8609...,4,abolition
18,[],iune_golf_ver01,[Chicago],Illinois,[Cook County],19140710,,NOON EDITION,1917,Daily (except Sunday and holidays),/lccn/sn83045487/1914-07-10/ed-1/seq-14/,[English],sn83045487,"[""An adless daily newspaper."", Archived issues...","jy iuwipf mfwiw\nJUST VOTES, NO ""WILD OATS,"" F...",,[Illinois--Cook County--Chicago],"Chicago, Ill.",N.D. Cochran,,14,1911,[Illinois],"[Chicago (Ill.)--Newspapers., Illinois--Chicag...",The day book. [volume],day book.,page,https://chroniclingamerica.loc.gov/lccn/sn8304...,1,abolition
99,[],curiv_iris_ver01,[Oroville],California,[Butte],18630801,,,1864,Weekly,/lccn/sn86058108/1863-08-01/ed-1/seq-2/,[English],sn86058108,"[Description based on: Vol. 5, no. 45 (Sept. 1...",MtcffilD^’iUtffßfCDrt\nGEO. H. CROSETTE. Edito...,,[California--Butte--Oroville],Oroville [Calif.],Geo. H. Crosette,,2,1858,[California],"[Butte County (Calif.)--Newspapers., Californi...",The weekly Butte record. [volume],weekly butte record.,page,https://chroniclingamerica.loc.gov/lccn/sn8605...,5,abolition


<div class="alert alert-info">
<p>The Guardian newspaper has a search API (link below). Sign up for a key. If it doesn't come in time, you can use mine.
<p>Write a function similar to the LoC one for searching for specific content. Have the function accept a query and page number parameters. 
    
<p>Use your function to create a dataframe of articles on Brexit.
    
</div>

[API Manual](https://open-platform.theguardian.com/documentation/)
[Search Manual](https://open-platform.theguardian.com/documentation/search)

API key: 283a6176-f92c-46ec-91ce-29efdb59ffab

In [64]:
def search_guard_page(search_term, page_number):
    base_url   = 'https://content.guardianapis.com/search'

    parameters = {'api-key' : '283a6176-f92c-46ec-91ce-29efdb59ffab',
                  'format' :  'json',
                  'q'      :       search_term,
                  'page-size' : 50,
                  'page'      : page_number
                 }


    r = requests.get(base_url, params = parameters)
    return pd.DataFrame(r.json()['response']['results'])

search_guard_page('Brexit', 3)


Unnamed: 0,apiUrl,id,isHosted,pillarId,pillarName,sectionId,sectionName,type,webPublicationDate,webTitle,webUrl
0,https://content.guardianapis.com/politics/2019...,politics/2019/may/22/andrea-leadsom-quits-over...,False,pillar/news,News,politics,Politics,article,2019-05-22T18:42:09Z,Andrea Leadsom quits over Theresa May's Brexit...,https://www.theguardian.com/politics/2019/may/...
1,https://content.guardianapis.com/technology/20...,technology/2019/jun/24/nick-clegg-facebook-bre...,False,pillar/news,News,technology,Technology,article,2019-06-24T09:50:24Z,Nick Clegg denies misuse of Facebook influence...,https://www.theguardian.com/technology/2019/ju...
2,https://content.guardianapis.com/politics/2019...,politics/2019/jun/18/brexit-weekly-briefing-ab...,False,pillar/news,News,politics,Politics,article,2019-06-18T06:00:47Z,Brexit weekly briefing: absent Boris Johnson h...,https://www.theguardian.com/politics/2019/jun/...
3,https://content.guardianapis.com/politics/2019...,politics/2019/may/21/leadsom-gives-may-ultimat...,False,pillar/news,News,politics,Politics,article,2019-05-21T08:19:23Z,Leadsom gives May ultimatum over Brexit bill s...,https://www.theguardian.com/politics/2019/may/...
4,https://content.guardianapis.com/politics/2019...,politics/2019/jul/12/greg-clark-no-deal-brexit...,False,pillar/news,News,politics,Politics,article,2019-07-12T08:54:08Z,Greg Clark: no-deal Brexit would destroy 'thou...,https://www.theguardian.com/politics/2019/jul/...
5,https://content.guardianapis.com/politics/2019...,politics/2019/mar/23/corbyns-cabinet-set-for-a...,False,pillar/news,News,politics,Politics,article,2019-03-23T20:58:00Z,Corbyn’s team split over soft Brexit,https://www.theguardian.com/politics/2019/mar/...
6,https://content.guardianapis.com/business/2019...,business/2019/jul/10/uk-economy-returns-to-gro...,False,pillar/news,News,business,Business,article,2019-07-10T18:13:29Z,UK economy returns to growth as carmakers end ...,https://www.theguardian.com/business/2019/jul/...
7,https://content.guardianapis.com/politics/2019...,politics/2019/jul/10/social-justice-not-brexit...,False,pillar/news,News,politics,Politics,article,2019-07-10T12:23:54Z,"Social justice, not Brexit – Theresa May races...",https://www.theguardian.com/politics/2019/jul/...
8,https://content.guardianapis.com/world/2019/ju...,world/2019/jul/09/irish-ministers-meet-to-disc...,False,pillar/news,News,world,World news,article,2019-07-09T17:30:01Z,No-deal Brexit a political and economic threat...,https://www.theguardian.com/world/2019/jul/09/...
9,https://content.guardianapis.com/business/2019...,business/2019/mar/21/shoppers-increase-spendin...,False,pillar/news,News,business,Business,article,2019-03-21T11:12:22Z,Shoppers increase spending despite Brexit unce...,https://www.theguardian.com/business/2019/mar/...


In [73]:
def search_guard_page(search_term, page_number):
    base_url   = 'https://content.guardianapis.com/search'

    parameters = {'api-key' : '283a6176-f92c-46ec-91ce-29efdb59ffab',
                  'format' :  'json',
                  'q'      :    search_term,
                  'page-size' : 50,
                  'page'      : page_number
                 }


    r = requests.get(base_url, params = parameters)
    return pd.DataFrame(r.json()['response']['results'])



def get_page_count(search_term):
    base_url   = 'https://content.guardianapis.com/search'

    parameters = {'api-key' : '283a6176-f92c-46ec-91ce-29efdb59ffab',
                  'format' :  'json',
                  'page-size' : 50,

                  'q'      :    search_term,
                 }

    r = requests.get(base_url, params = parameters)
    return r.json()['response']['pages']

   
    
def search_guardian(search_term, max_pages = 5):
        
    # Figure out number of pages. 
    number_of_pages = get_page_count(search_term)

    # Don't go all the way to the total if fewer are requested
    if max_pages < number_of_pages:
        pages_to_get = max_pages
    else:
        pages_to_get = number_of_pages
    print('Getting',pages_to_get,'of',number_of_pages,'pages.')
    
    # Empty list to store results
    dfs = []
    
    # grab each page of results, storing in dfs
    for page_number in range(1, pages_to_get + 1):
        new_df = search_guard_page(search_term, page_number)
        new_df['page_number'] = page_number
        dfs.append(new_df)
        sleep(1)
        print('Retrieved',page_number,'of',pages_to_get,'pages.')


    
    # Combine into one datframe
    df = pd.concat(dfs, ignore_index = True)
    df['search_term'] = search_term
        
    return df


        
    

brexit_df = search_guardian('brexit', max_pages = 10)
brexit_df.to_csv('brexit.csv')

Getting 10 of 681 pages.
Retrieved 1 of 10 pages.
Retrieved 2 of 10 pages.
Retrieved 3 of 10 pages.
Retrieved 4 of 10 pages.
Retrieved 5 of 10 pages.
Retrieved 6 of 10 pages.
Retrieved 7 of 10 pages.
Retrieved 8 of 10 pages.
Retrieved 9 of 10 pages.
Retrieved 10 of 10 pages.
