# Chronicling America API

[Chronicling America](https://chroniclingamerica.loc.gov/) is a collection of digitized American newspapers dating from 1777 to 1963 provided by the Library of Congress. The collection offers an application programming interface (api) which allows users to easily harvest large amounts of data.

In this notebook we will search Chronicling America's api, gather the search results into a Pandas dataframe, clean the data, and save it as a csv file.

In [None]:
# imports
import requests
import json
import math
import pandas as pd
import spacy

## Chronicling America URLs

If I search for a term, "socialism" for example, on https://chroniclingamerica.loc.gov/ I will get a results url that looks like this:

https://chroniclingamerica.loc.gov/search/pages/results/?state=&date1=1777&date2=1963&proxtext=socialism&x=16&y=8&dateFilterType=yearRange&rows=20&searchType=basic

These search results are human actionable, but not machine actionable. Chronicling America as an API that allows me to get machine actionable results if I add `&format=json`:

https://chroniclingamerica.loc.gov/search/pages/results/?state=&date1=1777&date2=1963&proxtext=socialism&x=16&y=8&dateFilterType=yearRange&rows=20&searchType=basic&format=json

If we examine the url we see that there are a number of search parameters:
* `state=`
* `date1=1777`
* `date2=1963`
* `proxtext=socialism`

We can edit these values to modify our search. Here I will limit my date range from 1945-1963:

https://chroniclingamerica.loc.gov/search/pages/results/?state=&date1=1945&date2=1963&proxtext=socialism&x=16&y=8&dateFilterType=yearRange&rows=20&searchType=basic&format=json



Now I can use the `requests` library to retrieve data from the url.

In [None]:
# initial search
url = 'https://chroniclingamerica.loc.gov/search/pages/results/?state=&date1=1945&date2=1963&proxtext=socialism&x=16&y=8&dateFilterType=yearRange&rows=20&searchType=basic&format=json'
response = requests.get(url)  # returns a 'object' representing the webpage
raw = response.text  # the text attribute contains the text from the web page as a string
results = json.loads(raw)  # the loads method from the json library transforms the string into a dict

## Explore the resutls

In [None]:
results.keys()

In [None]:
print('totalItems:', results['totalItems'])
print('endIndex:', results['endIndex'])
print('startIndex:', results['startIndex'])
print('itemsPerPage:', results['itemsPerPage'])
print('Length and type of items:', len(results['items']), type(results['items']))

The Chronicling America API returned 209,609 results. However, it will only display 20 at a time by default. I can add a new parameter `page=` to cycle through all the results, but first I need to know how many pages there will be. I can find this out by dividing `totalItems` (209,609) by `itemsPerPage` (20) and then round-up using `math.ceil`.

In [None]:
# find total amount of pages
total_pages = math.ceil(results['totalItems'] / results['itemsPerPage'])
print(total_pages)

Now that I know how many pages there will be, I can use a for loop to iterate through each result page and then each item on each result page. I then gather the data I want from each item: newspaper title, city, date, and text. 

Notice in the code below I placed the url string in parentheses () so that I could break it up over multiple lines making it easier to read.

Also, for the sake of this demonstration, I am only iterating over 10 pages. For the full results the for loop should begin: `for i in range(1, total_pages+1)` (the `+1` is necessary becase the seond number in the range function is exclusive).

In [None]:
# query the api and save to dict 
data = []
start_date = '1945'
end_date = '1963'
search_term = 'socialism'
for i in range(1, 11):  # for sake of time I'm doing only 10, you will want to put total_pages+1
    url = (f'https://chroniclingamerica.loc.gov/search/pages/results/?state=&date1={start_date}'
           f'&date2={end_date}&proxtext={search_term}&x=16&y=8&dateFilterType=yearRange&rows=20'
           f'&searchType=basic&format=json&page={i}')  # f-string
    response = requests.get(url)
    raw = response.text
    print(response.status_code)  # checking for errors
    results = json.loads(raw)
    items_ = results['items']
    for item_ in items_:
        temp_dict = {}
        temp_dict['title'] = item_['title_normal']
        temp_dict['city'] = item_['city']
        temp_dict['date'] = item_['date']
        temp_dict['raw_text'] = item_['ocr_eng']
        data.append(temp_dict)

In [None]:
# creating a backup of the results in case we get timed out in class
#with open('../data/backup-data.json', 'w') as f:
#    json.dump(data, f)

#with open('../data/backup-data.json', 'r') as f:
#    data = json.load(f)

## Convert dictionary to a Pandas dataframe
Now that we have the information we want in a dictionary called `data`, we can put it into a dataframe.

In [None]:
# convert dict to dataframe
df = pd.DataFrame.from_dict(data)

In [None]:
# sanity check
df.head()

### Change date format
Pandas allows us to clean and edit our data easily (relatively). We can first convert the string values in the date column to properly formated dates and then sort the dataframe by date.

In [None]:
# convert date column from string to date-time object
df['date'] = pd.to_datetime(df['date'])

In [None]:
# sort by date
sorted_df = df.sort_values(by='date')

### Process text
We can now porcess our text for analysis. The text provded by Chronicling America comes from optical character recognition (ocr) and the accuracy of ocr can be low. Here I will remove new line characters (`\n`), stop words, and then lemamtize the text.

**Rememeber** the decisions you make in how to process your text should be based on the kind of analysis you want to do.

In [None]:
# write fuction to process text
# load nlp model
nlp = spacy.load("en_core_web_sm")
nlp.disable_pipes('ner', 'parser')  # these are unnecessary for the task at hand

junk_words = ['hesse']  # you can add any words you want removed here

def process_text(text):
    """Remove new line characters and lemmatize text. Returns string of lemmas"""
    text = text.replace('\n', ' ')
    doc = nlp(text)
    tokens = [token for token in doc]
    no_stops = [token for token in tokens if not token.is_stop]
    no_punct = [token for token in no_stops if token.is_alpha]
    lemmas = [token.lemma_ for token in no_punct]
    lemmas_lower = [lemma.lower() for lemma in lemmas]
    lemmas_string = ' '.join(lemmas_lower)
    return lemmas_string
    

In [None]:
# apply process_text function
sorted_df['lemmas'] = sorted_df['raw_text'].apply(process_text)

In [None]:
sorted_df.head()

## Save as a csv file

In [None]:
# save to csv
sorted_df.to_csv(f'../data/{search_term}{start_date}-{end_date}.csv', index=False)