# Chronicling America API Assignment
In this assignment, you are tasked with:
* searching Chronicling America's API for a key word of your choice
* parsing your results from a dictionary to a `DataFrame` with headings "title", "city", 'date", and "raw_text"
* processing the raw text by removing "\n" characters, stopwords, and then lemmatizing the raw text before adding it to a new column called "lemmas."
* saving your `DataFrame` to a csv file

If you need any help with this assignment please email micah.saxton@tufts.edu


In [1]:
# imports
import pandas as pd
import requests
import json

In [2]:
# initial search
url = 'https://chroniclingamerica.loc.gov/search/pages/results/?state=&date1=1945&date2=1963&proxtext=socialism&x=16&y=8&dateFilterType=yearRange&rows=20&searchType=basic&format=json'
response = requests.get(url)
raw = response.text
results = json.loads(raw)

In [3]:
results.keys()

dict_keys(['totalItems', 'endIndex', 'startIndex', 'itemsPerPage', 'items'])

In [4]:
print('totalItems:', results['totalItems'])


totalItems: 209609


In [5]:
print('endIndex:', results['endIndex'])


endIndex: 20


In [6]:
print('Length and type of items:', len(results['items']), type(results['items']))

Length and type of items: 20 <class 'list'>


In [11]:
# find total amount of pages
import math
total_pages = math.ceil(results['totalItems'] / results['itemsPerPage'])
print(total_pages)

10481


In [14]:
# query the api and save to dict 
data = []
start_date = '1945'
end_date = '1963'
search_term = 'socialism'

for i in range(1,11):
    url = (f'https://chroniclingamerica.loc.gov/search/pages/results/?state=&date1={start_date}'
           f'&date2={end_date}&proxtext={search_term}&x=16&y=8&dateFilterType=yearRange&rows=20'
           f'&searchType=basic&format=json&page={i}')
    response = requests.get(url)
    raw = response.text
    print(response.status_code)
    results = json.loads(raw)
    items_ = results['items']
    for item_ in items_:
        temp_dict = {}
        temp_dict['title'] = item_['title_normal']
        temp_dict['city'] = item_['city']
        temp_dict['date'] = item_['date']
        temp_dict['raw_text'] = item_['ocr_eng']
        data.append(temp_dict)

200
200
200
200
200
200
200
200
200
200


In [15]:
# convert dict to dataframe
df = pd.DataFrame.from_dict(data)

In [16]:
df.head()

Unnamed: 0,title,city,date,raw_text
0,evening star.,[Washington],19580921,">; sy '\n■ •■ / ;, • .\nWill you leave these f..."
1,valley settler.,[Palmer],19460801,Storage\nCACHE YOUR\nFOOD\nTHE\nMODERN WAY\n' ...
2,automotive news.,[Detroit],19451022,Itai/ck section\nWash. Idaho\nAgreement\nOn Re...
3,valley settler.,[Palmer],19460228,"PALMER, ALASKA Dorothy Boll Editor St Publishe..."
4,chicago star.,[Chicago],19480724,* — 1\nThe Chicago\nVol. 3. No. 30\nPublished\...


In [17]:
# convert date column from string to date-time object
df['date'] = pd.to_datetime(df['date'])

In [19]:
# sort by date
sorted_df = df.sort_values(by = 'date')

In [23]:
# write fuction to process text
import spacy

nlp = spacy.load("en_core_web_sm")
nlp.disable_pipes('ner', 'parser')

junk_words = ['hesse']

def process_text(text):
    text = text.replace('\n', ' ')
    doc = nlp(text)
    tokens = [token for token in doc]
    no_stops = [token for token in tokens if not token.is_stop]
    no_punct = [token for token in no_stops if token.is_alpha]
    lemmas = [token.lemma_ for token in no_punct]
    lemmas_lower = [lemma.lower() for lemma in lemmas]
    lemmas_string = ' '.join(lemmas_lower)
    return lemmas_string

In [24]:
# apply process_text function to raw_text column
sorted_df['lemmas'] = sorted_df['raw_text'].apply(process_text)
sorted_df.head()

Unnamed: 0,title,city,date,raw_text,lemmas
117,coolidge examiner.,[Coolidge],1945-03-30,Page Six\nCooli\nPUBLISHED EVERY FRIDAY MORNIN...,page cooli published friday morning enter seco...
40,proletarec.,[Chicago],1945-05-30,A Yugoslav Weekly Devoted to the\nInterest of ...,yugoslav weekly devoted interest workers offic...
195,detroit evening times.,[Detroit],1945-06-12,End the Pacific War VICTORIOUSLY. Make the Pea...,end pacific war victoriously peace permanently...
49,detroit evening times.,[Detroit],1945-07-04,End the Pacific War VICTORIOUSLY. Make the Pea...,end pacific war victoriously peace permanently...
76,detroit evening times.,[Detroit],1945-07-06,End the Pacific War VICTORIOUSLY. Make the Pea...,end pacific war victoriously peace permanently...


In [27]:
# save to csv
sorted_df.to_csv(f'{search_term}{start_date}-{end_date}.csv', index = False)

Please see github to view csv file output!