First I import the request library and demonstrate how my program interacts with the "Chronicaling America" API.

In [1]:
import requests #import the library that we use to make the http request
r = requests.get("http://chroniclingamerica.loc.gov/search/pages/results/?state=Arizona&format=json").json()
r['itemsPerPage'] #print out the number of items per page:

20

In [2]:
r['items'][0]['ocr_eng'] #here is an example of a newspaper page pulled down from "Chronicaling America"

'THE ARIZONA REPUBLICAN, WEDNESDAY MORN1NCJ, A TRIE 22, 1008.\n3:\nWHAT m BEEN DONE\nIN THE STATE Of KANSAS\nThe Results of Two Score Years of\nProhibition.\nCI\nALL SIZES. BLACK AND GALVANIZED,\nJust received a Car Load.\nGIVE US YOUR ORDER.\n1\n121-130\nEZRA W. THAYER\nWash. St. 127-13:\'. E. Adams St.\nHIS M.IND iS CLEAR\nJOHN BSOAOBtGKS CASE\nHis Release From \'.he Asyijm Where\nHe Wrs Sent Two Months Ago to\nOn January is of tii\nMro.idbe. k. ..1.1 res;\nv. as i n;;:i -.1 to i!u-\n. tisane iillil ill ::s much\n-ntv-iiine ears of\nis y. ar John A.\n.1.-1,1 C Tempo,\nasylum for the\nas in- was wv-\ngo it was not\nthought that he wiulil recover his\nminu. The hallu. illations umlt-r which\n1:. ss said to be laboring .t\' of the\nK -,.r. -fc2ext. .--.\nfV 2 Ii EKH:S?2P3ER\nft MMlP W. l?-ilH\'r.rT.7-n.\ni,rr\'fir, - v - . r -t..-\'\nv.\'?, : \' \'Ln.-;rW.";V,il.\')..K.j\nI character which generally Indicates\n\' hopciess insanity?"\nj Knt on April y X.r. liioadbeck was\ndischarged with

Here I searched the "Chronicaling America" database for pages containing my target phrase "America First" from 1900-1922

In [3]:
af_search = requests.get('https://chroniclingamerica.loc.gov/search/pages/results/?state=&date1=1900&date2=1922&proxtext=%22America+first%22&x=0&y=0&dateFilterType=yearRange&searchType=basic&format=json').json()

Here I've searched for all the pages that CA has from 1900 - 1922

In [4]:
total = requests.get('https://chroniclingamerica.loc.gov/search/pages/results/?state=&date1=1900&date2=1922&proxtext=&x=16&y=16&dateFilterType=yearRange&rows=20&searchType=basic&format=json').json()

Here I've printed out how many pages have "America First" and how many there are total

In [5]:
print(af_search["totalItems"])
print(total["totalItems"])

79734
7826804


In [6]:
total['items'][0]  # This is what on of the items returned looks like. It is a dictionary of info.

{'alt_title': ['Call', 'Call-chronicle-examiner', 'Sunday call'],
 'batch': 'batch_curiv_brea_ver01',
 'city': ['San Francisco'],
 'country': 'California',
 'county': ['San Francisco'],
 'date': '19010614',
 'edition': None,
 'edition_label': '',
 'end_year': 1913,
 'frequency': 'Daily',
 'id': '/lccn/sn85066387/1901-06-14/ed-1/seq-5/',
 'language': ['English'],
 'lccn': 'sn85066387',
 'note': ['"San Francisco" appears above, and later across, masthead ornament.',
  'Also issued online.',
  'Archived issues are available in digital format from the Library of Congress Chronicling America online collection.',
  'Issued with a joint ed. of the San Francisco chronicle and the San Francisco examiner on the day after the San Francisco earthquake, Apr. 19, 1906.',
  'Master negatives are available for duplication from:',
  'Publishers: Charles M. Shortridge, <1896>; John D. Spreckles, <1899>.'],
 'ocr_eng': 'Grand vice presidents. Miss Elizv D.\nKeith of Alta Parlor. San Francisco; Misa\nDora

I wanted to get a sense of how each calendar year was represented on each page of results from my empty search of the database. This count only looks at one page from the results. Each page has twenty results per page.

In [7]:
from collections import Counter

c = Counter()
for page in total['items']:
    c[page['date'][:4]] += 1
    
print(c)

Counter({'1911': 6, '1920': 5, '1917': 3, '1901': 2, '1916': 2, '1908': 2})


I wanted to know how many total pages of 20 items were in my query range (1900-1922)

In [2]:
how_many_pages = 7777694/20
print(how_many_pages)

388884.7


Here I looked at the first 800 (40 * 20) entries from the API using no search parameters to see how each year was represented.
I was disappointed by the variance in the representation of each year in my set.

Warning: This loop takes a while.

In [10]:
page_count = Counter()  # This will look at pages that come back from an empty search

for page in range(0,40):
    total_pages = requests.get('https://chroniclingamerica.loc.gov/search/pages/results/?state=&date1=1900&date2=1922&proxtext=&x=16&y=16&dateFilterType=yearRange&rows=20&searchType=basic&format=json&rows=20&page=' + str(page)).json()
    for item in total_pages['items']:
        page_count[item['date'][0:4]] += 1
        

page_count.most_common()

[('1911', 252),
 ('1920', 141),
 ('1916', 100),
 ('1901', 88),
 ('1908', 86),
 ('1917', 76),
 ('1912', 29),
 ('1918', 27),
 ('1906', 1)]

I decided to import the random library and create a more random sample rather than just rely on whatever the API's search offered me up.

In [3]:
import random

how_many_pages=int(how_many_pages)

Here I used the random library to find 50 random pages in the range offered by CA. I was happier with the distribution.

In [4]:
how_many_pages=int(how_many_pages)
page_sample = random.sample(range(0,how_many_pages), 50)

# Using random, I get a random sample of fifty number
# between 0 and the number of pages
# returned by my empty search

Below I looped through my random page sample and tacked each random integer onto the end of my empty search query to the API. This game me fifty random pages from the overall set. For each item on each page, I stripped the year attached to that item in the database and kept track of the results in a counter.

Warning: This proccess takes a while.

In [13]:
page_count = Counter()

for page in page_sample:
    total_pages = requests.get('https://chroniclingamerica.loc.gov/search/pages/results/?state=&date1=1900&date2=1922&proxtext=&x=16&y=16&dateFilterType=yearRange&rows=20&searchType=basic&format=json&rows=20&page=' + str(page)).json()
    for item in total_pages['items']:
        page_count[item['date'][0:4]] += 1
        

page_count.most_common()

[('1910', 64),
 ('1916', 62),
 ('1909', 57),
 ('1913', 53),
 ('1917', 52),
 ('1911', 50),
 ('1915', 49),
 ('1908', 49),
 ('1919', 45),
 ('1922', 44),
 ('1921', 44),
 ('1905', 43),
 ('1904', 42),
 ('1906', 42),
 ('1918', 40),
 ('1920', 38),
 ('1903', 37),
 ('1902', 37),
 ('1912', 36),
 ('1900', 36),
 ('1907', 30),
 ('1914', 30),
 ('1901', 20)]

In [14]:
page_count_list = list(page_count.items())

In [15]:
page_count_list  # I turn my Counter object into a list

[('1919', 45),
 ('1909', 57),
 ('1903', 37),
 ('1912', 36),
 ('1911', 50),
 ('1907', 30),
 ('1918', 40),
 ('1917', 52),
 ('1904', 42),
 ('1922', 44),
 ('1915', 49),
 ('1914', 30),
 ('1921', 44),
 ('1910', 64),
 ('1916', 62),
 ('1920', 38),
 ('1913', 53),
 ('1906', 42),
 ('1900', 36),
 ('1902', 37),
 ('1905', 43),
 ('1901', 20),
 ('1908', 49)]

Everything I did above for the empty search, I now do on the set returned from my "America First" query. I divided the number of items by 20 (which is the API's default items per page )

In [16]:
af_search["totalItems"]  # Here are the total items in the set

79734

In [17]:
how_many_pages_af=79521/20
print(how_many_pages_af)
how_many_pages_af = int(how_many_pages_af)
print(how_many_pages_af)

3976.05
3976


In [18]:
random_num_af = random.sample(range(0,how_many_pages_af), 50) # I find 50 random pages in my "America First" search
print(random_num_af)

[1448, 3762, 387, 494, 2707, 1749, 3720, 786, 2825, 1621, 3715, 2915, 1709, 2864, 2252, 2787, 561, 43, 1100, 1254, 1237, 1474, 2158, 3083, 1466, 866, 2974, 212, 1770, 211, 3602, 1689, 2924, 891, 3763, 3616, 1038, 3612, 1953, 3177, 3497, 3701, 829, 406, 338, 337, 1756, 1050, 2620, 2179]


Here I'm using the random 50 numbers to pull a sample of 1000 pages from my "America First" search and printing out the years with the most instances of "America first."

Warning: This takes a while.

In [19]:
af_count = Counter()

for page in random_num_af:
    af_json = requests.get('https://chroniclingamerica.loc.gov/search/pages/results/?state=&date1=1900&date2=1922&proxtext=%22America+first%22&x=0&y=0&dateFilterType=yearRange&searchType=basic&format=json&rows=20&page=' + str(page)).json()
    for item in af_json['items']:
        af_count[item['date'][0:4]] += 1
        

af_count.most_common()

[('1917', 84),
 ('1906', 82),
 ('1916', 78),
 ('1918', 61),
 ('1912', 53),
 ('1920', 52),
 ('1922', 51),
 ('1915', 46),
 ('1911', 45),
 ('1921', 42),
 ('1919', 42),
 ('1907', 40),
 ('1905', 39),
 ('1901', 36),
 ('1913', 34),
 ('1903', 31),
 ('1914', 31),
 ('1909', 30),
 ('1908', 30),
 ('1910', 30),
 ('1900', 23),
 ('1904', 20),
 ('1902', 20)]

In [20]:
af_count # currently my list of "American First" mentions per year is actually in a Counter object

Counter({'1900': 23,
         '1901': 36,
         '1902': 20,
         '1903': 31,
         '1904': 20,
         '1905': 39,
         '1906': 82,
         '1907': 40,
         '1908': 30,
         '1909': 30,
         '1910': 30,
         '1911': 45,
         '1912': 53,
         '1913': 34,
         '1914': 31,
         '1915': 46,
         '1916': 78,
         '1917': 84,
         '1918': 61,
         '1919': 42,
         '1920': 52,
         '1921': 42,
         '1922': 51})

I create a list of tuples out of my Counter object. Each tuple contains a year paired with the mentions of "AF" for that year

In [21]:
date_count = list(af_count.items())

In [22]:
date_count[10]

('1913', 34)

In [23]:
date_count

[('1912', 53),
 ('1904', 20),
 ('1909', 30),
 ('1911', 45),
 ('1901', 36),
 ('1903', 31),
 ('1906', 82),
 ('1908', 30),
 ('1917', 84),
 ('1915', 46),
 ('1913', 34),
 ('1905', 39),
 ('1910', 30),
 ('1916', 78),
 ('1907', 40),
 ('1920', 52),
 ('1921', 42),
 ('1902', 20),
 ('1914', 31),
 ('1922', 51),
 ('1919', 42),
 ('1918', 61),
 ('1900', 23)]

So I have two sets. One (page_count) contains the instances of each calendar year when I run an empty search through the API. The second (date_count) contains instances of each calendar year when I run a search through the API using "America First" as a query. 

Here I've combined the two sets into list of tuples.

In [24]:
complete_list = []
for x in date_count:
    for y in page_count_list:
        if x[0] == y[0]:
            cast = list(x)
            cast.insert(2, y[1])
            x = tuple(cast)
            complete_list.append(x)
            print(x)
            

('1912', 53, 36)
('1904', 20, 42)
('1909', 30, 57)
('1911', 45, 50)
('1901', 36, 20)
('1903', 31, 37)
('1906', 82, 42)
('1908', 30, 49)
('1917', 84, 52)
('1915', 46, 49)
('1913', 34, 53)
('1905', 39, 43)
('1910', 30, 64)
('1916', 78, 62)
('1907', 40, 30)
('1920', 52, 38)
('1921', 42, 44)
('1902', 20, 37)
('1914', 31, 30)
('1922', 51, 44)
('1919', 42, 45)
('1918', 61, 40)
('1900', 23, 36)


In [25]:
complete_list

[('1912', 53, 36),
 ('1904', 20, 42),
 ('1909', 30, 57),
 ('1911', 45, 50),
 ('1901', 36, 20),
 ('1903', 31, 37),
 ('1906', 82, 42),
 ('1908', 30, 49),
 ('1917', 84, 52),
 ('1915', 46, 49),
 ('1913', 34, 53),
 ('1905', 39, 43),
 ('1910', 30, 64),
 ('1916', 78, 62),
 ('1907', 40, 30),
 ('1920', 52, 38),
 ('1921', 42, 44),
 ('1902', 20, 37),
 ('1914', 31, 30),
 ('1922', 51, 44),
 ('1919', 42, 45),
 ('1918', 61, 40),
 ('1900', 23, 36)]

In [26]:
complete_list.sort()  # The list is easier to comprehed if sorted by year

In [27]:
complete_list

[('1900', 23, 36),
 ('1901', 36, 20),
 ('1902', 20, 37),
 ('1903', 31, 37),
 ('1904', 20, 42),
 ('1905', 39, 43),
 ('1906', 82, 42),
 ('1907', 40, 30),
 ('1908', 30, 49),
 ('1909', 30, 57),
 ('1910', 30, 64),
 ('1911', 45, 50),
 ('1912', 53, 36),
 ('1913', 34, 53),
 ('1914', 31, 30),
 ('1915', 46, 49),
 ('1916', 78, 62),
 ('1917', 84, 52),
 ('1918', 61, 40),
 ('1919', 42, 45),
 ('1920', 52, 38),
 ('1921', 42, 44),
 ('1922', 51, 44)]

What I'm actually interested in is the ratio of yearly instances in my "America First" set to the yearly instances in my empty-search set. I could have simply used the instances from the "America First" search, but I have no idea if the "Chronicling America" database over represents some years over others. The ratio--the weighted appearance frequency--is my correction for that possible error.

In [28]:
list_of_ratios = []  # This list will hold all the weighted appearance frequencies
for x in complete_list:
    cast = list(x)
    ratio = cast[1]/cast[2]
    list_of_ratios.append(ratio)
    print(ratio)

0.6388888888888888
1.8
0.5405405405405406
0.8378378378378378
0.47619047619047616
0.9069767441860465
1.9523809523809523
1.3333333333333333
0.6122448979591837
0.5263157894736842
0.46875
0.9
1.4722222222222223
0.6415094339622641
1.0333333333333334
0.9387755102040817
1.2580645161290323
1.6153846153846154
1.525
0.9333333333333333
1.368421052631579
0.9545454545454546
1.1590909090909092


Of course I need these frequencies in my tuple

In [29]:
completer_list = []
list_of_ratios = []
for x in complete_list:
    cast = list(x)
    ratio = cast[1]/cast[2]
    cast.insert(3, ratio)
    x = tuple(cast)
    completer_list.append(x)
    print(x)
    

('1900', 23, 36, 0.6388888888888888)
('1901', 36, 20, 1.8)
('1902', 20, 37, 0.5405405405405406)
('1903', 31, 37, 0.8378378378378378)
('1904', 20, 42, 0.47619047619047616)
('1905', 39, 43, 0.9069767441860465)
('1906', 82, 42, 1.9523809523809523)
('1907', 40, 30, 1.3333333333333333)
('1908', 30, 49, 0.6122448979591837)
('1909', 30, 57, 0.5263157894736842)
('1910', 30, 64, 0.46875)
('1911', 45, 50, 0.9)
('1912', 53, 36, 1.4722222222222223)
('1913', 34, 53, 0.6415094339622641)
('1914', 31, 30, 1.0333333333333334)
('1915', 46, 49, 0.9387755102040817)
('1916', 78, 62, 1.2580645161290323)
('1917', 84, 52, 1.6153846153846154)
('1918', 61, 40, 1.525)
('1919', 42, 45, 0.9333333333333333)
('1920', 52, 38, 1.368421052631579)
('1921', 42, 44, 0.9545454545454546)
('1922', 51, 44, 1.1590909090909092)


In [30]:
completer_list

[('1900', 23, 36, 0.6388888888888888),
 ('1901', 36, 20, 1.8),
 ('1902', 20, 37, 0.5405405405405406),
 ('1903', 31, 37, 0.8378378378378378),
 ('1904', 20, 42, 0.47619047619047616),
 ('1905', 39, 43, 0.9069767441860465),
 ('1906', 82, 42, 1.9523809523809523),
 ('1907', 40, 30, 1.3333333333333333),
 ('1908', 30, 49, 0.6122448979591837),
 ('1909', 30, 57, 0.5263157894736842),
 ('1910', 30, 64, 0.46875),
 ('1911', 45, 50, 0.9),
 ('1912', 53, 36, 1.4722222222222223),
 ('1913', 34, 53, 0.6415094339622641),
 ('1914', 31, 30, 1.0333333333333334),
 ('1915', 46, 49, 0.9387755102040817),
 ('1916', 78, 62, 1.2580645161290323),
 ('1917', 84, 52, 1.6153846153846154),
 ('1918', 61, 40, 1.525),
 ('1919', 42, 45, 0.9333333333333333),
 ('1920', 52, 38, 1.368421052631579),
 ('1921', 42, 44, 0.9545454545454546),
 ('1922', 51, 44, 1.1590909090909092)]

Now I start putting my info into a sqlite table

In [31]:
import sqlite3

In [33]:
conn = sqlite3.connect('af_dates.db')

In [34]:
cur.execute('''CREATE TABLE IF NOT EXISTS af_dates (dates text, yr_instances_af integer, yr_instances_empty integer, weighted_frequency real)''')

<sqlite3.Cursor at 0x154e8aee9d0>

Here I insert my data into the created table.

In [35]:
for date in completer_list:
    cur.execute('INSERT INTO af_dates VALUES (?,?,?,?)', (date))


In [36]:
conn.commit()

At this point I should have a db with dates, yearly instances in my "America First" search, yearly instances in my empty search, and the weighted appearance frequency included side by side.

In [37]:
print("Below each entry has four pieces of data:\n\t1. Year\n\t2. Mentions of \"America First\"\n\t3. Number of pages from that year in a random sample\n\t4. Ratio of instances to number of pages from that year")
for row in cur.execute('SELECT * FROM af_dates ORDER BY date(dates)'):
    print(row)

Below each entry has four pieces of data:
	1. Year
	2. Mentions of "America First"
	3. Number of pages from that year in a random sample
	4. Ratio of instances to number of pages from that year
('1900', 23, 36, 0.6388888888888888)
('1901', 36, 20, 1.8)
('1902', 20, 37, 0.5405405405405406)
('1903', 31, 37, 0.8378378378378378)
('1904', 20, 42, 0.47619047619047616)
('1905', 39, 43, 0.9069767441860465)
('1906', 82, 42, 1.9523809523809523)
('1907', 40, 30, 1.3333333333333333)
('1908', 30, 49, 0.6122448979591837)
('1909', 30, 57, 0.5263157894736842)
('1910', 30, 64, 0.46875)
('1911', 45, 50, 0.9)
('1912', 53, 36, 1.4722222222222223)
('1913', 34, 53, 0.6415094339622641)
('1914', 31, 30, 1.0333333333333334)
('1915', 46, 49, 0.9387755102040817)
('1916', 78, 62, 1.2580645161290323)
('1917', 84, 52, 1.6153846153846154)
('1918', 61, 40, 1.525)
('1919', 42, 45, 0.9333333333333333)
('1920', 52, 38, 1.368421052631579)
('1921', 42, 44, 0.9545454545454546)
('1922', 51, 44, 1.1590909090909092)


Here I'm just showing that I can take my data out of the database and put into a list

In [38]:
post_db_list = []
for row in cur.execute('SELECT * FROM af_dates ORDER BY date(dates)'):
    post_db_list.append(row)

In [39]:
post_db_list

[('1900', 23, 36, 0.6388888888888888),
 ('1901', 36, 20, 1.8),
 ('1902', 20, 37, 0.5405405405405406),
 ('1903', 31, 37, 0.8378378378378378),
 ('1904', 20, 42, 0.47619047619047616),
 ('1905', 39, 43, 0.9069767441860465),
 ('1906', 82, 42, 1.9523809523809523),
 ('1907', 40, 30, 1.3333333333333333),
 ('1908', 30, 49, 0.6122448979591837),
 ('1909', 30, 57, 0.5263157894736842),
 ('1910', 30, 64, 0.46875),
 ('1911', 45, 50, 0.9),
 ('1912', 53, 36, 1.4722222222222223),
 ('1913', 34, 53, 0.6415094339622641),
 ('1914', 31, 30, 1.0333333333333334),
 ('1915', 46, 49, 0.9387755102040817),
 ('1916', 78, 62, 1.2580645161290323),
 ('1917', 84, 52, 1.6153846153846154),
 ('1918', 61, 40, 1.525),
 ('1919', 42, 45, 0.9333333333333333),
 ('1920', 52, 38, 1.368421052631579),
 ('1921', 42, 44, 0.9545454545454546),
 ('1922', 51, 44, 1.1590909090909092)]

In [40]:
complete_list_sorted = sorted(post_db_list, key=lambda date: date[0])

In [41]:
complete_list_sorted

[('1900', 23, 36, 0.6388888888888888),
 ('1901', 36, 20, 1.8),
 ('1902', 20, 37, 0.5405405405405406),
 ('1903', 31, 37, 0.8378378378378378),
 ('1904', 20, 42, 0.47619047619047616),
 ('1905', 39, 43, 0.9069767441860465),
 ('1906', 82, 42, 1.9523809523809523),
 ('1907', 40, 30, 1.3333333333333333),
 ('1908', 30, 49, 0.6122448979591837),
 ('1909', 30, 57, 0.5263157894736842),
 ('1910', 30, 64, 0.46875),
 ('1911', 45, 50, 0.9),
 ('1912', 53, 36, 1.4722222222222223),
 ('1913', 34, 53, 0.6415094339622641),
 ('1914', 31, 30, 1.0333333333333334),
 ('1915', 46, 49, 0.9387755102040817),
 ('1916', 78, 62, 1.2580645161290323),
 ('1917', 84, 52, 1.6153846153846154),
 ('1918', 61, 40, 1.525),
 ('1919', 42, 45, 0.9333333333333333),
 ('1920', 52, 38, 1.368421052631579),
 ('1921', 42, 44, 0.9545454545454546),
 ('1922', 51, 44, 1.1590909090909092)]

In [42]:
dates = [row[0] for row in complete_list_sorted]

In [43]:
af_instances = [row[1] for row in complete_list_sorted]

In [44]:
empty_search_instances = [row[2] for row in complete_list_sorted]

In [45]:
frequencies = [row[3] for row in complete_list_sorted]

In [46]:
from bokeh.io import show, output_notebook
from bokeh.plotting import figure, output_file, show
from bokeh.models import HoverTool, FuncTickFormatter, FixedTicker, ColumnDataSource

output_notebook()

In [47]:
p = figure(x_range=dates[:23], plot_height=600, plot_width=1000, title="Weighted Frequency of \'America First\'", x_axis_label = "Year",
       y_axis_label = "Weighted Appearance Frequency of \'America First\'", toolbar_location=None, tools="")

In [48]:
p.line(x=dates[:23], y=frequencies[:23])

In [49]:
show(p)