Analyze state of the union addresses. 
Data source: https://en.wikisource.org/wiki/Portal:State_of_the_Union_Speeches_by_United_States_Presidents
scrape the text of all speeches and then maybe try to find patterns of speech of each president?

https://nlp.stanford.edu/IR-book/html/htmledition/sublinear-tf-scaling-1.html 
https://www.oreilly.com/library/view/applied-text-analysis/9781491963036/ch04.html 

## Setup

In [20]:
from bs4 import BeautifulSoup
import requests
import pandas as pd
import urllib.request
import re
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
import spacy
import time
from sklearn.neighbors import NearestNeighbors

# Data source we are going to scrape for results
data_url = 'https://en.wikisource.org/wiki/Portal:State_of_the_Union_Speeches_by_United_States_Presidents'

link_list = []

# extract the text of a speech from a URL
# text is extracted in a list of paragraphs (strings) for each speech
def get_speech(url):
    return([ p.text.strip() for p in BeautifulSoup(urllib.request.urlopen(url)).find_all("p") if \
             'This work is in the public domain in the United States because it is a work of the United States federal government' \
             not in p.text.strip()])

# Make a frequency count by distinct values of 
# column(s) listed in 'groupbyvars'
# Returns pandas dataframe
def tidy_count(df,groupbyvars):
    return(df.groupby(groupbyvars).size().reset_index().\
        rename(columns={0: "n"}).sort_values('n',ascending=False).reset_index(drop=True))

## Web Scraping

In [21]:
resp = urllib.request.urlopen(data_url)
soup = BeautifulSoup(resp, from_encoding=resp.info().get_param('charset'))

# Get all links to state of the union addresses from 
for link in soup.find_all('a', href=True):
    if "union_address" in link['href'].lower() or "union_speech" in link['href'].lower() \
        and "portal" not in link['href'].lower() and "#" not in link['href'].lower():
        link_list.append(link['href'])

# Note that I am storing these speeches as lists of paragraphs (strings) for readability
speeches = [get_speech('https://en.wikisource.org' + link) for link in link_list]
# Extract presidents names from link text
presidents = [ link.replace('%','/').split('/')[2].replace('_',' ') for link in link_list ]

# Extract state of the union text entries so we can extract the date
sou_entries = []
for item in soup.find_all('li'):
    if 'union' in item.text.strip().lower() and '(' in  item.text.strip().lower():
        sou_entries.append(item.text.strip())

speeches_pd = pd.DataFrame({
                'president' : presidents,
                'speech' : speeches,
                'year' : [int(re.findall('\d+',item)[1]) for item in sou_entries ]} )

In [42]:
#speeches_pd['speech_num'] = speeches_pd.index # for joining

In [22]:
speeches_pd.sample(n=5)

Unnamed: 0,president,speech,year
39,John Quincy Adams,[Fellow Citizens of the Senate and of the Hous...,1828
135,Calvin Coolidge,"[To the Congress of the United States:, The pr...",1924
48,Martin Van Buren,[Fellow-Citizens of the Senate and House of Re...,1837
193,Jimmy Carter,"[To the Congress of the United States:, The St...",1981
217,George W. Bush,"[Mr. Speaker, Vice President Cheney, members o...",2005


## Preprocessing

Clean text (remove stop words, convert to lower case, remove non-alphabetic text, and lemmatize

In [8]:
nlp = spacy.load('en_core_web_lg', disable=['parser', 'ner']) # disabling the parser makes it run faster
#nlp = spacy.load('en_core_web_lg',disable=['parser', 'ner'])

# Workaround for stopwords bug in en_core_web_lg model
for word in nlp.Defaults.stop_words:
    lex = nlp.vocab[word]
    lex.is_stop = True

# Preprocess pipeline with spacy. 
def spacy_preprocess(text): 
    text_out = []
    for token in nlp(text.lower()):
        # Get rid of stop words and non-alphanumeric
        if not token.is_stop and token.is_alpha and len(token)>1:
                text_out.append(token.text)            
    return(nlp(" ".join(text_out)))

In [9]:
# Print stop words
#print(nlp.Defaults.stop_words)

In [10]:
for token in nlp('The the weather'):
    print(token.is_stop)

False
True
False


In [11]:
# test spacy preprocessing
spacy_preprocess('The dog ran into Bob beCause he saw 234 squirrels under VAU15')

dog ran bob saw squirrels

## Vectorize Speeches

Use spacy's inbuilt embedding model to vectorize our speeches 

In [14]:
# Each speech is stored as a list of paragraph strings. 
# Here we join the paragraphs into a single speech string
speech_list = [" ".join(speech) for speech in speeches_pd['speech'].tolist() ]

## Pre-process and tokenize our speeches
t0 = time.time()

speeches_embed = np.array([ spacy_preprocess(speech).vector for speech in speech_list])

print('Preprocessing time elapsed: ' + str(time.time()-t0))

Preprocessing time elapsed: 144.02652311325073


Do a k-nearest neighbors search to find similar speeches

In [15]:
k_search_dist = 5

In [16]:
t0 = time.time()

kn_model = NearestNeighbors()
kn_model.fit(speeches_embed)

# find X most similar speeches for each speech
# We add 1 to k since each speech will be most similar to itself (and we remove that result)
dist_speeches, sim_speeches = kn_model.kneighbors(speeches_embed,k_search_dist+1)

print('k-nearest search time elapsed: ' + str(time.time()-t0))

k-nearest search time elapsed: 0.023030757904052734


In [23]:
# Story nump arrays in pandas
dist_speeches_pd =pd.DataFrame(dist_speeches)
dist_speeches_pd.insert(0,'speech_num',speeches_pd.index)

sim_speeches_pd =pd.DataFrame(sim_speeches)
sim_speeches_pd.insert(0,'speech_num',speeches_pd.index)

In [34]:
dist_matrix = pd.melt(dist_speeches_pd,
    id_vars=['speech_num'],value_vars=list(range(0,k_search_dist+1))).\
    rename({'variable':'rank','value': 'distance'},axis='columns')

sim_matrix = pd.melt(sim_speeches_pd,
    id_vars=['speech_num'],value_vars=list(range(0,k_search_dist+1))).\
    rename({'variable': 'rank','value':'speech_num_match'},axis='columns')

Show the most 'similar' state of the union speeches according to spacy document embeddings

In [48]:
# Only keep one unique pair of matches and don't keep rows that match the same speech to itself
simdist_matrix = sim_matrix[(sim_matrix['speech_num'] != sim_matrix['speech_num_match']) & \
                        (sim_matrix['speech_num'] < sim_matrix['speech_num_match'])].\
    merge(dist_matrix,on=['speech_num','rank']).\
    merge(speeches_pd[['president','year']],left_on='speech_num',right_index=True).\
    merge(speeches_pd[['president','year']],left_on='speech_num_match',right_index=True,suffixes=['','_match']).\
    sort_values('distance')

The most similar speeches

In [50]:
simdist_matrix.head(5)

Unnamed: 0,speech_num,rank,speech_num_match,distance,president,year,president_match,year_match
58,97,1,99,0.14449,Grover Cleveland,1886,Grover Cleveland,1888
40,66,1,67,0.161351,Franklin Pierce,1855,Franklin Pierce,1856
57,96,1,97,0.162432,Grover Cleveland,1885,Grover Cleveland,1886
91,166,1,167,0.165062,Dwight D. Eisenhower,1955,Dwight D. Eisenhower,1956
55,92,1,96,0.17533,Chester A. Arthur,1881,Grover Cleveland,1885


Now let's eliminate cases where the president is the same 

In [56]:
simdist_matrix[simdist_matrix['president'] != simdist_matrix['president_match']].head(10)

Unnamed: 0,speech_num,rank,speech_num_match,distance,president,year,president_match,year_match
55,92,1,96,0.17533,Chester A. Arthur,1881,Grover Cleveland,1885
187,92,2,99,0.184178,Chester A. Arthur,1881,Grover Cleveland,1888
54,91,1,100,0.184329,Rutherford B. Hayes,1880,Benjamin Harrison,1889
305,92,3,97,0.18439,Chester A. Arthur,1881,Grover Cleveland,1886
439,96,4,100,0.187086,Grover Cleveland,1885,Benjamin Harrison,1889
572,92,5,100,0.192889,Chester A. Arthur,1881,Benjamin Harrison,1889
32,50,1,53,0.196045,Martin Van Buren,1839,John Tyler,1842
31,47,1,50,0.207403,Andrew Jackson,1836,Martin Van Buren,1839
53,89,1,92,0.20749,Rutherford B. Hayes,1878,Chester A. Arthur,1881
192,102,2,104,0.20759,Benjamin Harrison,1891,Grover Cleveland,1893


Most similar speeches since 1950 (between different presidents)

In [58]:
simdist_matrix[(simdist_matrix['president'] != simdist_matrix['president_match']) & \
              ((simdist_matrix['year'] >= 1950 ) & (simdist_matrix['year_match'] >= 1950 ))].head(15)

Unnamed: 0,speech_num,rank,speech_num_match,distance,president,year,president_match,year_match
104,190,1,195,0.232852,Jimmy Carter,1978,Ronald Reagan,1983
103,188,1,194,0.266203,Gerald Ford,1976,Ronald Reagan,1982
498,210,4,226,0.277419,Bill Clinton,1998,Barack Obama,2014
625,210,5,225,0.279423,Bill Clinton,1998,Barack Obama,2013
109,203,1,218,0.284601,George Herbert Walker Bush,1991,George W. Bush,2006
500,212,4,225,0.291351,Bill Clinton,2000,Barack Obama,2013
231,188,2,195,0.294004,Gerald Ford,1976,Ronald Reagan,1983
339,170,3,173,0.297012,Dwight D. Eisenhower,1959,John F. Kennedy,1961
350,188,3,190,0.300386,Gerald Ford,1976,Jimmy Carter,1978
499,211,4,225,0.304767,Bill Clinton,1999,Barack Obama,2013


Most dissimilar speeches since 1900.

In [62]:
simdist_matrix[((simdist_matrix['year'] >= 1900 ) & (simdist_matrix['year_match'] >= 1900 ))].\
                sort_values('distance',ascending=False).head(15)

Unnamed: 0,speech_num,rank,speech_num_match,distance,president,year,president_match,year_match
592,152,5,161,0.613455,Franklin Delano Roosevelt,1942,Harry S. Truman,1951
462,152,4,163,0.60022,Franklin Delano Roosevelt,1942,Harry S. Truman,1953
581,128,5,150,0.583449,Woodrow Wilson,1917,Franklin Delano Roosevelt,1940
593,153,5,154,0.553815,Franklin Delano Roosevelt,1943,Franklin Delano Roosevelt,1944
484,187,4,190,0.543752,Gerald Ford,1975,Jimmy Carter,1978
318,128,3,151,0.543523,Woodrow Wilson,1917,Franklin Delano Roosevelt,1941
463,153,4,163,0.535268,Franklin Delano Roosevelt,1943,Harry S. Truman,1953
317,127,3,134,0.534764,Woodrow Wilson,1916,Calvin Coolidge,1923
202,128,2,154,0.531889,Woodrow Wilson,1917,Franklin Delano Roosevelt,1944
615,192,5,203,0.523964,Jimmy Carter,1980,George Herbert Walker Bush,1991
