# Lede Algorithms -- Assignment 2

In this assignment you will use all your text analysis skills to analyze the U.S. State of the Union speeches in the 20th century. 

First, load `state-of-the-union.csv`. This is is a standard CSV file with one speech per row. There are two columns: the year of the speech, and the text of the speech. 

In [5]:
# Some stuff you'll need
import pandas as pd
from textblob import TextBlob
from sklearn.feature_extraction.text import CountVectorizer,TfidfVectorizer


In [6]:
# load 'state-of-the-union.csv'

df = pd.read_csv("state-of-the-union.csv")
df.head()


Unnamed: 0,year,text
0,1790,"George Washington\nJanuary 8, 1790\n\nFellow-C..."
1,1790,\nState of the Union Address\nGeorge Washingto...
2,1791,\nState of the Union Address\nGeorge Washingto...
3,1792,\nState of the Union Address\nGeorge Washingto...
4,1793,\nState of the Union Address\nGeorge Washingto...


We will work with ony those speeches in the 20th century, so start by filtering out only the rows with a year between 1900 and 1999

In [7]:
df_20_century = df[(df.year >= 1900) & (df.year <= 1999)]
df_20_century.head()

Unnamed: 0,year,text
111,1900,\nState of the Union Address\nWilliam McKinley...
112,1901,\nState of the Union Address\nTheodore Rooseve...
113,1902,\nState of the Union Address\nTheodore Rooseve...
114,1903,\nState of the Union Address\nTheodore Rooseve...
115,1905,\nState of the Union Address\nTheodore Rooseve...


In [8]:
df_20_century.reset_index()

Unnamed: 0,index,year,text
0,111,1900,\nState of the Union Address\nWilliam McKinley...
1,112,1901,\nState of the Union Address\nTheodore Rooseve...
2,113,1902,\nState of the Union Address\nTheodore Rooseve...
3,114,1903,\nState of the Union Address\nTheodore Rooseve...
4,115,1905,\nState of the Union Address\nTheodore Rooseve...
5,116,1905,\nState of the Union Address\nTheodore Rooseve...
6,117,1906,\nState of the Union Address\nTheodore Rooseve...
7,118,1907,\nState of the Union Address\nTheodore Rooseve...
8,119,1908,\nState of the Union Address\nTheodore Rooseve...
9,120,1909,\nState of the Union Address\nWilliam H. Taft\...


The first step in your analysis task will be to tokenize each document in this set and create a dataframe of tf-idf vectors. We're going to need to tokenize first, so write (or cut and paste!) a tokenizer function that takes a string and returns a list of standardized tokens.

In [9]:
# STEPS
# 1. split into words (token)
# 2. count the words (tfidf)
# 3. find the angle (1 - angle) - the closest to zero the more similar it is 

In [10]:
df_20_century.dtypes

year     int64
text    object
dtype: object

In [11]:
def tokenize(s):
    blob = TextBlob(s.lower()) # makes the string lowercase
    words = [token for token in blob.words if len(token)>2] # gives me only strings longer than 2 letters 
    return words

Good stuff. Now use this to create a matrix of tf-idf vectors for the document set.

In [12]:
vectorizer = TfidfVectorizer(stop_words='english', tokenizer=tokenize)

matrix = vectorizer.fit_transform(df_20_century.text)

# The easiest way to see what happenned is to make a dataframe
tfidf = pd.DataFrame(matrix.toarray(), columns=vectorizer.get_feature_names())
tfidf.head()

Unnamed: 0,'70,'76,'82,'86,'89,'90,'follow,'forties,'ll,'re,...,zest,zigzag,zimbabwe,zimbabwean,zinc,zion,zone,zones,zoological,zooming
0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.005191,0.0,0.0
1,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.010902,0.0
2,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.018962,0.0,0.0,0.0
4,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.014207,0.0,0.0,0.0


You're probably going to want a way to print out the most highly weighted terms this as well, so we'll use print_sorted_vector from the lesson notebook:

In [13]:
def print_sorted_vector(v):
    # this "lambda" thing is an anonymous function, google me to unluck bonus coding knowledge
    sorted_list = sorted(v.items(), key=lambda x: (x[1],x[0]), reverse=True) 
    sorted_list = sorted_list[:20]
    print('\n'.join([str(x) for x in sorted_list]))

Print out a few of the State of The Union vectors for individual speeches to get a sense of what's happening here.

In [14]:
print_sorted_vector(tfidf.iloc[0])

('1899', 0.22654730615726476)
('islands', 0.2246593569693863)
('1900', 0.21938859199604244)
('government', 0.20032638731025443)
('states', 0.1437741785528732)
('legations', 0.12557098475535067)
('united', 0.11878180052704417)
('shall', 0.11424814550261289)
('chinese', 0.11331900768469443)
('commission', 0.11057028102969554)
('convention', 0.10687709369836014)
('peking', 0.1004343441254551)
('congress', 0.10009594709377248)
('imperial', 0.09984173648944777)
('hon', 0.08975559229252457)
('year', 0.08895845437608177)
('powers', 0.08078328815850949)
('legation', 0.0803877537977391)
('philippine', 0.07656038457066006)
('people', 0.07643690505342625)


In [15]:
print_sorted_vector(tfidf.iloc[9])

('government', 0.20864894814615967)
('department', 0.18568403338535452)
('states', 0.14462933264276034)
('1909', 0.1373538808061215)
('united', 0.13695813245509583)
('tariff', 0.13014746585538778)
('shall', 0.11673368819956711)
('american', 0.11012547101953016)
('courts', 0.10444520838483083)
('officers', 0.09379370845399901)
('congress', 0.09297599955606022)
('court', 0.09214027163145355)
('conference', 0.09058815906620958)
('state', 0.09039333290172522)
('observatory', 0.08916083124642232)
('convention', 0.08847406488324566)
('great', 0.0878106662473902)
('commerce', 0.08504694894850054)
('board', 0.08212411583063019)
('law', 0.08163005816382375)


Now sum the vectors for each decade, and print out the results. Do you see any themes? Can you connect the terms to major historical events? (wars, the great depression, assassinations, the civil rights movement, Watergate…)

In [16]:
def doc2vec_count(s):
    tokens = tokenize(s)
    vec = {}
    for t in tokens:
        vec[t] = vec.get(t, 0) + 1
    return vec

In [17]:
df_20_century.text.head()

111    \nState of the Union Address\nWilliam McKinley...
112    \nState of the Union Address\nTheodore Rooseve...
113    \nState of the Union Address\nTheodore Rooseve...
114    \nState of the Union Address\nTheodore Rooseve...
115    \nState of the Union Address\nTheodore Rooseve...
Name: text, dtype: object

Which two decades are most similar, according to the cosine similarity of their average vectors? You will need to use a double loop that compares every pair of decades and finds the pair with the smallest distance.

In [18]:
def doc_distance(a_vec,b_vec):
    # First we have to compute similarity. The idea is the same as doc_similarity, but
    # because we are using arrays and not dictionaries, we can just multiply all the elements 
    # together and add the sum. This is what numpy's dot function does
    a_vec = a_vec/math.sqrt(sum([x*x for x in a_vec]))
    b_vec = b_vec/math.sqrt(sum([x*x for x in b_vec]))
    similarity = a_vec.dot(b_vec)
    # Because the vectors are already normalized, similarity will be 1 if equal, 0 if disjoint
    # We want things the other way around
    return 1-similarity
# helpful little function for distance between documents i and j
def dij(i,j):
    
    return doc_distance(df2.iloc[i], df2.iloc[j])

In [19]:
smallest_value=100

for i in range(len(df_20_century)):
    for j in range(i+1,10):
        if dij(i,j) < smallest_value:
            smallest_value = dij(i,j)
        smallest_index = (i,j)
print (smallest_index, smallest_value)

NameError: name 'df2' is not defined

Write a 500 word (max) article on what U.S. presidents discussed in their SOTU speeches in the 20th century. You should obviously use your tf-idf analysis as a primary source *but* you will no be able to complete this without actually reading some of the speeches, and comparing them to other historical references.

Turn in this notebook, with your article below.
    

(your SOTU article here)