# Lede Algorithms -- Assignment 2

In this assignment you will use all your text analysis skills to analyze the U.S. State of the Union speeches in the 20th century. 

First, load `state-of-the-union.csv`. This is is a standard CSV file with one speech per row. There are two columns: the year of the speech, and the text of the speech. 

In [1]:
# Some stuff you'll need
import pandas as pd
from textblob import TextBlob
from sklearn.feature_extraction.text import CountVectorizer,TfidfVectorizer
import math

In [2]:
# load 'state-of-the-union.csv'
df = pd.read_csv('state-of-the-union.csv')
df.head()

Unnamed: 0,year,text
0,1790,"George Washington\nJanuary 8, 1790\n\nFellow-C..."
1,1790,\nState of the Union Address\nGeorge Washingto...
2,1791,\nState of the Union Address\nGeorge Washingto...
3,1792,\nState of the Union Address\nGeorge Washingto...
4,1793,\nState of the Union Address\nGeorge Washingto...


We will work with ony those speeches in the 20th century, so start by filtering out only the rows with a year between 1900 and 1999

In [3]:
new_df = df[(df.year >= 1900) & (df.year < 2000)].reset_index().drop(columns=['index'])
new_df['year'] = new_df['year'].astype(str)
new_df['decade'] = new_df.year.str.extract('\d\d(\d)\d')
new_df.head()

Unnamed: 0,year,text,decade
0,1900,\nState of the Union Address\nWilliam McKinley...,0
1,1901,\nState of the Union Address\nTheodore Rooseve...,0
2,1902,\nState of the Union Address\nTheodore Rooseve...,0
3,1903,\nState of the Union Address\nTheodore Rooseve...,0
4,1905,\nState of the Union Address\nTheodore Rooseve...,0


The first step in your analysis task will be to tokenize each document in this set and create a dataframe of tf-idf vectors. We're going to need to tokenize first, so write (or cut and paste!) a tokenizer function that takes a string and returns a list of standardized tokens.

In [4]:
def tokenize(s):
    blob = TextBlob(s.lower())
    words = [token for token in blob.words if len(token)>2]
    return words

Good stuff. Now use this to create a matrix of tf-idf vectors for the document set.

In [5]:
# tfidf = something
vectorizer = TfidfVectorizer(stop_words='english', tokenizer=tokenize)

matrix = vectorizer.fit_transform(new_df.text)

tfidf = pd.DataFrame(matrix.toarray(), columns=vectorizer.get_feature_names())
tfidf.head()

Unnamed: 0,'70,'76,'82,'86,'89,'90,'follow,'forties,'ll,'re,...,zest,zigzag,zimbabwe,zimbabwean,zinc,zion,zone,zones,zoological,zooming
0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.005191,0.0,0.0
1,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.010902,0.0
2,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.018962,0.0,0.0,0.0
4,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.014207,0.0,0.0,0.0


You're probably going to want a way to print out the most highly weighted terms this as well, so we'll use print_sorted_vector from the lesson notebook:

In [6]:
def print_sorted_vector(v):
    # this "lambda" thing is an anonymous function, google me to unluck bonus coding knowledge
    sorted_list = sorted(v.items(), key=lambda x: (x[1],x[0]), reverse=True) 
    sorted_list = sorted_list[:20]
    print('\n'.join([str(x) for x in sorted_list]))

Print out a few of the State of The Union vectors for individual speeches to get a sense of what's happening here.

In [7]:
print_sorted_vector(tfidf.iloc[50])

('soviet', 0.2304437044446423)
('world', 0.22722433857476837)
('free', 0.22567081516259663)
('nations', 0.19855480295662237)
('defense', 0.13293358963473945)
('aggression', 0.1287224355539119)
('fighting', 0.125839715280979)
('military', 0.12521387085709124)
('strength', 0.12462524028256822)
('korea', 0.11834619341829287)
('need', 0.10532988440677407)
('planes', 0.10330457167111418)
('b-36', 0.09989729985417252)
('b-17', 0.09989729985417252)
('europe', 0.09612406045962713)
('freedom', 0.09588247112568013)
('rulers', 0.0871553745602872)
('atlantic', 0.08622686498856352)
('kremlin', 0.08585715794809681)
('countries', 0.08364431525952326)


Now sum the vectors for each decade, and print out the results. Do you see any themes? Can you connect the terms to major historical events? (wars, the great depression, assassinations, the civil rights movement, Watergate…)

In [8]:
tfidf['decade'] = new_df['decade']
df_decade = tfidf.groupby(by='decade').sum()
df_decade

Unnamed: 0_level_0,'70,'76,'82,'86,'89,'90,'follow,'forties,'ll,'re,...,zest,zigzag,zimbabwe,zimbabwean,zinc,zion,zone,zones,zoological,zooming
decade,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.058793,0.005191,0.010902,0.0
1,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.033309,0.0,0.0,0.0,0.0,0.0,0.065645,0.0,0.0,0.0
2,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.02241,0.009414,0.0,0.0
3,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.02569,0.0,0.0,0.0,0.0,0.0
4,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.042793,0.0,0.0,...,0.0,0.0,0.0,0.0,0.030765,0.0,0.005952,0.00291,0.0,0.0
5,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.02065,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
6,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.016261,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
7,0.03786,0.03786,0.0,0.0,0.0,0.0,0.0,0.0,0.106116,0.0,...,0.029912,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
8,0.0,0.0,0.004846,0.028066,0.0,0.0265,0.0,0.0,0.488972,1.153217,...,0.0,0.0,0.009692,0.004846,0.0,0.0,0.035879,0.099337,0.0,0.02775
9,0.0,0.0,0.0,0.0,0.066209,0.0,0.019749,0.0,0.488245,1.110055,...,0.0,0.0,0.0,0.0,0.0,0.016243,0.020409,0.10876,0.0,0.0


In [9]:
df_decade = df_decade.reset_index().drop(columns='decade')
df_decade

Unnamed: 0,'70,'76,'82,'86,'89,'90,'follow,'forties,'ll,'re,...,zest,zigzag,zimbabwe,zimbabwean,zinc,zion,zone,zones,zoological,zooming
0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.058793,0.005191,0.010902,0.0
1,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.033309,0.0,0.0,0.0,0.0,0.0,0.065645,0.0,0.0,0.0
2,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.02241,0.009414,0.0,0.0
3,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.02569,0.0,0.0,0.0,0.0,0.0
4,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.042793,0.0,0.0,...,0.0,0.0,0.0,0.0,0.030765,0.0,0.005952,0.00291,0.0,0.0
5,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.02065,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
6,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.016261,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
7,0.03786,0.03786,0.0,0.0,0.0,0.0,0.0,0.0,0.106116,0.0,...,0.029912,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
8,0.0,0.0,0.004846,0.028066,0.0,0.0265,0.0,0.0,0.488972,1.153217,...,0.0,0.0,0.009692,0.004846,0.0,0.0,0.035879,0.099337,0.0,0.02775
9,0.0,0.0,0.0,0.0,0.066209,0.0,0.019749,0.0,0.488245,1.110055,...,0.0,0.0,0.0,0.0,0.0,0.016243,0.020409,0.10876,0.0,0.0


In [10]:
for i in range(0,10):
    print('-----19'+str(i)+'0\'s-----')
    print_sorted_vector(df_decade.iloc[i])

-----1900's-----
('government', 1.4445796432453253)
('states', 1.0881463675885807)
('law', 1.0732517617704944)
('great', 0.9456746810258504)
('congress', 0.8780189191137404)
('united', 0.8463550391878844)
('people', 0.8265085497821171)
('islands', 0.8032093063814292)
('public', 0.7769135333929027)
('country', 0.7702163511955874)
('work', 0.762656216718705)
('men', 0.7439955776626556)
('navy', 0.6973315900628397)
('officers', 0.6941297703771574)
('man', 0.6940588406128467)
('business', 0.6764629849981343)
('corporations', 0.6700926675281301)
('commerce', 0.6346252800536611)
('service', 0.6305853459735535)
('department', 0.6231707624597032)
-----1910's-----
('government', 1.1157823462813592)
('shall', 0.8862784515269287)
('great', 0.8181069885482202)
('congress', 0.8150409836345999)
('country', 0.7905992904838366)
('states', 0.7730897844553355)
('war', 0.7208245824536633)
('men', 0.6681689735962684)
('united', 0.6368153298067745)
('people', 0.6231947653699088)
('necessary', 0.61278960710

Which two decades are most similar, according to the cosine similarity of their average vectors? You will need to use a double loop that compares every pair of decades and finds the pair with the smallest distance.

In [11]:
def doc_distance(a_vec,b_vec):
    similarity = a_vec.dot(b_vec)
    return 1-similarity

def dij(i,j):
    return doc_distance(df_decade.iloc[i], df_decade.iloc[j])

In [12]:
closest = []
for i in range(0,10):
    for j in range(0,10):
        distance = (i, j, dij(i,j))
        closest.append(distance)

In [13]:
closest.sort(key=lambda x: x[2])
closest[:10]

[(5, 5, -57.799009388107635),
 (0, 0, -55.796699906588586),
 (9, 9, -51.25167050480036),
 (6, 6, -49.64437165665301),
 (2, 2, -46.665101045586695),
 (8, 8, -44.27074720291066),
 (7, 7, -43.494513368115456),
 (5, 6, -41.21943912519605),
 (6, 5, -41.21943912519605),
 (0, 2, -38.21269548728503)]

Write a 500 word (max) article on what U.S. presidents discussed in their SOTU speeches in the 20th century. You should obviously use your tf-idf analysis as a primary source *but* you will no be able to complete this without actually reading some of the speeches, and comparing them to other historical references.

Turn in this notebook, with your article below.
    

In [15]:
new_df.text[55].split('\n')

['',
 'State of the Union Address',
 'Dwight D. Eisenhower',
 'January 6, 1955',
 '',
 'Mr. President, Mr. Speaker, Members of the Congress:',
 '',
 'First, I extend cordial greetings to the 84th Congress. We shall have much',
 'to do together; I am sure that we shall get it done--and, that we shall do',
 'it in harmony and good will.',
 '',
 'At the outset, I believe it would be well to remind ourselves of this great',
 'fundamental in our national life: our common belief that every human being',
 'is divinely endowed with dignity and worth and inalienable rights. This',
 'faith, with its corollary--that to grow and flourish people must be',
 'free--shapes the interests and aspirations of every American.',
 '',
 'From this deep faith have evolved three main purposes of our Federal',
 'Government:',
 '',
 'First, to maintain justice and freedom among ourselves and to champion them',
 'for others so that we may work effectively for enduring peace;',
 '',
 'Second, to help keep our econo

(your SOTU article here)

Due to the analysis above, threre is a similarity in speeches between 1950's and 60's. After the World War II, the United States still involved in wars over the world and "world" order and "peace/free(dom)" are the major issues in the Presidents speech.