# Lede Algorithms -- Assignment 2

In this assignment you will use all your text analysis skills to analyze the U.S. State of the Union speeches in the 20th century. 

First, load `state-of-the-union.csv`. This is is a standard CSV file with one speech per row. There are two columns: the year of the speech, and the text of the speech. 

In [1]:
# Some stuff you'll need
import pandas as pd
from textblob import TextBlob
from sklearn.feature_extraction.text import TfidfVectorizer

In [2]:
df = pd.read_csv("state-of-the-union.csv")
df.head()

Unnamed: 0,year,text
0,1790,"George Washington\nJanuary 8, 1790\n\nFellow-C..."
1,1790,\nState of the Union Address\nGeorge Washingto...
2,1791,\nState of the Union Address\nGeorge Washingto...
3,1792,\nState of the Union Address\nGeorge Washingto...
4,1793,\nState of the Union Address\nGeorge Washingto...


We will work with ony those speeches in the 20th century, so start by filtering out only the rows with a year between 1900 and 1999

In [3]:
century = df[(df['year'] > 1899) & (df['year'] < 2000)]
century.head()

Unnamed: 0,year,text
111,1900,\nState of the Union Address\nWilliam McKinley...
112,1901,\nState of the Union Address\nTheodore Rooseve...
113,1902,\nState of the Union Address\nTheodore Rooseve...
114,1903,\nState of the Union Address\nTheodore Rooseve...
115,1905,\nState of the Union Address\nTheodore Rooseve...


The first step in your analysis task will be to tokenize each document in this set and create a dataframe of tf-idf vectors. We're going to need to tokenize first, so write (or cut and paste!) a tokenizer function that takes a string and returns a list of standardized tokens.

In [4]:
# Use a Tokenizer to  break the text into sentences and words or more accurately, "tokens" 
# Tokens are basically words but can also be punctuation and numbers.

def tokenize(s):
    blob = TextBlob(s.lower())
    words = [token for token in blob.words if len(token)>2]
    return words

Good stuff. Now use this to create a matrix of tf-idf vectors for the document set.

In [5]:
# tfidf = something

vectorizer = TfidfVectorizer(stop_words='english', tokenizer=tokenize)
matrix = vectorizer.fit_transform(century.text)

# The easiest way to see what happenned is to make a dataframe
tfidf = pd.DataFrame(matrix.toarray(), columns=vectorizer.get_feature_names())
tfidf.head()

Unnamed: 0,'70,'76,'82,'86,'89,'90,'follow,'forties,'ll,'re,...,zest,zigzag,zimbabwe,zimbabwean,zinc,zion,zone,zones,zoological,zooming
0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.005191,0.0,0.0
1,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.010902,0.0
2,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.018962,0.0,0.0,0.0
4,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.014207,0.0,0.0,0.0


You're probably going to want a way to print out the most highly weighted terms this as well, so we'll use print_sorted_vector from the lesson notebook:

In [6]:
def print_sorted_vector(v):
    # this "lambda" thing is an anonymous function, google me to unluck bonus coding knowledge
    sorted_list = sorted(v.items(), key=lambda x: (x[1],x[0]), reverse=True) 
    sorted_list = sorted_list[:20]
    print('\n'.join([str(x) for x in sorted_list]))

Print out a few of the State of The Union vectors for individual speeches to get a sense of what's happening here.

In [7]:
def doc2vec_count(s):
    tokens = tokenize(s)
    vec = {}
    for t in tokens:
        vec[t] = vec.get(t, 0) + 1
    return vec

In [8]:
doc2vec_count("God bless America")

{'god': 1, 'bless': 1, 'america': 1}

In [9]:
century_text = df.text[212]
doc = TextBlob(century_text)
doc.sentences

[Sentence("
 State of the Union Address
 William J. Clinton
 January 27, 2000
 
 Mr. Speaker, Mr. Vice President, Members of Congress, honored guests, my
 fellow Americans:
 
 We are fortunate to be alive at this moment in history."),
 Sentence("Never before has
 our nation enjoyed, at once, so much prosperity and social progress with so
 little internal crisis or so few external threats."),
 Sentence("Never before have we had
 such a blessed opportunity--and, therefore, such a profound obligation--
 to build the more perfect union of our founders' dreams."),
 Sentence("We begin the new century with over 20 million new jobs."),
 Sentence("The fastest
 economic growth in more than 30 years; the lowest unemployment rates in 30
 years; the lowest poverty rates in 20 years; the lowest African-American
 and Hispanic unemployment rates on record; the first back-to-back budget
 surpluses in 42 years."),
 Sentence("Next month, America will achieve the longest period of economic growth in
 our 

In [10]:
print_sorted_vector(doc2vec_count(century_text))

('the', 329)
('and', 281)
('our', 132)
('for', 107)
('that', 79)
('more', 53)
('new', 51)
('have', 51)
('must', 47)
('will', 45)
('with', 43)
('this', 43)
('all', 43)
('america', 38)
('make', 34)
('children', 33)
('are', 32)
('years', 30)
('can', 29)
('you', 28)


In [11]:
s = doc.sentences[20]
s

Sentence("We cut
crime: with 100,000 community police and the Brady Law, which has kept guns
out of the hands of half a million criminals.")

In [12]:
s.split(' ')

WordList(['We', 'cut\ncrime:', 'with', '100,000', 'community', 'police', 'and', 'the', 'Brady', 'Law,', 'which', 'has', 'kept', 'guns\nout', 'of', 'the', 'hands', 'of', 'half', 'a', 'million', 'criminals.'])

In [13]:
s.tokens

WordList(['We', 'cut', 'crime', ':', 'with', '100,000', 'community', 'police', 'and', 'the', 'Brady', 'Law', ',', 'which', 'has', 'kept', 'guns', 'out', 'of', 'the', 'hands', 'of', 'half', 'a', 'million', 'criminals', '.'])

In [14]:
for word in s.words:
    print("ORIGINAL:", word, "| LEMMA:", word.lemmatize(), "| STEM:", word.stem())

ORIGINAL: We | LEMMA: We | STEM: We
ORIGINAL: cut | LEMMA: cut | STEM: cut
ORIGINAL: crime | LEMMA: crime | STEM: crime
ORIGINAL: with | LEMMA: with | STEM: with
ORIGINAL: 100,000 | LEMMA: 100,000 | STEM: 100,000
ORIGINAL: community | LEMMA: community | STEM: commun
ORIGINAL: police | LEMMA: police | STEM: polic
ORIGINAL: and | LEMMA: and | STEM: and
ORIGINAL: the | LEMMA: the | STEM: the
ORIGINAL: Brady | LEMMA: Brady | STEM: bradi
ORIGINAL: Law | LEMMA: Law | STEM: law
ORIGINAL: which | LEMMA: which | STEM: which
ORIGINAL: has | LEMMA: ha | STEM: ha
ORIGINAL: kept | LEMMA: kept | STEM: kept
ORIGINAL: guns | LEMMA: gun | STEM: gun
ORIGINAL: out | LEMMA: out | STEM: out
ORIGINAL: of | LEMMA: of | STEM: of
ORIGINAL: the | LEMMA: the | STEM: the
ORIGINAL: hands | LEMMA: hand | STEM: hand
ORIGINAL: of | LEMMA: of | STEM: of
ORIGINAL: half | LEMMA: half | STEM: half
ORIGINAL: a | LEMMA: a | STEM: a
ORIGINAL: million | LEMMA: million | STEM: million
ORIGINAL: criminals | LEMMA: criminal | S

In [15]:
print_sorted_vector(doc2vec_count(century_text))

('the', 329)
('and', 281)
('our', 132)
('for', 107)
('that', 79)
('more', 53)
('new', 51)
('have', 51)
('must', 47)
('will', 45)
('with', 43)
('this', 43)
('all', 43)
('america', 38)
('make', 34)
('children', 33)
('are', 32)
('years', 30)
('can', 29)
('you', 28)


In [16]:
def doc_similarity(a_vec,b_vec):
    total = 0
    for word in a_vec:
        if word in b_vec:
            total += a_vec[word]*b_vec[word]
    return total

In [17]:
a = doc2vec_count(str(doc.sentences[6]))  # need str to convert Sentence object to string
b = doc2vec_count(str(doc.sentences[9]))

In [18]:
print(a)

{'have': 1, 'built': 1, 'new': 1, 'economy': 1}


In [19]:
print(b)

{'welfare': 1, 'rolls': 1, 'cut': 1, 'half': 1, 'their': 1, 'lowest': 1, 'levels': 1, 'years': 1}


In [20]:
doc_similarity(a,b)

0

Now sum the vectors for each decade, and print out the results. Do you see any themes? Can you connect the terms to major historical events? (wars, the great depression, assassinations, the civil rights movement, Watergate…)

In [21]:
def doc2vec_normalized(s):
    tokens = tokenize(s)
    vec = {}
    for t in tokens:
        vec[t] = vec.get(t, 0) + 1 # get from dict with a default of 0 if missing
        
    length = math.sqrt(sum([x*x for x in vec.values()]))  # length of a vector, according to Pythagoras
    for word,value in vec.items():
        vec[word] /= length
        
    return vec

In [22]:
import math

a = doc2vec_normalized(str(doc.sentences[6]))  # need str to convert Sentence object to string
b = doc2vec_normalized(str(doc.sentences[9]))
c = doc2vec_normalized(str(doc.sentences[5]))

In [23]:
print(a)

{'have': 0.5, 'built': 0.5, 'new': 0.5, 'economy': 0.5}


In [24]:
print(b)

{'welfare': 0.35355339059327373, 'rolls': 0.35355339059327373, 'cut': 0.35355339059327373, 'half': 0.35355339059327373, 'their': 0.35355339059327373, 'lowest': 0.35355339059327373, 'levels': 0.35355339059327373, 'years': 0.35355339059327373}


In [25]:
print(c)

{'next': 0.2773500981126146, 'month': 0.2773500981126146, 'america': 0.2773500981126146, 'will': 0.2773500981126146, 'achieve': 0.2773500981126146, 'the': 0.2773500981126146, 'longest': 0.2773500981126146, 'period': 0.2773500981126146, 'economic': 0.2773500981126146, 'growth': 0.2773500981126146, 'our': 0.2773500981126146, 'entire': 0.2773500981126146, 'history': 0.2773500981126146}


In [32]:
print(doc_similarity(a,b))
print(doc_similarity(b,c))
print(doc_similarity(a,c))

0
0
0


Which two decades are most similar, according to the cosine similarity of their average vectors? You will need to use a double loop that compares every pair of decades and finds the pair with the smallest distance.

In [35]:
#TF-IDF: Term Frequency / Inverse Document Frequency
from sklearn.feature_extraction.text import CountVectorizer,TfidfVectorizer

vectorizer = CountVectorizer(stop_words='english', tokenizer=tokenize)
matrix = vectorizer.fit_transform(century.text[0:10])
results = pd.DataFrame(matrix.toarray(), columns=vectorizer.get_feature_names())
results

Unnamed: 0,'the,"1,000","1,000,000","1,005","1,026,000","1,049","1,071,881,662","1,127,680","1,128","1,180,800",...,youthful,yukon,zeal,zealand,zealous,zealously,zelaya,zone,zones,zoological
0,0,0,0,0,0,0,1,1,0,1,...,0,0,0,0,0,0,0,0,1,0
1,0,0,0,0,0,0,0,0,0,0,...,1,0,2,0,0,0,0,0,0,1
2,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,0,1,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,3,0,0
4,0,0,0,0,0,0,0,0,0,0,...,0,1,0,0,1,0,0,2,0,0
5,0,1,0,0,1,0,0,0,0,0,...,0,1,0,0,0,0,0,1,0,0
6,1,1,1,1,0,0,0,0,0,0,...,0,1,1,0,0,0,0,1,0,0
7,0,0,0,0,0,0,0,0,0,0,...,0,0,0,1,0,0,0,0,0,0
8,0,0,0,0,0,0,0,0,0,0,...,0,0,0,1,0,0,0,0,0,0
9,0,0,1,0,0,1,0,0,1,0,...,0,0,0,0,1,1,4,2,0,0


In [36]:
vectorizer.get_feature_names()

["'the",
 '1,000',
 '1,000,000',
 '1,005',
 '1,026,000',
 '1,049',
 '1,071,881,662',
 '1,127,680',
 '1,128',
 '1,180,800',
 '1,247,827.58',
 '1,274,404',
 '1,289',
 '1,309,620.76',
 '1,358,228.35',
 '1,394,483,082',
 '1,453',
 '1,481,307',
 '1,500',
 '1,517,412',
 '1,636',
 '1,660',
 '1,801,524',
 '1,868,729',
 '1,923,840',
 '1.6',
 '1/2',
 '10,000',
 '10,000,000',
 '10,216,853.87',
 '10,240',
 '10,618',
 '10,748',
 '100',
 '100,000',
 '100,000,000',
 '103',
 '107,077.70',
 '108,000',
 '108,324',
 '109,020',
 '11,024,743.65',
 '11,411,779',
 '11,650',
 '11.65',
 '11.89',
 '110,400.77',
 '111,631,193',
 '117,509',
 '11907',
 '11th',
 '12,000,000',
 '12,672',
 '120',
 '121,848,047.26',
 '128,000',
 '12th',
 '13,000,000',
 '13,418,065.74',
 '13,453,887.96',
 '134,700,597.24',
 '134,774,767.78',
 '135',
 '135,000,000',
 '138,462,130.65',
 '139,303,794.50',
 '139,705,200',
 '13th',
 '14,093',
 '143,000,000',
 '143,382,624.34',
 '143,750.73',
 '145,000',
 '147,040.16',
 '14th',
 '15,000',
 '

In [37]:
vectorizer = TfidfVectorizer(stop_words='english', tokenizer=tokenize)

matrix = vectorizer.fit_transform(century.text)

# The easiest way to see what happenned is to make a dataframe
tfidf = pd.DataFrame(matrix.toarray(), columns=vectorizer.get_feature_names())
tfidf.head()

Unnamed: 0,'70,'76,'82,'86,'89,'90,'follow,'forties,'ll,'re,...,zest,zigzag,zimbabwe,zimbabwean,zinc,zion,zone,zones,zoological,zooming
0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.005191,0.0,0.0
1,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.010902,0.0
2,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.018962,0.0,0.0,0.0
4,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.014207,0.0,0.0,0.0


In [39]:
print_sorted_vector(tfidf.iloc[100])

('21st', 0.27551897028712946)
('century', 0.2136500266597532)
('tonight', 0.17821263263949527)
('america', 0.14465930556488224)
('medicare', 0.1422603159683729)
('americans', 0.1393860733545617)
('security', 0.12539093061511267)
('social', 0.12123867487078827)
('support', 0.12055291146961268)
('year', 0.11705024611030593)
("n't", 0.11679774903172196)
('ask', 0.11600007196185878)
('work', 0.11095059456419319)
('new', 0.11095059456419319)
('years', 0.10818044503872504)
('children', 0.10682006343489041)
('congress', 0.09378978021763641)
('help', 0.09222587636595672)
('say', 0.08949411170093288)
('parents', 0.08555643775593061)


In [45]:
docs = tfidf.iloc[:226,:]
total = docs.sum(axis=0)
print_sorted_vector(total)

('government', 10.355424646314273)
('world', 9.493239166492428)
('people', 9.470719697601915)
('congress', 8.929728655791182)
('new', 7.40458203140697)
('year', 7.110187769852229)
('american', 6.319288058795445)
('nation', 6.090758985904416)
('years', 5.843573434753461)
('war', 5.840410447393049)
('america', 5.800567325962085)
('states', 5.739088321373245)
('great', 5.635071946163714)
('time', 5.406662587056824)
('national', 5.371044284443324)
('peace', 5.350935278546842)
('country', 5.051711549027097)
('united', 4.981694061255582)
('federal', 4.847567856871241)
('nations', 4.8130965944728965)


Write a 500 word (max) article on what U.S. presidents discussed in their SOTU speeches in the 20th century. You should obviously use your tf-idf analysis as a primary source *but* you will not be able to complete this without actually reading some of the speeches, and comparing them to other historical references.

Turn in this notebook, with your article below.
    

According to the term frequency analysis, most American presidents place an equal weight on the government, world, and the people they govern. Each directly calls on Congress during their address in front of the legislative body, and places a weight on what new tasks will be accomplished in the coming year. Not surprisingly, many presidents before Donald Trump relied on the words "great" and "America," as well.