#Part I: Data Munging

<b>Data Sources:</b>
<ul>
<li>Inaugural Addresses and States of the Union: Project Gutenberg</li>
<li>[Presidential Data](http://www.infoplease.com/ipa/A0194030.html): Infoplease</li>
<li>[Presidential Rankings](https://en.wikipedia.org/wiki/Historical_rankings_of_Presidents_of_the_United_States#Five_Thirty_Eight_analysis): Wikipedia/538</li>
</ul>

Structured data can be found [here](https://docs.google.com/spreadsheets/d/1cujFV5JLRivY-k6LMEDCP8_zapHUtwNCdb9Qr8h2gOQ/edit#gid=0).

###<i>Step 1: Parsing Speech Text</i>

First, let's import all the packages we'll need to clean the data:
<ul>
<li><code>re</code> for regular expression functions</li>
<li><code>pprint</code> to make printing more readable</li>
<li><code>string</code> to clean string values</li>
<li><code>pandas</code> because <i>duh</i></li>
<li><code>numpy</code> because math</li>
<li><code>matplotlib.pyplot</code> for charts</li>
<li><code>CountVectorizer</code> for parsing tokens and removing stop words</li>
</ul>

In [1]:
%matplotlib inline

import re
import pprint as pp
import string
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from sklearn.feature_extraction.text import CountVectorizer

Next, we'll open the text files and read them into Python objects that can be parsed.

In [2]:
# Inaugural Address text
inaugural = open('../data/inaugural.txt', 'r')
inaugural_text = inaugural.read()

# State of the Union text
sotu = open('../data/sotu.txt', 'r')
sotu_text = sotu.read()

First, we'll parse the inaugural speech data using <code>re</code> modules. We'll begin by creating a list of speech titles which will act as speech IDs.

In [3]:
raw_speech_id_list = re.findall(r'\*\s\*\s\*\s\*\s\*([\w\s\,\.]+)ADDRESS',
                                inaugural_text)

We'll use a <code>string</code> method (<code>strip</code>) to remove extraneous characters from the title list first. Later, we'll create a <code>dict</code> object that will have each title as a key and each full speech text as a value.

In [4]:
stripped_id_list = [string.strip(title, "\r\n ") for title in raw_speech_id_list]

Let's move on to cleaning the speech text since we've cleaned the titles.

All the speeches in the text file are separated by \* \* \* \* \* delimiters, so we'll use <code>re.split</code> again to extract all the text between the delimiters.

In [5]:
raw_speech = re.split(r'\*\s\*\s\*\s\*\s\*', inaugural_text)

Next, we'll use <code>re.sub</code> to replace the "Transcriber's Notes" because we only want the speech text for each inaugural address. We'll also ignore the first and last elements in the <code>raw_speech</code> list because it isn't actually speech text.

In [6]:
speeches = [re.sub(r'^([\w\W\s]+)\]', "", speech) for speech in raw_speech[1:len(raw_speech)-1]]

print len(speeches)

55


Finally, we'll use a combination of <code>re.sub</code> and <code>string.strip</code> to clean up all the extra spaces and newline characters in each speech.

In [7]:
clean_speeches = []
[clean_speeches.append(re.sub(r'\r\n',
                              " ",
                              string.strip(speech,"\r\n"))) for speech in speeches]

print len(clean_speeches)

55


It looks like most of the works is done, but you'll see that the last three speeches still contain extranous test (mostly speech IDs) that should be removed, so we'll take the last use <code>re.sub</code> on the last three to extract the last bit of cruft before moving on.

In [8]:
clean_speeches_inaugural = [re.sub(r'([A-Z0-9\,\.\s]+)\s{3}', "", speech) 
                            for speech in clean_speeches]

Now that the inaugural data is clean, let's follow similar steps to clean the State of the Union (SOTU) speeches. Again, we'll use <code>re</code> modules to extract the text.

First, we'll create a list of titles that will serve as speech IDs. Rather than extracting using Python, however, it'll be easier to just copy and paste the SOTU titles and load it into a Python list :)

In [9]:
raw_speech_id_list_sotu = [
'George Washington, State of the Union Address',
'George Washington, State of the Union Address',
'George Washington, State of the Union Address',
'George Washington, State of the Union Address',
'George Washington, State of the Union Address',
'George Washington, State of the Union Address',
'George Washington, State of the Union Address',
'George Washington, State of the Union Address',
'John Adams, State of the Union Address',
'John Adams, State of the Union Address',
'John Adams, State of the Union Address',
'John Adams, State of the Union Address',
'Thomas Jefferson, State of the Union Address',
'Thomas Jefferson, State of the Union Address',
'Thomas Jefferson, State of the Union Address',
'Thomas Jefferson, State of the Union Address',
'Thomas Jefferson, State of the Union Address',
'Thomas Jefferson, State of the Union Address',
'Thomas Jefferson, State of the Union Address',
'Thomas Jefferson, State of the Union Address',
'James Madison, State of the Union Address',
'James Madison, State of the Union Address',
'James Madison, State of the Union Address',
'James Madison, State of the Union Address',
'James Madison, State of the Union Address',
'James Madison, State of the Union Address',
'James Madison, State of the Union Address',
'James Madison, State of the Union Address',
'James Monroe, State of the Union Address',
'James Monroe, State of the Union Address',
'James Monroe, State of the Union Address',
'James Monroe, State of the Union Address',
'James Monroe, State of the Union Address',
'James Monroe, State of the Union Address',
'James Monroe, State of the Union Address',
'James Monroe, State of the Union Address',
'John Quincy Adams, State of the Union Address',
'John Quincy Adams, State of the Union Address',
'John Quincy Adams, State of the Union Address',
'John Quincy Adams, State of the Union Address',
'Andrew Jackson, State of the Union Address',
'Andrew Jackson, State of the Union Address',
'Andrew Jackson, State of the Union Address',
'Andrew Jackson, State of the Union Address',
'Andrew Jackson, State of the Union Address',
'Andrew Jackson, State of the Union Address',
'Andrew Jackson, State of the Union Address',
'Andrew Jackson, State of the Union Address',
'Martin van Buren, State of the Union Address',
'Martin van Buren, State of the Union Address',
'Martin van Buren, State of the Union Address',
'Martin van Buren, State of the Union Address',
'John Tyler, State of the Union Address',
'John Tyler, State of the Union Address',
'John Tyler, State of the Union Address',
'John Tyler, State of the Union Address',
'James Polk, State of the Union Address',
'James Polk, State of the Union Address',
'James Polk, State of the Union Address',
'James Polk, State of the Union Address',
'Zachary Taylor, State of the Union Address',
'Millard Fillmore, State of the Union Address',
'Millard Fillmore, State of the Union Address',
'Millard Fillmore, State of the Union Address',
'Franklin Pierce, State of the Union Address',
'Franklin Pierce, State of the Union Address',
'Franklin Pierce, State of the Union Address',
'Franklin Pierce, State of the Union Address',
'James Buchanan, State of the Union Address',
'James Buchanan, State of the Union Address',
'James Buchanan, State of the Union Address',
'James Buchanan, State of the Union Address',
'Abraham Lincoln, State of the Union Address',
'Abraham Lincoln, State of the Union Address',
'Abraham Lincoln, State of the Union Address',
'Abraham Lincoln, State of the Union Address',
'Andrew Johnson, State of the Union Address',
'Andrew Johnson, State of the Union Address',
'Andrew Johnson, State of the Union Address',
'Andrew Johnson, State of the Union Address',
'Ulysses S. Grant, State of the Union Address',
'Ulysses S. Grant, State of the Union Address',
'Ulysses S. Grant, State of the Union Address',
'Ulysses S. Grant, State of the Union Address',
'Ulysses S. Grant, State of the Union Address',
'Ulysses S. Grant, State of the Union Address',
'Ulysses S. Grant, State of the Union Address',
'Ulysses S. Grant, State of the Union Address',
'Rutherford B. Hayes, State of the Union Address',
'Rutherford B. Hayes, State of the Union Address',
'Rutherford B. Hayes, State of the Union Address',
'Rutherford B. Hayes, State of the Union Address',
'Chester A. Arthur, State of the Union Address',
'Chester A. Arthur, State of the Union Address',
'Chester A. Arthur, State of the Union Address',
'Chester A. Arthur, State of the Union Address',
'Grover Cleveland, State of the Union Address',
'Grover Cleveland, State of the Union Address',
'Grover Cleveland, State of the Union Address',
'Grover Cleveland, State of the Union Address',
'Benjamin Harrison, State of the Union Address',
'Benjamin Harrison, State of the Union Address',
'Benjamin Harrison, State of the Union Address',
'Benjamin Harrison, State of the Union Address',
'William McKinley, State of the Union Address',
'William McKinley, State of the Union Address',
'William McKinley, State of the Union Address',
'William McKinley, State of the Union Address',
'Theodore Roosevelt, State of the Union Address',
'Theodore Roosevelt, State of the Union Address',
'Theodore Roosevelt, State of the Union Address',
'Theodore Roosevelt, State of the Union Address',
'Theodore Roosevelt, State of the Union Address',
'Theodore Roosevelt, State of the Union Address',
'Theodore Roosevelt, State of the Union Address',
'Theodore Roosevelt, State of the Union Address',
'William H. Taft, State of the Union Address',
'William H. Taft, State of the Union Address',
'William H. Taft, State of the Union Address',
'William H. Taft, State of the Union Address',
'Woodrow Wilson, State of the Union Address',
'Woodrow Wilson, State of the Union Address',
'Woodrow Wilson, State of the Union Address',
'Woodrow Wilson, State of the Union Address',
'Woodrow Wilson, State of the Union Address',
'Woodrow Wilson, State of the Union Address',
'Woodrow Wilson, State of the Union Address',
'Woodrow Wilson, State of the Union Address',
'Warren Harding, State of the Union Address',
'Warren Harding, State of the Union Address',
'Calvin Coolidge, State of the Union Address',
'Calvin Coolidge, State of the Union Address',
'Calvin Coolidge, State of the Union Address',
'Calvin Coolidge, State of the Union Address',
'Calvin Coolidge, State of the Union Address',
'Calvin Coolidge, State of the Union Address',
'Herbert Hoover, State of the Union Address',
'Herbert Hoover, State of the Union Address',
'Herbert Hoover, State of the Union Address',
'Herbert Hoover, State of the Union Address',
'Franklin D. Roosevelt, State of the Union Address',
'Franklin D. Roosevelt, State of the Union Address',
'Franklin D. Roosevelt, State of the Union Address',
'Franklin D. Roosevelt, State of the Union Address',
'Franklin D. Roosevelt, State of the Union Address',
'Franklin D. Roosevelt, State of the Union Address',
'Franklin D. Roosevelt, State of the Union Address',
'Franklin D. Roosevelt, State of the Union Address',
'Franklin D. Roosevelt, State of the Union Address',
'Franklin D. Roosevelt, State of the Union Address',
'Franklin D. Roosevelt, State of the Union Address',
'Franklin D. Roosevelt, State of the Union Address',
'Harry S. Truman, State of the Union Address',
'Harry S. Truman, State of the Union Address',
'Harry S. Truman, State of the Union Address',
'Harry S. Truman, State of the Union Address',
'Harry S. Truman, State of the Union Address',
'Harry S. Truman, State of the Union Address',
'Harry S. Truman, State of the Union Address',
'Harry S. Truman, State of the Union Address',
'Dwight D. Eisenhower, State of the Union Address',
'Dwight D. Eisenhower, State of the Union Address',
'Dwight D. Eisenhower, State of the Union Address',
'Dwight D. Eisenhower, State of the Union Address',
'Dwight D. Eisenhower, State of the Union Address',
'Dwight D. Eisenhower, State of the Union Address',
'Dwight D. Eisenhower, State of the Union Address',
'Dwight D. Eisenhower, State of the Union Address',
'Dwight D. Eisenhower, State of the Union Address',
'John F. Kennedy, State of the Union Address',
'John F. Kennedy, State of the Union Address',
'John F. Kennedy, State of the Union Address',
'Lyndon B. Johnson, State of the Union Address',
'Lyndon B. Johnson, State of the Union Address',
'Lyndon B. Johnson, State of the Union Address',
'Lyndon B. Johnson, State of the Union Address',
'Lyndon B. Johnson, State of the Union Address',
'Lyndon B. Johnson, State of the Union Address',
'Richard Nixon, State of the Union Address',
'Richard Nixon, State of the Union Address',
'Richard Nixon, State of the Union Address',
'Richard Nixon, State of the Union Address',
'Richard Nixon, State of the Union Address',
'Gerald R. Ford, State of the Union Address',
'Gerald R. Ford, State of the Union Address',
'Gerald R. Ford, State of the Union Address',
'Jimmy Carter, State of the Union Address',
'Jimmy Carter, State of the Union Address',
'Jimmy Carter, State of the Union Address',
'Jimmy Carter, State of the Union Address',
'Ronald Reagan, State of the Union Address',
'Ronald Reagan, State of the Union Address',
'Ronald Reagan, State of the Union Address',
'Ronald Reagan, State of the Union Address',
'Ronald Reagan, State of the Union Address',
'Ronald Reagan, State of the Union Address',
'Ronald Reagan, State of the Union Address',
'George H.W. Bush, State of the Union Address',
'George H.W. Bush, State of the Union Address',
'George H.W. Bush, State of the Union Address',
'William J. Clinton, State of the Union Address',
'William J. Clinton, State of the Union Address',
'William J. Clinton, State of the Union Address',
'William J. Clinton, State of the Union Address',
'William J. Clinton, State of the Union Address',
'William J. Clinton, State of the Union Address',
'William J. Clinton, State of the Union Address',
'George W. Bush, State of the Union Address',
'George W. Bush, State of the Union Address',
'George W. Bush, State of the Union Address',
'George W. Bush, State of the Union Address',
'George W. Bush, State of the Union Address',
'George W. Bush, State of the Union Address',
'George W. Bush, State of the Union Address'
]

# Capitalize speech IDs to conform to Inaugural Address data
raw_speech_id_list_sotu_caps = []
[raw_speech_id_list_sotu_caps.append(item.upper()) for item in raw_speech_id_list_sotu]

pp.pprint(raw_speech_id_list_sotu_caps[:2])

['GEORGE WASHINGTON, STATE OF THE UNION ADDRESS',
 'GEORGE WASHINGTON, STATE OF THE UNION ADDRESS']


In [10]:
# Parse out speech IDs and append them to a list
speech_id_list_sotu = []
[speech_id_list_sotu.append(re.findall(r'^(.*?)\sADDRESS',
                                       speech)[0])
 for speech in raw_speech_id_list_sotu_caps]

pp.pprint(speech_id_list_sotu[:2])

['GEORGE WASHINGTON, STATE OF THE UNION',
 'GEORGE WASHINGTON, STATE OF THE UNION']


In [11]:
# Combine the speech IDs into a single list
title_list = stripped_id_list + speech_id_list_sotu

Now we're going to add each president's number to each of the titles in <code>title_list</code>, which will make for easier joining when we add personal details and rankings.

In [12]:
# Write speech DataFrame data to a file
file_df = open('../data/ids.csv', 'w')
for row in title_list:
    file_df.write(row)
    file_df.write('\n')

file_df.close()

Normally, we'd find a way to add the actual order numbers programmatically, but since there are relatively few records in the dataset, we can just do it by hand, then re-upload the CSV with the new values.

In [13]:
# Load newly-tagged data
df_with_id = pd.read_csv('../data/ids_final.csv')
df_with_id.head()

Unnamed: 0,id,name,speech
0,1,GEORGE WASHINGTON,FIRST INAUGURAL
1,1,GEORGE WASHINGTON,SECOND INAUGURAL
2,2,JOHN ADAMS,INAUGURAL
3,3,THOMAS JEFFERSON,FIRST INAUGURAL
4,3,THOMAS JEFFERSON,SECOND INAUGURAL


The above <code>DataFrame</code> can be concatenated to the later <code>DataFrame</code>s containing the tokens generated by parsing the speech text.

Now for the hard part: let's grab the actual speech text for each State of the Union speech. First, we'll split the full text file; each speech is separated by \*\*\*, so we'll split using that.

In [14]:
raw_speech_sotu = re.split(r'\*\*\*\r\n\r\n', sotu_text)

# Actual speeches start at index 4 and end at index -3
raw_speech_sotu = raw_speech_sotu[4:-3]

To clean things up just a bit more, we'll remove the title information in each speech text.

In [15]:
clean_speeches_2 = []
[clean_speeches_2.append(re.findall(r'[0-9]{4}([\w\W\s\S]+)$',
                            speech)[0])
                            for speech in raw_speech_sotu]

print len(clean_speeches_2)

214


In [16]:
# Still need to clean SOTU speeches and remove '\r\n' instances and replace with '' or spaces
clean_speeches_sotu = []

for speech in clean_speeches_2:
    clean_speeches_sotu.append(re.sub(r'\r\n{1}', ' ', speech))

Now that both sets of speeches have been properly cleaned, we'll add them both together to create an aggregate list of cleaned speeches.

In [17]:
clean_speeches_all = clean_speeches_inaugural + clean_speeches_sotu

In [18]:
df_speeches = pd.DataFrame(clean_speeches_all, columns = ["speech_text"])

In [19]:
len(df_speeches)

269

In [20]:
df_all = pd.concat([df_with_id.iloc[:,:1], df_speeches], axis = 1)

In [21]:
df_grouped = df_all.groupby('id', as_index = False).sum()

In [22]:
# Create list of aggregated speech texts by president
speeches_all = list(df_grouped.speech_text)

In [23]:
# TODO: PorterStemmer > tf-idf (Unigram) > Create fake, similar data > Ensemble technique classifer

# Import TfidfVectorizer from scikit-learn
from sklearn.feature_extraction.text import TfidfVectorizer

# Unigram tf-idf
unigram_tfidf_vect = TfidfVectorizer(decode_error = 'ignore',
                                     stop_words = 'english',
                                     lowercase = True,
                                     max_features = 10000,
                                     min_df = 0.25, # Original: 0.1
                                     max_df = 0.75) # Original: 0.9
unigram_tfidf_output = unigram_tfidf_vect.fit_transform(speeches_all)

# Turn matrix into a DataFrame
unigram_tfidf_df = pd.DataFrame(unigram_tfidf_output.toarray(),
                                columns = unigram_tfidf_vect.get_feature_names())

In [24]:
unigram_tfidf_df.columns.values[39:] # Word-only columns

array([u'23d', u'24', u'25', ..., u'zealous', u'zealously', u'zone'], dtype=object)

In [25]:
unigram_tfidf_df_all = unigram_tfidf_df.iloc[:,39:]
unigram_tfidf_df_all.head()

Unnamed: 0,23d,24,25,250,25th,26,27,27th,28,29,...,yielded,yielding,yields,york,young,youth,zeal,zealous,zealously,zone
0,0,0.0,0.0,0.0,0.025055,0.0,0,0.0,0.0,0,...,0.0,0.0,0.014791,0.008624,0,0.020574,0.030065,0.044372,0.0,0.0
1,0,0.0,0.0,0.0,0.035749,0.0,0,0.0,0.0,0,...,0.0,0.0,0.0,0.0,0,0.0,0.028598,0.042207,0.0,0.0
2,0,0.007076,0.007076,0.0,0.0,0.008014,0,0.0,0.0,0,...,0.0,0.0,0.012136,0.021229,0,0.0,0.041116,0.0,0.011315,0.0
3,0,0.005861,0.005861,0.0,0.0,0.0,0,0.0,0.0,0,...,0.024787,0.023391,0.0,0.023446,0,0.006992,0.034058,0.0,0.009372,0.0
4,0,0.006626,0.0,0.004285,0.0,0.011257,0,0.005297,0.00385,0,...,0.00467,0.013221,0.005682,0.013252,0,0.007904,0.023101,0.011365,0.005297,0.005297


In [31]:
# Multigram tf-idf
multigram_tfidf_vect = TfidfVectorizer(decode_error = 'ignore',
                                       stop_words = 'english',
                                       lowercase = True,
                                       ngram_range = (2, 3),
                                       max_features = 10000,
                                       min_df = 0.25, # Original: 0.1
                                       max_df = 0.75) # Original: 0.9
multigram_tfidf_output = multigram_tfidf_vect.fit_transform(speeches_all)

# Turn matrix into a DataFrame
multigram_tfidf_df = pd.DataFrame(multigram_tfidf_output.toarray(),
                                  columns = multigram_tfidf_vect.get_feature_names())

In [36]:
multigram_tfidf_df.columns.values[39:] # Word-only columns

array([u'800 000', u'able announce', u'able say', u'absolutely necessary',
       u'act congress', u'act june', u'act march', u'act passed',
       u'act session', u'action congress', u'action government',
       u'action taken', u'active service', u'acts congress',
       u'additional legislation', u'adequate provision',
       u'adjournment congress', u'adjustment claims',
       u'administration congress', u'administration government',
       u'administration justice', u'adopt measures',
       u'adoption constitution', u'advice consent',
       u'advice consent senate', u'afford opportunity',
       u'agreement reached', u'agricultural products',
       u'agriculture commerce', u'almighty god', u'amendment constitution',
       u'america great', u'american citizen', u'american citizens',
       u'american citizenship', u'american flag', u'american industry',
       u'american life', u'american products', u'american republics',
       u'american states', u'american vessels', u'amica

In [38]:
multigram_tfidf_df_all = multigram_tfidf_df.iloc[:,39:]
multigram_tfidf_df_all

Unnamed: 0,able announce,able say,absolutely necessary,act congress,act june,act march,act passed,act session,action congress,action government,...,year estimated,year increase,year period,year present,year time,year year,years come,years past,years time,young men
0,0.032039,0.0,0.0,0.018681,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,0.0,0.0,0.0,0.05099,0.0,0.0,0.0,0.040765,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.028874,0.0,0.0,0.0,0.0
2,0.062595,0.0,0.0,0.036497,0.0,0.0,0.0,0.145891,0.0,0.0,...,0.0,0.0,0.0,0.031298,0.0,0.020667,0.0,0.023603,0.0,0.0
3,0.0,0.0,0.0,0.030797,0.0,0.0,0.0,0.098484,0.0,0.0,...,0.021705,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,0.0,0.0,0.013414,0.072955,0.0,0.038884,0.011096,0.10369,0.0,0.0,...,0.034279,0.0,0.0,0.013903,0.013903,0.0,0.0,0.02097,0.023551,0.0
5,0.0,0.012655,0.0,0.152956,0.0,0.085599,0.010469,0.012228,0.009123,0.023662,...,0.01078,0.012228,0.0,0.013117,0.0,0.0,0.021561,0.0,0.01111,0.0
6,0.0,0.0,0.008789,0.069047,0.008789,0.0,0.029082,0.016985,0.025342,0.016433,...,0.02246,0.008493,0.0,0.009109,0.009109,0.01203,0.0,0.00687,0.0,0.0
7,0.0,0.013905,0.0,0.02521,0.0,0.0,0.023006,0.026873,0.0,0.012999,...,0.0,0.0,0.014412,0.014412,0.0,0.009517,0.02369,0.021738,0.0,0.0
8,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.055831,0.0,0.0
9,0.035075,0.033842,0.016921,0.040902,0.0,0.0,0.0,0.0327,0.0,0.047455,...,0.0,0.0,0.0,0.0,0.0,0.0,0.028828,0.013226,0.0,0.0
