#PM3: First Draft and Progress Report ([presentation link](https://docs.google.com/presentation/d/1x0f_TOlJKiu6sIVaI-RIBynmladd02lt-9ZwkZsvPSE/edit#slide=id.p))

##<i>Part I: Current Project Status</i>

<i><b>tl;dr</b>: I've added more data to include more presidents, and labelling political affiliation on a gradient is too hard, so I'm going to switch to modeling other things.</i>

###Roadblocks

Since the last project milestone, I've obtained more data (which will be presented shortly) and, as a result of playing around with the original dataset, am strongly considering slightly tweaking my project objectives. Analyzing how conservative or liberal a president may be based on political rhetoric would likely involve creating a custom sentiment mapping, and while I have some familiarity with NLTK, I believe that a custom job would probably be beyond the scope of my skills at this time.

###New Data

As such, I've obtained a few other pieces of data that may allow me to do something similar to my original goal. These data (from various online sources) contain information like a president's political party, religious affiliation, age at inauguration and at death, and each president's aggregate rank according to various scholars.

In addition, I've also expanded the corpus of text to analyze to include state of the union addresses as well. Some presidents - primarily those who were not elected to office, mainly vice-presidents who assumed the position after their predecessors had either passed away or resigned - did not deliver inaugural addresses, and so I included more speeches to ensure more presidents were included in the sample.

###Changes in Project Scope and Objectives

With these new data, I now hope to use political rhetoric to model the newly-introduced categorical and continuous labels (e.g., party, religion, ranking) and potentiall gain insight to the following questions, among others:
<ul>
<li>What common language, themes, and phrases do our greatest presidents have in common in their rhetoric?</li>
<li>Is there a quanitifiable relationship between these rhetorical traits and how great a president is perceived to be?</li>
<li>Does a president's specific religious background affect the language used in speeches?</li>
</ul>

##<i>Part II: Previous Milestone (PM2): Data Ready</i>

<i>This section contains mostly data pre-processing and initial exploratory data analysis. If you're interested in what I've done since the previous milestone deadline, feel free to skip ahead to <b>Part III</b>.</i>

<b>Data Sources:</b>
<ul>
<li>Inaugural Addresses and States of the Union: Project Gutenberg</li>
<li>[Presidential Data](http://www.infoplease.com/ipa/A0194030.html): Infoplease</li>
<li>[Presidential Rankings](https://en.wikipedia.org/wiki/Historical_rankings_of_Presidents_of_the_United_States#Five_Thirty_Eight_analysis): Wikipedia/538</li>
</ul>

Structured data can be found [here](https://docs.google.com/spreadsheets/d/1cujFV5JLRivY-k6LMEDCP8_zapHUtwNCdb9Qr8h2gOQ/edit#gid=0).

####<i>Step 1: Pre-processing - Parsing Speeches</i>

First, let's import all the packages we'll need to clean the data:
<ul>
<li><code>re</code> for regular expression functions</li>
<li><code>pprint</code> to make printing more readable</li>
<li><code>string</code> to clean string values</li>
<li><code>pandas</code> because <i>duh</i></li>
<li><code>numpy</code> because math</li>
<li><code>matplotlib.pyplot</code> for charts</li>
<li><code>CountVectorizer</code> for parsing tokens and removing stop words</li>
</ul>

In [1]:
%matplotlib inline

import re
import pprint as pp
import string
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from sklearn.feature_extraction.text import CountVectorizer

Next, we'll open the text files and read them into Python objects that can be parsed.

In [2]:
# Inaugural speech text
inaugural = open('../data/inaugural.txt', 'r')
inaugural_text = inaugural.read()

# State of the Union text
sotu = open('../data/sotu.txt', 'r')
sotu_text = sotu.read()

First, we'll parse the inaugural speech data using <code>re</code> modules.

In [3]:
# Create list of speech titles which will act as speech IDs
raw_speech_id_list = re.findall(r'\*\s\*\s\*\s\*\s\*([\w\s\,\.]+)ADDRESS',
                                inaugural_text)

print len(raw_speech_id_list)

55


We'll use a <code>string</code> method (<code>strip</code>) to remove extraneous characters from the title list first. Later, we'll create a <code>dict</code> object that will have each title as a key and each full speech text as a value.

In [4]:
stripped_id_list = [string.strip(title, "\r\n ") for title in raw_speech_id_list]

Let's move on to cleaning the speech text since we've cleaned the titles.

All the speeches in the text file are separated by \* \* \* \* \* delimiters, so we'll use <code>re.split</code> again to extract all the text between the delimiters.

In [5]:
raw_speech = re.split(r'\*\s\*\s\*\s\*\s\*', inaugural_text)

Next, we'll use <code>re.sub</code> to replace the "Transcriber's Notes" because we only want the speech text for each inaugural address. We'll also ignore the first and last elements in the <code>raw_speech</code> list because it isn't actually speech text.

In [6]:
speeches = [re.sub(r'^([\w\W\s]+)\]', "", speech) for speech in raw_speech[1:len(raw_speech)-1]]

print len(speeches)

55


Finally, we'll use a combination of <code>re.sub</code> and <code>string.strip</code> to clean up all the extra spaces and newline characters in each speech.

In [7]:
clean_speeches = []
[clean_speeches.append(re.sub(r'\r\n',
                              " ",
                              string.strip(speech,
                                           "\r\n"))) for speech in speeches]

pp.pprint(clean_speeches[-1])

'GEORGE W. BUSH, SECOND INAUGURAL ADDRESS  THURSDAY, JANUARY 20, 2005    Vice President Cheney, Mr. Chief Justice, President Carter, President Bush, President Clinton, reverend clergy, distinguished guests, fellow citizens:  On this day, prescribed by law and marked by ceremony, we celebrate the durable wisdom of our Constitution, and recall the deep commitments that unite our country. I am grateful for the honor of this hour, mindful of the consequential times in which we live, and determined to fulfill the oath that I have sworn and you have witnessed.  At this second gathering, our duties are defined not by the words I use, but by the history we have seen together. For a half century, America defended our own freedom by standing watch on distant borders. After the shipwreck of communism came years of relative quiet, years of repose, years of sabbatical--and then there came a day of fire.  We have seen our vulnerability--and we have seen its deepest source. For as long as whole regio

Now that the data are all clean, we can create the <code>dict</code> that we mentioned earlier. But first, we'll create <code>list</code> of zipped <code>tuple</code>s, in case we need to access the data by index, since the key values in the <code>dict</code> will be a little unwieldy to invoke.

In [8]:
# Zip the titles and speeches together
speeches_zip_inaugural = zip(stripped_id_list, clean_speeches)

# Create a dictionary from the zipped data
speeches_dict_inaugural = dict(speeches_zip_inaugural)

pp.pprint(speeches_zip_inaugural[:1])

[('GEORGE WASHINGTON, FIRST INAUGURAL',
  'Fellow-Citizens of the Senate and of the House of Representatives:  Among the vicissitudes incident to life no event could have filled me with greater anxieties than that of which the notification was transmitted by your order, and received on the 14th day of the present month. On the one hand, I was summoned by my Country, whose voice I can never hear but with veneration and love, from a retreat which I had chosen with the fondest predilection, and, in my flattering hopes, with an immutable decision, as the asylum of my declining years--a retreat which was rendered every day more necessary as well as more dear to me by the addition of habit to inclination, and of frequent interruptions in my health to the gradual waste committed on it by time. On the other hand, the magnitude and difficulty of the trust to which the voice of my country called me, being sufficient to awaken in the wisest and most experienced of her citizens a distrustful scrut

Now that the data are mostly clean, we can now begin parsing the actual speech words using <code>CountVectorizer</code>.

We'll since the <code>get_feature_names</code> method for <code>CountVectorizer</code> instances returns Unicode strings, we'll use the <code>encode</code> function to convert the resulting tokens to ASCII strings.

In [9]:
inaugural_vect = CountVectorizer(decode_error = 'ignore', stop_words='english')
inaugural_vect.fit(clean_speeches)
raw_feature_names_inaugural = [token.encode('ascii','ignore') for token in inaugural_vect.get_feature_names()]

We'll create a document-term matrix that will allow us to then create a <code>DataFrame</code> that counts the number of times each token appears in each speech:

In [10]:
dtm = inaugural_vect.transform(clean_speeches)
dtm.toarray()
raw_df_inaugural = pd.DataFrame(dtm.toarray(), columns=inaugural_vect.get_feature_names())

Let's clean up the <code>DataFrame</code> by ignoring all the "number" tokens so we're left with only full words. We'll also add the titles to the final <code>DataFrame</code> to make labelling easier.

In [11]:
inaugural_df = raw_df_inaugural.iloc[:,raw_feature_names_inaugural.index('6th')+1:]

inaugural_df['title_id'] = stripped_id_list

cols = inaugural_df.columns.tolist()
cols = cols[-1:] + cols[:-1]
inaugural_df = inaugural_df[cols]

print inaugural_df[:5]

                              title_id  abandon  abandoned  abandonment  \
0   GEORGE WASHINGTON, FIRST INAUGURAL        0          0            0   
1  GEORGE WASHINGTON, SECOND INAUGURAL        0          0            0   
2                 JOHN ADAMS INAUGURAL        0          1            0   
3     THOMAS JEFFERSON FIRST INAUGURAL        1          0            0   
4    THOMAS JEFFERSON SECOND INAUGURAL        0          0            0   

   abate  abdicated  abeyance  abhorring  abide  abiding  ...   yorktown  \
0      0          0         0          0      0        0  ...          0   
1      0          0         0          0      0        0  ...          0   
2      0          0         0          0      0        0  ...          0   
3      0          0         0          0      0        0  ...          0   
4      0          0         0          0      0        0  ...          0   

   young  younger  youngest  youth  youthful  zeal  zealous  zealously  zone  
0      0     

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  app.launch_new_instance()


Now let's parse the State of the Union (SOTU) data. Again, we'll use <code>re</code> modules to extract the text.

First, we'll create a list of titles that will serve as speech IDs. Rather than extracting using Python, however, it'll be easier to just copy and paste the SOTU titles and load it into a Python list :)

In [12]:
raw_speech_id_list_sotu = [
'George Washington, State of the Union Address, January 8, 1790',
'George Washington, State of the Union Address, December 8, 1790',
'George Washington, State of the Union Address, October 25, 1791',
'George Washington, State of the Union Address, November 6, 1792',
'George Washington, State of the Union Address, December 3, 1793',
'George Washington, State of the Union Address, November 19, 1794',
'George Washington, State of the Union Address, December 8, 1795',
'George Washington, State of the Union Address, December 7, 1796',
'John Adams, State of the Union Address, November 22, 1797',
'John Adams, State of the Union Address, December 8, 1798',
'John Adams, State of the Union Address, December 3, 1799',
'John Adams, State of the Union Address, November 11, 1800',
'Thomas Jefferson, State of the Union Address, December 8, 1801',
'Thomas Jefferson, State of the Union Address, December 15, 1802',
'Thomas Jefferson, State of the Union Address, October 17, 1803',
'Thomas Jefferson, State of the Union Address, November 8, 1804',
'Thomas Jefferson, State of the Union Address, December 3, 1805',
'Thomas Jefferson, State of the Union Address, December 2, 1806',
'Thomas Jefferson, State of the Union Address, October 27, 1807',
'Thomas Jefferson, State of the Union Address, November 8, 1808',
'James Madison, State of the Union Address, November 29, 1809',
'James Madison, State of the Union Address, December 5, 1810',
'James Madison, State of the Union Address, November 5, 1811',
'James Madison, State of the Union Address, November 4, 1812',
'James Madison, State of the Union Address, December 7, 1813',
'James Madison, State of the Union Address, September 20, 1814',
'James Madison, State of the Union Address, December 5, 1815',
'James Madison, State of the Union Address, December 3, 1816',
'James Monroe, State of the Union Address, December 12, 1817',
'James Monroe, State of the Union Address, November 16, 1818',
'James Monroe, State of the Union Address, December 7, 1819',
'James Monroe, State of the Union Address, November 14, 1820',
'James Monroe, State of the Union Address, December 3, 1821',
'James Monroe, State of the Union Address, December 3, 1822',
'James Monroe, State of the Union Address, December 2, 1823',
'James Monroe, State of the Union Address, December 7, 1824',
'John Quincy Adams, State of the Union Address, December 6, 1825',
'John Quincy Adams, State of the Union Address, December 5, 1826',
'John Quincy Adams, State of the Union Address, December 4, 1827',
'John Quincy Adams, State of the Union Address, December 2, 1828',
'Andrew Jackson, State of the Union Address, December 8, 1829',
'Andrew Jackson, State of the Union Address, December 6, 1830',
'Andrew Jackson, State of the Union Address, December 6, 1831',
'Andrew Jackson, State of the Union Address, December 4, 1832',
'Andrew Jackson, State of the Union Address, December 3, 1833',
'Andrew Jackson, State of the Union Address, December 1, 1834',
'Andrew Jackson, State of the Union Address, December 7, 1835',
'Andrew Jackson, State of the Union Address, December 5, 1836',
'Martin van Buren, State of the Union Address, December 5, 1837',
'Martin van Buren, State of the Union Address, December 3, 1838',
'Martin van Buren, State of the Union Address, December 2, 1839',
'Martin van Buren, State of the Union Address, December 5, 1840',
'John Tyler, State of the Union Address, December 7, 1841',
'John Tyler, State of the Union Address, December 6, 1842',
'John Tyler, State of the Union Address, December 6, 1843',
'John Tyler, State of the Union Address, December 3, 1844',
'James Polk, State of the Union Address, December 2, 1845',
'James Polk, State of the Union Address, December 8, 1846',
'James Polk, State of the Union Address, December 7, 1847',
'James Polk, State of the Union Address, December 5, 1848',
'Zachary Taylor, State of the Union Address, December 4, 1849',
'Millard Fillmore, State of the Union Address, December 2, 1850',
'Millard Fillmore, State of the Union Address, December 2, 1851',
'Millard Fillmore, State of the Union Address, December 6, 1852',
'Franklin Pierce, State of the Union Address, December 5, 1853',
'Franklin Pierce, State of the Union Address, December 4, 1854',
'Franklin Pierce, State of the Union Address, December 31, 1855',
'Franklin Pierce, State of the Union Address, December 2, 1856',
'James Buchanan, State of the Union Address, December 8, 1857',
'James Buchanan, State of the Union Address, December 6, 1858',
'James Buchanan, State of the Union Address, December 19, 1859',
'James Buchanan, State of the Union Address, December 3, 1860',
'Abraham Lincoln, State of the Union Address, December 3, 1861',
'Abraham Lincoln, State of the Union Address, December 1, 1862',
'Abraham Lincoln, State of the Union Address, December 8, 1863',
'Abraham Lincoln, State of the Union Address, December 6, 1864',
'Andrew Johnson, State of the Union Address, December 4, 1865',
'Andrew Johnson, State of the Union Address, December 3, 1866',
'Andrew Johnson, State of the Union Address, December 3, 1867',
'Andrew Johnson, State of the Union Address, December 9, 1868',
'Ulysses S. Grant, State of the Union Address, December 6, 1869',
'Ulysses S. Grant, State of the Union Address, December 5, 1870',
'Ulysses S. Grant, State of the Union Address, December 4, 1871',
'Ulysses S. Grant, State of the Union Address, December 2, 1872',
'Ulysses S. Grant, State of the Union Address, December 1, 1873',
'Ulysses S. Grant, State of the Union Address, December 7, 1874',
'Ulysses S. Grant, State of the Union Address, December 7, 1875',
'Ulysses S. Grant, State of the Union Address, December 5, 1876',
'Rutherford B. Hayes, State of the Union Address, December 3, 1877',
'Rutherford B. Hayes, State of the Union Address, December 2, 1878',
'Rutherford B. Hayes, State of the Union Address, December 1, 1879',
'Rutherford B. Hayes, State of the Union Address, December 6, 1880',
'Chester A. Arthur, State of the Union Address, December 6, 1881',
'Chester A. Arthur, State of the Union Address, December 4, 1882',
'Chester A. Arthur, State of the Union Address, December 4, 1883',
'Chester A. Arthur, State of the Union Address, December 1, 1884',
'Grover Cleveland, State of the Union Address, December 8, 1885',
'Grover Cleveland, State of the Union Address, December 6, 1886',
'Grover Cleveland, State of the Union Address, December 6, 1887',
'Grover Cleveland, State of the Union Address, December 3, 1888',
'Benjamin Harrison, State of the Union Address, December 3, 1889',
'Benjamin Harrison, State of the Union Address, December 1, 1890',
'Benjamin Harrison, State of the Union Address, December 9, 1891',
'Benjamin Harrison, State of the Union Address, December 6, 1892',
'William McKinley, State of the Union Address, December 6, 1897',
'William McKinley, State of the Union Address, December 5, 1898',
'William McKinley, State of the Union Address, December 5, 1899',
'William McKinley, State of the Union Address, December 3, 1900',
'Theodore Roosevelt, State of the Union Address, December 3, 1901',
'Theodore Roosevelt, State of the Union Address, December 2, 1902',
'Theodore Roosevelt, State of the Union Address, December 7, 1903',
'Theodore Roosevelt, State of the Union Address, December 6, 1904',
'Theodore Roosevelt, State of the Union Address, December 5, 1905',
'Theodore Roosevelt, State of the Union Address, December 3, 1906',
'Theodore Roosevelt, State of the Union Address, December 3, 1907',
'Theodore Roosevelt, State of the Union Address, December 8, 1908',
'William H. Taft, State of the Union Address, December 7, 1909',
'William H. Taft, State of the Union Address, December 6, 1910',
'William H. Taft, State of the Union Address, December 5, 1911',
'William H. Taft, State of the Union Address, December 3, 1912',
'Woodrow Wilson, State of the Union Address, December 2, 1913',
'Woodrow Wilson, State of the Union Address, December 8, 1914',
'Woodrow Wilson, State of the Union Address, December 7, 1915',
'Woodrow Wilson, State of the Union Address, December 5, 1916',
'Woodrow Wilson, State of the Union Address, December 4, 1917',
'Woodrow Wilson, State of the Union Address, December 2, 1918',
'Woodrow Wilson, State of the Union Address, December 2, 1919',
'Woodrow Wilson, State of the Union Address, December 7, 1920',
'Warren Harding, State of the Union Address, December 6, 1921',
'Warren Harding, State of the Union Address, December 8, 1922',
'Calvin Coolidge, State of the Union Address, December 6, 1923',
'Calvin Coolidge, State of the Union Address, December 3, 1924',
'Calvin Coolidge, State of the Union Address, December 8, 1925',
'Calvin Coolidge, State of the Union Address, December 7, 1926',
'Calvin Coolidge, State of the Union Address, December 6, 1927',
'Calvin Coolidge, State of the Union Address, December 4, 1928',
'Herbert Hoover, State of the Union Address, December 3, 1929',
'Herbert Hoover, State of the Union Address, December 2, 1930',
'Herbert Hoover, State of the Union Address, December 8, 1931',
'Herbert Hoover, State of the Union Address, December 6, 1932',
'Franklin D. Roosevelt, State of the Union Address, January 3, 1934',
'Franklin D. Roosevelt, State of the Union Address, January 4, 1935',
'Franklin D. Roosevelt, State of the Union Address, January 3, 1936',
'Franklin D. Roosevelt, State of the Union Address, January 6, 1937',
'Franklin D. Roosevelt, State of the Union Address, January 3, 1938',
'Franklin D. Roosevelt, State of the Union Address, January 4, 1939',
'Franklin D. Roosevelt, State of the Union Address, January 3, 1940',
'Franklin D. Roosevelt, State of the Union Address, January 6, 1941',
'Franklin D. Roosevelt, State of the Union Address, January 6, 1942',
'Franklin D. Roosevelt, State of the Union Address, January 7, 1943',
'Franklin D. Roosevelt, State of the Union Address, January 11, 1944',
'Franklin D. Roosevelt, State of the Union Address, January 6, 1945',
'Harry S. Truman, State of the Union Address, January 21, 1946',
'Harry S. Truman, State of the Union Address, January 6, 1947',
'Harry S. Truman, State of the Union Address, January 7, 1948',
'Harry S. Truman, State of the Union Address, January 5, 1949',
'Harry S. Truman, State of the Union Address, January 4, 1950',
'Harry S. Truman, State of the Union Address, January 8, 1951',
'Harry S. Truman, State of the Union Address, January 9, 1952',
'Harry S. Truman, State of the Union Address, January 7, 1953',
'Dwight D. Eisenhower, State of the Union Address, February 2, 1953',
'Dwight D. Eisenhower, State of the Union Address, January 7, 1954',
'Dwight D. Eisenhower, State of the Union Address, January 6, 1955',
'Dwight D. Eisenhower, State of the Union Address, January 5, 1956',
'Dwight D. Eisenhower, State of the Union Address, January 10, 1957',
'Dwight D. Eisenhower, State of the Union Address, January 9, 1958',
'Dwight D. Eisenhower, State of the Union Address, January 9, 1959',
'Dwight D. Eisenhower, State of the Union Address, January 7, 1960',
'Dwight D. Eisenhower, State of the Union Address, January 12, 1961',
'John F. Kennedy, State of the Union Address, January 30, 1961',
'John F. Kennedy, State of the Union Address, January 11, 1962',
'John F. Kennedy, State of the Union Address, January 14, 1963',
'Lyndon B. Johnson, State of the Union Address, January 8, 1964',
'Lyndon B. Johnson, State of the Union Address, January 4, 1965',
'Lyndon B. Johnson, State of the Union Address, January 12, 1966',
'Lyndon B. Johnson, State of the Union Address, January 10, 1967',
'Lyndon B. Johnson, State of the Union Address, January 17, 1968',
'Lyndon B. Johnson, State of the Union Address, January 14, 1969',
'Richard Nixon, State of the Union Address, January 22, 1970',
'Richard Nixon, State of the Union Address, January 22, 1971',
'Richard Nixon, State of the Union Address, January 20, 1972',
'Richard Nixon, State of the Union Address, February 2, 1973',
'Richard Nixon, State of the Union Address, January 30, 1974',
'Gerald R. Ford, State of the Union Address, January 15, 1975',
'Gerald R. Ford, State of the Union Address, January 19, 1976',
'Gerald R. Ford, State of the Union Address, January 12, 1977',
'Jimmy Carter, State of the Union Address, January 19, 1978',
'Jimmy Carter, State of the Union Address, January 25, 1979',
'Jimmy Carter, State of the Union Address, January 21, 1980',
'Jimmy Carter, State of the Union Address, January 16, 1981',
'Ronald Reagan, State of the Union Address, January 26, 1982',
'Ronald Reagan, State of the Union Address, January 25, 1983',
'Ronald Reagan, State of the Union Address, January 25, 1984',
'Ronald Reagan, State of the Union Address, February 6, 1985',
'Ronald Reagan, State of the Union Address, February 4, 1986',
'Ronald Reagan, State of the Union Address, January 27, 1987',
'Ronald Reagan, State of the Union Address, January 25, 1988',
'George H.W. Bush, State of the Union Address, January 31, 1990',
'George H.W. Bush, State of the Union Address, January 29, 1991',
'George H.W. Bush, State of the Union Address, January 28, 1992',
'William J. Clinton, State of the Union Address, January 25, 1994',
'William J. Clinton, State of the Union Address, January 24, 1995',
'William J. Clinton, State of the Union Address, January 23, 1996',
'William J. Clinton, State of the Union Address, February 4, 1997',
'William J. Clinton, State of the Union Address, January 27, 1998',
'William J. Clinton, State of the Union Address, January 19, 1999',
'William J. Clinton, State of the Union Address, January 27, 2000',
'George W. Bush, State of the Union Address, February 27, 2001',
'George W. Bush, State of the Union Address, September 20, 2001',
'George W. Bush, State of the Union Address, January 29, 2002',
'George W. Bush, State of the Union Address, January 28, 2003',
'George W. Bush, State of the Union Address, January 20, 2004',
'George W. Bush, State of the Union Address, February 2, 2005',
'George W. Bush, State of the Union Address, January 31, 2006'
]

# Capitalize speech IDs to conform to Inaugural Address data
raw_speech_id_list_sotu_caps = []
[raw_speech_id_list_sotu_caps.append(item.upper()) for item in raw_speech_id_list_sotu]

pp.pprint(raw_speech_id_list_sotu_caps[:5])

['GEORGE WASHINGTON, STATE OF THE UNION ADDRESS, JANUARY 8, 1790',
 'GEORGE WASHINGTON, STATE OF THE UNION ADDRESS, DECEMBER 8, 1790',
 'GEORGE WASHINGTON, STATE OF THE UNION ADDRESS, OCTOBER 25, 1791',
 'GEORGE WASHINGTON, STATE OF THE UNION ADDRESS, NOVEMBER 6, 1792',
 'GEORGE WASHINGTON, STATE OF THE UNION ADDRESS, DECEMBER 3, 1793']


To make things a little cleaner, we'll separate the speech IDs from the dates and use each as separate labels in the to-be-created <code>DataFrame</code>.

In [13]:
# Parse out speech dates and append them to a list
speech_date_list_sotu = []
[speech_date_list_sotu.append(re.findall(r'ADDRESS\,\s([A-Z0-9\s\,]+)$',
                             speech)[0])
 for speech in raw_speech_id_list_sotu_caps]

print speech_date_list_sotu[:5]

['JANUARY 8, 1790', 'DECEMBER 8, 1790', 'OCTOBER 25, 1791', 'NOVEMBER 6, 1792', 'DECEMBER 3, 1793']


In [14]:
# Parse out speech IDs and append them to a list
speech_id_list_sotu = []
[speech_id_list_sotu.append(re.findall(r'^(.*?)\sADDRESS',
                                       speech)[0])
 for speech in raw_speech_id_list_sotu_caps]

print speech_id_list_sotu[:5]

['GEORGE WASHINGTON, STATE OF THE UNION', 'GEORGE WASHINGTON, STATE OF THE UNION', 'GEORGE WASHINGTON, STATE OF THE UNION', 'GEORGE WASHINGTON, STATE OF THE UNION', 'GEORGE WASHINGTON, STATE OF THE UNION']


In [15]:
# Zip together both lists
speeches_zip_sotu = zip(speech_id_list_sotu, speech_date_list_sotu)

Now for the hard part: let's grab the actual speech text for each State of the Union speech. First, we'll split the full text file; each speech is separated by \*\*\*, so we'll split using that.

In [16]:
raw_speech_sotu = re.split(r'\*\*\*\r\n\r\n', sotu_text)

# Actual speeches start at index 4 and end at index -3
raw_speech_sotu = raw_speech_sotu[4:-3]

print raw_speech_sotu[2]

State of the Union Address
George Washington
October 25, 1791

Fellow-Citizens of the Senate and House of Representatives:

"In vain may we expect peace with the Indians on our frontiers so long
as a lawless set of unprincipled wretches can violate the rights of
hospitality, or infringe the most solemn treaties, without receiving the
punishment they so justly merit."

I meet you upon the present occasion with the feelings which are
naturally inspired by a strong impression of the prosperous situations
of our common country, and by a persuasion equally strong that the
labors of the session which has just commenced will, under the guidance
of a spirit no less prudent than patriotic, issue in measures conducive
to the stability and increase of national prosperity.

Numerous as are the providential blessings which demand our grateful
acknowledgments, the abundance with which another year has again
rewarded the industry of the husbandman is too important to escape
recol

To clean things up just a bit more, we'll remove the title information in each speech text.

In [17]:
clean_speeches_sotu = []
[clean_speeches_sotu.append(re.findall(r'[0-9]{4}([\w\W\s\S]+)$',
                            speech)[0])
                            for speech in raw_speech_sotu]

print clean_speeches_sotu[0]



Fellow-Citizens of the Senate and House of Representatives:

I embrace with great satisfaction the opportunity which now presents
itself of congratulating you on the present favorable prospects of our
public affairs. The recent accession of the important state of North
Carolina to the Constitution of the United States (of which official
information has been received), the rising credit and respectability of
our country, the general and increasing good will toward the government
of the Union, and the concord, peace, and plenty with which we are
blessed are circumstances auspicious in an eminent degree to our
national prosperity.

In resuming your consultations for the general good you can not but
derive encouragement from the reflection that the measures of the last
session have been as satisfactory to your constituents as the novelty
and difficulty of the work allowed you to hope. Still further to realize
their expectations and to secure the blessings which a gracio

At this point, we can now instantiate a new <code>CountVectorizer</code> and tokenize the words in all of the SOTU speeches.

In [18]:
vect_sotu = CountVectorizer(decode_error = 'ignore', stop_words='english')
vect_sotu.fit(clean_speeches_sotu)
raw_feature_names_sotu = [token.encode('ascii','ignore') for token in vect_sotu.get_feature_names()]

As with the Inaugural Addresses, we'll create a document-term matrix that will allow us to then create a <code>DataFrame</code> that counts the number of times each token appears in each speech. We'll also remove all non-word tokens:

In [19]:
# pp.pprint(raw_feature_names_sotu[raw_feature_names_sotu.index('aaron'):raw_feature_names_sotu.index('aaron') + 1])

dtm_sotu = vect_sotu.transform(clean_speeches_sotu)
dtm_sotu.toarray()
raw_df_sotu = pd.DataFrame(dtm_sotu.toarray(), columns=vect_sotu.get_feature_names())

In [20]:
df_sotu = raw_df_sotu.iloc[:,raw_feature_names_sotu.index('aaron'):]

df_sotu['title_id'] = speech_id_list_sotu
df_sotu['speech_date'] = speech_date_list_sotu

# Re-arrange columns so title_id and speech_date are first two columns from the left
cols_sotu = df_sotu.columns.tolist()
cols_sotu = cols_sotu[-2:] + cols_sotu[:-2]
df_sotu = df_sotu[cols_sotu]

print df_sotu[:5]

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  app.launch_new_instance()
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy


                                title_id       speech_date  aaron  abandon  \
0  GEORGE WASHINGTON, STATE OF THE UNION   JANUARY 8, 1790      0        0   
1  GEORGE WASHINGTON, STATE OF THE UNION  DECEMBER 8, 1790      0        0   
2  GEORGE WASHINGTON, STATE OF THE UNION  OCTOBER 25, 1791      0        0   
3  GEORGE WASHINGTON, STATE OF THE UNION  NOVEMBER 6, 1792      0        0   
4  GEORGE WASHINGTON, STATE OF THE UNION  DECEMBER 3, 1793      0        0   

   abandoned  abandoning  abandonment  abandons  abate  abated   ...     \
0          0           0            0         0      0       0   ...      
1          0           0            0         0      0       0   ...      
2          0           0            0         0      0       0   ...      
3          0           0            0         0      0       0   ...      
4          0           0            0         0      0       0   ...      

   zimbabwe  zimbabwean  zinc  zion  zollverein  zone  zones  zoological  \
0   

Phew! We're done with that part. There's still more to do though!

####<i>Step 2: Pre-processing - Joining Data and Selecting Features</i>

Now we'll load the other datasets into <code>DataFrame</code>s:

In [21]:
# Load president details into DataFrame
df_prez = pd.read_csv('../data/presidents.csv')

# Load aggregated president rankings into DataFrame
df_rankings = pd.read_csv('../data/prez_rankings_538.csv')

In [22]:
print df_prez[:5]

   id   name_and_party         name party_letter             party_name  \
0   1  Washington (F)3  Washington             F             Federalist   
1   2     J. Adams (F)    J. Adams             F             Federalist   
2   3   Jefferson (DR)   Jefferson            DR  Democratic Republican   
3   4     Madison (DR)     Madison            DR  Democratic Republican   
4   5      Monroe (DR)      Monroe            DR  Democratic Republican   

        term state_of_birth  birth_date  death_date      religion  \
0  1789–1797            Va.   2/22/1732  12/14/1799  Episcopalian   
1  1797–1801          Mass.  10/30/1735    7/4/1826     Unitarian   
2  1801–1809            Va.   4/13/1743    7/4/1826         Deist   
3  1809–1817            Va.   3/16/1751   6/28/1836  Episcopalian   
4  1817–1825            Va.   4/28/1758    7/4/1831  Episcopalian   

   age_inauguration age_death  
0                57        67  
1                61        90  
2                57        83  
3     

Since the full dataset pulls together data from multiple sources, we'll need to very strategic about what data we join together to create the dataset that will be most effectively interpreted by the models we will use. Ideally, we'd like to be able join all the disparate data together using some common key ID, which unfortunately is not available, so it will need to be manufactured first before the data can be combined. Luckily, the <code>presidents.csv</code> file has an <code>id</code> feature that can be used to create other <code>id</code> in the other datasets.

It would likely be more difficult to do this programmatically, so, in the interest of keeping things simple, let's write the current <code>DataFrame</code>s to files and manually tag each of the speeches with each president's <code>id</code>. A much larger dataset would probably require writing code to assign <code>id</code>s.

In [23]:
# Write inaugural speech DataFrame data to a file
file_df_inaugural = open('../data/df_inaugural.csv', 'w')
for row in inaugural_df.title_id:
    file_df_inaugural.write(row)
    file_df_inaugural.write('\n')

file_df_inaugural.close()

# Write SOTU DataFrame data to a file
file_df_sotu = open('../data/df_sotu.csv', 'w')
for row in df_sotu.title_id:
    file_df_sotu.write(row)
    file_df_sotu.write('\n')

file_df_sotu.close()

So after a little manual labor thanks to magic data entry elves, we can re-load the speech data into the <code>DataFrame</code>s, now with each president's respective <code>id</code>. Then we'll replace the existing columns in each <code>DataFrame</code> with the new data.

In [24]:
# Load newly-tagged data
df_inaugural_id = pd.read_csv('../data/df_inaugural_with_id.csv')
df_sotu_id = pd.read_csv('../data/df_sotu_with_id.csv')

# Concat loaded DataFrames to old DataFrames
df_inaugural = pd.concat([df_inaugural_id, inaugural_df.iloc[:,1:]], axis=1)
df_sotu = pd.concat([df_sotu_id, df_sotu.iloc[:,1:]], axis=1)

We're finally done with our preliminary <code>DataFrame</code>s, so we can now decide how we can best combine all the data.

In [25]:
# Next step is to combine the inaugural and SOTU DataFrames
# After they've been combined, we can groupby 'id' to get aggregate data per president
# Then we can join party, religion, and ranking data as outcome labels and begin modeling

columns_inaugural = list(df_inaugural.columns.values)
columns_sotu = list(df_sotu.columns.values)

columns_all = list(columns_inaugural)

print 'columns_sotu length:'
print len(columns_sotu)

print 'Original length:'
print len(columns_all)

for col in columns_sotu:
    if col not in columns_all:
        columns_all.append(col)
        
print 'New length:'
print len(columns_all)

columns_sotu length:
22044
Original length:
8506
New length:
22744


The new <code>columns_all</code> list now contains every token in both the Inaugural Address and State of the Union speeches. Next, we'll <code>groupby</code> the <code>id</code> column for each <code>DataFrame</code> to count the number of occurences for each token for each president per set of speeches, then combine the sums from both to get a total count:

In [26]:
# Create DataFrame containing only id and tokens for inaugurals
df_inaugural_wc = pd.concat([df_inaugural.iloc[:,:1], df_inaugural.iloc[:,3:]], axis=1)

# Use groupby on df_inaugural_wc to get token count per president
df_inaugural_wc_groupby = df_inaugural_wc.groupby('id').sum()
print df_inaugural_wc_groupby[:1]

    abandon  abandoned  abandonment  abate  abdicated  abeyance  abhorring  \
id                                                                           
1         0          0            0      0          0         0          0   

    abide  abiding  abilities  ...   yorktown  young  younger  youngest  \
id                             ...                                        
1       0        0          0  ...          0      0        0         0   

    youth  youthful  zeal  zealous  zealously  zone  
id                                                   
1       0         0     0        0          0     0  

[1 rows x 8503 columns]


In [27]:
# Create DataFrame containing only id and tokens, this time for SOTU
df_sotu_wc = pd.concat([df_sotu.iloc[:,:1], df_sotu.iloc[:,4:]], axis=1)

# Use groupby on df_inaugural_wc to get token count per president
df_sotu_wc_groupby = df_sotu_wc.groupby('id').sum()

In [28]:
# Next, we'll need to create a new DataFrame that has a row for each president
# and a column for each token. We will then loop through each row in the newly
# grouped-by DataFrames and add the counts per token per president to the new DataFrame

dict_totals = {}

# Create dictionary with one dictionary for each president
for num in range(1, 45):
    dict_totals[num] = {}

# Add all tokens to each president's dictionary
for item in dict_totals:
    for col in columns_all:
        dict_totals[item][col] = 0

In [None]:
# Objective: Loop through the inaugural word count DataFrame and add to each president's word count in dict_totals

# Pseudo code:
# for president in dict_totals:
#    for word in dict_totals[president]:
        # loop through every word in each president's row in inaugural dataframe
            # if the current dataframe word == dictionary word, add word count to that president's dictionary's count
        # loop through every word in each president's row in SOTU dataframe
            # if the current dataframe word == dictionary word, add word count to that president's dictionary's count

# Actual code:
#for president in dict_totals: # loop through president dictionaries - i.e., {1: {'word': 0, 'foo': 0, 'bar': 0}}
#    for word in dict_totals[president]: # loop through words in current prez dictionary - i.e., {'foo': 0, 'bar': 0}
#        for col in df_inaugural_wc_groupby[:1]: # TODO: Need to figure how to loop through DataFrame properly 
#            if str(word) == str(col): # If dict word == DataFrame word
#                dict_totals[president][word] += df_inaugural_wc_groupby.iloc[0][col] # Add DataFrame value for word to dictionary total

# print dict_totals[1] # Test output of first dictionary

counter = 0
# for word in dict_totals[1]:
#    for col in df_sotu_wc_groupby[:1]:
#        counter += 1
#        print df_sotu_wc_groupby[:1][col]
#        if counter == 5: break

#    for num in df_inaugural_wc_groupby.index:

# turn dictionaries into DataFrames ("MegaDataFrame")

id
1     0
Name: aaron, dtype: int64
id
1     0
Name: abandon, dtype: int64
id
1     1
Name: abandoned, dtype: int64
id
1     0
Name: abandoning, dtype: int64
id
1     0
Name: abandonment, dtype: int64
id
1     0
Name: aaron, dtype: int64
id
1     0
Name: abandon, dtype: int64
id
1     1
Name: abandoned, dtype: int64
id
1     0
Name: abandoning, dtype: int64
id
1     0
Name: abandonment, dtype: int64
id
1     0
Name: abandons, dtype: int64
id
1     0
Name: abate, dtype: int64
id
1     0
Name: abated, dtype: int64
id
1     1
Name: abatement, dtype: int64
id
1     0
Name: abating, dtype: int64
id
1     0
Name: abbas, dtype: int64
id
1     0
Name: abbreviation, dtype: int64
id
1     0
Name: abdicate, dtype: int64
id
1     0
Name: abdicated, dtype: int64
id
1     0
Name: abdicating, dtype: int64
id
1     0
Name: abdication, dtype: int64
id
1     0
Name: abducted, dtype: int64
id
1     0
Name: abduction, dtype: int64
id
1     0
Name: aberdeen, dtype: int64
id
1     0
Name: abet, dtype: int6

In [31]:
# Objective: Join df_prez and df_rankings to MegaDataFrame

##<i>Part III: NLTK Applications, LDA (Optional) and Modeling</i>