#Part I: Data Munging

<b>Data Sources:</b>
<ul>
<li>Inaugural Addresses and States of the Union: Project Gutenberg</li>
<li>[Presidential Data](http://www.infoplease.com/ipa/A0194030.html): Infoplease</li>
<li>[Presidential Rankings](https://en.wikipedia.org/wiki/Historical_rankings_of_Presidents_of_the_United_States#Five_Thirty_Eight_analysis): Wikipedia/538</li>
</ul>

Structured data can be found [here](https://docs.google.com/spreadsheets/d/1cujFV5JLRivY-k6LMEDCP8_zapHUtwNCdb9Qr8h2gOQ/edit#gid=0).

###<i>Step 1: Parsing Speech Text</i>

First, let's import all the packages we'll need to clean the data:
<ul>
<li><code>re</code> for regular expression functions</li>
<li><code>pprint</code> to make printing more readable</li>
<li><code>string</code> to clean string values</li>
<li><code>pandas</code> because <i>duh</i></li>
<li><code>numpy</code> because math</li>
<li><code>matplotlib.pyplot</code> for charts</li>
<li><code>CountVectorizer</code> for parsing tokens and removing stop words</li>
</ul>

In [2]:
%matplotlib inline

import re
import pprint as pp
import string
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from sklearn.feature_extraction.text import CountVectorizer

Next, we'll open the text files and read them into Python objects that can be parsed.

In [3]:
# Inaugural Address text
inaugural = open('../data/inaugural.txt', 'r')
inaugural_text = inaugural.read()

# State of the Union text
sotu = open('../data/sotu.txt', 'r')
sotu_text = sotu.read()

First, we'll parse the inaugural speech data using <code>re</code> modules. We'll begin by creating a list of speech titles which will act as speech IDs.

In [4]:
raw_speech_id_list = re.findall(r'\*\s\*\s\*\s\*\s\*([\w\s\,\.]+)ADDRESS',
                                inaugural_text)

We'll use a <code>string</code> method (<code>strip</code>) to remove extraneous characters from the title list first. Later, we'll create a <code>dict</code> object that will have each title as a key and each full speech text as a value.

In [5]:
stripped_id_list = [string.strip(title, "\r\n ") for title in raw_speech_id_list]

Let's move on to cleaning the speech text since we've cleaned the titles.

All the speeches in the text file are separated by \* \* \* \* \* delimiters, so we'll use <code>re.split</code> again to extract all the text between the delimiters.

In [6]:
raw_speech = re.split(r'\*\s\*\s\*\s\*\s\*', inaugural_text)

Next, we'll use <code>re.sub</code> to replace the "Transcriber's Notes" because we only want the speech text for each inaugural address. We'll also ignore the first and last elements in the <code>raw_speech</code> list because it isn't actually speech text.

In [7]:
speeches = [re.sub(r'^([\w\W\s]+)\]', "", speech) for speech in raw_speech[1:len(raw_speech)-1]]

print len(speeches)

55


Finally, we'll use a combination of <code>re.sub</code> and <code>string.strip</code> to clean up all the extra spaces and newline characters in each speech.

In [8]:
clean_speeches = []
[clean_speeches.append(re.sub(r'\r\n',
                              " ",
                              string.strip(speech,
                                           "\r\n"))) 
 for speech in speeches]

print len(clean_speeches)

55


It looks like most of the works is done, but you'll see that the last three speeches still contain extranous test (mostly speech IDs) that should be removed, so we'll take the last use <code>re.sub</code> on the last three to extract the last bit of cruft before moving on.

In [9]:
clean_speeches_inaugural = [re.sub(r'([A-Z0-9\,\.\s]+)\s{3}', "", speech) 
                            for speech in clean_speeches]

Now that the inaugural data is clean, let's follow similar steps to clean the State of the Union (SOTU) speeches. Again, we'll use <code>re</code> modules to extract the text.

First, we'll create a list of titles that will serve as speech IDs. Rather than extracting using Python, however, it'll be easier to just copy and paste the SOTU titles and load it into a Python list :)

In [10]:
raw_speech_id_list_sotu = [
'George Washington, State of the Union Address',
'George Washington, State of the Union Address',
'George Washington, State of the Union Address',
'George Washington, State of the Union Address',
'George Washington, State of the Union Address',
'George Washington, State of the Union Address',
'George Washington, State of the Union Address',
'George Washington, State of the Union Address',
'John Adams, State of the Union Address',
'John Adams, State of the Union Address',
'John Adams, State of the Union Address',
'John Adams, State of the Union Address',
'Thomas Jefferson, State of the Union Address',
'Thomas Jefferson, State of the Union Address',
'Thomas Jefferson, State of the Union Address',
'Thomas Jefferson, State of the Union Address',
'Thomas Jefferson, State of the Union Address',
'Thomas Jefferson, State of the Union Address',
'Thomas Jefferson, State of the Union Address',
'Thomas Jefferson, State of the Union Address',
'James Madison, State of the Union Address',
'James Madison, State of the Union Address',
'James Madison, State of the Union Address',
'James Madison, State of the Union Address',
'James Madison, State of the Union Address',
'James Madison, State of the Union Address',
'James Madison, State of the Union Address',
'James Madison, State of the Union Address',
'James Monroe, State of the Union Address',
'James Monroe, State of the Union Address',
'James Monroe, State of the Union Address',
'James Monroe, State of the Union Address',
'James Monroe, State of the Union Address',
'James Monroe, State of the Union Address',
'James Monroe, State of the Union Address',
'James Monroe, State of the Union Address',
'John Quincy Adams, State of the Union Address',
'John Quincy Adams, State of the Union Address',
'John Quincy Adams, State of the Union Address',
'John Quincy Adams, State of the Union Address',
'Andrew Jackson, State of the Union Address',
'Andrew Jackson, State of the Union Address',
'Andrew Jackson, State of the Union Address',
'Andrew Jackson, State of the Union Address',
'Andrew Jackson, State of the Union Address',
'Andrew Jackson, State of the Union Address',
'Andrew Jackson, State of the Union Address',
'Andrew Jackson, State of the Union Address',
'Martin van Buren, State of the Union Address',
'Martin van Buren, State of the Union Address',
'Martin van Buren, State of the Union Address',
'Martin van Buren, State of the Union Address',
'John Tyler, State of the Union Address',
'John Tyler, State of the Union Address',
'John Tyler, State of the Union Address',
'John Tyler, State of the Union Address',
'James Polk, State of the Union Address',
'James Polk, State of the Union Address',
'James Polk, State of the Union Address',
'James Polk, State of the Union Address',
'Zachary Taylor, State of the Union Address',
'Millard Fillmore, State of the Union Address',
'Millard Fillmore, State of the Union Address',
'Millard Fillmore, State of the Union Address',
'Franklin Pierce, State of the Union Address',
'Franklin Pierce, State of the Union Address',
'Franklin Pierce, State of the Union Address',
'Franklin Pierce, State of the Union Address',
'James Buchanan, State of the Union Address',
'James Buchanan, State of the Union Address',
'James Buchanan, State of the Union Address',
'James Buchanan, State of the Union Address',
'Abraham Lincoln, State of the Union Address',
'Abraham Lincoln, State of the Union Address',
'Abraham Lincoln, State of the Union Address',
'Abraham Lincoln, State of the Union Address',
'Andrew Johnson, State of the Union Address',
'Andrew Johnson, State of the Union Address',
'Andrew Johnson, State of the Union Address',
'Andrew Johnson, State of the Union Address',
'Ulysses S. Grant, State of the Union Address',
'Ulysses S. Grant, State of the Union Address',
'Ulysses S. Grant, State of the Union Address',
'Ulysses S. Grant, State of the Union Address',
'Ulysses S. Grant, State of the Union Address',
'Ulysses S. Grant, State of the Union Address',
'Ulysses S. Grant, State of the Union Address',
'Ulysses S. Grant, State of the Union Address',
'Rutherford B. Hayes, State of the Union Address',
'Rutherford B. Hayes, State of the Union Address',
'Rutherford B. Hayes, State of the Union Address',
'Rutherford B. Hayes, State of the Union Address',
'Chester A. Arthur, State of the Union Address',
'Chester A. Arthur, State of the Union Address',
'Chester A. Arthur, State of the Union Address',
'Chester A. Arthur, State of the Union Address',
'Grover Cleveland, State of the Union Address',
'Grover Cleveland, State of the Union Address',
'Grover Cleveland, State of the Union Address',
'Grover Cleveland, State of the Union Address',
'Benjamin Harrison, State of the Union Address',
'Benjamin Harrison, State of the Union Address',
'Benjamin Harrison, State of the Union Address',
'Benjamin Harrison, State of the Union Address',
'William McKinley, State of the Union Address',
'William McKinley, State of the Union Address',
'William McKinley, State of the Union Address',
'William McKinley, State of the Union Address',
'Theodore Roosevelt, State of the Union Address',
'Theodore Roosevelt, State of the Union Address',
'Theodore Roosevelt, State of the Union Address',
'Theodore Roosevelt, State of the Union Address',
'Theodore Roosevelt, State of the Union Address',
'Theodore Roosevelt, State of the Union Address',
'Theodore Roosevelt, State of the Union Address',
'Theodore Roosevelt, State of the Union Address',
'William H. Taft, State of the Union Address',
'William H. Taft, State of the Union Address',
'William H. Taft, State of the Union Address',
'William H. Taft, State of the Union Address',
'Woodrow Wilson, State of the Union Address',
'Woodrow Wilson, State of the Union Address',
'Woodrow Wilson, State of the Union Address',
'Woodrow Wilson, State of the Union Address',
'Woodrow Wilson, State of the Union Address',
'Woodrow Wilson, State of the Union Address',
'Woodrow Wilson, State of the Union Address',
'Woodrow Wilson, State of the Union Address',
'Warren Harding, State of the Union Address',
'Warren Harding, State of the Union Address',
'Calvin Coolidge, State of the Union Address',
'Calvin Coolidge, State of the Union Address',
'Calvin Coolidge, State of the Union Address',
'Calvin Coolidge, State of the Union Address',
'Calvin Coolidge, State of the Union Address',
'Calvin Coolidge, State of the Union Address',
'Herbert Hoover, State of the Union Address',
'Herbert Hoover, State of the Union Address',
'Herbert Hoover, State of the Union Address',
'Herbert Hoover, State of the Union Address',
'Franklin D. Roosevelt, State of the Union Address',
'Franklin D. Roosevelt, State of the Union Address',
'Franklin D. Roosevelt, State of the Union Address',
'Franklin D. Roosevelt, State of the Union Address',
'Franklin D. Roosevelt, State of the Union Address',
'Franklin D. Roosevelt, State of the Union Address',
'Franklin D. Roosevelt, State of the Union Address',
'Franklin D. Roosevelt, State of the Union Address',
'Franklin D. Roosevelt, State of the Union Address',
'Franklin D. Roosevelt, State of the Union Address',
'Franklin D. Roosevelt, State of the Union Address',
'Franklin D. Roosevelt, State of the Union Address',
'Harry S. Truman, State of the Union Address',
'Harry S. Truman, State of the Union Address',
'Harry S. Truman, State of the Union Address',
'Harry S. Truman, State of the Union Address',
'Harry S. Truman, State of the Union Address',
'Harry S. Truman, State of the Union Address',
'Harry S. Truman, State of the Union Address',
'Harry S. Truman, State of the Union Address',
'Dwight D. Eisenhower, State of the Union Address',
'Dwight D. Eisenhower, State of the Union Address',
'Dwight D. Eisenhower, State of the Union Address',
'Dwight D. Eisenhower, State of the Union Address',
'Dwight D. Eisenhower, State of the Union Address',
'Dwight D. Eisenhower, State of the Union Address',
'Dwight D. Eisenhower, State of the Union Address',
'Dwight D. Eisenhower, State of the Union Address',
'Dwight D. Eisenhower, State of the Union Address',
'John F. Kennedy, State of the Union Address',
'John F. Kennedy, State of the Union Address',
'John F. Kennedy, State of the Union Address',
'Lyndon B. Johnson, State of the Union Address',
'Lyndon B. Johnson, State of the Union Address',
'Lyndon B. Johnson, State of the Union Address',
'Lyndon B. Johnson, State of the Union Address',
'Lyndon B. Johnson, State of the Union Address',
'Lyndon B. Johnson, State of the Union Address',
'Richard Nixon, State of the Union Address',
'Richard Nixon, State of the Union Address',
'Richard Nixon, State of the Union Address',
'Richard Nixon, State of the Union Address',
'Richard Nixon, State of the Union Address',
'Gerald R. Ford, State of the Union Address',
'Gerald R. Ford, State of the Union Address',
'Gerald R. Ford, State of the Union Address',
'Jimmy Carter, State of the Union Address',
'Jimmy Carter, State of the Union Address',
'Jimmy Carter, State of the Union Address',
'Jimmy Carter, State of the Union Address',
'Ronald Reagan, State of the Union Address',
'Ronald Reagan, State of the Union Address',
'Ronald Reagan, State of the Union Address',
'Ronald Reagan, State of the Union Address',
'Ronald Reagan, State of the Union Address',
'Ronald Reagan, State of the Union Address',
'Ronald Reagan, State of the Union Address',
'George H.W. Bush, State of the Union Address',
'George H.W. Bush, State of the Union Address',
'George H.W. Bush, State of the Union Address',
'William J. Clinton, State of the Union Address',
'William J. Clinton, State of the Union Address',
'William J. Clinton, State of the Union Address',
'William J. Clinton, State of the Union Address',
'William J. Clinton, State of the Union Address',
'William J. Clinton, State of the Union Address',
'William J. Clinton, State of the Union Address',
'George W. Bush, State of the Union Address',
'George W. Bush, State of the Union Address',
'George W. Bush, State of the Union Address',
'George W. Bush, State of the Union Address',
'George W. Bush, State of the Union Address',
'George W. Bush, State of the Union Address',
'George W. Bush, State of the Union Address'
]

# Capitalize speech IDs to conform to Inaugural Address data
raw_speech_id_list_sotu_caps = []
[raw_speech_id_list_sotu_caps.append(item.upper()) for item in raw_speech_id_list_sotu]

pp.pprint(raw_speech_id_list_sotu_caps[:2])

['GEORGE WASHINGTON, STATE OF THE UNION ADDRESS',
 'GEORGE WASHINGTON, STATE OF THE UNION ADDRESS']


In [11]:
# Parse out speech IDs and append them to a list
speech_id_list_sotu = []
[speech_id_list_sotu.append(re.findall(r'^(.*?)\sADDRESS',
                                       speech)[0])
 for speech in raw_speech_id_list_sotu_caps]

pp.pprint(speech_id_list_sotu[:2])

['GEORGE WASHINGTON, STATE OF THE UNION',
 'GEORGE WASHINGTON, STATE OF THE UNION']


In [12]:
# Combine the speech IDs into a single list
title_list = stripped_id_list + speech_id_list_sotu

Now for the hard part: let's grab the actual speech text for each State of the Union speech. First, we'll split the full text file; each speech is separated by \*\*\*, so we'll split using that.

In [13]:
raw_speech_sotu = re.split(r'\*\*\*\r\n\r\n', sotu_text)

# Actual speeches start at index 4 and end at index -3
raw_speech_sotu = raw_speech_sotu[4:-3]

To clean things up just a bit more, we'll remove the title information in each speech text.

In [14]:
clean_speeches_2 = []
[clean_speeches_2.append(re.findall(r'[0-9]{4}([\w\W\s\S]+)$',
                            speech)[0])
                            for speech in raw_speech_sotu]

print len(clean_speeches_2)

214


In [15]:
# Still need to clean SOTU speeches and remove '\r\n' instances and replace with '' or spaces
clean_speeches_sotu = []

for speech in clean_speeches_2:
    clean_speeches_sotu.append(re.sub(r'\r\n{1}', ' ', speech))

Now that both sets of speeches have been properly cleaned, we'll add them both together to create an aggregate list of cleaned speeches.

In [16]:
clean_speeches_all = clean_speeches_inaugural + clean_speeches_sotu

####<i>Tokenization with CountVectorizer</i>

We'll create both a unigram and multigram (bigram and trigram) <code>DataFrame</code> for non-stemmed tokens that can be found in the speeches.

We'll also lowercase all the tokens, and ensure that the document frequency is between 10 and 90 percent. Words that appear in fewer than 10 percent of speeches probably aren't relevant, and words that appear in greater than 90 percent are likely stop word-like, so don't add any meaningful context to the speeches.

In [17]:
# Create a unigram vector
unigram_vect = CountVectorizer(decode_error = 'ignore',
                               stop_words = 'english',
                               lowercase = True,
                               max_features = 10000,
                               min_df = 0.1,
                               max_df = 0.9)
unigram_vect.fit(clean_speeches_all)
unigram_raw_feature_names = [token.encode('ascii','ignore') for token in unigram_vect.get_feature_names()]

In [18]:
# Create a multigram (bigram, trigram) vector to cut total number of features
multigram_vect = CountVectorizer(decode_error = 'ignore',
                                 stop_words = 'english',
                                 ngram_range = (2,3),
                                 lowercase = True,
                                 max_features = 10000,
                                 min_df = 0.1,
                                 max_df = 0.9)
multigram_vect.fit(clean_speeches_all)
multigram_raw_feature_names = [token.encode('ascii','ignore') for token in multigram_vect.get_feature_names()]

In [19]:
print len(unigram_raw_feature_names)
print len(multigram_raw_feature_names)

3596
413


Using bigrams and trigrams in conjunction with setting document frequency values results in a feature space about a tenth as large the the space for unigrams only! We'll keep both instances though, since some of the techniques we'll use below seem to be more effective with unigrams, however.

In [31]:
# Create unigram document-term matrix, then unigram DataFrame
unigram_dtm = unigram_vect.transform(clean_speeches_all)
unigram_dtm.toarray()
unigram_df = pd.DataFrame(unigram_dtm.toarray(),
                          columns=unigram_vect.get_feature_names())

# Next, make sure to only include actual words in final DataFrame before adding speech IDs

# Find the index of '90', the last non-word feature
np.where(unigram_df.columns.values == '90') # index: 61

# Create DataFrame that only contains non-word features
unigram_words_df = unigram_df.iloc[:,62:]

In [36]:
# Create multigram document-term matrix, then multigram DataFrame
multigram_dtm = multigram_vect.transform(clean_speeches_all)
multigram_dtm.toarray()
multigram_df = pd.DataFrame(multigram_dtm.toarray(),
                            columns=multigram_vect.get_feature_names())

# Next, make sure to only include actual word combinations in final DataFrame before adding speech IDs

# Find the index of '500 000', the last non-word feature
np.where(multigram_df.columns.values == '500 000') # index: 13

# Create DataFrame that only contains non-word features
multigram_words_df = multigram_df.iloc[:,14:]

multigram_words_df.head()

Unnamed: 0,act congress,act march,act passed,action congress,action taken,acts congress,administration government,agricultural products,amendment constitution,american citizen,...,world war,worthy consideration,year ago,year ending,year ending 30th,year ending june,year year,years ago,years come,young men
0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,0,0,0,0,0,0,1,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,1,0,0,0
3,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,0,0,0,0,0,0,0,0,1,0,...,0,0,0,0,0,0,0,0,0,0


####<i>Topic Clustering with Latent Dirichlet Allocation (LDA)</i>

In [22]:
# TODO: Might want to wait for "word-only" matrix before using LDA to eliminate noise
# TODO: In addition, might want to use a DataFrame the groups by president to get topics by president

# Turn the DataFrame into a matrix of numpy arrays, will serve as X in LDA
unigram_df_matrix = unigram_df.as_matrix(columns=None)

# Import LDA module
import lda

# Create new instance of LDA that will group into 20 topics
# and cycle through 1000 iterations
unigram_model = lda.LDA(n_topics=20, n_iter=1000, random_state=1)
unigram_model.fit(unigram_df_matrix)
unigram_topic_word = unigram_model.topic_word_
unigram_n_top_words = 8

#Note: word_list was generated in the above section on stemming
for i, topic_dist in enumerate(unigram_topic_word):
    unigram_topic_words = np.array(unigram_raw_feature_names)[np.argsort(topic_dist)][:-unigram_n_top_words:-1]
    print('Topic {}: {}'.format(i, ' '.join(unigram_topic_words)))

Topic 0: constitution union state powers rights executive duty
Topic 1: congress treaty act year duties minister territory
Topic 2: years congress economy tax federal support programs
Topic 3: 000 department work years increase year commission
Topic 4: year 000 value fiscal currency report silver
Topic 5: british powers force spain millions coast vessels
Topic 6: general treasury means department subject present measures
Topic 7: mexico texas shall congress army mexican territory
Topic 8: year program federal administration million dollars billion
Topic 9: america american americans work year tonight children
Topic 10: men law work business navy man far
Topic 11: american make order international possible congress needs
Topic 12: world life freedom know men america let
Topic 13: shall best objects commerce fellow laws duties
Topic 14: world free military strength forces today freedom
Topic 15: law shall laws congress service legislation right
Topic 16: present necessary long principle 

In [23]:
# Might want to wait for "word-only" matrix before using LDA to eliminate noise

# Turn the DataFrame into a matrix of numpy arrays, will serve as X in LDA
multigram_df_matrix = multigram_df.as_matrix(columns=None)

# Import LDA module
import lda

# Create new instance of LDA that will group into 20 topics
# and cycle through 1000 iterations
multigram_model = lda.LDA(n_topics=20, n_iter=1000, random_state=1)
multigram_model.fit(multigram_df_matrix)
multigram_topic_word = multigram_model.topic_word_
multigram_n_top_words = 5

#Note: word_list was generated in the above section on stemming
for i, topic_dist in enumerate(multigram_topic_word):
    multigram_topic_words = np.array(multigram_raw_feature_names)[np.argsort(topic_dist)][:-multigram_n_top_words:-1]
    print('Topic {}: {}'.format(i, ' '.join(multigram_topic_words)))

Topic 0: great britain government united states government united secretary state
Topic 1: federal government past years soviet union state local
Topic 2: federal government recommend congress private enterprise national defense
Topic 3: ending 30th 30th june year ending 30th year ending
Topic 4: june 30 fiscal year year ending ending june 30
Topic 5: american people state union years ago federal government
Topic 6: constitution united states constitution united branch government congress united states
Topic 7: great britain citizens united citizens united states british government
Topic 8: civil service foreign trade past year fiscal year
Topic 9: house representatives fellow citizens public debt present year
Topic 10: attention congress district columbia house representatives session congress
Topic 11: health care social security american people years ago
Topic 12: 000 000 500 000 10 000 great britain
Topic 13: general government fellow citizens federal government public money
Topic 

####<i>Stemming with PorterStemmer</i>

In [24]:
# Next is exploring PorterStemmer. Note that PorterStemmer can only work on a list of unigrams.

# Import PorterStemmer
from nltk.stem.porter import PorterStemmer

# Instantiate a new PorterStemmer object
ps = PorterStemmer()

# Create Python list of tokens in DataFrame
word_list = list(unigram_vect.get_feature_names())

# Use PorterStemmer to stem the tokens
stems = [ps.stem(token) for token in word_list]
stems_set = set(stems) # This reduces the number of elements from the original 3596 unigrams
stems_list = list(stems_set)

print len(stems_list)

2226


In [25]:
# How do I count the occurrence of each stem in each speech?
test_speech = str.split(clean_speeches_all[0])
counter = 0
stems_holder = []

for stem in stems_list:
    for word in test_speech:
        if stem == ps.stem(word):
#            print stem + ': ' + ps.stem(word)
            counter += 1
            stems_holder.append(stem)

print counter

# Note: Checking for stems in each speech doesn't work effectively
# It might be better to create a dictionary of all the available stems for all speeches,
#   then use ps.stem() on each word in each speech, then compare each generated stem to the stem dictionary

398


####<i>Word Relevance Using TF-IDF Analysis</i>

In [32]:
# Import TfidfVectorizer from scikit-learn
from sklearn.feature_extraction.text import TfidfVectorizer

# Unigram tf-idf
tfidf_vect = TfidfVectorizer(decode_error = 'ignore',
                             stop_words = 'english',
                             lowercase = True,
                             max_features = 10000,
                             min_df = 0.1,
                             max_df = 0.9)
tfidf_output = tfidf_vect.fit_transform(clean_speeches_all)

# Turn matrix into a DataFrame
tfidf_df = pd.DataFrame(tfidf_output.toarray(),
                        columns=tfidf_vect.get_feature_names())

array([u'000', u'10', u'100', ..., u'young', u'youth', u'zeal'], dtype=object)

In [None]:
# Panelist suggestions:
# - Use bigrams and trigrams to reduce features and add context
# - Try tf-idf or PCA
# - Use min_df and max_df parameters to eliminate tokens that aren't used
#     that often or are used too often to be meaningful
#     CountVectorizer documentation: http://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.CountVectorizer.html
# - Grid Search can help find min_df and max_df parameters for
#     CountVectorizer: http://scikit-learn.org/stable/modules/grid_search.html

#Part II: Exploratory Data Analysis (EDA)

#Part III: Training and Testing Models