### Classifying Inaugural Speeches

#### Exercise
You are asked to identify the words that are most indicative of an Presidential inaugural speech for a given year.
For this task, you will have to do the following:
* Select the target speeches
* Treat each sentence in the target speech as a document; if the sentence is part of the target speeches, mark it as positive, otherwise mark it as negative
* Create a dataset that contains the words that appear in each "positive" and in each "negative" sentence; filter the words so that we only see words that appear in a sufficiently large number of sentences.
* Train a classifier
* See the most informative words

The NLTK toolkit contains the inaugural speeches for all presidents from 1789 till 2009.

In [1]:
from nltk.corpus import inaugural

inaugural.fileids()

['1789-Washington.txt',
 '1793-Washington.txt',
 '1797-Adams.txt',
 '1801-Jefferson.txt',
 '1805-Jefferson.txt',
 '1809-Madison.txt',
 '1813-Madison.txt',
 '1817-Monroe.txt',
 '1821-Monroe.txt',
 '1825-Adams.txt',
 '1829-Jackson.txt',
 '1833-Jackson.txt',
 '1837-VanBuren.txt',
 '1841-Harrison.txt',
 '1845-Polk.txt',
 '1849-Taylor.txt',
 '1853-Pierce.txt',
 '1857-Buchanan.txt',
 '1861-Lincoln.txt',
 '1865-Lincoln.txt',
 '1869-Grant.txt',
 '1873-Grant.txt',
 '1877-Hayes.txt',
 '1881-Garfield.txt',
 '1885-Cleveland.txt',
 '1889-Harrison.txt',
 '1893-Cleveland.txt',
 '1897-McKinley.txt',
 '1901-McKinley.txt',
 '1905-Roosevelt.txt',
 '1909-Taft.txt',
 '1913-Wilson.txt',
 '1917-Wilson.txt',
 '1921-Harding.txt',
 '1925-Coolidge.txt',
 '1929-Hoover.txt',
 '1933-Roosevelt.txt',
 '1937-Roosevelt.txt',
 '1941-Roosevelt.txt',
 '1945-Roosevelt.txt',
 '1949-Truman.txt',
 '1953-Eisenhower.txt',
 '1957-Eisenhower.txt',
 '1961-Kennedy.txt',
 '1965-Johnson.txt',
 '1969-Nixon.txt',
 '1973-Nixon.txt',
 '1

If we want to see the words and/or sentences of these speeches we use the following commands:

In [4]:
speech = '2009-Obama.txt'

# Here is the list of sentences. Each sentence is a list of tokens
inaugural.sents(speech)

[['My', 'fellow', 'citizens', ':'], ['I', 'stand', 'here', 'today', 'humbled', 'by', 'the', 'task', 'before', 'us', ',', 'grateful', 'for', 'the', 'trust', 'you', 'have', 'bestowed', ',', 'mindful', 'of', 'the', 'sacrifices', 'borne', 'by', 'our', 'ancestors', '.'], ...]

In [3]:
speech = '2001-Bush.txt'

# Here is the list of sentences. Each sentence is a list of tokens
inaugural.sents(speech)

[['President', 'Clinton', ',', 'distinguished', 'guests', 'and', 'my', 'fellow', 'citizens', ',', 'the', 'peaceful', 'transfer', 'of', 'authority', 'is', 'rare', 'in', 'history', ',', 'yet', 'common', 'in', 'our', 'country', '.'], ['With', 'a', 'simple', 'oath', ',', 'we', 'affirm', 'old', 'traditions', 'and', 'make', 'new', 'beginnings', '.'], ...]

In [5]:
# Here is the first sentence
inaugural.sents(speech)[0]

['My', 'fellow', 'citizens', ':']

In [6]:
# Here is the second sentence
inaugural.sents(speech)[1]

['I',
 'stand',
 'here',
 'today',
 'humbled',
 'by',
 'the',
 'task',
 'before',
 'us',
 ',',
 'grateful',
 'for',
 'the',
 'trust',
 'you',
 'have',
 'bestowed',
 ',',
 'mindful',
 'of',
 'the',
 'sacrifices',
 'borne',
 'by',
 'our',
 'ancestors',
 '.']

In [7]:
# And here is the list of tokens
list(inaugural.words(speech))

['My',
 'fellow',
 'citizens',
 ':',
 'I',
 'stand',
 'here',
 'today',
 'humbled',
 'by',
 'the',
 'task',
 'before',
 'us',
 ',',
 'grateful',
 'for',
 'the',
 'trust',
 'you',
 'have',
 'bestowed',
 ',',
 'mindful',
 'of',
 'the',
 'sacrifices',
 'borne',
 'by',
 'our',
 'ancestors',
 '.',
 'I',
 'thank',
 'President',
 'Bush',
 'for',
 'his',
 'service',
 'to',
 'our',
 'nation',
 ',',
 'as',
 'well',
 'as',
 'the',
 'generosity',
 'and',
 'cooperation',
 'he',
 'has',
 'shown',
 'throughout',
 'this',
 'transition',
 '.',
 'Forty',
 '-',
 'four',
 'Americans',
 'have',
 'now',
 'taken',
 'the',
 'presidential',
 'oath',
 '.',
 'The',
 'words',
 'have',
 'been',
 'spoken',
 'during',
 'rising',
 'tides',
 'of',
 'prosperity',
 'and',
 'the',
 'still',
 'waters',
 'of',
 'peace',
 '.',
 'Yet',
 ',',
 'every',
 'so',
 'often',
 'the',
 'oath',
 'is',
 'taken',
 'amidst',
 'gathering',
 'clouds',
 'and',
 'raging',
 'storms',
 '.',
 'At',
 'these',
 'moments',
 ',',
 'America',
 'has'

In [8]:
import nltk

# And here is the raw text
raw_text = inaugural.raw(speech)

# And as a reminder, here are the NTLK commands for 
# splitting the text into sentences, or tokenizing it
# (See part A for more details)
sentences = nltk.sent_tokenize(raw_text)
tokens = nltk.word_tokenize(raw_text)
nltk_text = nltk.Text(tokens)

In [8]:
# Here is the list of (non-tokenized) sentences
sentences

[u'My fellow citizens:\n\nI stand here today humbled by the task before us, grateful for the trust you have bestowed, mindful of the sacrifices borne by our ancestors.',
 u'I thank President Bush for his service to our nation, as well as the generosity and cooperation he has shown throughout this transition.',
 u'Forty-four Americans have now taken the presidential oath.',
 u'The words have been spoken during rising tides of prosperity and the still waters of peace.',
 u'Yet, every so often the oath is taken amidst gathering clouds and raging storms.',
 u'At these moments, America has carried on not simply because of the skill or vision of those in high office, but because We the People have remained faithful to the ideals of our forbearers, and true to our founding documents.',
 u'So it has been.',
 u'So it must be with this generation of Americans.',
 u'That we are in the midst of crisis is now well understood.',
 u'Our nation is at war, against a far-reaching network of violence and

In [9]:
# And here is an example of doing POS tagging on the second sentence
sent_tokens = nltk.word_tokenize(sentences[1])
nltk.pos_tag(sent_tokens)

[('I', 'PRP'),
 ('thank', 'VBP'),
 ('President', 'NNP'),
 ('Bush', 'NNP'),
 ('for', 'IN'),
 ('his', 'PRP$'),
 ('service', 'NN'),
 ('to', 'TO'),
 ('our', 'PRP$'),
 ('nation', 'NN'),
 (',', ','),
 ('as', 'RB'),
 ('well', 'RB'),
 ('as', 'IN'),
 ('the', 'DT'),
 ('generosity', 'NN'),
 ('and', 'CC'),
 ('cooperation', 'NN'),
 ('he', 'PRP'),
 ('has', 'VBZ'),
 ('shown', 'VBN'),
 ('throughout', 'IN'),
 ('this', 'DT'),
 ('transition', 'NN'),
 ('.', '.')]

### Exercise

You are asked to identify the words that are most indicative of an Presidential inaugural speech for a given year. 

For this task, you will have to do the following:
* Select the target speeches
* Treat each sentence in the target speech as a document; if the sentence is part of the target speeches, mark it as positive, otherwise mark it as negative
* Create a dataset that contains the words that appear in each "positive" and in each "negative" sentence; filter the words so that we only see words that appear in a sufficiently large number of sentences.
* Train a classifier
* See the most informative words

In [50]:
# Here we define our "target" class. We will define our target class, as all the speeches
# in the 18th and 19th century
target_speeches = ['1949-Truman.txt',
 '1953-Eisenhower.txt',
 '1957-Eisenhower.txt',
 '1961-Kennedy.txt',
 '1965-Johnson.txt',
 '1969-Nixon.txt',
 '1973-Nixon.txt',
 '1977-Carter.txt',
 '1981-Reagan.txt',
 '1985-Reagan.txt',
 '1989-Bush.txt',
 '1993-Clinton.txt',
 '1997-Clinton.txt',
 '2001-Bush.txt',
 '2005-Bush.txt',
 '2009-Obama.txt']

# Or shorter....
#target_speeches = [s for s in inaugural.fileids() 
#                   if s.startswith('17') or s.startswith('18') or s.startswith('1901')]

non_target_speeches = [s for s in inaugural.fileids() if s not in target_speeches]

In [51]:
# We go over all speeches, and extract the sentences (each sentence is a list, containing the words/tokens)
# If the speech is a target speech, add the sentence

# The data will contain a tuple ("pos", sentence) and ("neg", sentence)
data = []
speeches = inaugural.fileids()

for speech in speeches:
    
    if speech in target_speeches:
        label = "pos"
    else:
        label = "neg"
    # If we want to operate with the raw text
    raw_text = inaugural.raw(speech)
    sentences = nltk.sent_tokenize(raw_text)
    # Or, alternatively, to add the alterady tokenized sentences
    # sentences = list(inaugural.sents(speech))
    
    # We now add the sentences in our dataset, with the appropriate tag
    # We create a list comprehension for each sentence in the speech
    # and then we add all these elements into "data"
    data.extend( [(label, sent) for sent in sentences] )
    

In [52]:
len(data)

4839

In [53]:
# This is the number of positive sentences
len([tag for (tag, s) in data if tag=='pos'])

1584

In [54]:
# This is the number of negative sentences
len([tag for (tag, s) in data if tag=='neg'])

3255

In [55]:
data

[('neg',
  'Fellow-Citizens of the Senate and of the House of Representatives:\n\nAmong the vicissitudes incident to life no event could have filled me with greater anxieties than that of which the notification was transmitted by your order, and received on the 14th day of the present month.'),
 ('neg',
  'On the one hand, I was summoned by my Country, whose voice I can never hear but with veneration and love, from a retreat which I had chosen with the fondest predilection, and, in my flattering hopes, with an immutable decision, as the asylum of my declining years -- a retreat which was rendered every day more necessary as well as more dear to me by the addition of habit to inclination, and of frequent interruptions in my health to the gradual waste committed on it by time.'),
 ('neg',
  'On the other hand, the magnitude and difficulty of the trust to which the voice of my country called me, being sufficient to awaken in the wisest and most experienced of her citizens a distrustful sc

In [59]:
# This is out function that takes as input a sentence and then extracts
# the features, and creates the feature dictionary that we will use for
# training. We use binary representation of our features (either the feature
# appears in the sentence or not). Notice that we only set as "True" the 
# features that appear; the remaining ones will be implicitly set to "None"/False
def features(sentence):
    features = dict()
    tokens = nltk.word_tokenize(sentence)
    pos_tagged_tokens = nltk.pos_tag(tokens)
    for token, pos_tag in pos_tagged_tokens:
        # We keep only specific part of speech as features
        #if (pos_tag.startswith("J")):
            features[token+"/"+pos_tag] = True
    return features

In [60]:
# Example: Here is our first data point/sentence
data[0]

('neg',
 'Fellow-Citizens of the Senate and of the House of Representatives:\n\nAmong the vicissitudes incident to life no event could have filled me with greater anxieties than that of which the notification was transmitted by your order, and received on the 14th day of the present month.')

In [61]:
# Let's see the featurized version of the sentence
features(data[0][1])

{',/,': True,
 './.': True,
 '14th/JJ': True,
 ':/:': True,
 'Among/IN': True,
 'Fellow-Citizens/NNS': True,
 'House/NNP': True,
 'Representatives/NNPS': True,
 'Senate/NNP': True,
 'and/CC': True,
 'anxieties/NNS': True,
 'by/IN': True,
 'could/MD': True,
 'day/NN': True,
 'event/NN': True,
 'filled/VBN': True,
 'greater/JJR': True,
 'have/VB': True,
 'incident/NN': True,
 'life/NN': True,
 'me/PRP': True,
 'month/NN': True,
 'no/DT': True,
 'notification/NN': True,
 'of/IN': True,
 'on/IN': True,
 'order/NN': True,
 'present/JJ': True,
 'received/VBD': True,
 'than/IN': True,
 'that/DT': True,
 'the/DT': True,
 'to/TO': True,
 'transmitted/VBN': True,
 'vicissitudes/NNS': True,
 'was/VBD': True,
 'which/WDT': True,
 'with/IN': True,
 'your/PRP$': True}

In [62]:
# So, now we go through all elements in the "data" list (tag, sentence)
# and we apply the "features" function in each sentence, to get back its
# featurized form
featurized_data = [(features(sentence), class_label) 
                   for (class_label, sentence) in data]

In [63]:
import random
random.shuffle(featurized_data)
test_set_size = 500
train_set, test_set = featurized_data[test_set_size:], featurized_data[:test_set_size]
classifier = nltk.NaiveBayesClassifier.train(train_set)
print(nltk.classify.accuracy(classifier, test_set))
classifier.show_most_informative_features(50)

0.806
Most Informative Features
               policy/NN = True              neg : pos    =     26.1 : 1.0
                 help/VB = True              pos : neg    =     23.5 : 1.0
        Constitution/NNP = True              neg : pos    =     20.0 : 1.0
               State/NNP = True              neg : pos    =     16.7 : 1.0
           protection/NN = True              neg : pos    =     15.4 : 1.0
              opinion/NN = True              neg : pos    =     14.7 : 1.0
                story/NN = True              pos : neg    =     14.4 : 1.0
              journey/NN = True              pos : neg    =     14.4 : 1.0
                dream/NN = True              pos : neg    =     14.4 : 1.0
            condition/NN = True              neg : pos    =     14.4 : 1.0
              general/JJ = True              neg : pos    =     13.4 : 1.0
               proper/JJ = True              neg : pos    =     13.4 : 1.0
           commitment/NN = True              pos : neg    =     13.0