### Classifying Inaugural Speeches

#### Exercise
You are asked to identify the words that are most indicative of an Presidential inaugural speech for a given year.
For this task, you will have to do the following:
* Select the target speeches
* Treat each sentence in the target speech as a document; if the sentence is part of the target speeches, mark it as positive, otherwise mark it as negative
* Create a dataset that contains the words that appear in each "positive" and in each "negative" sentence; filter the words so that we only see words that appear in a sufficiently large number of sentences.
* Train a classifier
* See the most informative words

The NLTK toolkit contains the inaugural speeches for all presidents from 1789 till 2009.

In [None]:
from nltk.corpus import inaugural

inaugural.fileids()

If we want to see the words and/or sentences of these speeches we use the following commands:

In [None]:
speech = '2009-Obama.txt'

# Here is the list of sentences. Each sentence is a list of tokens
inaugural.sents(speech)

In [None]:
speech = '2001-Bush.txt'

# Here is the list of sentences. Each sentence is a list of tokens
inaugural.sents(speech)

In [None]:
# Here is the first sentence
inaugural.sents(speech)[0]

In [None]:
# Here is the second sentence
inaugural.sents(speech)[1]

In [None]:
# And here is the list of tokens
list(inaugural.words(speech))

In [None]:
import nltk

# And here is the raw text
raw_text = inaugural.raw(speech)

# And as a reminder, here are the NTLK commands for 
# splitting the text into sentences, or tokenizing it
# (See part A for more details)
sentences = nltk.sent_tokenize(raw_text)
tokens = nltk.word_tokenize(raw_text)
nltk_text = nltk.Text(tokens)

In [None]:
# Here is the list of (non-tokenized) sentences
sentences

In [None]:
# And here is an example of doing POS tagging on the second sentence
sent_tokens = nltk.word_tokenize(sentences[1])
nltk.pos_tag(sent_tokens)

### Exercise

You are asked to identify the words that are most indicative of an Presidential inaugural speech for a given year. 

For this task, you will have to do the following:
* Select the target speeches
* Treat each sentence in the target speech as a document; if the sentence is part of the target speeches, mark it as positive, otherwise mark it as negative
* Create a dataset that contains the words that appear in each "positive" and in each "negative" sentence; filter the words so that we only see words that appear in a sufficiently large number of sentences.
* Train a classifier
* See the most informative words

In [None]:
# Here we define our "target" class. We will define our target class, as all the speeches
# in the 18th and 19th century
target_speeches = ['1949-Truman.txt',
 '1953-Eisenhower.txt',
 '1957-Eisenhower.txt',
 '1961-Kennedy.txt',
 '1965-Johnson.txt',
 '1969-Nixon.txt',
 '1973-Nixon.txt',
 '1977-Carter.txt',
 '1981-Reagan.txt',
 '1985-Reagan.txt',
 '1989-Bush.txt',
 '1993-Clinton.txt',
 '1997-Clinton.txt',
 '2001-Bush.txt',
 '2005-Bush.txt',
 '2009-Obama.txt']

# Or shorter....
#target_speeches = [s for s in inaugural.fileids() 
#                   if s.startswith('17') or s.startswith('18') or s.startswith('1901')]

non_target_speeches = [s for s in inaugural.fileids() if s not in target_speeches]

In [None]:
# We go over all speeches, and extract the sentences (each sentence is a list, containing the words/tokens)
# If the speech is a target speech, add the sentence

# The data will contain a tuple ("pos", sentence) and ("neg", sentence)
data = []
speeches = inaugural.fileids()

for speech in speeches:
    
    if speech in target_speeches:
        label = "pos"
    else:
        label = "neg"
    # If we want to operate with the raw text
    raw_text = inaugural.raw(speech)
    sentences = nltk.sent_tokenize(raw_text)
    # Or, alternatively, to add the alterady tokenized sentences
    # sentences = list(inaugural.sents(speech))
    
    # We now add the sentences in our dataset, with the appropriate tag
    # We create a list comprehension for each sentence in the speech
    # and then we add all these elements into "data"
    data.extend( [(label, sent) for sent in sentences] )
    

In [None]:
len(data)

In [None]:
# This is the number of positive sentences
len([tag for (tag, s) in data if tag=='pos'])

In [None]:
# This is the number of negative sentences
len([tag for (tag, s) in data if tag=='neg'])

In [None]:
data

In [None]:
# This is out function that takes as input a sentence and then extracts
# the features, and creates the feature dictionary that we will use for
# training. We use binary representation of our features (either the feature
# appears in the sentence or not). Notice that we only set as "True" the 
# features that appear; the remaining ones will be implicitly set to "None"/False
def features(sentence):
    features = dict()
    tokens = nltk.word_tokenize(sentence)
    pos_tagged_tokens = nltk.pos_tag(tokens)
    for token, pos_tag in pos_tagged_tokens:
        # We keep only specific part of speech as features
        #if (pos_tag.startswith("J")):
            features[token+"/"+pos_tag] = True
    return features

In [None]:
# Example: Here is our first data point/sentence
data[0]

In [None]:
# Let's see the featurized version of the sentence
features(data[0][1])

In [None]:
# So, now we go through all elements in the "data" list (tag, sentence)
# and we apply the "features" function in each sentence, to get back its
# featurized form
featurized_data = [(features(sentence), class_label) 
                   for (class_label, sentence) in data]

In [None]:
import random
random.shuffle(featurized_data)
test_set_size = 500
train_set, test_set = featurized_data[test_set_size:], featurized_data[:test_set_size]
classifier = nltk.NaiveBayesClassifier.train(train_set)
print(nltk.classify.accuracy(classifier, test_set))
classifier.show_most_informative_features(50)