# About The Data
For this project, I have used the corpus of State of the Union addresses with the goal of classifing the texts using a combination of supervised and unsupervised learning techniques. The corpus comes from nltk and contains 65 State of the Union addresses from ten different US presidents. For my project, I have considered only 4 presidents.

# Sections:

•	Text Processing

•	supervised Feature Generation

•	Unsupervised Feature Generation

•	Supervised Learning Models

•	Comparing Supervised and Unsupervised Learning for NLP Applications


In [10]:
%matplotlib inline
import numpy as np
import pandas as pd
import scipy
import sklearn
import matplotlib.pyplot as plt
import seaborn as sns
import re
import spacy
from nltk.corpus import state_union, stopwords
from collections import Counter

from sklearn.model_selection import cross_val_score,cross_val_predict, GridSearchCV, train_test_split
from sklearn import ensemble
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.svm import SVC
from sklearn.linear_model import LogisticRegression
from sklearn.neighbors import KNeighborsClassifier
from sklearn import neural_network
from sklearn.cluster import MeanShift, estimate_bandwidth,  KMeans, MiniBatchKMeans
from sklearn.metrics import confusion_matrix, auc, precision_recall_curve
from sklearn.decomposition import TruncatedSVD
from sklearn.pipeline import make_pipeline
from sklearn.preprocessing import Normalizer

In [309]:
from sklearn.svm import SVC

# Text Processing
The first step is to get each speech for the corpus. 
I will read the files president wise and and store them in a list and then create a loop to break each file into sentence level documents.

# Reading files

In [11]:
Truman = []

Johnson = []
Clinton = []
GWBush = []
for i in state_union.fileids():
    if 'Truman' in i:
        Truman.append(state_union.raw(i))

    if 'Johnson' in i:
        Johnson.append(state_union.raw(i))
    if 'Clinton' in i:
        Clinton.append(state_union.raw(i))
    if 'GWBush' in i:
        GWBush.append(state_union.raw(i))

In [12]:
Truman_speech = ' '.join(Truman)

Johnson_speech = ' '.join(Johnson)
Clinton_speech = ' '.join(Clinton)
GWBush_speech = ' '.join(GWBush)

# Cleaning

In [13]:
# Utility function for standard text cleaning.
def text_cleaner(text):
    
    text = ' '.join(text.split())
     # Get rid of words in square brackets.
    text = re.sub("[\[].*?[\]]", "", text)
    return text



In [15]:
def remove_first_sentence(text):
    text2 = ''
    text2 = text.replace(text[0:text.find('\n')],'')
    return(text2)

In [16]:
Clinton_speech = remove_first_sentence(Clinton_speech)
Clinton_speech_cleaned = ''.join(text_cleaner(Clinton_speech))

Truman = remove_first_sentence(Truman_speech)
Truman_speech_cleaned = ''.join(text_cleaner(Truman_speech))

Johnson = remove_first_sentence(Johnson_speech)
Johnson_speech_cleaned = ''.join(text_cleaner(Johnson_speech))

GWBush = remove_first_sentence(GWBush_speech)
GWBush_speech_cleaned = ''.join(text_cleaner(GWBush_speech))

Compare the few lines of one file before cleaning and after cleaning.

In [230]:
Clinton_speech[0:500]

'\n \nFebruary 17, 1993 \n\nMr. President, Mr. Speaker, Members of the House and the Senate, distinguished Americans here as visitors in this Chamber, as am I. It is nice to have a fresh excuse for giving a long speech. [Laughter]\nWhen Presidents speak to Congress and the Nation from this podium, typically they comment on the full range and challenges and opportunities that face the United States. But this is not an ordinary time, and for all the many tasks that require our attention, I believe tonig'

In [233]:
Clinton_speech_cleaned[0:500]

'February 17, 1993 Mr. President, Mr. Speaker, Members of the House and the Senate, distinguished Americans here as visitors in this Chamber, as am I. It is nice to have a fresh excuse for giving a long speech.  When Presidents speak to Congress and the Nation from this podium, typically they comment on the full range and challenges and opportunities that face the United States. But this is not an ordinary time, and for all the many tasks that require our attention, I believe tonight one calls on'

The cleaning looks good. Now that we have our documents cleaned up , let's tokenize the sentences into words and get the lemmas.

# Parsing

In [17]:
# Parse using SpaCy
nlp = spacy.load('en')

In [19]:
Clinton_doc = nlp(Clinton_speech_cleaned)

In [20]:
Truman_doc = nlp(Truman_speech_cleaned)

In [21]:
Johnson_doc = nlp(Johnson_speech_cleaned)

In [22]:
GWBush_doc = nlp(GWBush_speech_cleaned)

In [91]:
#Group into sentences
Clinton_sents = [ [sent,'Clinton']  for sent in Clinton_doc.sents]
Truman_sents =  [ [sent, 'Truman']  for sent in Truman_doc.sents]
Johnson_sents = [ [sent, 'Johnson'] for sent in Johnson_doc.sents]
GWBush_sents =  [ [sent,'GWBush']   for sent in GWBush_doc.sents]


sentences_df = pd.DataFrame(Clinton_sents + Truman_sents + 
                            Johnson_sents + GWBush_sents)
sentences_df.head()

Unnamed: 0,0,1
0,"(February, 17, ,, 1993, Mr., President, ,, Mr....",Clinton
1,"(It, is, nice, to, have, a, fresh, excuse, for...",Clinton
2,"(When, Presidents, speak, to, Congress, and, t...",Clinton
3,"(But, this, is, not, an, ordinary, time, ,, an...",Clinton
4,"(And, that, is, our, economy, .)",Clinton


In [92]:
sentences_df.shape

(9795, 2)

In [93]:
sentences_df.columns= ['sent','president']

In [163]:
sentences_df.president.value_counts(normalize=True) * 100

Clinton    32.363451
Truman     26.778969
GWBush     22.399183
Johnson    18.458397
Name: president, dtype: float64

# Supervised Feature Generation 
Bag of Words Features

In [164]:
# Create bag of words function for each text
def bag_of_words(text, most_common_count, person):
    
    # filter out punctuation and stop words
    allwords = [token.lemma_
                for token in text
                if not token.is_punct
                and not token.is_stop
               ]
    
    
    # Return most common words
    return [item[0] for item in Counter(allwords).most_common(most_common_count)]



As a first step, let's try taking only the 500 most common words from each president

In [165]:
# Get bags 
Clinton_words = bag_of_words(Clinton_doc, 500, 'Clinton')
Truman_words = bag_of_words(Truman_doc, 500, 'Truman')

Johnson_words = bag_of_words(Johnson_doc, 500, 'Johnson')
GWBush_words = bag_of_words(GWBush_doc, 500, 'GWBush')


# Combine bags to create common set of unique words
common_words = set(Clinton_words + Truman_words +
                   Johnson_words + GWBush_words )

In [166]:
print(len(common_words))

872


In [167]:
# Create bag of words data frame using combined common words and sentences
def bow_features(sentences, common_words):
    
    # Build data frame
    df = pd.DataFrame(columns=common_words)
    df['text_sentence'] = sentences['sent']
    df['text_source'] = sentences['president']
    df.loc[:, common_words] = 0
    
    # Process each row, counting the occurrence of words in each sentence.
    for i, sentences in enumerate(df['text_sentence']):
        
        # Convert the sentence to lemmas, then filter out punctuation,
        # stop words, and uncommon words.
        words = [token.lemma_
                 for token in sentences
                 if (
                     not token.is_punct
                     and not token.is_stop
                     and token.lemma_ in common_words
                 )]
        
        # Populate the row with word counts.
        for word in words:
            df.loc[i, word] += 1
    
    return df

In [168]:
# Create bow features 
speech_df = bow_features(sentences_df, common_words)
speech_df.head()

Unnamed: 0,best,courage,across,welcome,secure,10,strong,care,marriage,2,...,task,person,current,water,deserve,gather,destruction,condition,text_sentence,text_source
0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,"(February, 17, ,, 1993, Mr., President, ,, Mr....",Clinton
1,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,"(It, is, nice, to, have, a, fresh, excuse, for...",Clinton
2,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,"(When, Presidents, speak, to, Congress, and, t...",Clinton
3,0,0,0,0,0,0,0,0,0,0,...,1,0,0,0,0,0,0,0,"(But, this, is, not, an, ordinary, time, ,, an...",Clinton
4,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,"(And, that, is, our, economy, .)",Clinton


In [169]:
speech_df.shape

(9795, 874)

# Modelling on BOW 

In [101]:
def bow_split():
    X_bow = speech_df.drop(['text_sentence', 'text_source'], 1)
    Y_bow = speech_df['text_source']
    X_train_bow, X_test_bow,y_train_bow, y_test_bow= train_test_split(X_bow,Y_bow, test_size=0.2, random_state=0, stratify = Y_bow)
    print('Training Data size\n',X_train_bow.shape)
    print('Test data size\n',X_test_bow.shape)
    return(X_train_bow, X_test_bow,y_train_bow, y_test_bow)

In [252]:
def models_default_param(X_train, y_train,X_test,y_test):
    
    print('Logistic Regression')
    lr = LogisticRegression()
    lr_bow = lr.fit(X_train, y_train)
    lr_train_score =  cross_val_score(lr_bow, X_train, y_train, cv=5)
    lr_test_score = cross_val_score(lr_bow, X_test,y_test, cv=5)
    print('Score : \n ',lr_train_score)
    print('Training Data Avg Score:', np.mean(lr_train_score))
    print('Test Data Avg Score:', np.mean(lr_test_score))
    print()
    
    print('Random Forest Classifier')
    rfc = ensemble.RandomForestClassifier()
    rfc_bow = rfc.fit(X_train, y_train)
    rfc_train_score =  cross_val_score(rfc_bow, X_train, y_train, cv=5)
    rfc_test_score = cross_val_score(rfc_bow, X_test,y_test, cv=5)
    print('Score : \n ',rfc_train_score)
    print('Training Data Avg Score:', np.mean(rfc_train_score))
    print('Test Data Avg Score:', np.mean(rfc_test_score))
    print()
       

# Models with 500 most common words

In [103]:
X_train, X_test,y_train, y_test = bow_split()
print()
models_default_param(X_train, y_train,X_test,y_test)

Training Data size
 (7836, 872)
Test data size
 (1959, 872)

Logistic Regression
Score : 
  [ 0.63989802  0.66560306  0.65475431  0.63369496  0.69667944]
Training Data Avg Score: 0.658125958314
Test Data Avg Score: 0.590076259119

Random Forest Classifier
Score : 
  [ 0.53983429  0.54562859  0.55009572  0.53605616  0.53001277]
Training Data Avg Score: 0.540325506598
Test Data Avg Score: 0.479812115195



The scores are not bad. Let's try to improve the scores by taking more common words and then considering other aspects too.

# Improve BOW

In [170]:
# Get bags 

Clinton_words = bag_of_words(Clinton_doc, 1500, 'Clinton')
Truman_words = bag_of_words(Truman_doc, 1500, 'Truman')
Johnson_words = bag_of_words(Johnson_doc, 1500, 'Johnson')
GWBush_words = bag_of_words(GWBush_doc, 1500, 'GWBush')


# Combine bags to create common set of unique words
common_words = set(Clinton_words + Truman_words + 
                   Johnson_words + GWBush_words )

In [171]:

def bow_features(sentences, common_words):   
   
    
    df = pd.DataFrame(columns= set(list(common_words) ))

    df['text_sentence'] = sentences['sent']
    df['text_source'] = sentences['president']
    df.loc[:, common_words] = 0
    
    df['sent_length'] = 0
    
    df['prev_sent_length'] = 0
    df['next_sent_length'] = 0
    df['num_words_repeated_from_prior_sent'] = 0
    
    for i, sentence in enumerate(df['text_sentence']):
        
          #Check to see if each phrase turns up in the sentence (store as binary var for the time being)       
                
        words = [token.lemma_
                 for token in sentence
                 if (
                     not token.is_punct
                     and not token.is_stop
                     and token.lemma_ in common_words
                 )]
        
        # Populate the row with word counts.
        #Also add # of repeated words from one sentence to the next
        repeats = 0
        for word in words:
            df.loc[i, word] += 1
            if i > 0: 
                if ((df.loc[i-1, word] > 0) & (df.loc[i, word] > 0)):
                    repeats += 1
            else: 
                repeats = 0
        df['num_words_repeated_from_prior_sent'][i] = repeats        

        sent_len = 0    
        num_punct = 0 
        
        for token in sentence:
        
            if not token.is_punct:
                sent_len += 1
            else:
                num_punct += 1
        df.loc[i, 'sent_length'] = sent_len
        df.loc[i, 'sent_punct_count'] = num_punct
        
        if i > 0:
            df.loc[i, 'prev_sent_length'] = df.loc[i-1, 'sent_length']
        else:
            df.loc[i, 'prev_sent_length'] = 0
                              
        # This counter is just to make sure the kernel didn't hang.
                      
        if i % 500 == 0:
            print("Processing row {}".format(i))
    #Back out of the loop through sentences and just shift the df by one to get the "next sent len" feature
    df['next_sent_length'] = df['sent_length'].shift(-1)         
                
    return df


In [172]:
# Create bow features 
speech_df = bow_features(sentences_df, common_words)
speech_df.head()

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy


Processing row 0
Processing row 500
Processing row 1000
Processing row 1500
Processing row 2000
Processing row 2500
Processing row 3000
Processing row 3500
Processing row 4000
Processing row 4500
Processing row 5000
Processing row 5500
Processing row 6000
Processing row 6500
Processing row 7000
Processing row 7500
Processing row 8000
Processing row 8500
Processing row 9000
Processing row 9500


Unnamed: 0,best,welcome,define,pleased,settle,pentagon,iii,2,expenditure,flourishing,...,gather,catch,scarcity,text_sentence,text_source,sent_length,prev_sent_length,next_sent_length,num_words_repeated_from_prior_sent,sent_punct_count
0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,"(February, 17, ,, 1993, Mr., President, ,, Mr....",Clinton,25,0,14.0,0,5.0
1,0,0,0,0,0,0,0,0,0,0,...,0,0,0,"(It, is, nice, to, have, a, fresh, excuse, for...",Clinton,14,25,27.0,1,1.0
2,0,0,0,0,0,0,0,0,0,0,...,0,0,0,"(When, Presidents, speak, to, Congress, and, t...",Clinton,27,14,31.0,2,2.0
3,0,0,0,0,0,0,0,0,0,0,...,0,0,0,"(But, this, is, not, an, ordinary, time, ,, an...",Clinton,31,27,5.0,12,5.0
4,0,0,0,0,0,0,0,0,0,0,...,0,0,0,"(And, that, is, our, economy, .)",Clinton,5,31,17.0,4,1.0


In [173]:
def entityty_types(df):
    
    person_ent_type = []
    qty_ent_type = []
    ordinal_ent_type = []
    time_ent_type = []
    org_ent_type = []
    lang_ent_type = []
    date_ent_type = []
    card_ent_type = []
    gpe_ent_type = []
    fac_ent_type = []
    for i, sentence in enumerate(df['text_sentence']):
        person_count = 0
        qty_count= 0
        ordinal_count = 0
        time_count = 0
        org_count = 0
        lang_count = 0
        date_count= 0
        cardinal_count =0 
        gpe_count= 0
        fac_count = 0
    
        for token in sentence:
            if token.ent_type_ == 'PERSON':
                person_count += 1
        
            if token.ent_type_ == 'QUANTITY':
                qty_count += 1
            
            if token.ent_type_ == 'ORDINAL':
                ordinal_count += 1
            
            if token.ent_type_ == 'TIME':
                time_count += 1
            
            if token.ent_type_ == 'ORG':
                org_count += 1
            
            if token.ent_type_ == 'LANGUAGE':
                lang_count += 1
            if token.ent_type_ == 'DATE':
                date_count += 1            
        
            if token.ent_type_ == 'CARDINAL':
                cardinal_count += 1            
            if token.ent_type_ == 'GPE':
                gpe_count += 1            
            if token.ent_type_ == 'FAC':
                fac_count += 1            
            
        person_ent_type.append(person_count)
        qty_ent_type.append(qty_count)
        ordinal_ent_type.append(ordinal_count)
        time_ent_type.append(time_count)
        org_ent_type.append(org_count)
        lang_ent_type.append(lang_count)
        date_ent_type.append(date_count)
        card_ent_type.append(cardinal_count)
        gpe_ent_type.append(gpe_count)
        fac_ent_type.append(fac_count)

          
    df['person_ent'] = person_ent_type
    df['qty_ent'] = qty_ent_type
    df['ordinal_ent'] = ordinal_ent_type
    df['time_ent'] = time_ent_type
    df['org_ent'] = org_ent_type
    df['lang_ent'] = lang_ent_type
    df['date_ent'] = date_ent_type
    df['card_ent'] = card_ent_type
    df['gpe_ent'] = gpe_ent_type
    df['fac_ent'] = fac_ent_type
    return(df)


def grammars(df):
    adv_count_list = []
    verb_count_list = []
    noun_count_list = []
    propnoun_count_list = []
    punc_count_list = []
    #-----------------
    part_cnt_list= []
    adj_cnt_list= []
    adp_cnt_list= []
    det_cnt_list= []

    for sentence in df['text_sentence']:
        
        advs_cnt = 0
        verb_cnt = 0
        noun_cnt = 0
        propnoun_cnt = 0
        punc_cnt = 0
    #-----------------
        part_cnt= 0
        adj_cnt= 0
        adp_cnt= 0
        det_cnt= 0
        
        for token in sentence:
            if token.pos_ == 'ADV':
                advs_cnt +=1
            if token.pos_ == 'VERB':
                verb_cnt +=1
            if token.pos_ == 'NOUN':
                noun_cnt +=1
            if token.pos_ == 'PROPN':
                propnoun_cnt +=1
            if token.pos_ == 'PUNCT':
                punc_cnt +=1
    #---------------------------------------
            if token.pos_ == 'PART':
                part_cnt +=1
            if token.pos_ == 'ADJ':
                adj_cnt +=1
            if token.pos_ == 'ADP':
                adp_cnt +=1
            if token.pos_ == 'DET':
                det_cnt +=1
        
        adv_count_list.append(advs_cnt)
        verb_count_list.append(verb_cnt)
        noun_count_list.append(noun_cnt)
        propnoun_count_list.append(propnoun_cnt)
        punc_count_list.append(punc_cnt)
        #----------------------------------
        part_cnt_list.append(part_cnt)
        adj_cnt_list.append(adj_cnt)
        adp_cnt_list.append(adp_cnt)
        det_cnt_list.append(det_cnt)
    #---------------------------------------    
        
    df['adv_count'] = adv_count_list
    df['verb_count'] = verb_count_list
    df['noun_count'] = noun_count_list
    df['pronoun_count'] = propnoun_count_list
    df['punc_count'] = punc_count_list

    #-----------------
    df['part_cnt'] = part_cnt_list
    df['adj_cnt'] = adj_cnt_list
    df['adp_cnt'] = adp_cnt_list
    df['det_cnt'] = det_cnt_list
    return(df)



In [174]:
speech_df = entityty_types(speech_df)
speech_df = grammars(speech_df)

In [175]:
speech_df.head()

Unnamed: 0,best,welcome,define,pleased,settle,pentagon,iii,2,expenditure,flourishing,...,fac_ent,adv_count,verb_count,noun_count,pronoun_count,punc_count,part_cnt,adj_cnt,adp_cnt,det_cnt
0,0,0,0,0,0,0,0,0,0,0,...,0,1,1,1,11,5,0,1,4,3
1,0,0,0,0,0,0,0,0,0,0,...,0,0,3,2,0,1,1,3,1,2
2,0,0,0,0,0,0,0,0,0,0,...,0,2,3,5,4,2,0,2,3,4
3,0,0,0,0,0,0,0,0,0,0,...,0,1,7,4,0,5,3,5,2,3
4,0,0,0,0,0,0,0,0,0,0,...,0,0,1,1,0,1,0,1,0,1


In [176]:
speech_df.fillna(0,inplace=True)

In [112]:
X_train, X_test,y_train, y_test = bow_split()
print()
models_default_param(X_train, y_train, X_test, y_test)


Training Data size
 (7836, 2810)
Test data size
 (1959, 2810)

Logistic Regression
Score : 
  [ 0.69279796  0.69432036  0.7019783   0.68793874  0.73563218]
Training Data Avg Score: 0.702533508138
Test Data Avg Score: 0.603906815577

Random Forest Classifier
Score : 
  [ 0.56214149  0.56987875  0.5807275   0.57753669  0.59131545]
Training Data Avg Score: 0.576319978618
Test Data Avg Score: 0.523217751552



After adding few more features, the logistic regression scores for train data got 5% improvement, and test data got 1% improvement. For Random forest classifier, the train scores and test score got 4% improvement.

Let's try hyperparameter tuning.

In [115]:
# Do we need multi_class='multinomial' ??scores have reduced when tried this !!!
X_train, X_test,y_train, y_test = bow_split()
lr = LogisticRegression(solver='newton-cg', multi_class='multinomial')
lr_bow = lr.fit(X_train, y_train)
lr_train_score =  cross_val_score(lr_bow, X_train, y_train, cv=5)
lr_test_score = cross_val_score(lr_bow, X_test,y_test, cv=5)
print('Score : \n ',lr_train_score)
print('Training Data Avg Score:', np.mean(lr_train_score))
print('Test Data Avg Score:', np.mean(lr_test_score))


Training Data size
 (7836, 2810)
Test data size
 (1959, 2810)
Score : 
  [ 0.67941364  0.69176771  0.69432036  0.68921506  0.72030651]
Training Data Avg Score: 0.695004655933
Test Data Avg Score: 0.598301021776


In [119]:
from sklearn.linear_model import LogisticRegressionCV
# Do we need multi_class='multinomial' ??scores have reduced when tried this !!!
X_train, X_test,y_train, y_test = bow_split()
lrcv = LogisticRegressionCV()
# solver='newton-cg', multi_class='multinomial'
lrcv_bow = lrcv.fit(X_train, y_train)
lrcv_train_score =  cross_val_score(lrcv_bow, X_train, y_train, cv=5)
lrcv_test_score = cross_val_score(lrcv_bow, X_test,y_test, cv=5)
print('Score : \n ',lrcv_train_score)
print('Training Data Avg Score:', np.mean(lrcv_train_score))
print('Test Data Avg Score:', np.mean(lrcv_test_score))

Training Data size
 (7836, 2810)
Test data size
 (1959, 2810)
Score : 
  [ 0.6736775   0.68155712  0.69049138  0.67262285  0.72349936]
Training Data Avg Score: 0.688369641909
Test Data Avg Score: 0.605442660787


In [120]:
lr_params =[ {'C': [0.01, 0.1, 1, 10],'solver':['liblinear'],'penalty':['l1', 'l2'],'fit_intercept':[True]} ]
lr = LogisticRegression()
gr_logr = GridSearchCV(lr,param_grid = lr_params )
gr_logr.fit(X_train, y_train)
print('Best Parameter ', gr_logr.best_params_)

Best Parameter  {'C': 1, 'fit_intercept': True, 'penalty': 'l2', 'solver': 'liblinear'}


In [121]:
rfc_params  = {
    'n_estimators':[100,500],
    'max_features':['auto', 'sqrt', 'log2'],
    'max_depth':[4, 6,7, None],
    'min_samples_split':[2, 8]
}
rfc = ensemble.RandomForestClassifier(random_state=10)
rfc_grid = GridSearchCV(rfc, param_grid=rfc_params)
rfc_grid.fit(X_train, y_train)


print('Best Parameter ', rfc_grid.best_params_)

Best Parameter  {'max_depth': None, 'max_features': 'log2', 'min_samples_split': 2, 'n_estimators': 500}


In [270]:
def models_best_param(X_train, y_train,X_test,y_test):
    
    print('Logistic Regression')
    lr = LogisticRegression(**gr_logr.best_params_)
    lr_bow = lr.fit(X_train, y_train)
    lr_train_score_gs =  cross_val_score(lr_bow, X_train, y_train, cv=5)
    lr_test_score_gs = cross_val_score(lr_bow, X_test,y_test, cv=5)
    print('Score : \n ',lr_train_score_gs)
    print('Training Data Avg Score:', np.mean(lr_train_score_gs))
    print('Test Data Avg Score:', np.mean(lr_test_score_gs))
    print()
    
    print('Random Forest Classifier')
    rfc = ensemble.RandomForestClassifier(**rfc_grid.best_params_)
    rfc_bow = rfc.fit(X_train, y_train)
    rfc_train_score_gs =  cross_val_score(rfc_bow, X_train, y_train, cv=5)
    rfc_test_score_gs = cross_val_score(rfc_bow, X_test,y_test, cv=5)
    print('Score : \n ',rfc_train_score_gs)
    print('Training Data Avg Score:', np.mean(rfc_train_score_gs))
    print('Test Data Avg Score:', np.mean(rfc_test_score_gs))
    print()
    

In [123]:
models_best_param(X_train, y_train,X_test,y_test)

Logistic Regression
Score : 
  [ 0.69279796  0.69432036  0.7019783   0.68793874  0.73563218]
Training Data Avg Score: 0.702533508138
Test Data Avg Score: 0.603906815577

Random Forest Classifier
Score : 
  [ 0.66794136  0.64964901  0.66560306  0.65092534  0.65581098]
Training Data Avg Score: 0.657985951277
Test Data Avg Score: 0.604874854041



After trying hyperparameter tuning we did not see any improvement in Logistic regression scores, but the random forest scores are better now with 8% improvement for both train and test data.

# Unsupervised Feature Generation

 In this section, we'll explore two techniques for generating unsupervised features:

•	Tf-idf

•	Latent Semantic Analysis



# TF IDF

In [257]:

Truman = []
Johnson = []
Clinton = []
GWBush = []
for i in state_union.fileids():
    if 'Truman' in i:
        truman_sents = state_union.sents(i)
        truman_sent = [ " ".join(sent) for sent in truman_sents]
        del truman_sent[0]
        Truman.append( truman_sent   )

    if 'Johnson' in i:
        Johnson_sents = state_union.sents(i)
        Johnson_sent = [ " ".join(sent) for sent in Johnson_sents]
        del Johnson_sent[0]
        Johnson.append( Johnson_sent   )
        
    if 'Clinton' in i:
        Clinton_sents = state_union.sents(i)
        Clinton_sent =   [ " ".join(sent) for sent in Clinton_sents]
        del Clinton_sent[0]
        Clinton.append(Clinton_sent    )
        
    if 'GWBush' in i:
        GWBush_sents = state_union.sents(i)
        GWBush_sent = [ " ".join(sent) for sent in GWBush_sents] 
        del GWBush_sent [0]
        GWBush.append(  GWBush_sent  )



In [258]:
Truman_sents_list = []
Johnson_sents_list = []
Clinton_sents_list = []
GWBush_sents_list = []
for sublist in Truman:
    for item in sublist:
        Truman_sents_list.append(item)  

for sublist in Johnson:
    for item in sublist:
        Johnson_sents_list.append(item)  


for sublist in Clinton:
    for item in sublist:
        Clinton_sents_list.append(item)  

for sublist in GWBush:
    for item in sublist:
        GWBush_sents_list.append(item)  
        

In [259]:
tfidf_sentences_df = pd.DataFrame(columns= ['sent', 'president'])
tfidf_sentences_df['sent'] = Truman_sents_list
tfidf_sentences_df['president'] = 'Truman'

Johnson_df = pd.DataFrame(columns= ['sent', 'president'])
Johnson_df['sent'] = Johnson_sents_list
Johnson_df['president'] = 'Johnson'

Clinton_df = pd.DataFrame(columns= ['sent', 'president'])
Clinton_df['sent'] = Clinton_sents_list
Clinton_df['president'] = 'Clinton'

GWBush_df = pd.DataFrame(columns= ['sent', 'president'])
GWBush_df['sent'] = GWBush_sents_list
GWBush_df['president'] = 'GWBush'

In [260]:
tfidf_sentences_df = pd.concat([tfidf_sentences_df, Johnson_df,
                                Clinton_df, GWBush_df ])

In [261]:
tfidf_sentences_df.head()

Unnamed: 0,sent,president
0,"April 16 , 1945",Truman
1,"Mr . Speaker , Mr . President , Members of the...",Truman
2,"Only yesterday , we laid to rest the mortal re...",Truman
3,"At a time like this , words are inadequate .",Truman
4,The most eloquent tribute would be a reverent ...,Truman


In [262]:
tfidf_sentences_df.shape

(9613, 2)

In [151]:
tfidf_sentences_df.president.value_counts()

Clinton    3114
Truman     2575
GWBush     2153
Johnson    1771
Name: president, dtype: int64

In [264]:
X_train, X_test, y_train, y_test = train_test_split(tfidf_sentences_df['sent'],tfidf_sentences_df['president'],
                                    stratify=tfidf_sentences_df['president'],
                                    test_size=0.2,
                                    random_state=42)

In [265]:
len(X_train)

7690

In [281]:
X_train

2961                              A lot is riding on it .
2745             I hope you will support that , as well .
23      Another picture would be full of blessings : a...
751                                 There is more to do .
258                                         ( Applause .)
1614    But we have to move ahead with courage and hon...
1460                               The need is pressing .
181     And now I understand why , having dealt with t...
383     We have seen it in the courage of passengers ,...
348     We froze domestic spending and used honest bud...
937     That ' s why we worked so hard to increase edu...
2365    The imperialism of the czars has been replaced...
2929    I think you ought to do it for two reasons : F...
635     And they are going to have those privileges of...
129     Later this year , we will offer a plan to end ...
1204    Now it is time for us to look also to the chal...
1181    Twenty - eight months have passed since Septem...
598     The Co

In [266]:
vectorizer = TfidfVectorizer(stop_words='english', #filter out stopwords
                             lowercase=True,       #convert all to lowercase
                             min_df=2,             #use words appearing at least twice per document
                             max_df=0.5,           #drop words that occur in more than half of the documents
                             use_idf=True,
                             smooth_idf=True,
                             norm='l2'
                             )

After applying TF IDF vectorizer, let's split our data into train test set and apply the supervised models.

In [267]:
X_train_tfidf = vectorizer.fit_transform(X_train)
X_test_tfidf = vectorizer.transform(X_test)

In [268]:

models_default_param(X_train_tfidf, y_train, X_test_tfidf, y_test )

Logistic Regression
Score : 
  [ 0.67012987  0.66211826  0.6655823   0.66167859  0.64996747]
Training Data Avg Score: 0.661895299138
Test Data Avg Score: 0.586611732897

Random Forest Classifier
Score : 
  [ 0.57077922  0.56985055  0.57774886  0.57059206  0.5718933 ]
Training Data Avg Score: 0.572172799119
Test Data Avg Score: 0.521612607314



In [271]:
models_best_param(X_train_tfidf, y_train, X_test_tfidf, y_test )

Logistic Regression
Score : 
  [ 0.67012987  0.66211826  0.6655823   0.66167859  0.64996747]
Training Data Avg Score: 0.661895299138
Test Data Avg Score: 0.586611732897

Random Forest Classifier
Score : 
  [ 0.65454545  0.63807667  0.62654522  0.64150943  0.63435264]
Training Data Avg Score: 0.639005882926
Test Data Avg Score: 0.581930919643



The scores have not improved after the TF IDF approach. Let's try get new features from unsupervised clustering models and try again.

In [244]:
km = MiniBatchKMeans(n_clusters=4, init='k-means++', batch_size=5000)

km.fit(X_train_tfidf)
km_train_label = km.labels_
km_test_label = km.predict(X_test_tfidf)

X_train_new_km = pd.concat([pd.DataFrame(X_train_tfidf.toarray()), pd.DataFrame(km_train_label, columns = ['Cluster'])], axis = 1)
X_test_new_km = pd.concat([pd.DataFrame(X_test_tfidf.toarray()), pd.DataFrame(km_test_label, columns = ['Cluster'])], axis = 1)

In [245]:
models_best_param(X_train_new_km, y_train, X_test_new_km, y_test )

Logistic Regression
Score : 
  [ 0.67142857  0.65951917  0.66363045  0.66232921  0.64996747]
Training Data Avg Score: 0.661374974099
Test Data Avg Score: 0.5845337968

Random Forest Classifier
Score : 
  [ 0.65584416  0.63807667  0.63695511  0.62979831  0.63044893]
Training Data Avg Score: 0.638224634247
Test Data Avg Score: 0.585579521963



the scores remained almost the same. We can take our tf-idf vector matrix one step further and perform latent semantic analysis (LSA) on the words in an attempt to gain semantic information. LSA is performed through a dimensionality reduction technique called singluar value decompositin (SVD). SVD is applied to a tf-idf vector matrix and the resulting components represent clusters of words that presumably reflect topics within the corpus.

# Latent Semantic Analysis using Singular Value Decomposition

In [157]:
X_train_tfidf.shape

(7690, 4899)

In [158]:
svd = TruncatedSVD(2000)
lsa_pipe = make_pipeline(svd, Normalizer(copy=False))

# Fit with training data, transform test data
X_train_lsa = lsa_pipe.fit_transform(X_train_tfidf)
X_test_lsa = lsa_pipe.transform(X_test_tfidf)

# Examine variance captured in reduced feature space
variance_explained = svd.explained_variance_ratio_
total_variance = variance_explained.sum()
print('Percent variance captured by components:', total_variance*100)

sent_by_component = pd.DataFrame(X_train_lsa, index = X_train)

# Look at values from first 5 components
for i in range(6):
    print('Component {}:'.format(i))
    print(sent_by_component.loc[:, i].sort_values(ascending=False)[:5])

Percent variance captured by components: 88.2883775118
Component 0:
sent
( Applause .)    0.999859
( Applause .)    0.999859
( Applause .)    0.999859
( Applause .)    0.999859
( Applause .)    0.999859
Name: 0, dtype: float64
Component 1:
sent
The people of this great country have a right to expect that the Congress and the President will work in closest cooperation with one objective - the welfare of the people of this Nation as a whole .                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      0.302470
So I ask you 

Looking at the first 5 sentences of the first 6 components, we get the following semantic information:


Component 0 - The word 'applause'. This is a corpus specific stop word that could be removed from the documents.

Component 1 - Longer statements containing positive sentiments about prosperity in America, world peace.

Component 2 - Statements referencing the word 'year'.

Component 3  and Component 4 - Statements around 'thanks'.

Component 5 - The phrase 'The American people '

In [273]:
 print(len(X_train_lsa), len(y_train), len(X_test_lsa),len(y_test) )

7690 7690 1923 1923


In [160]:
models_default_param(X_train_lsa , y_train, X_test_lsa, y_test )

Logistic Regression
Score : 
  [ 0.67337662  0.66601689  0.66428107  0.66297983  0.65517241]
Training Data Avg Score: 0.664365365822
Test Data Avg Score: 0.599602811335

Random Forest Classifier
Score : 
  [ 0.39480519  0.39246264  0.3955758   0.38646714  0.41314249]
Training Data Avg Score: 0.396490651807
Test Data Avg Score: 0.395213171154



In [278]:

lr_lsa_params =[ {'C': [0.01, 0.1, 1, 10],'solver':['liblinear'],'penalty':['l1', 'l2'],'fit_intercept':[True]} ]
lr_lsa = LogisticRegression()
gr_logr_lsa = GridSearchCV(lr_lsa,param_grid = lr_lsa_params )
gr_logr_lsa.fit(X_train_lsa, y_train)
print('Best Parameter ', gr_logr_lsa.best_params_)



Best Parameter  {'C': 1, 'fit_intercept': True, 'penalty': 'l2', 'solver': 'liblinear'}


In [279]:

rfc_lsa_params  = {
    'n_estimators':[100,500],
    'max_features':['auto', 'sqrt', 'log2'],
    'max_depth':[4, 6,7, None],
    'min_samples_split':[2, 8]
}
rfc_lsa = ensemble.RandomForestClassifier(random_state=10)
rfc_grid_lsa = GridSearchCV(rfc_lsa, param_grid=rfc_lsa_params)
rfc_grid_lsa.fit(X_train_lsa, y_train)


print('Best Parameter ', rfc_grid_lsa.best_params_)

Best Parameter  {'max_depth': None, 'max_features': 'auto', 'min_samples_split': 2, 'n_estimators': 500}


In [280]:
models_best_param(X_train_lsa , y_train, X_test_lsa, y_test )

Logistic Regression
Score : 
  [ 0.67207792  0.66796621  0.66753416  0.66232921  0.65322056]
Training Data Avg Score: 0.664625612727
Test Data Avg Score: 0.601166657194

Random Forest Classifier
Score : 
  [ 0.49090909  0.47823262  0.49577098  0.48275862  0.47755368]
Training Data Avg Score: 0.485044997722
Test Data Avg Score: 0.469577756561



The Random forest scores are decreased a lot. But the logistic regression got 2% improvement on test data.

# Comparing Supervised and Unsupervised Learning for NLP Applications

Based on the models above, it's clear, more work could be done to make a more accurate classifier. We started by removing corpus specific stop words and punctuations after a little bit of text cleaning. We explored how supervised and unsupervised methods could be applied to natural language processing applications.

Here's a summary of what we learned:

Unsupervised:

Tf-Idf can be used to detect unique identifier words
SVD can identify corpus specific stop words (i.e. 'applause'), as well as common phrases (i.e. 'The American people ')
SVD clusters the tf-idf matrix into general themes and sentiments present in the documents. Sometimes these clusters are meaningful and can shed light on common sentiments.
We didn't use similarity comparison here because it didn't make much sense for our document size. However, if the document size was larger, say the full text body for each address, running a similarity analysis could be quite informative. For example, we could compare the similarity of different presidents, or look at how similarity changes within a president's term.

Supervised:

The supervised methods is used for text classification purposes. Depending on the model used, additional insights could be gained by looking at feature importance.