# NLP Challenge -- Kristofer Schobert

For this challenge assignment, we will create a supervised NLP model that predicts which act a given paragraph of William Shakesheare's *Hamlet* is from. The play has five acts all roughly 200 paragraphs longs. We will use two methods of feature generation, "Bag of Words" and "tf-idf," in combination with three classifiers: a random forest classifier, logistic regression, and a gradient boosting classifier.

In [83]:
# importing packages

%matplotlib inline
import numpy as np
import pandas as pd
import scipy
import sklearn
import matplotlib.pyplot as plt
import seaborn as sns
import re
import spacy
from collections import Counter
from sklearn.linear_model import LogisticRegression


In [314]:
# Import the data
from nltk.corpus import gutenberg, stopwords

# We will analyze Hamlet by William Shakespeare
hamlet = gutenberg.raw('shakespeare-hamlet.txt')

## Data cleaning / processing / language parsing

After sifting through this text, I have found that Act I and Act II are labeled, but the other three acts are not. We do know they begin with a certain string of text. So will will loop through each paragraph and find which index cooresponds to the first paragraph of each act. 

In [6]:
# hamlet_paras is a nested list. We have a list whose elements are paragraphs 
# whose elements are sentences whose elements are lists.

hamlet_paras = gutenberg.paras('shakespeare-hamlet.txt')
for index, para in enumerate(hamlet_paras):
    joined_first_sent = ' '.join(para[0])
    #print(joined_first_sent)
    if ('Actus' in joined_first_sent) or \
    ('Enter King , Queene , Polonius , Ophelia , Rosincrance , Guildenstern , and Lords .' in joined_first_sent) or \
    ('Enter King .' in joined_first_sent) or \
    ('Enter two Clownes .' in joined_first_sent):# and 'matters' in joined_third_sent):
        print(index)
        print(joined_first_sent)


1
Actus Primus .
230
Actus Secundus .
392
Enter King , Queene , Polonius , Ophelia , Rosincrance , Guildenstern , and Lords .
608
Enter King .
629
Enter King .
751
Enter two Clownes .


We have two condenders for the start of the fourth act. (Act IV starts with the sentence "Enter King .") Let's look to see what the following paragraph is. By knowing this, we can determine which paragraph is the start of Act 4.

In [7]:
print(hamlet_paras[608], hamlet_paras[609])
print(hamlet_paras[629], hamlet_paras[630])

[['Enter', 'King', '.']] [['King', '.'], ['There', "'", 's', 'matters', 'in', 'these', 'sighes', '.'], ['These', 'profound', 'heaues', 'You', 'must', 'translate', ';', 'Tis', 'fit', 'we', 'vnderstand', 'them', '.'], ['Where', 'is', 'your', 'Sonne', '?'], ['Qu', '.'], ['Ah', 'my', 'good', 'Lord', ',', 'what', 'haue', 'I', 'seene', 'to', 'night', '?'], ['King', '.'], ['What', 'Gertrude', '?'], ['How', 'do', "'", 's', 'Hamlet', '?'], ['Qu', '.'], ['Mad', 'as', 'the', 'Seas', ',', 'and', 'winde', ',', 'when', 'both', 'contend', 'Which', 'is', 'the', 'Mightier', ',', 'in', 'his', 'lawlesse', 'fit', 'Behinde', 'the', 'Arras', ',', 'hearing', 'something', 'stirre', ',', 'He', 'whips', 'his', 'Rapier', 'out', ',', 'and', 'cries', 'a', 'Rat', ',', 'a', 'Rat', ',', 'And', 'in', 'his', 'brainish', 'apprehension', 'killes', 'The', 'vnseene', 'good', 'old', 'man']]
[['Enter', 'King', '.']] [['King', '.'], ['I', 'haue', 'sent', 'to', 'seeke', 'him', ',', 'and', 'to', 'find', 'the', 'bodie', ':', 'Ho

hamlet_paras[608] is the start of our fourth act. 

In [8]:
# sectioning hamlet_paras into the five acts
# begining with index 1 to get rid of the title of the book
hamlet_paras_1 = hamlet_paras[1:230]
hamlet_paras_2 = hamlet_paras[230:392]
hamlet_paras_3 = hamlet_paras[392:608]
hamlet_paras_4 = hamlet_paras[608:751]
hamlet_paras_5 = hamlet_paras[751:]

In [322]:
# this function takes in hamlet_paras_n which is nested list for act n with the same structure as hamlet_paras
# this function return a list of paragraphs where each paragraph is now one continuous string.

def concat_sentences(text):
    
    #initializing list of lists which elements being the words of a given paragraph
    para_list = []
    
    #looping through each paragraph with nested sentences and words
    for para in text:
        para_combined = []
        
        #looping thorugh each sentence with words as elements
        for sent in para:
            
            # creating a list with elements that are all the words of a paragraph.
            para_combined = para_combined + sent
         
        # creating a list of those lists with elements that are all the words of a paragraph.
        para_list.append(para_combined)  
    para_str = []   
    
    # for each list of words of a paragraph
    for para in para_list:
        # combine them into one string.
        para_str.append(' '.join(para))
    return para_str  

In [323]:
# Turning hamlet_paras_n into hamlet_n which is a list of paragraphs. 
# Each element in this list is a continuous string
hamlet_1 = concat_sentences(hamlet_paras_1)
hamlet_2 = concat_sentences(hamlet_paras_2)
hamlet_3 = concat_sentences(hamlet_paras_3)
hamlet_4 = concat_sentences(hamlet_paras_4)
hamlet_5 = concat_sentences(hamlet_paras_5)

In [12]:
# Parse the cleaned novels. This can take a bit.
nlp = spacy.load('en')

# inputing the list of paragraphs for act n
# outputing the spaCy analysed paragrphs in a similar list

def nlping(hamlet_n):
    hamlet_n_doc = []
    for para in hamlet_n:
        hamlet_n_doc.append(nlp(para))
    return hamlet_n_doc
    
    
hamlet_1_doc = nlping(hamlet_1)
hamlet_2_doc = nlping(hamlet_2)
hamlet_3_doc = nlping(hamlet_3)
hamlet_4_doc = nlping(hamlet_4)
hamlet_5_doc = nlping(hamlet_5)


In [55]:
# Creating 2d lists to form the desired data frame.
hamlet_1_ = [[para, '1'] for para in hamlet_1_doc]
hamlet_2_ = [[para, '2'] for para in hamlet_2_doc]
hamlet_3_ = [[para, '3'] for para in hamlet_3_doc]
hamlet_4_ = [[para, '4'] for para in hamlet_4_doc]
hamlet_5_ = [[para, '5'] for para in hamlet_5_doc]


# Combine the pragraphs from the five acts into one data frame.
para_df = pd.DataFrame(hamlet_1_ + hamlet_2_ + hamlet_3_ + hamlet_4_ + hamlet_5_)
para_df.head()

Unnamed: 0,0,1
0,"(Actus, Primus, ., Scoena, Prima, .)",1
1,"(Enter, Barnardo, and, Francisco, two, Centine...",1
2,"(Barnardo, ., Who, ', s, there, ?, Fran, ., Na...",1
3,"(Bar, ., Long, liue, the, King)",1
4,"(Fran, ., Barnardo, ?, Bar, ., He)",1


## Bag of Words

In [37]:
# Utility function to create a list of the 500 most common words of each act
def bag_of_words(text):
    
    # Loop through each doc (which is each paragraph) and filter out punctuation and stop words.
    # we will collect the 500 most common words from each act.
    
    allwords = []
    for doc in text:
        for token in doc:
            if not token.is_punct and not token.is_stop:
                allwords.append(token.lemma_)
        
    
    # Return the most common words.
    return [item[0] for item in Counter(allwords).most_common(500)]



# Set up the bags.
act_1_words = bag_of_words(hamlet_1_doc)
act_2_words = bag_of_words(hamlet_2_doc)
act_3_words = bag_of_words(hamlet_3_doc)
act_4_words = bag_of_words(hamlet_4_doc)
act_5_words = bag_of_words(hamlet_5_doc)

# Combine bags to create a set of unique words.
common_words = set(act_1_words + act_2_words + act_3_words + act_4_words + act_5_words)

In [45]:
# Determining the length of our most common words list.
len(common_words)

1500

In [65]:
# creating funtion that takes in our data frame para_df and list of most common words
# it outputs a dataframe containing word frequencies for the common words.
def bow_features(para_df, common_words):
    
    # Scaffold the data frame and initialize counts to zero.
    df = pd.DataFrame(columns=common_words)
    df['text_para'] = para_df[0]
    df['act_number'] = para_df[1]
    df.loc[:, common_words] = 0
    
    # Process each row, counting the occurrence of words in each paragraph.
    for i, para in enumerate(df['text_para']):
        
        # Convert the paragraphs to lemmas, then filter out punctuation,
        # stop words, and uncommon words.
        words = [token.lemma_
                 for token in para
                 if (
                     not token.is_punct
                     and not token.is_stop
                     and token.lemma_ in common_words
                 )]
        
        # Populate the row with word counts.
        for word in words:
            df.loc[i, word] += 1
        
        # This counter is just to make sure the kernel didn't hang.
        if i % 50 == 0:
            print("Processing row {}".format(i))
            
    return df

In [66]:
word_counts = bow_features(para_df, common_words)
word_counts.head()

Processing row 0
Processing row 50
Processing row 100
Processing row 150
Processing row 200
Processing row 250
Processing row 300
Processing row 350
Processing row 400
Processing row 450
Processing row 500
Processing row 550
Processing row 600
Processing row 650
Processing row 700
Processing row 750
Processing row 800
Processing row 850
Processing row 900


Unnamed: 0,sore,assume,begge,doe,greeue,art,offer,Euen,sulleye,Prince,...,Actors,deliuer,idle,action,Apparition,follow,Mes,looke,text_para,act_number
0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,"(Actus, Primus, ., Scoena, Prima, .)",1
1,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,"(Enter, Barnardo, and, Francisco, two, Centine...",1
2,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,"(Barnardo, ., Who, ', s, there, ?, Fran, ., Na...",1
3,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,"(Bar, ., Long, liue, the, King)",1
4,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,"(Fran, ., Barnardo, ?, Bar, ., He)",1


We now have the dataframe that we need. Let's use the three supervised learning classifiers to see which performs best with cross validation. 

## Supervised Models with BoW as features

In [82]:
from sklearn import ensemble
from sklearn.model_selection import cross_val_score

rfc = ensemble.RandomForestClassifier(n_estimators=100)
Y = word_counts['act_number']
X = np.array(word_counts.drop(['text_para','act_number'], 1))

cvs = cross_val_score(rfc, X, Y, cv=5)
print('The Cross Validation Mean is {} +/_ {}'.format(round(np.mean(cvs),5), round(np.std(cvs)/np.sqrt(len(cvs)),5)))


The Cross Validation Mean is 0.40784 +/_ 0.01319


In [86]:
from sklearn import ensemble
from sklearn.model_selection import cross_val_score

# using updated default parameters (this gets rid of the warnings announcing the changing default parameters)
# I'm also fine with using the updated defaults for a first attempt
lr = LogisticRegression(penalty='l2', solver='lbfgs', multi_class='auto')
Y = word_counts['act_number']
X = np.array(word_counts.drop(['text_para','act_number'], 1))

cvs = cross_val_score(lr, X, Y, cv=5)
print('The Cross Validation Mean is {} +/_ {}'.format(round(np.mean(cvs),5), round(np.std(cvs)/np.sqrt(len(cvs)),5)))


The Cross Validation Mean is 0.45112 +/_ 0.01022


In [64]:
word_counts['Actors'].unique()

array([0, 1], dtype=object)

In [88]:
from sklearn import ensemble
from sklearn.model_selection import cross_val_score

gbc = ensemble.GradientBoostingClassifier()
Y = word_counts['act_number']
X = np.array(word_counts.drop(['text_para','act_number'], 1))

cvs = cross_val_score(gbc, X, Y, cv=5)
print('The Cross Validation Mean is {} +/_ {}'.format(round(np.mean(cvs),5), round(np.std(cvs)/np.sqrt(len(cvs)),5)))

The Cross Validation Mean is 0.43353 +/_ 0.02297


We have found the logistic regression works best for our bag of words features. Next, let's try tf-idf. I suspect some words are common in each act, thus the idf will help reduce the influence of those common words on our model. 

## tf-idf as Features of Supervised Model

In [142]:
# Here we are creating a column in our simipler dataframe para_df which now contains three columns
# We will create a column that contains elements that are long strings connecting every word of a paragraph
# by a spacebar
para_str = []
for index, row in enumerate(para_df[0]):
    para_str.append(str(row))
    
para_df['para_str'] = para_str    

In [162]:
# renaming the other two columns of the dataframe to be more telling of the elements of the column
para_df = para_df.rename(index=str, columns={0: "text_list", 1: "act_number"})

In [186]:
para_df.tail()

Unnamed: 0,text_list,act_number,para_str
944,"(Amb, ., The, sight, is, dismall, ,, And, our,...",5,"Amb . The sight is dismall , And our affaires ..."
945,"(For, ., Let, vs, hast, to, heare, it, ,, And,...",5,"For . Let vs hast to heare it , And call the N..."
946,"(For, ., Let, foure, Captaines, Beare, Hamlet,...",5,For . Let foure Captaines Beare Hamlet like a ...
947,"(Exeunt, ., Marching, :, after, the, which, ,,...",5,"Exeunt . Marching : after the which , a Peale ..."
948,"(FINIS, ., The, tragedie, of, HAMLET, ,, Princ...",5,"FINIS . The tragedie of HAMLET , Prince of Den..."


In [171]:
# Joining the lists of paragraphs for each act by a spacebar. Thus, hamlet_n_str is the nth act of hamlet expressed
# as one long continuous string
hamlet_1_str = ' '.join(hamlet_1)
hamlet_2_str = ' '.join(hamlet_2)
hamlet_3_str = ' '.join(hamlet_3)
hamlet_4_str = ' '.join(hamlet_4)
hamlet_5_str = ' '.join(hamlet_5)

# creating a list of those 5 strings that are one act long.
hamlet_list_acts_as_str = [hamlet_1_str, hamlet_2_str, hamlet_3_str, hamlet_4_str, hamlet_5_str]

In [326]:
hamlet_1_str



In [328]:
from sklearn.feature_extraction.text import TfidfVectorizer

# setting the parameters of our tf-idf vectorizer
vectorizer = TfidfVectorizer(max_df=0.5, # drop words that occur in more than half the acts
                             min_df=2, # only use words that appear at least twice
                             stop_words='english', 
                             lowercase=True, #convert everything to lower case 
                             use_idf=True,#we definitely want to use inverse document frequencies in our weighting
                             norm=u'l2', #Applies a correction factor so that longer paragraphs and shorter paragraphs get treated equally
                             smooth_idf=True #Adds 1 to all document frequencies, as if an extra document existed that used every word once.  Prevents divide-by-zero errors
                            )

# fitting the vectorizer to the 5 acts of hamlet:
# we are choosing our features (terms) based on what would be best for determing which act a word is from.
# what would be "best" is determined by our parameter choices of TfidfVectorizer.
vectorizer.fit(hamlet_list_acts_as_str)

TfidfVectorizer(analyzer='word', binary=False, decode_error='strict',
        dtype=<class 'numpy.float64'>, encoding='utf-8', input='content',
        lowercase=True, max_df=0.5, max_features=None, min_df=2,
        ngram_range=(1, 1), norm='l2', preprocessor=None, smooth_idf=True,
        stop_words='english', strip_accents=None, sublinear_tf=False,
        token_pattern='(?u)\\b\\w\\w+\\b', tokenizer=None, use_idf=True,
        vocabulary=None)

In [329]:
# what are the names of our features and how many do we have?
print(vectorizer.get_feature_names())
print(len(vectorizer.get_feature_names()))

['aboord', 'absurd', 'accent', 'accident', 'acte', 'actor', 'acts', 'actus', 'admiration', 'admit', 'adue', 'aduice', 'affaire', 'affection', 'affliction', 'ah', 'ake', 'alacke', 'allowance', 'aloofe', 'alwayes', 'ambassadors', 'ambition', 'angell', 'anticke', 'ape', 'appear', 'appeare', 'apprehension', 'apt', 'argument', 'arm', 'armour', 'arrant', 'arrowes', 'asking', 'assay', 'asse', 'assistant', 'assur', 'assurance', 'auoyd', 'author', 'awake', 'awe', 'axe', 'baby', 'bak', 'baser', 'beasts', 'begger', 'begins', 'beguile', 'bell', 'bend', 'bene', 'birth', 'blast', 'blesse', 'bloody', 'bloud', 'blow', 'blowes', 'bodie', 'bore', 'boy', 'braue', 'breathing', 'breed', 'breeding', 'briefe', 'bruite', 'bulke', 'buriall', 'buried', 'burne', 'burning', 'burst', 'byrlady', 'caesar', 'cal', 'cals', 'canker', 'cannon', 'canst', 'capitall', 'catch', 'caught', 'cease', 'celestiall', 'censure', 'chance', 'character', 'charitable', 'chast', 'cheefe', 'cheere', 'cheerefully', 'childe', 'choyce', 'ch

In [330]:
# use those features to determine the tf-idf elements of our array where each row represents a paragraph
X = vectorizer.transform(para_df['para_str'])

In [331]:
# determing the shape of the vectorizer's output
print(X.shape)

(949, 669)


In [339]:
# creating a pandas data frame of our vectorizers output
# we will add a column containing the act the paragarph is from and a column giving the paragraph
df_tf_idf = pd.DataFrame(X.toarray())
df_tf_idf['act_number'] = list(para_df['act_number'])
df_tf_idf.columns = vectorizer.get_feature_names() + ['act_number']
df_tf_idf['text_para'] = list(para_df['para_str'])

In [340]:
df_tf_idf.head()

Unnamed: 0,aboord,absurd,accent,accident,acte,actor,acts,actus,admiration,admit,...,wonderfull,wondrous,wonted,wormes,worse,wouldest,wretched,yeare,act_number,text_para
0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1,Actus Primus . Scoena Prima .
1,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1,Enter Barnardo and Francisco two Centinels .
2,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1,Barnardo . Who ' s there ? Fran . Nay answer m...
3,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1,Bar . Long liue the King
4,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1,Fran . Barnardo ? Bar . He


## Applying the three classifiers to determine which is best

In [341]:
from sklearn import ensemble
from sklearn.model_selection import cross_val_score

rfc = ensemble.RandomForestClassifier(n_estimators=100)
Y = df_tf_idf['act_number']
X = np.array(df_tf_idf.drop(['text_para','act_number'], 1))

cvs = cross_val_score(rfc, X, Y, cv=5)
print('The Cross Validation Mean is {} +/_ {}'.format(round(np.mean(cvs),5), round(np.std(cvs)/np.sqrt(len(cvs)),5)))


The Cross Validation Mean is 0.26979 +/_ 0.0062


In [342]:
from sklearn import ensemble
from sklearn.model_selection import cross_val_score

# using updated default parameters (this gets rid of the warnings announcing the changing default parameters)
# I'm also fine with using the updated defaults for a first attempt
lr = LogisticRegression(penalty='l2', solver='lbfgs', multi_class='auto')
Y = df_tf_idf['act_number']
X = np.array(df_tf_idf.drop(['text_para','act_number'], 1))

cvs = cross_val_score(lr, X, Y, cv=5)
print('The Cross Validation Mean is {} +/_ {}'.format(round(np.mean(cvs),5), round(np.std(cvs)/np.sqrt(len(cvs)),5)))


The Cross Validation Mean is 0.27192 +/_ 0.00624


In [343]:
from sklearn import ensemble
from sklearn.model_selection import cross_val_score

gbc = ensemble.GradientBoostingClassifier()
Y = df_tf_idf['act_number']
X = np.array(df_tf_idf.drop(['text_para','act_number'], 1))

cvs = cross_val_score(gbc, X, Y, cv=5)
print('The Cross Validation Mean is {} +/_ {}'.format(round(np.mean(cvs),5), round(np.std(cvs)/np.sqrt(len(cvs)),5)))

The Cross Validation Mean is 0.25596 +/_ 0.0154


The classifier which outputs the highest cross validation score is again logistic regression. However, our score is much lower that it was for our bag of words features. Let's try to improve our model by changing the tfidfvecotrizers parameters.

## Improving tf-idf model

This round, we will have our features be terms that are found in two three or four of the acts. We will disregard terms that appear in just one or all of the acts. Note that last round we ultimately only considered terms that were in two of the acts. That was likely not a good choice. 

In [346]:
from sklearn.feature_extraction.text import TfidfVectorizer

# setting the parameters of our tf-idf vectorizer
vectorizer = TfidfVectorizer(max_df=4, # drop words that occur in all five acts
                             min_df=2, # only use words that appear at least twice
                             stop_words='english', 
                             lowercase=True, #convert everything to lower case 
                             use_idf=True,#we definitely want to use inverse document frequencies in our weighting
                             norm=u'l2', #Applies a correction factor so that longer paragraphs and shorter paragraphs get treated equally
                             smooth_idf=True #Adds 1 to all document frequencies, as if an extra document existed that used every word once.  Prevents divide-by-zero errors
                            )

# fitting the vectorizer to the 5 acts of hamlet:
# we are choosing our features (terms) based on what would be best for determing which act a word is from.
# what would be "best" is determined by our parameter choices of TfidfVectorizer.
vectorizer.fit(hamlet_list_acts_as_str)

TfidfVectorizer(analyzer='word', binary=False, decode_error='strict',
        dtype=<class 'numpy.float64'>, encoding='utf-8', input='content',
        lowercase=True, max_df=4, max_features=None, min_df=2,
        ngram_range=(1, 1), norm='l2', preprocessor=None, smooth_idf=True,
        stop_words='english', strip_accents=None, sublinear_tf=False,
        token_pattern='(?u)\\b\\w\\w+\\b', tokenizer=None, use_idf=True,
        vocabulary=None)

In [347]:
# use those features to determine the tf-idf elements of our array where each row represents a paragraph
X = vectorizer.transform(para_df['para_str'])

In [348]:
# creating a pandas data frame of our vectorizers output
# we will add a column containing the act the paragarph is from and a column giving the paragraph
df_tf_idf = pd.DataFrame(X.toarray())
df_tf_idf['act_number'] = list(para_df['act_number'])
df_tf_idf.columns = vectorizer.get_feature_names() + ['act_number']
df_tf_idf['text_para'] = list(para_df['para_str'])

In [349]:
df_tf_idf.head()

Unnamed: 0,aboord,aboue,absurd,accent,accident,act,acte,action,actor,acts,...,wretch,wretched,wrong,yea,yeare,yong,young,youth,act_number,text_para
0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1,Actus Primus . Scoena Prima .
1,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1,Enter Barnardo and Francisco two Centinels .
2,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1,Barnardo . Who ' s there ? Fran . Nay answer m...
3,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1,Bar . Long liue the King
4,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1,Fran . Barnardo ? Bar . He


## Applying the three classifiers to determine which is best

In [350]:
from sklearn import ensemble
from sklearn.model_selection import cross_val_score

rfc = ensemble.RandomForestClassifier(n_estimators=100)
Y = df_tf_idf['act_number']
X = np.array(df_tf_idf.drop(['text_para','act_number'], 1))

cvs = cross_val_score(rfc, X, Y, cv=5)
print('The Cross Validation Mean is {} +/_ {}'.format(round(np.mean(cvs),5), round(np.std(cvs)/np.sqrt(len(cvs)),5)))


The Cross Validation Mean is 0.39563 +/_ 0.02425


In [351]:
from sklearn import ensemble
from sklearn.model_selection import cross_val_score

# using updated default parameters (this gets rid of the warnings announcing the changing default parameters)
# I'm also fine with using the updated defaults for a first attempt
lr = LogisticRegression(penalty='l2', solver='lbfgs', multi_class='auto')
Y = df_tf_idf['act_number']
X = np.array(df_tf_idf.drop(['text_para','act_number'], 1))

cvs = cross_val_score(lr, X, Y, cv=5)
print('The Cross Validation Mean is {} +/_ {}'.format(round(np.mean(cvs),5), round(np.std(cvs)/np.sqrt(len(cvs)),5)))


The Cross Validation Mean is 0.37239 +/_ 0.02255


In [352]:
from sklearn import ensemble
from sklearn.model_selection import cross_val_score

gbc = ensemble.GradientBoostingClassifier()
Y = df_tf_idf['act_number']
X = np.array(df_tf_idf.drop(['text_para','act_number'], 1))

cvs = cross_val_score(gbc, X, Y, cv=5)
print('The Cross Validation Mean is {} +/_ {}'.format(round(np.mean(cvs),5), round(np.std(cvs)/np.sqrt(len(cvs)),5)))

The Cross Validation Mean is 0.33356 +/_ 0.02858


### Improvement Results

Great. This improved our cross validation scores significantly. They are still really poor however... Our best classifier is the random forest classifier. We have improved our tf-idf score by 12%. 

# Conclusion and Future Thoughts

Our best model is a logistic regression classifier that acts on our bag of words features. Our cross validation score for that winning model is 45.11% +/_ 1.02%. It is proving to be difficult to determin which act a paragraph of *Hamlet* is from. I was hoping that the story arch would make this task achievable. An act with a dark, depressing theme likely have words that reflect that mood, while a cheerful act may have words that reflect that tone. 

While the five acts only have roughly 200 paragraphs each to work with, there is likely a lot for improvement here. One could use another form of feature selection. Perhaps employing Latent Semantic Analysis would be helpful. We have such a large list of features that we could likely benifit from feature reduction. Perhaps including punctuation could have been helpful. Maybe some acts have more '!' than others. We could have used parts of speech, paragraph length, or maybe dialog length per paragraph (this is a play after all). We could see how many individuals speak per paragraph. There is are tons of ideas to try here. One of them could lead us to a cross validation score significantly greater than what we have found. I am hopeful that that is the case. 

In [311]:
vect = TfidfVectorizer()

xxx = vect.fit_transform(['hi how are you ?', 'bye how is you ?'])

In [312]:
dfxxx = pd.DataFrame(xxx.toarray())

In [313]:
dfxxx.columns = vect.get_feature_names()

In [270]:
dfxxx

Unnamed: 0,are,bye,hi,how,is,you
0,0.576152,0.0,0.576152,0.409937,0.0,0.409937
1,0.0,0.576152,0.0,0.409937,0.576152,0.409937


In [271]:
yyy = vect.transform(['good day you', 'are you hi ?', 'are you not still hi ?'])

In [272]:
dfyyy = pd.DataFrame(yyy.toarray())
dfyyy.columns = vect.get_feature_names()

In [273]:
dfyyy

Unnamed: 0,are,bye,hi,how,is,you
0,0.0,0.0,0.0,0.0,0.0,1.0
1,0.631667,0.0,0.631667,0.0,0.0,0.449436
2,0.631667,0.0,0.631667,0.0,0.0,0.449436


In [274]:
np.sqrt(0.631667**2 + 0.631667**2 + 0.44943**2)

0.9999968613340744