# To be or not to be
## General Information
    Author: Patrick McNamee
    Date: September 28 2019
## Description
This note book takes in all the works of Shakespear from a csv file and then makes a couple classifiers to estimate player from a given line.

## Data Exploration
First lets import the data and then see what the data looks like.

In [1]:
import numpy as np
import pandas as pd

shakespeare = pd.read_csv('./data/Shakespeare_data.csv')
shakespeare.head()

Unnamed: 0,Dataline,Play,PlayerLinenumber,ActSceneLine,Player,PlayerLine
0,1,Henry IV,,,,ACT I
1,2,Henry IV,,,,SCENE I. London. The palace.
2,3,Henry IV,,,,"Enter KING HENRY, LORD JOHN OF LANCASTER, the ..."
3,4,Henry IV,1.0,1.1.1,KING HENRY IV,"So shaken as we are, so wan with care,"
4,5,Henry IV,1.0,1.1.2,KING HENRY IV,"Find we a time for frighted peace to pant,"


There appears to be quite of bit of NaN values for the Player column during exposition which we do not necessarily care about.

In [2]:
shakespeare = shakespeare[shakespeare['Player'].notnull()]
print(shakespeare.head())
print(shakespeare.tail())

   Dataline      Play  PlayerLinenumber ActSceneLine         Player  \
3         4  Henry IV               1.0        1.1.1  KING HENRY IV   
4         5  Henry IV               1.0        1.1.2  KING HENRY IV   
5         6  Henry IV               1.0        1.1.3  KING HENRY IV   
6         7  Henry IV               1.0        1.1.4  KING HENRY IV   
7         8  Henry IV               1.0        1.1.5  KING HENRY IV   

                                       PlayerLine  
3          So shaken as we are, so wan with care,  
4      Find we a time for frighted peace to pant,  
5  And breathe short-winded accents of new broils  
6         To be commenced in strands afar remote.  
7       No more the thirsty entrance of this soil  
        Dataline            Play  PlayerLinenumber ActSceneLine   Player  \
111391    111392  A Winters Tale              38.0      5.3.180  LEONTES   
111392    111393  A Winters Tale              38.0      5.3.181  LEONTES   
111393    111394  A Winters Tale 

There is still some NaN values due to player exiting and entering and it appears to be  in ActSceneLine and so we will also filter out these rows

In [3]:
shakespeare = shakespeare[shakespeare['ActSceneLine'].notnull()]

#reset index
shakespeare.reset_index(drop=True, inplace=True)

## Feature Engineering
With the data now cleaned, we will start adding in feature. The features that we will add in are:

1. Number of words in a line
2. Play Act
3. Play Scene
4. Play Line
5. Continuation Line (Bool)

In [5]:
#Generate the columns
shakespeare['NumWords'] = 0
shakespeare['Act'] = 0
shakespeare['Scene'] = 0
shakespeare['Line'] = 0
shakespeare['Continuation'] = False

#Iterate through rows to populate columns

for i in shakespeare.index:
    shakespeare.loc[i, 'NumWords'] = len(shakespeare.loc[i, 'PlayerLine'].split(' '))
    asl = shakespeare.loc[i, 'ActSceneLine'].split('.')
    shakespeare.loc[i, 'Act'] = int(asl[0])
    shakespeare.loc[i, 'Scene'] = int(asl[1])
    shakespeare.loc[i, 'Line'] = int(asl[2])
    
    if i != 0:
        if shakespeare.loc[i, 'Player'] == shakespeare.loc[i-1, 'Player']:
            shakespeare.loc[i, 'Continuation'] = True
    
shakespeare.head()
shakespeare.to_csv(r'./data/shakespeare_modded.csv')

KeyboardInterrupt: 

## Model building
Lets start building a couple models for analysis. First one is just to be a random guesser for a base line performance.

In [12]:
import pickle #saving models
from sklearn.model_selection import train_test_split # Train/Test split
from sklearn import metrics                          # Meterics

shakespeare = pd.read_csv('./data/shakespeare_modded.csv') #Updated csv

X = shakespeare[['Play', 'Act', 'Scene', 'Line', 'PlayerLinenumber', 'NumWords']]
y = shakespeare['Player']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=1)

### Random Guesser
The idea is simple we will generate a function that has a list of the unique players and a corresponding list of frequency. It will then use the relative player frequencies and a randomly generated number [0 , 1) so that will output a guessed player.

In [7]:
player_list = shakespeare.Player.unique()
player_count = np.zeros((player_list.shape[0],))

#getting the player counts
for i, player in enumerate(player_list):
    player_count[i] = (shakespeare['Player'] == player).sum()
print('Player counts:')
for player, count in zip(player_list[:10], player_count[:10]):
    print(player, count)

#normalizing
total = sum(player_count)
player_count = player_count/total
print('\nNormalized Player Counts')
for player, count in zip(player_list[:10], player_count[:10]):
    print(player, count)


Player counts:
KING HENRY IV 341.0
WESTMORELAND 79.0
FALSTAFF 1053.0
PRINCE HENRY 584.0
POINS 80.0
EARL OF WORCESTER 188.0
NORTHUMBERLAND 198.0
HOTSPUR 562.0
SIR WALTER BLUNT 41.0
First Carrier 20.0

Normalized Player Counts
KING HENRY IV 0.003242924528301887
WESTMORELAND 0.0007512933657942788
FALSTAFF 0.010014074863055387
PRINCE HENRY 0.005553864881314668
POINS 0.0007608034083992696
EARL OF WORCESTER 0.0017878880097382836
NORTHUMBERLAND 0.0018829884357881924
HOTSPUR 0.0053446439440048695
SIR WALTER BLUNT 0.0003899117468046257
First Carrier 0.0001902008520998174


In [27]:
import random
class randomGuesser:
    def __init__(self, player, rel_count):
        self.player_and_count = zip(player, rel_count)
        
    def guess(self):
        r = random.random()
        player_shuffled = random.shuffle(list(self.player_and_count))
        total = 0
        for combo in player_shuffled:
            player = combo[0]
            count = combo[1]
            total = total + count
            if r <= total:
                return player
            
#Testing
print('Testing random guesser:')
random_player = randomGuesser(player_list, player_count)
y_pred = []
for i in range(len(y_test)):
    y_pred.append(random_player.guess())
    if i < 10:
        print("Player:\t",player,"Guessed:\t", y_pred[-1])
print("Random Guesser Accuracy:",metrics.accuracy_score(y_test, y_pred))
pickle.dump(random_player, open("./models/random_guesser.p", 'wb+'))

Testing random guesser:


TypeError: 'NoneType' object is not iterable

### Decision Tree
One of the most model simplistic model for this type of classification is simply a decision tree.

In [None]:
#import necessay libraries
from sklearn.tree import DecisionTreeClassifier      # Decision Tree Classifier
from sklearn.model_selection import train_test_split # Train/Test split
from sklearn import metrics                          # Meterics

#Input Variables
X = shakespeare[['Play', 'Act', 'Scene', 'Line', 'PlayerLineNumber', 'NumWords']]
y = shakespeare['Player']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=1)

clf = DecisionTreeClassifier()

# Train Decision Tree Classifer
clf = clf.fit(X_train,y_train)
#Predict the response for test dataset
y_pred = clf.predict(X_test)
#Evaluate model
print("Decision Tree Classifier Accuracy:",metrics.accuracy_score(y_test, y_pred))
pickle.dump(clf, open("./models/decision_tree.p", 'wb+'))

### Random Forest
Can you see the forest for the trees? If one tree was good then surely more is better.

In [None]:
from sklearn.ensemble import RandomForestRegressor
forest = RandomForestRegressor(n_estimators = 100, random_state = 1) #Pretty aribitrary number chosen
forest.fit(X_train, y_train)
y_pred = forest.predict(X_test)
#Evaluate model
print("Random Forest Classifier Accuracy:",metrics.accuracy_score(y_test, y_pred))
pickle.dump(clf, open("./models/decision_forest.p", 'wb+'))