In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import featuretools as ft


Read in the data from the CSV file

In [13]:
shake = pd.read_csv("./data/raw/Shakespeare_data.csv")
shake.head

<bound method NDFrame.head of         Dataline            Play  PlayerLinenumber ActSceneLine  \
0              1        Henry IV               NaN          NaN   
1              2        Henry IV               NaN          NaN   
2              3        Henry IV               NaN          NaN   
3              4        Henry IV               1.0        1.1.1   
4              5        Henry IV               1.0        1.1.2   
...          ...             ...               ...          ...   
111391    111392  A Winters Tale              38.0      5.3.180   
111392    111393  A Winters Tale              38.0      5.3.181   
111393    111394  A Winters Tale              38.0      5.3.182   
111394    111395  A Winters Tale              38.0      5.3.183   
111395    111396  A Winters Tale              38.0          NaN   

               Player                                         PlayerLine  
0                 NaN                                              ACT I  
1              

Drop the useless Column of data

In [3]:
df_shake = shake.drop(columns="Dataline")
df_shake['Player'].replace(np.nan, 'No Character', inplace = True)

Gets list of unique players

In [4]:
df_PlayerNumber = df_shake.groupby('Player').nunique()

Get the list of times the player talks

In [5]:
val = shake['Player'].value_counts()
val

GLOUCESTER          1920
HAMLET              1582
IAGO                1161
FALSTAFF            1117
KING HENRY V        1086
                    ... 
Second Neighbour       1
Second Patrician       1
ARMADO                 1
WORCESTER              1
First Messenger        1
Name: Player, Length: 934, dtype: int64

In [6]:
df = val.rename_axis('Player').reset_index(name='Number of Appearances')
df

Unnamed: 0,Player,Number of Appearances
0,GLOUCESTER,1920
1,HAMLET,1582
2,IAGO,1161
3,FALSTAFF,1117
4,KING HENRY V,1086
5,BRUTUS,1051
6,OTHELLO,928
7,MARK ANTONY,927
8,KING HENRY VI,917
9,DUKE VINCENTIO,909


In [7]:
play_grouping = df_shake.groupby(['Play','Player' ]).count()
play_grouping

Unnamed: 0_level_0,Unnamed: 1_level_0,PlayerLinenumber,ActSceneLine,PlayerLine
Play,Player,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
A Comedy of Errors,ADRIANA,284,276,284
A Comedy of Errors,AEGEON,150,147,150
A Comedy of Errors,AEMELIA,75,73,75
A Comedy of Errors,ANGELO,99,96,99
A Comedy of Errors,ANTIPHOLUS,6,6,6
A Comedy of Errors,BALTHAZAR,31,31,31
A Comedy of Errors,Courtezan,43,40,43
A Comedy of Errors,DROMIO OF EPHESUS,191,187,191
A Comedy of Errors,DROMIO OF SYRACUSE,323,314,323
A Comedy of Errors,DUKE SOLINUS,97,93,97


Imports for both the decision tree and the random forest implmentation

In [8]:
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn import preprocessing
from sklearn.model_selection import train_test_split 
from sklearn import metrics 

So prior to being able to run the classification things, you need to make sure that the processor is able to take in the values. In this case it only takes floats so we need to turn all of our strings into floats by using the label encoder.

In [9]:
le = preprocessing.LabelEncoder()
le.fit(df_shake['Player'])
df_shake['Player'] = le.transform(df_shake['Player'])
le.fit(df_shake['Play'])
df_shake['Play'] = le.transform(df_shake['Play'])

df_shake['ActSceneLine'] = df_shake['ActSceneLine'].astype(str)

le.fit(df_shake['ActSceneLine'])
df_shake['ActSceneLine'] = le.transform(df_shake['ActSceneLine'])

This takes in the X and y classification that we will be using and splitting it into training and testing sets. We will be using Play and ActSceneLine to determine the player.



In [10]:
X= df_shake[['Play', 'ActSceneLine']]
y= df_shake['Player']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=1)

Takes in the training and testing set to predict the outcome from our testing set. The accuracy we get returned averages around 60%. The decision tree is quick, but often leads to overfitting of the data.

In [11]:
decision_tree = DecisionTreeClassifier()
clf = decision_tree.fit(X_train, y_train)
y_pred = clf.predict(X_test)
print("Accuracy:",metrics.accuracy_score(y_test, y_pred))

Accuracy: 0.6038599640933573


Like the Decision tree the random forest classifcation takes in the training and testing set to predict the outcome from our testing set. The accuracy we get returned averages around 60%. The random forest classification is again very accurate and fast, but falls into the trap of scaling and the larger the number of trees the slower the classifcation, our dataset begins to show growing pains.

In [12]:
my_model = RandomForestClassifier()
my_model.fit(X_train, y_train)
y_pred = my_model.predict(X_test)
print("Accuracy:",metrics.accuracy_score(y_test, y_pred))



Accuracy: 0.6043087971274685


All in all we were able to take the shakespeare play and create some feature engineering models to help establish future ways to explain our data. If I had to add one more model, I would have done a way to analyze the playerline in the equation. I think adding that part can lead to an accuracy of over 90%