In [None]:
import numpy as np
import pandas as pd

sps = pd.read_csv("../data/Shakespeare_data.csv")
print(sps.head(5))

The shakespeare data is loaded into a data frame and, but still needs to be cleaned. We want to remove all of the extraneous, non-dialogue lines, which can be done by removing all lines where the row includes NaN. These occur during transitions primarily and are not relevant to our model. They happen infrequently enough otherwise that it will not skew the data. 

In [None]:
cleaned = sps.dropna()

Additional value could be derived from the data set by looking at who speaks the most lines in each play, as this could be a proxy for pay scale when actually doing a production of a Shakespeare play. For instance, we compare King Henry IV to Prince Henry below.

In [None]:
import matplotlib.pyplot as plt

plt.rcdefaults()
fig, ax = plt.subplots()

isHenryIV = cleaned['Play']=='Henry IV'
HenryIVPlay = cleaned[isHenryIV]
isKingHenry = HenryIVPlay['Player']=="KING HENRY IV"
kingHenryLines = len(HenryIVPlay[isKingHenry])
isPrinceHenry = HenryIVPlay['Player']=="PRINCE HENRY"
princeHenryLines = len(HenryIVPlay[isPrinceHenry])

nums = (kingHenryLines, princeHenryLines)
people = ('King Henry', 'Prince Henry')
y_pos = np.arange(len(people))
performance = 5 + 10 * np.random.rand(len(people))
error = np.random.rand(len(people))

ax.barh(y_pos, nums, align='center')
ax.set_yticks(y_pos)
ax.set_yticklabels(people)
ax.invert_yaxis()
ax.set_xlabel('Lines')
ax.set_title('Lines Per Actor')

plt.show()

The above shows just 2 of the players and the quantity of lines they have in Henry IV. This analysis could be extended to include any number of players. The specific feature engineering here is breaking down the dataset by play.

Next we will further prepare the dataset to be inputed to the classification model. We will use one-hot encoding to incorporate the strings into the model. One-hot encoding is used to ensure that the different feature attributes are not "mixed up" with one another. We also drop the player line here because it's not useful in one-hot encoding due to the wide variety of possibilities.

In [None]:
from sklearn import preprocessing
le = preprocessing.LabelEncoder()
#x_2 = x.apply(le.fit_transform)
cat_columns = ["Play","Player"]
df_processed = pd.get_dummies(cleaned, prefix_sep="__", columns = cat_columns)

cat_dummies = [col for col in df_processed 
               if "__" in col 
               and col.split("__")[0] in cat_columns]

processed_columns = list(df_processed.columns[:])

df_processed.head(5)
df_processed.drop(columns = ['PlayerLine'], inplace=True)
data = df_processed

In [None]:
#Here we separate X and Y data.
from sklearn.model_selection import cross_val_score
from sklearn.tree import DecisionTreeClassifier
Xtemp = data['PlayerLinenumber']
A = data.loc[:, ~data.columns.str.startswith('Player')]
X = pd.merge(Xtemp, A, left_index = True, right_index = True)
X = X.drop('ActSceneLine', axis = 1)
Y = data.loc[:, data.columns.str.startswith('Player')]
Y = Y.drop(['PlayerLinenumber'], axis = 1)


Next, we will use a classification model to determine the player using the other columns as features. The first step will be breaking the data up into training and testing data.

In [None]:
from sklearn.model_selection import train_test_split
from sklearn import tree

X_train, X_test, Y_train, Y_test = train_test_split(X, Y, random_state=1, test_size = .2)

model = tree.DecisionTreeClassifier()
model.fit(X_train, Y_train)

Y_predict = model.predict(X_test)

In [None]:
from sklearn.metrics import accuracy_score

print(accuracy_score(Y_test, Y_predict))

As you can see directly above, the accuracy of our model was roughly 72%. This is strong performance, which makes me think the decision tree may be overfitting.