# **Project 2:** Classifying players from William Shakespeare's Drama Lines

**Author:** Ishrak Hayet<br>**Date:** 09/21/2020

In this project, we work with a dataset containing the lines from William Shakespeare's plays. We also have a column that depicts the player who is saying a line. Our goal for this project will be to train a Natural Language Processing model with engineered features to identify a character from a given line.

The data analysis and classification workflow will be as follows:

1) Creating a wordcloud for each player to get a visual description of word frequencies for each player

2) Engineering the following features from the dataset
    a) 

3) Exploratory data analysis using the engineered features

4) Splitting the dataset into training, validation and testing set

5) Training the following classifiers using the engineered features

6) Model evaluation

---

## **Step 1:** Importing the required packages and loading saved models

In [37]:
#Basic packages
import numpy as np
import pandas as pd
import math
import category_encoders as ce
from sklearn.model_selection import train_test_split
from sklearn.model_selection import cross_val_score
from sklearn.svm import SVC
from sklearn.ensemble import RandomForestClassifier

import matplotlib.pyplot as plt
%matplotlib inline

#Wordcloud packages
from PIL import Image
from wordcloud import WordCloud, ImageColorGenerator

#NLP packages
import re
import nltk
from nltk.tokenize import word_tokenize
nltk.download('punkt')
from nltk.corpus import stopwords
nltk.download('stopwords')
from nltk.stem import WordNetLemmatizer
nltk.download('wordnet')

#Model storage
import pickle

[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\ISHRA\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\ISHRA\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package wordnet to
[nltk_data]     C:\Users\ISHRA\AppData\Roaming\nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


#### Loading the saved Shakespeare_Word2Vec.sav model using pickle

In [2]:
filename = '../models/Shakespeare_Word2Vec.sav'
Shakespeare_Word2Vec_Model = pickle.load(open(filename, 'rb'))

---

## **Step 2:** Reading the dataset

In [3]:
data = pd.read_csv("../data/raw/Shakespeare_data.csv")
data.head()

Unnamed: 0,Dataline,Play,PlayerLinenumber,ActSceneLine,Player,PlayerLine
0,1,Henry IV,,,,ACT I
1,2,Henry IV,,,,SCENE I. London. The palace.
2,3,Henry IV,,,,"Enter KING HENRY, LORD JOHN OF LANCASTER, the ..."
3,4,Henry IV,1.0,1.1.1,KING HENRY IV,"So shaken as we are, so wan with care,"
4,5,Henry IV,1.0,1.1.2,KING HENRY IV,"Find we a time for frighted peace to pant,"


A description of the columns of the dataset is as follows:

**Dataline**: Sequence of numbers to identify an index

**Play**: Name of the play from which we have the current play line

**PlayerLineNumber**: This is the number of a line that is spoken by a player. A sequence of contiguous lines can be spoken by the same player and those will all be identified as the same line number. This can be thought of more like a paragraph for every instance of dialog for a player.

**ActSceneLine**: This is a dot separated value consisting of three sub-values. The first sub-value represents the act number. The second sub-value represents the scene number. The third sub-value represents the line number.

**Player**: Name of the player who is saying the current line

**PlayerLine**: The line that is being spoken by the current player

---

## **Step 3:** Preprocessing

Let's take a look at the datatypes for each of the columns

In [4]:
data.dtypes

Dataline              int64
Play                 object
PlayerLinenumber    float64
ActSceneLine         object
Player               object
PlayerLine           object
dtype: object

#### Removing the Dataline column
The Dataline column represents a monotonically increasing sequence for indexing the dataset. Since, we will not be using this indexing, we can drop this column.

In [5]:
if 'Dataline' in data.columns:
    data.drop(['Dataline'], axis=1, inplace=True)

#### Removing tuples with missing values of "ActSceneLine" and "Player"

After taking a look at the dataset, we can be certain that the tuples having NaN values for "ActSceneLine" and "Player" contain lines that are not spoken by a player. These lines will not be useful since our goal is to identify players from spoken lines. So, we remove these tuples from the dataset.

In [6]:
data.dropna(axis=0, subset=['Player', 'ActSceneLine'], inplace=True)

#### Preprocessing PlayerLine

Now, we will convert the PlayerLine words to lowercase for the convenience of analysis.

In [7]:
data['PlayerLine'] = data['PlayerLine'].str.lower()

We will now remove the punctuation marks from the PlayerLine column since these punctuation marks will not be useful for the analysis.

In [8]:
data['PlayerLine'] = data['PlayerLine'].apply(lambda line: re.sub(r'[^A-Za-z0-9 ]', '', line))

At this point, we will tokenize the PlayerLines by words.

In [9]:
data['PlayerLine'] = data['PlayerLine'].apply(lambda line: word_tokenize(line))

Then, we will remove the stop words from the PlayerLine column since the stop words are also not useful for identifiability.

In [10]:
stop_words = stopwords.words('english')

data['PlayerLine'] = data['PlayerLine'].apply(lambda line: [w for w in line if not w in stop_words])

Finally, we will use lemmatization to find the root of words so that we are creating as much of a common lexical ground as possible regardless of the sentence structures. The reason for using lemmatization instead of stemming is because lemmatization derives more accurate roots (lemmas) of words since it performs a complete morphological analysis [1]. We will use the WordNet [2] lemmatizer from nltk to perform the lemmatization.

In [11]:
data['PlayerLine'] = data['PlayerLine'].apply(lambda line: [WordNetLemmatizer().lemmatize(w) for w in line])

## **Step 4:** Feature Engineering

#### Binary Encoding the Play column

The "Play" column is a categorical variable and the values are strings. Since string values have a lot of entropy, these might not be useful for the classifer. Therefore, we encode the "Play" column into binary values. The reason for using binary encoder instead of one hot encoder is to reduce the number of columns. In one hot encoding we get as many number of columns as the number of unique values in a column. As a result, the classifier can suffer from the curse of dimensionality which suggests that when we have more features, we need higher number of data. On the other hand, binary encoder generates as many columns as the binary logarithm of the number of unique values. Consequently, we get exponentially fewer columns than that of one hot encoding. To further reduce the number of columns, we can join the encoded columns into a bitstring and each unique bitstring will represent a unique value of the "Play" column. Then, we can drop the "Play" column.

In [12]:
playEncoder = ce.BinaryEncoder()

if 'Play' in data.columns:
    encodedPlay = playEncoder.fit_transform(data['Play'])

    data['PlayEncoded'] = encodedPlay.apply(lambda x: ''.join(x.astype(str)), axis=1)

    data.drop(['Play'], axis=1, inplace=True)

data.head()

Unnamed: 0,PlayerLinenumber,ActSceneLine,Player,PlayerLine,PlayEncoded
3,1.0,1.1.1,KING HENRY IV,"[shaken, wan, care]",1
4,1.0,1.1.2,KING HENRY IV,"[find, time, frighted, peace, pant]",1
5,1.0,1.1.3,KING HENRY IV,"[breathe, shortwinded, accent, new, broil]",1
6,1.0,1.1.4,KING HENRY IV,"[commenced, strand, afar, remote]",1
7,1.0,1.1.5,KING HENRY IV,"[thirsty, entrance, soil]",1


#### Label Encoding the Player column

The "Player" column is our target feature. For the convenience of classification, let us encode the "Player" column using numerical labels.

In [13]:
if data.dtypes['Player'] == 'object':
    playerEncoder = ce.OrdinalEncoder()
    
    data['Player'] = playerEncoder.fit_transform(data['Player'])

print(playerEncoder.category_mapping)
data.head()

[{'col': 'Player', 'mapping': KING HENRY IV      1
WESTMORELAND       2
FALSTAFF           3
PRINCE HENRY       4
POINS              5
                ... 
PERDITA          931
DORCAS           932
MOPSA            933
Shepard          934
NaN               -2
Length: 935, dtype: int64, 'data_type': dtype('O')}]


Unnamed: 0,PlayerLinenumber,ActSceneLine,Player,PlayerLine,PlayEncoded
3,1.0,1.1.1,1,"[shaken, wan, care]",1
4,1.0,1.1.2,1,"[find, time, frighted, peace, pant]",1
5,1.0,1.1.3,1,"[breathe, shortwinded, accent, new, broil]",1
6,1.0,1.1.4,1,"[commenced, strand, afar, remote]",1
7,1.0,1.1.5,1,"[thirsty, entrance, soil]",1


#### Decomposing the "ActSceneLine" column

In the ActSceneLine column, we have values that represent act.scene.line. The act, scene and line numbers might be useful to give a hierarchical insight. Therefore, we decompose the "ActSceneLine" column into three different columns. Then, we can drop the "ActSceneLine" column.

In [14]:
if not set(['Act', 'Scene', 'Line']).issubset(set(data.columns)):
    temp = data['ActSceneLine'].str.split('.', expand=True)
    temp.columns = ['Act', 'Scene', 'Line']

    data = pd.concat([data, temp], axis=1)

if 'ActSceneLine' in data.columns:
    data.drop(['ActSceneLine'], axis=1, inplace=True)
    
data.head()

Unnamed: 0,PlayerLinenumber,Player,PlayerLine,PlayEncoded,Act,Scene,Line
3,1.0,1,"[shaken, wan, care]",1,1,1,1
4,1.0,1,"[find, time, frighted, peace, pant]",1,1,1,2
5,1.0,1,"[breathe, shortwinded, accent, new, broil]",1,1,1,3
6,1.0,1,"[commenced, strand, afar, remote]",1,1,1,4
7,1.0,1,"[thirsty, entrance, soil]",1,1,1,5


#### Vectorizing the PlayerLine

PlayerLine values are now a list of tokens. Such a list of strings is not a suitable feature for classifiers. So, we convert each token into vectors using the pretrained *Shakespeare_Word2Vec.sav* model. Then, we take the average of the list of words for every sentence to represent that sentence's vector.

In [29]:
def computeSentenceVector(frame):    
    vec = [Shakespeare_Word2Vec_Model.wv[w] for w in frame['PlayerLine']]
    print(np.average(vec))        

In [None]:
data.apply(computeSentenceVector, axis=1)

## **Step 5:** Exploring the Engineered Dataset using Visualization

### Wordcloud Generation

Wordclouds are an interesting technique to visualize word frequencies in a given text. Since our goal is to classify players from the other features, we will group the dataset by "Player" column and create a wordcloud for each player. This might be useful to explore the word usage frequencies for every player.

Since there are many players, visualizing all of them at once might not be feasible. Therefore, we take a subset of players and create wordclouds.

In [119]:
def showWordCloud(frame):
    cloudMask = np.array(Image.open("../data/external/william-shakespeare-black-silhouette.jpg"))
    
    playerLine = " ".join([w for line in frame['PlayerLine'] for w in line])
    
    wordCloud = WordCloud(max_font_size=100, background_color="white", mask=cloudMask)
    wordCloud.generate(playerLine)
    
    # Display the word cloud 
    ax = plt.gca()
    ax.set_title(frame['Player'].iloc[0])
    
    plt.imshow(wordCloud, interpolation='bilinear')
    plt.show()

In [None]:
groupedData = data.groupby(['Player'])

groupKeys = list(groupedData.groups.keys())

for i in range(0, 10):
    showWordCloud(groupedData.get_group(groupKeys[i]))

From the wordclouds, we can see that the players having an incomplete silhouette of Shakespeare, have fewer lines.

## **Step 6:** Classifying Players from the Rest of the Features

In [33]:
X = data[['PlayerLinenumber', 'Act', 'Scene', 'Line', 'PlayEncoded']]
y = data['Player']

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=1)

#### SVM Classifier

In [38]:
svmClassifier = SVC(class_weight='balanced')



#### RandomForest Classifier

## **Step 7:** Evaluation

## References

1. https://towardsdatascience.com/text-cleaning-methods-for-natural-language-processing-f2fc1796e8c7
2. Princeton University "About WordNet." WordNet (https://wordnet.princeton.edu/). Princeton University. 2010.