# EECS 731 Project 2 - Classification
#### Author: Jace Kline
### Project Goal
The goal of this project is to classify a character (player) in a Shakespeare play given the line text, play, and data about the act, scene, and line where the character spoke the line.

### Setup
First, we import required Python 3 packages and our helper functions.

In [2]:
# General imports
import sys
sys.path.append('../src/')
from funcs import *

import re
import numpy as np
import pandas as pd
from matplotlib import pyplot as plt
%matplotlib inline

Now we load the raw data file into a Pandas DataFrame object and print it to the screen.

In [3]:
df_orig = pd.read_csv("../data/raw/Shakespeare_data.csv")
df_orig

Unnamed: 0,Dataline,Play,PlayerLinenumber,ActSceneLine,Player,PlayerLine
0,1,Henry IV,,,,ACT I
1,2,Henry IV,,,,SCENE I. London. The palace.
2,3,Henry IV,,,,"Enter KING HENRY, LORD JOHN OF LANCASTER, the ..."
3,4,Henry IV,1.0,1.1.1,KING HENRY IV,"So shaken as we are, so wan with care,"
4,5,Henry IV,1.0,1.1.2,KING HENRY IV,"Find we a time for frighted peace to pant,"
...,...,...,...,...,...,...
111391,111392,A Winters Tale,38.0,5.3.180,LEONTES,"Lead us from hence, where we may leisurely"
111392,111393,A Winters Tale,38.0,5.3.181,LEONTES,Each one demand an answer to his part
111393,111394,A Winters Tale,38.0,5.3.182,LEONTES,Perform'd in this wide gap of time since first
111394,111395,A Winters Tale,38.0,5.3.183,LEONTES,We were dissever'd: hastily lead away.


## Data Preparation and Cleaning
### Removing Unknown 'Player' Rows
Since we want to train/test against the 'Player' label, we shall remove all records where the Player attribute is NaN.

In [4]:
df = df_orig.dropna(subset=['Player'])

Let's also clean up the indeces by resetting the index and removing the old 'index' and 'Dataline' columns.

In [5]:
df.reset_index(inplace=True)
df.drop(columns=['index','Dataline'],axis='columns',inplace=True)
df

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  return super().drop(


Unnamed: 0,Play,PlayerLinenumber,ActSceneLine,Player,PlayerLine
0,Henry IV,1.0,1.1.1,KING HENRY IV,"So shaken as we are, so wan with care,"
1,Henry IV,1.0,1.1.2,KING HENRY IV,"Find we a time for frighted peace to pant,"
2,Henry IV,1.0,1.1.3,KING HENRY IV,And breathe short-winded accents of new broils
3,Henry IV,1.0,1.1.4,KING HENRY IV,To be commenced in strands afar remote.
4,Henry IV,1.0,1.1.5,KING HENRY IV,No more the thirsty entrance of this soil
...,...,...,...,...,...
111384,A Winters Tale,38.0,5.3.180,LEONTES,"Lead us from hence, where we may leisurely"
111385,A Winters Tale,38.0,5.3.181,LEONTES,Each one demand an answer to his part
111386,A Winters Tale,38.0,5.3.182,LEONTES,Perform'd in this wide gap of time since first
111387,A Winters Tale,38.0,5.3.183,LEONTES,We were dissever'd: hastily lead away.


### Converting All Strings to Lowercase
We shall convert all strings to uppercase to avoid confusion and ambiguity in our text analysis.

In [6]:
str_cols = list(df.dtypes[df.dtypes == 'object'].keys())
str_cols

['Play', 'ActSceneLine', 'Player', 'PlayerLine']

In [7]:
for colname in str_cols:
    df[colname] = df[colname].apply(lambda x: str(x).lower())
df

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df[colname] = df[colname].apply(lambda x: str(x).lower())


Unnamed: 0,Play,PlayerLinenumber,ActSceneLine,Player,PlayerLine
0,henry iv,1.0,1.1.1,king henry iv,"so shaken as we are, so wan with care,"
1,henry iv,1.0,1.1.2,king henry iv,"find we a time for frighted peace to pant,"
2,henry iv,1.0,1.1.3,king henry iv,and breathe short-winded accents of new broils
3,henry iv,1.0,1.1.4,king henry iv,to be commenced in strands afar remote.
4,henry iv,1.0,1.1.5,king henry iv,no more the thirsty entrance of this soil
...,...,...,...,...,...
111384,a winters tale,38.0,5.3.180,leontes,"lead us from hence, where we may leisurely"
111385,a winters tale,38.0,5.3.181,leontes,each one demand an answer to his part
111386,a winters tale,38.0,5.3.182,leontes,perform'd in this wide gap of time since first
111387,a winters tale,38.0,5.3.183,leontes,we were dissever'd: hastily lead away.


### Converting 'PlayerLinenumber' to an Integer value to conserve memory
Since the 'PlayerLinenumber' column is listing whole number values only, it is a good idea to replace this column's values with the integer equivalents.

In [8]:
df['PlayerLinenumber'] = df['PlayerLinenumber'].apply(lambda x: int(x))#.astype(np.short)
df['PlayerLinenumber'] = df['PlayerLinenumber'].astype(np.short)
df

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df['PlayerLinenumber'] = df['PlayerLinenumber'].apply(lambda x: int(x))#.astype(np.short)
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df['PlayerLinenumber'] = df['PlayerLinenumber'].astype(np.short)


Unnamed: 0,Play,PlayerLinenumber,ActSceneLine,Player,PlayerLine
0,henry iv,1,1.1.1,king henry iv,"so shaken as we are, so wan with care,"
1,henry iv,1,1.1.2,king henry iv,"find we a time for frighted peace to pant,"
2,henry iv,1,1.1.3,king henry iv,and breathe short-winded accents of new broils
3,henry iv,1,1.1.4,king henry iv,to be commenced in strands afar remote.
4,henry iv,1,1.1.5,king henry iv,no more the thirsty entrance of this soil
...,...,...,...,...,...
111384,a winters tale,38,5.3.180,leontes,"lead us from hence, where we may leisurely"
111385,a winters tale,38,5.3.181,leontes,each one demand an answer to his part
111386,a winters tale,38,5.3.182,leontes,perform'd in this wide gap of time since first
111387,a winters tale,38,5.3.183,leontes,we were dissever'd: hastily lead away.


## Understanding the Data
### Querying Available Plays and Players
First, we shall get the play names in our dataset.

In [9]:
play_series = df['Play'].drop_duplicates()
play_names = list(play_series)
print("Number of plays: ", play_series.count())

Number of plays:  36


Now, we run a query to get the available players.

In [10]:
all_players = df['Player'].drop_duplicates()
players = list(all_players)
print("Number of players: ", all_players.count())

Number of players:  922


## Feature Engineering
To make our dataset more useful for training our model, we must perform transformations to our data. For each current feature in our dataset, we must look for ways to extract more meaningful information.

### Coversion of 'ActSceneLine' values to numerical values
Currently, the 'ActSceneLine' column is an object type, and therefore will serve little purpose in our numerically-inclined model. Hence, we must map this feature to an equivalent numerical form. We shall achieve this by creating three new features: 'Act', 'Scene', and 'Line', where the data type for each is an integer value. See the example below.

In [11]:
def actSceneLineConvert(asl): # string -> 3-tuple of integers
    regex = '([1-9]+)[.]([1-9]+)[.]([1-9]+)'
    m = re.search(regex, str(asl))
    if asl == asl and asl is not None and m.group(1):
        return (int(m.group(1)),int(m.group(2)),int(m.group(3)))
    else:
        return (0,0,0)

# Example
actSceneLineConvert('3.4.27')

(3, 4, 27)

Now, let us use this function that we defined to create our new features.

In [12]:
r,c = df.shape
newarr = np.zeros((r,3),dtype=np.short)

for i in range(0,r):
    try:
        newarr[i,0:3] = actSceneLineConvert(df.loc[i,'ActSceneLine'])
    except:
        pass

In [13]:
df_add = pd.DataFrame(newarr,columns=['Act','Scene','Line'])
df = pd.concat([df,df_add],axis='columns').drop('ActSceneLine', axis='columns')
cols = df.columns.tolist()
cols_ = cols[0:2] + cols[-3:] + cols[3:4] + cols[2:3]
df = df[cols_]
df

Unnamed: 0,Play,PlayerLinenumber,Act,Scene,Line,PlayerLine,Player
0,henry iv,1,1,1,1,"so shaken as we are, so wan with care,",king henry iv
1,henry iv,1,1,1,2,"find we a time for frighted peace to pant,",king henry iv
2,henry iv,1,1,1,3,and breathe short-winded accents of new broils,king henry iv
3,henry iv,1,1,1,4,to be commenced in strands afar remote.,king henry iv
4,henry iv,1,1,1,5,no more the thirsty entrance of this soil,king henry iv
...,...,...,...,...,...,...,...
111384,a winters tale,38,5,3,18,"lead us from hence, where we may leisurely",leontes
111385,a winters tale,38,5,3,181,each one demand an answer to his part,leontes
111386,a winters tale,38,5,3,182,perform'd in this wide gap of time since first,leontes
111387,a winters tale,38,5,3,183,we were dissever'd: hastily lead away.,leontes


### Conversion of 'Play' to Numerical values
Like above, we want to maximize the amount of numerical data available to our model and hence we shall convert each play to a unique integer value for in our dataset. We will store this new data in a new column called 'PlayNum'.

In [14]:
# Mapping of play name -> integer
zipper = zip(play_names,range(0,len(play_names)))
map_dict = dict(zipper)

# Apply the transformation over the 'Play' column
df['PlayNum'] = df['Play'].apply(lambda s: map_dict[str(s)])
df['PlayNum'] = df['PlayNum'].astype(np.short)

# Reorder the columns
cols = df.columns.tolist()
cols_ = ['Play','PlayNum'] + cols[1:-1]
df = df[cols_]
df

Unnamed: 0,Play,PlayNum,PlayerLinenumber,Act,Scene,Line,PlayerLine,Player
0,henry iv,0,1,1,1,1,"so shaken as we are, so wan with care,",king henry iv
1,henry iv,0,1,1,1,2,"find we a time for frighted peace to pant,",king henry iv
2,henry iv,0,1,1,1,3,and breathe short-winded accents of new broils,king henry iv
3,henry iv,0,1,1,1,4,to be commenced in strands afar remote.,king henry iv
4,henry iv,0,1,1,1,5,no more the thirsty entrance of this soil,king henry iv
...,...,...,...,...,...,...,...,...
111384,a winters tale,35,38,5,3,18,"lead us from hence, where we may leisurely",leontes
111385,a winters tale,35,38,5,3,181,each one demand an answer to his part,leontes
111386,a winters tale,35,38,5,3,182,perform'd in this wide gap of time since first,leontes
111387,a winters tale,35,38,5,3,183,we were dissever'd: hastily lead away.,leontes


#### Let's save our dataset
We have made a lot of changes to the 'df' dataset and should therefore save the changes into a new file for redundancy and backup.

In [15]:
df.to_csv('../data/transformed/shakespeare.csv')

## Model Ideas
Now that we have cleaned and transformed our dataset, we can consider ideas regarding our approach to choosing and training a predictive model.

### Key Considerations
Given that this dataset is over 100,000 rows, we must be cognizant of our memory usage and our model complexity. Given that our machine is not specifically built with data science in mind, we must operate within the memory and time constraints of our machine.

### Text-Based Approach
#### Overview
One property of our dataset as it stands is that the text field 'PlayerLine' holds the highest quantity of raw information in each row. By extracting textual patterns, word usage, and word frequency for each player, it shall be possible to predict the player based on these textual cues. The common approach in this type of endeavor is to employ a vectorizer that will do the following:
1. Tokenize all text to create new features based on words, bigrams, trigrams, etc.
2. Vectorize each row's text field into a vector representing the count or frequency of each token
After these steps, we shall use the vectorized data to train a model (e.g. Naive Bayes). Then, we shall evaluate the model with test data to determine the model's predictive success rate.

#### Splitting Dataset into Multiple Datasets (by 'Play')
A property of the current dataset is that it is quite large (over 100,000 rows). In the pursuit of a player classification technique over the text field, we shall split our dataset by play name into distinct datasets and form distinct models for each play. Since the 'Play' attribute shall be given to us in the final evaluation, we may use this information to conditionally choose which model to run on a given record to output the estimated 'Player'. By reducing our dataset into pieces and forming separate models, we increase the likelihood of choosing the correct player. Essentially, this approach is expressing that, first and foremost, the 'Play' attribute must be matched in the model. In addition to the predictive potential of this new approach, the smaller datasets allow for more intensive text analysis on each play due to more free memory on the system.

#### Testing this Approach
We shall employ the text-based approach on a subset of the data to determine its practicality and effectiveness potential. In particular, we shall test this on the data for "Henry IV".

##### Process
We will perform the following high-level steps to accomplish this vectorization process:
1. Define a vectorizer to be a ScikitLearn TF-IDF tokenizer class instance (Term frequency-inverse document frequency)
2. Tokenize our input text
    * We will specify the tokenizer to include single words as tokens
    * This will help to capture the context of the characters' language in a vectorized numerical form
3. Generate a new set of features to match the tokens found in all the text (from each row)
4. Run the vectorizer capability on each row to fill the corresponding normalized token frequencies

In [16]:
# Import required SciKit Learn constructs
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.model_selection import train_test_split
from sklearn.naive_bayes import MultinomialNB

In [17]:
# Get only the 'Henry IV' play data
henryiv = df[df['Play'] == 'henry iv']

# instantiate our vectorizer
vectorizer = TfidfVectorizer()

# instantiate our model
model = MultinomialNB()

# Assign independent, dependent variables for model
X_henryiv_text = henryiv['PlayerLine']
y_henryiv = henryiv['Player']

def printPerformance(dec):
    print("Model performance: {}%".format(dec*100))

def tfidfModelEval(X_text,y,vectorizer,model):
    # Map the text column to a vectorized array
    X = vectorizer.fit_transform(X_text).toarray()

    # Split the data into training and testing data
    X_train, X_test, y_train, y_test = train_test_split(X,y,random_state=0)

    # Train the model on the training data
    clf = model.fit(X_train,y_train)

    # Evaluate and return the performance of the trained model on the testing data
    return clf.score(X_test,y_test)

printPerformance(tfidfModelEval(X_henryiv_text,y_henryiv,vectorizer,model))

Model performance: 27.250000000000004%


#### Results
As we can see from above, using a text-based vectorization approach only resulted in roughly 27% accuracy on the testing data. This low accuracy measure is an indicator that this paradigm is inherently flawed and the patterns that may be extracted from the text are quite sparse barring the use of more advanced methods. In addition to the low accuracy, it is simply not feasible to execute this type of classification model on the entire dataset due to memory constraints of our target machine. This leads us to pursue other options.

### Numbers-Exclusive Approach
#### Overview
Contary to the previous method, we shall use all of the numerical columns in our transformed dataset as features. These columns include 'PlayNum', 'PlayerLinenumber', 'Act', 'Scene', and 'Line'. My hypothesis is that, due to the conditional structure of the numerical data as it relates to the player output, we may use a Decision Tree model with success. In addition, since the memory overhead of this model is much lower, we may process the entire dataset in a single model.

In [18]:
# Import the Tree-based models
from sklearn.tree import DecisionTreeClassifier

In [19]:
feature_names = ['PlayNum','PlayerLinenumber','Act','Scene','Line']
X = df[feature_names].to_numpy()
y = df['Player']

def standardModelEval(X,y,model):
    X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=0)
    clf = model.fit(X_train, y_train)
    return clf.score(X_test,y_test)

In [20]:
model = DecisionTreeClassifier()
printPerformance(standardModelEval(X,y,model))

Model performance: 74.78813559322035%


#### Results
As we can see above, we achieved roughly 75% accuracy on our tree-based model. This model was very fast and required very little memory compared to the text vectorization model. These properties enabled us to train and test our entire dataset at one time. Although this percentage isn't perfect, the high speed and low memory allow us to work with this model over our entire dataset.

### Conclusion
The text-based analysis and modeling proved to be difficult in this project due to the sheer amount of data and the lack of textual patterns available to classify characters. On the contrary, utilization of the numerical features of our dataset and the deployment of a decision tree model proved to be not only much more accurate, but also fast and memory efficient in the context of our entire dataset. Although the text model proved ineffective in our limited trial, the idea may still be put to good use in conjuction to a tree-based model. Due to time constraints, we were not able to explore the concept of model combination, but this concept may prove useful in the context of this problem.