## EECS 731 Project 2: To be, or not to be

In this project, I will be reading and performing classification on a [Shakespeare play dataset](https://www.kaggle.com/kingburrito666/shakespeare-plays). In particular, I will be testing to see how performance changes when feature subsets of the total dataset are used instead of the entire feature set.

In addition to the above link, the dataset used in this project can be found in the data/raw/directory.

### Python Imports

In [1]:
import pandas as pd

from sklearn import preprocessing
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score

from sklearn.naive_bayes import GaussianNB
from sklearn import tree
from sklearn.ensemble import RandomForestClassifier
from sklearn.neural_network import MLPClassifier

### Creating the Classification Models

For all feature sets/subsets, I will be evaluating the performance of four different classification models:

- Naive Bayes
- Decision Tree
- Random Forest
- Neural Network (MLP)

In [2]:
class_NaiveBayes = GaussianNB()
class_DecisionTree = tree.DecisionTreeClassifier()
class_RandomForest = RandomForestClassifier(max_depth=6)
class_NeuralNetwork = MLPClassifier(max_iter=1000)

## Base Case

### Reading the Dataset

For the base case, I read all of the dataset's features so that they can be used for training and testing. After loading the dataset, I remove all rows that contain empty values and any duplicate rows. For this step, I include the "Player" column with the dataset's features in order to ensure that all datasets have the same number of elements.

In [3]:
player_dataset = pd.read_csv("../data/raw/Shakespeare_data.csv")
column_names = ["Play", "ActSceneLine", "PlayerLine", "PlayerLinenumber", "Player"]
columns = player_dataset.loc[:,column_names].dropna()
columns = columns.drop_duplicates()
columns.to_csv("../data/processed/base-case.csv")
columns

Unnamed: 0,Play,ActSceneLine,PlayerLine,PlayerLinenumber,Player
3,Henry IV,1.1.1,"So shaken as we are, so wan with care,",1.0,KING HENRY IV
4,Henry IV,1.1.2,"Find we a time for frighted peace to pant,",1.0,KING HENRY IV
5,Henry IV,1.1.3,And breathe short-winded accents of new broils,1.0,KING HENRY IV
6,Henry IV,1.1.4,To be commenced in strands afar remote.,1.0,KING HENRY IV
7,Henry IV,1.1.5,No more the thirsty entrance of this soil,1.0,KING HENRY IV
...,...,...,...,...,...
111390,A Winters Tale,5.3.179,"Is troth-plight to your daughter. Good Paulina,",38.0,LEONTES
111391,A Winters Tale,5.3.180,"Lead us from hence, where we may leisurely",38.0,LEONTES
111392,A Winters Tale,5.3.181,Each one demand an answer to his part,38.0,LEONTES
111393,A Winters Tale,5.3.182,Perform'd in this wide gap of time since first,38.0,LEONTES


### Extracting the Features and Players

In [4]:
feature_names = column_names[:-1]
features = columns.loc[:,feature_names]
features

Unnamed: 0,Play,ActSceneLine,PlayerLine,PlayerLinenumber
3,Henry IV,1.1.1,"So shaken as we are, so wan with care,",1.0
4,Henry IV,1.1.2,"Find we a time for frighted peace to pant,",1.0
5,Henry IV,1.1.3,And breathe short-winded accents of new broils,1.0
6,Henry IV,1.1.4,To be commenced in strands afar remote.,1.0
7,Henry IV,1.1.5,No more the thirsty entrance of this soil,1.0
...,...,...,...,...
111390,A Winters Tale,5.3.179,"Is troth-plight to your daughter. Good Paulina,",38.0
111391,A Winters Tale,5.3.180,"Lead us from hence, where we may leisurely",38.0
111392,A Winters Tale,5.3.181,Each one demand an answer to his part,38.0
111393,A Winters Tale,5.3.182,Perform'd in this wide gap of time since first,38.0


In [5]:
players_name = column_names[-1]
players = columns.loc[:,players_name]
players

3         KING HENRY IV
4         KING HENRY IV
5         KING HENRY IV
6         KING HENRY IV
7         KING HENRY IV
              ...      
111390          LEONTES
111391          LEONTES
111392          LEONTES
111393          LEONTES
111394          LEONTES
Name: Player, Length: 105152, dtype: object

After I separate the features and decision, I perform label encoding on both datasets so that they are compatible with the classification models.

In [6]:
encoder = preprocessing.LabelEncoder()
for f in feature_names:
    features[f] = encoder.fit_transform(features[f])
features

Unnamed: 0,Play,ActSceneLine,PlayerLine,PlayerLinenumber
3,9,324,60240,0
4,9,435,23568,0
5,9,546,4998,0
6,9,657,73793,0
7,9,768,48893,0
...,...,...,...,...
111390,2,14601,41329,37
111391,2,14603,42772,37
111392,2,14604,22110,37
111393,2,14605,55480,37


In [7]:
players = encoder.fit_transform(players)
players

array([457, 457, 457, ..., 494, 494, 494])

### Creating the Train and Test Sets

In [8]:
features_train, features_test, players_train, players_test = train_test_split(features, players, test_size=0.50, random_state=0, shuffle=True)

### Performing Classification

In [9]:
players_pred = class_NaiveBayes.fit(features_train, players_train).predict(features_test)
print("Naive Bayes Accuracy: {:.2f}".format(accuracy_score(players_test, players_pred)))

players_pred = class_DecisionTree.fit(features_train, players_train).predict(features_test)
print("Decision Tree Accuracy: {:.2f}".format(accuracy_score(players_test, players_pred)))

players_pred = class_RandomForest.fit(features_train, players_train).predict(features_test)
print("Random Forest Accuracy: {:.2f}".format(accuracy_score(players_test, players_pred)))

players_pred = class_NeuralNetwork.fit(features_train, players_train).predict(features_test)
print("Neural Network Accuracy: {:.2f}".format(accuracy_score(players_test, players_pred)))

Naive Bayes Accuracy: 0.20
Decision Tree Accuracy: 0.66
Random Forest Accuracy: 0.18
Neural Network Accuracy: 0.02


For the base case, the Decision Tree classifier had the best performance. However, note that the Random Forest classifier was limited to a max depth of 6 in order to reduce computation time. If unlimited, the Random Forest classifier would most likely have similar, if not better, performance to the Decision Tree classifier (though it would take much longer to run).

## Subset Case 1 - Play + ActSceneLine

In the first feature subset, I only use the Play and ActSceneLine features from the original dataset. Note that the same general steps employed in the base case are also used for this subset case.

### Reading the Dataset

In [10]:
column_names = ["Play", "ActSceneLine", "Player"]
columns = player_dataset.loc[:,column_names].dropna()
columns = columns.drop_duplicates()
columns.to_csv("../data/processed/subset-case-1.csv")
columns

Unnamed: 0,Play,ActSceneLine,Player
3,Henry IV,1.1.1,KING HENRY IV
4,Henry IV,1.1.2,KING HENRY IV
5,Henry IV,1.1.3,KING HENRY IV
6,Henry IV,1.1.4,KING HENRY IV
7,Henry IV,1.1.5,KING HENRY IV
...,...,...,...
111390,A Winters Tale,5.3.179,LEONTES
111391,A Winters Tale,5.3.180,LEONTES
111392,A Winters Tale,5.3.181,LEONTES
111393,A Winters Tale,5.3.182,LEONTES


### Extracting the Features and Players

In [11]:
feature_names = column_names[:-1]
features = columns.loc[:,feature_names]
features

Unnamed: 0,Play,ActSceneLine
3,Henry IV,1.1.1
4,Henry IV,1.1.2
5,Henry IV,1.1.3
6,Henry IV,1.1.4
7,Henry IV,1.1.5
...,...,...
111390,A Winters Tale,5.3.179
111391,A Winters Tale,5.3.180
111392,A Winters Tale,5.3.181
111393,A Winters Tale,5.3.182


In [12]:
players_name = column_names[-1]
players = columns.loc[:,players_name]
players

3         KING HENRY IV
4         KING HENRY IV
5         KING HENRY IV
6         KING HENRY IV
7         KING HENRY IV
              ...      
111390          LEONTES
111391          LEONTES
111392          LEONTES
111393          LEONTES
111394          LEONTES
Name: Player, Length: 104991, dtype: object

### Label Encoding

In [13]:
encoder = preprocessing.LabelEncoder()
for f in feature_names:
    features[f] = encoder.fit_transform(features[f])
features

Unnamed: 0,Play,ActSceneLine
3,9,324
4,9,435
5,9,546
6,9,657
7,9,768
...,...,...
111390,2,14601
111391,2,14603
111392,2,14604
111393,2,14605


In [14]:
players = encoder.fit_transform(players)
players

array([457, 457, 457, ..., 494, 494, 494])

### Creating Train and Test Sets

In [15]:
features_train, features_test, players_train, players_test = train_test_split(features, players, test_size=0.50, random_state=0, shuffle=True)

### Performing Classification

In [16]:
players_pred = class_NaiveBayes.fit(features_train, players_train).predict(features_test)
print("Naive Bayes Accuracy: {:.2f}".format(accuracy_score(players_test, players_pred)))

players_pred = class_DecisionTree.fit(features_train, players_train).predict(features_test)
print("Decision Tree Accuracy: {:.2f}".format(accuracy_score(players_test, players_pred)))

players_pred = class_RandomForest.fit(features_train, players_train).predict(features_test)
print("Random Forest Accuracy: {:.2f}".format(accuracy_score(players_test, players_pred)))

players_pred = class_NeuralNetwork.fit(features_train, players_train).predict(features_test)
print("Neural Network Accuracy: {:.2f}".format(accuracy_score(players_test, players_pred)))

Naive Bayes Accuracy: 0.23
Decision Tree Accuracy: 0.61
Random Forest Accuracy: 0.18
Neural Network Accuracy: 0.10


Similar to the base case, the Decision Tree classifier also has the best performance. In fact, the performance across all four classifiers remains relatively consistent. Based on this, it can be seen that the Play and ActSceneLine features alone can be used to largely represent the entire dataset.

## Subset Case 2 - PlayerLine + PlayerLinenumber

In the second subset, I only use the Play and ActSceneLine features from the original dataset. Once again, I use the same general steps for performing the classfication.

### Reading the Dataset

column_names = ["PlayerLine", "PlayerLinenumber", "Player"]
columns = player_dataset.loc[:,column_names].dropna()
columns = columns.drop_duplicates()
columns.to_csv("../data/processed/subset-case-2.csv")
columns

### Extracting the Features and Players

In [17]:
feature_names = column_names[:-1]
features = columns.loc[:,feature_names]
features

Unnamed: 0,Play,ActSceneLine
3,Henry IV,1.1.1
4,Henry IV,1.1.2
5,Henry IV,1.1.3
6,Henry IV,1.1.4
7,Henry IV,1.1.5
...,...,...
111390,A Winters Tale,5.3.179
111391,A Winters Tale,5.3.180
111392,A Winters Tale,5.3.181
111393,A Winters Tale,5.3.182


In [18]:
players_name = column_names[-1]
players = columns.loc[:,players_name]
players

3         KING HENRY IV
4         KING HENRY IV
5         KING HENRY IV
6         KING HENRY IV
7         KING HENRY IV
              ...      
111390          LEONTES
111391          LEONTES
111392          LEONTES
111393          LEONTES
111394          LEONTES
Name: Player, Length: 104991, dtype: object

### Label Encoding

In [19]:
encoder = preprocessing.LabelEncoder()
for f in feature_names:
    features[f] = encoder.fit_transform(features[f])
features

Unnamed: 0,Play,ActSceneLine
3,9,324
4,9,435
5,9,546
6,9,657
7,9,768
...,...,...
111390,2,14601
111391,2,14603
111392,2,14604
111393,2,14605


In [20]:
players = encoder.fit_transform(players)
players

array([457, 457, 457, ..., 494, 494, 494])

### Creating Train and Test Sets

In [21]:
features_train, features_test, players_train, players_test = train_test_split(features, players, test_size=0.50, random_state=0, shuffle=True)

### Performing Classification

In [22]:
players_pred = class_NaiveBayes.fit(features_train, players_train).predict(features_test)
print("Naive Bayes Accuracy: {:.2f}".format(accuracy_score(players_test, players_pred)))

players_pred = class_DecisionTree.fit(features_train, players_train).predict(features_test)
print("Decision Tree Accuracy: {:.2f}".format(accuracy_score(players_test, players_pred)))

players_pred = class_RandomForest.fit(features_train, players_train).predict(features_test)
print("Random Forest Accuracy: {:.2f}".format(accuracy_score(players_test, players_pred)))

players_pred = class_NeuralNetwork.fit(features_train, players_train).predict(features_test)
print("Neural Network Accuracy: {:.2f}".format(accuracy_score(players_test, players_pred)))

Naive Bayes Accuracy: 0.23
Decision Tree Accuracy: 0.61
Random Forest Accuracy: 0.18
Neural Network Accuracy: 0.13


In this case, the performance of all four classifiers significantly decreases down to 4% in the best case (Decision Tree and Random Forest). As such, unlike the first subset case, the features used for this subset can't be used to accurately represent the entire dataset.

## Results

As can be seen by the accuracy performance of the classification models across all test cases, the features used can have a significant impact on the end results. Namely, the first feature subset (Play+ActSceneLine) was able to achieve similar performance to the entire dataset, but the second feature subset (PlayerLine+PlayerLinenumber) instead saw much worse performance.