First, we import required Python 3 packages and our helper functions.

In [95]:
import sys
sys.path.append('../src/')
from funcs import *

import re
from sklearn.feature_extraction.text import TfidfVectorizer
import numpy as np
import pandas as pd
from matplotlib import pyplot as plt
%matplotlib inline

Now we load the raw data file into a Pandas DataFrame object and print it to the screen.

In [96]:
df_orig = pd.read_csv("../data/raw/Shakespeare_data.csv")
df_orig

Unnamed: 0,Dataline,Play,PlayerLinenumber,ActSceneLine,Player,PlayerLine
0,1,Henry IV,,,,ACT I
1,2,Henry IV,,,,SCENE I. London. The palace.
2,3,Henry IV,,,,"Enter KING HENRY, LORD JOHN OF LANCASTER, the ..."
3,4,Henry IV,1.0,1.1.1,KING HENRY IV,"So shaken as we are, so wan with care,"
4,5,Henry IV,1.0,1.1.2,KING HENRY IV,"Find we a time for frighted peace to pant,"
...,...,...,...,...,...,...
111391,111392,A Winters Tale,38.0,5.3.180,LEONTES,"Lead us from hence, where we may leisurely"
111392,111393,A Winters Tale,38.0,5.3.181,LEONTES,Each one demand an answer to his part
111393,111394,A Winters Tale,38.0,5.3.182,LEONTES,Perform'd in this wide gap of time since first
111394,111395,A Winters Tale,38.0,5.3.183,LEONTES,We were dissever'd: hastily lead away.


## Data Preparation and Cleaning
### Removing Unknown 'Player' Rows
Since we want to train/test against the 'Player' label, we shall remove all records where the Player attribute is NaN.

In [97]:
df = df_orig.dropna(subset=['Player'])
df

Unnamed: 0,Dataline,Play,PlayerLinenumber,ActSceneLine,Player,PlayerLine
3,4,Henry IV,1.0,1.1.1,KING HENRY IV,"So shaken as we are, so wan with care,"
4,5,Henry IV,1.0,1.1.2,KING HENRY IV,"Find we a time for frighted peace to pant,"
5,6,Henry IV,1.0,1.1.3,KING HENRY IV,And breathe short-winded accents of new broils
6,7,Henry IV,1.0,1.1.4,KING HENRY IV,To be commenced in strands afar remote.
7,8,Henry IV,1.0,1.1.5,KING HENRY IV,No more the thirsty entrance of this soil
...,...,...,...,...,...,...
111391,111392,A Winters Tale,38.0,5.3.180,LEONTES,"Lead us from hence, where we may leisurely"
111392,111393,A Winters Tale,38.0,5.3.181,LEONTES,Each one demand an answer to his part
111393,111394,A Winters Tale,38.0,5.3.182,LEONTES,Perform'd in this wide gap of time since first
111394,111395,A Winters Tale,38.0,5.3.183,LEONTES,We were dissever'd: hastily lead away.


### Converting All Strings to Uppercase
We shall convert all strings to uppercase to avoid confusion and ambiguity in our text analysis.

In [98]:
str_cols = list(df.dtypes[df.dtypes == 'object'].keys())
str_cols

['Play', 'ActSceneLine', 'Player', 'PlayerLine']

In [127]:
for colname in str_cols:
    df[colname] = df[colname].str.upper()
df

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df[colname] = df[colname].str.upper()


Unnamed: 0,Dataline,Play,PlayerLinenumber,ActSceneLine,Player,PlayerLine
3,4,HENRY IV,1.0,1.1.1,KING HENRY IV,"SO SHAKEN AS WE ARE, SO WAN WITH CARE,"
4,5,HENRY IV,1.0,1.1.2,KING HENRY IV,"FIND WE A TIME FOR FRIGHTED PEACE TO PANT,"
5,6,HENRY IV,1.0,1.1.3,KING HENRY IV,AND BREATHE SHORT-WINDED ACCENTS OF NEW BROILS
6,7,HENRY IV,1.0,1.1.4,KING HENRY IV,TO BE COMMENCED IN STRANDS AFAR REMOTE.
7,8,HENRY IV,1.0,1.1.5,KING HENRY IV,NO MORE THE THIRSTY ENTRANCE OF THIS SOIL
...,...,...,...,...,...,...
111391,111392,A WINTERS TALE,38.0,5.3.180,LEONTES,"LEAD US FROM HENCE, WHERE WE MAY LEISURELY"
111392,111393,A WINTERS TALE,38.0,5.3.181,LEONTES,EACH ONE DEMAND AN ANSWER TO HIS PART
111393,111394,A WINTERS TALE,38.0,5.3.182,LEONTES,PERFORM'D IN THIS WIDE GAP OF TIME SINCE FIRST
111394,111395,A WINTERS TALE,38.0,5.3.183,LEONTES,WE WERE DISSEVER'D: HASTILY LEAD AWAY.


## Understanding the Data
### Querying Available Plays and Players
First, we shall get the play names in our dataset.

In [100]:
play_series = df[df['Play'].shift() != df['Play']]['Play']
play_names = list(play_series)
print("Number of plays: ", play_series.count())

Number of plays:  36


Now, we run a query to get the available players.

In [None]:
players_series = df['Player'].drop_duplicates()
players = list(players_series)
print("Number of players: ", players_series.count())

In [None]:
# Use df.groupby('Player') instead

# accum = np.empty((players_series.count(),2), dtype='object')
# i = 0
# for player in players:
#     accum[i]=[player,foldl(df['PlayerLine'][df['Player'] == player].values, lambda x, y: x + ' ' + y, '')]
#     i+=1
# pd.DataFrame(accum, columns=['Player','Words'])


In [None]:
# df.groupby('Player')

## Feature Engineering
To make our dataset more useful for training our model, we must perform transformations to our data. For each current feature in our dataset, we must look for ways to extract more meaningful information.

### Coversion of 'ActSceneLine' values to a numerical value
Currently, the 'ActSceneLine' column is an object type, and therefore will serve little purpose in our numerically-inclined model. Hence, we must map this feature to an equivalent numerical form. We shall achieve this by creating three new features: 'Act', 'Scene', and 'Line', where the data type for each is an integer value. See the example below.

In [103]:
# Example
actSceneLineConvert('3.4.27')

(3, 4, 27)

Now, let us use this function that we defined to create our new features.

In [104]:
df['ActSceneLine']

3           1.1.1
4           1.1.2
5           1.1.3
6           1.1.4
7           1.1.5
           ...   
111391    5.3.180
111392    5.3.181
111393    5.3.182
111394    5.3.183
111395        NaN
Name: ActSceneLine, Length: 111389, dtype: object

### Text Analysis
Since this dataset's main source of classification potential lies in the text recognition, we shall use text vectorization methods to transform the text into numerical vector data, usable by our chosen model. We will perform the following high-level steps to accomplish this vectorization process:
1. Define a vectorizer to be a ScikitLearn TF-IDF tokenizer class instance (Term frequency-inverse document frequency)
2. Tokenize our input text
    * We will specify the tokenizer to include single words, digrams, and trigrams as tokens
    * This will help to capture the context of the characters' language in a vectorized numerical form
3. Generate a new set of features to match the tokens found in all the text (from each row)
4. Run the vectorizer capability on each row to fill the corresponding normalized token frequencies

In [None]:
tokenizer = TfidfVectorizer()