## 1. Imports (Libraries)

<a id='imports'></a>

In [1]:
# Import basic libraries
import pandas as pd
import numpy as np
from ast import literal_eval

In [2]:
# Import visualisation libraries
import matplotlib.pyplot as plt
import seaborn as sns

In [3]:
# Import NLP libraries
import re
import string
import nltk
from nltk.tokenize import sent_tokenize, word_tokenize, RegexpTokenizer
from nltk.stem import WordNetLemmatizer
from nltk.stem.porter import PorterStemmer
from nltk.corpus import stopwords
from nltk.sentiment.vader import SentimentIntensityAnalyzer

In [4]:
from gensim.models import Word2Vec

In [5]:
# Import sklearn libraries
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.preprocessing import FunctionTransformer
from sklearn.preprocessing import MultiLabelBinarizer
from sklearn.pipeline import Pipeline
from sklearn.naive_bayes import BernoulliNB, MultinomialNB, GaussianNB
from sklearn.multioutput import MultiOutputClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import classification_report, confusion_matrix, ConfusionMatrixDisplay, RocCurveDisplay
from sklearn.metrics import accuracy_score, recall_score, precision_score, f1_score, roc_auc_score
from sklearn.metrics import multilabel_confusion_matrix

In [6]:
#uncomment if not installed

#!pip install scikit-multilearn

from skmultilearn.model_selection import iterative_train_test_split

In [7]:
from xgboost import XGBClassifier

In [8]:
#uncomment if not installed

#!pip install shap

import shap

In [9]:
# some display adjustments to account for the fact that we have many columns
# and some columns contain many characters

np.set_printoptions(threshold=np.inf)
pd.set_option('display.max_colwidth', 100)
pd.set_option('display.max_columns', None)

## 2. Imports (Data)

<a id='data_imports'></a>

Since we exported the cleaned dataset as a csv file in Part II, we can simply import that file and start working on it immediately. The two columns we are most concerned with are 'redditlabel' and 'text_lemma'.

Data dictionary:

For test_df:
|column| datatype|explanation|
|:-|:-:|:-|
|<b>Episode Title</b>|*object*| The subject title for each episode for the first season of the Simpsons.|
|<b>Full Story</b>| *object*| The subject title of the Reddit post.|
|<b>text_lemma</b>| *object*| The subject title of the Reddit post.|
|<b>10 Tropes</b>| *object*| The subject title of the Reddit post.|
<br>

For train_df:
|column| datatype|explanation|
|:-|:-:|:-|
|<b>Trope Name</b>|*object*| The name for each trope.|
|<b>Trope Description</b>| *object*| The description of each trope (e.g. What is the trope about?).|
|<b>Text Length</b>| *int64*| Description word length.|
|<b>text_lemma</b>| *object*| Trope Description after applying the lemmization of Trope Description.|
<br>

For binarised_df:
|column| datatype|explanation|
|:-|:-:|:-|
|<b>Catchphrase</b>|*int32*| Binary column (1 = Present in text, 0 = Absent in text)|
|<b>Characterization Marches On</b>| *int32*| The description of each trope (e.g. What is the trope about?).|
|<b>Comically Missing the Point</b>| *int32*| Binary column (1 = Present in text, 0 = Absent in text)|
|<b>Couch Gag</b>| *int32*| Binary column (1 = Present in text, 0 = Absent in text)|
|<b>Disproportionate Retribution</b>| *int32*| Binary column (1 = Present in text, 0 = Absent in text)|
|<b>Establishing Character Moment</b>| *int32*| Binary column (1 = Present in text, 0 = Absent in text)|
|<b>Imagine Spot</b>| *int32*| Binary column (1 = Present in text, 0 = Absent in text)|
|<b>Karma Houdini</b>| *int32*| Binary column (1 = Present in text, 0 = Absent in text)|
|<b>Running Gag</b>| *int32*| Binary column (1 = Present in text, 0 = Absent in text)|
|<b>Special Guest</b>| *int32*| Binary column (1 = Present in text, 0 = Absent in text)|
<br>

In [10]:
test_df = pd.read_csv('simpsons_10_tropes.csv')
train_df = pd.read_csv('10_tropes.csv')

In [11]:
test_df

Unnamed: 0,Episode Title,Full Story,text_lemma,10 Tropes
0,Simpsons Roasting on an Open Fire,Homer hastily drives the Family Sedan with Marge and Maggie through a snowcovered street They ar...,homer hastily drive the family sedan with marge and maggie through a snowcovered street they be ...,"['Running Gag', 'Establishing Character Moment']"
1,Bart the Genius,The Simpson family is playing Scrabble in the living room in an effort to build Bart’s vocabular...,the simpson family be play scrabble in the living room in an effort to build bart ’ s vocabulary...,"['Imagine Spot', 'Comically Missing the Point', 'Couch Gag', 'Establishing Character Moment', 'C..."
2,Homers Odyssey,The episode begins in front of Springfield Elementary as Mrs Krabappel rounds up her class inclu...,the episode begin in front of springfield elementary a mr krabappel round up her class include b...,"['Karma Houdini', 'Characterization Marches On', 'Comically Missing the Point', 'Couch Gag', 'Sp..."
3,Theres No Disgrace Like Home,Bart and Lisa are fighting but it is not long until Homer quickly rushes in to break the melee u...,bart and lisa be fight but it be not long until homer quickly rush in to break the melee up he t...,"['Imagine Spot', 'Characterization Marches On', 'Disproportionate Retribution', 'Comically Missi..."
4,Bart the General,Bart and Lisa fight over Lisas cupcakesThe episode begins inside the Simpsons kitchen where Lisa...,bart and lisa fight over lisas cupcakesthe episode begin inside the simpson kitchen where lisa b...,"['Imagine Spot', 'Characterization Marches On', 'Disproportionate Retribution', 'Couch Gag', 'Ru..."
5,Moaning Lisa,Lisa feels depressedThe episode starts with a melancholic Lisa staring at herself in the bathroo...,lisa feel depressedthe episode start with a melancholic lisa star at herself in the bathroom mir...,"['Characterization Marches On', 'Comically Missing the Point', 'Couch Gag', 'Special Guest']"
6,The Call of the Simpsons,The episode begins with Homer and Bart outside doing yard work when Ned Flanders pulls up in his...,the episode begin with homer and bart outside do yard work when ned flanders pull up in his bran...,"['Comically Missing the Point', 'Couch Gag', 'Special Guest']"
7,The Telltale Head,Homer and Bart being chased by the mobThe episode begins with Homer and Bart walking on a sidewa...,homer and bart be chase by the mobthe episode begin with homer and bart walk on a sidewalk in do...,['Couch Gag']
8,Life on the Fast Lane,It is Marges birthday so Bart and Lisa prepare breakfast in bed for her During the surprise Home...,it be marge birthday so bart and lisa prepare breakfast in bed for her during the surprise homer...,"['Couch Gag', 'Special Guest', 'Establishing Character Moment']"
9,Homers Night Out,Bart purchases a miniature spy camera from a mailorder catalog which arrives six months later Ba...,bart purchase a miniature spy camera from a mailorder catalog which arrives six month later bart...,"['Karma Houdini', 'Comically Missing the Point', 'Couch Gag', 'Special Guest']"


In [12]:
test_df.dtypes

Episode Title    object
Full Story       object
text_lemma       object
10 Tropes        object
dtype: object

In [13]:
train_df

Unnamed: 0,Trope Name,Trope Description,Text Length,text_lemma
0,Imagine Spot,Okay Ralphie You win this time But well be backElliot JD be sensitive Dont act like youre at a ...,917,okay ralphie you win this time but well be backelliot jd be sensitive dont act like youre at a p...
1,Couch Gag,After seven seasons weve pretty much said everything you can say in this spot— Garfield Garfield...,864,after seven season weve pretty much say everything you can say in this spot— garfield garfield a...
2,Catchphrase,Catchphrase may refer to one of the following Character Catchphrase A phrase a character repeats...,642,catchphrase may refer to one of the follow character catchphrase a phrase a character repeat mul...
3,Comically Missing the Point,I started to walk down the street when I heard a voice saying Good evening Mr Dowd I turned and ...,1507,i start to walk down the street when i heard a voice say good even mr dowd i turn and there be t...
4,Running Gag,Thats not a running gag Thats a pun Its a running gag nowKermit the Frog No Fozzie Do not answer...,1633,thats not a run gag thats a pun it a run gag nowkermit the frog no fozzie do not answer that tel...
5,Disproportionate Retribution,But that was two years agoRevenge is a dish best served with an extra helping— Captain Young Tro...,2243,but that be two year agorevenge be a dish best serve with an extra helping— captain young troop ...
6,Establishing Character Moment,In just a handful of scenes weve established the full set of character archetypes to see us thro...,2597,in just a handful of scene weve establish the full set of character archetype to see u through u...
7,Karma Houdini,So is Satan just generous when people dont have a soul to sell Maybe if you do enough deeds in h...,5204,so be satan just generous when people dont have a soul to sell maybe if you do enough deed in hi...
8,Characterization Marches On,Nothing reconciles a past of animal abuse better than donut partiesAoi Asahina What the hecks ha...,1872,nothing reconciles a past of animal abuse well than donut partiesaoi asahina what the hecks happ...
9,Special Guest,Special but unknown guest starnote How can you have a guest star in a movieLadies and Gentlemen ...,728,special but unknown guest starnote how can you have a guest star in a movieladies and gentleman ...


In [14]:
train_df.dtypes

Trope Name           object
Trope Description    object
Text Length           int64
text_lemma           object
dtype: object

In [15]:
train_df = train_df.set_index(['Trope Name'])['text_lemma'].str.split().apply(lambda x:  pd.Series([' '.join(x[i:i+ (len(x)//30)]) 
                                                                                         for i in range(0, len(x), len(x)//30)])).stack().reset_index().drop('level_1', axis = 1)

In [16]:
train_df.shape

(320, 2)

In [17]:
train_df.columns = ['Trope Name', 'text_lemma']

In [18]:
# Defining X and y
X = train_df['text_lemma']
y = train_df['Trope Name']


In [19]:
# Train test split
X_train, X_test, y_train, y_test = train_test_split(X,
                                                    y,
                                                    stratify = y,
                                                    train_size=0.8,
                                                    random_state=42)

print(f'The X train set is {X_train.shape[0]} rows long.')
print(f'The y train set is {y_train.shape[0]} rows long.')
print(f'The X test set is {X_test.shape[0]} rows long.')
print(f'The y test set is {y_test.shape[0]} rows long.')


The X train set is 256 rows long.
The y train set is 256 rows long.
The X test set is 64 rows long.
The y test set is 64 rows long.


In [20]:
# Fitting the model
pipe_cv_lr = Pipeline(steps=[('cvec', CountVectorizer()),
                               ('logreg', LogisticRegression(solver='liblinear'))])

pipe_cv_lr_params = {'cvec__max_features':[5000], #2000, 3000, 4000, 5000
                       'cvec__min_df':[3], #2, 3
                       'cvec__max_df':[.85], #.85, .90, .95
                       'cvec__ngram_range':[(1,2)], #(1,1), (1,2), (1,3), (2,2)
                       'logreg__C': [0.1], #0.05, 0.1, 1
                       'logreg__penalty': ['l2']} #'l1', 'l2'

gs_cv_lr = GridSearchCV(pipe_cv_lr, param_grid=pipe_cv_lr_params, cv=3)

gs_cv_lr.fit(X_train, y_train)


In [21]:
# Making predictions
y_pred_cv_lr_train = gs_cv_lr.predict(X_train)
y_pred_cv_lr = gs_cv_lr.predict(X_test)
y_pred_proba_cv_lr = gs_cv_lr.predict_proba(X_test)


In [22]:
# Metrics
pred_prob_train = gs_cv_lr.predict_proba(X_train)
auc_score_train = roc_auc_score(y_train, pred_prob_train, 
                                multi_class="ovr", average="micro")
pred_prob_test = gs_cv_lr.predict_proba(X_test)
auc_score_test = roc_auc_score(y_test, pred_prob_test, 
                                multi_class="ovr", average="micro")

print(f'ROC-AUC on training set: {auc_score_train}')
print(f'ROC-AUC on testing set: {auc_score_test}')


print(classification_report(y_test, y_pred_cv_lr))

ROC-AUC on training set: 0.9496053059895834
ROC-AUC on testing set: 0.7713487413194444
                               precision    recall  f1-score   support

                  Catchphrase       1.00      0.17      0.29         6
  Characterization Marches On       0.00      0.00      0.00         7
  Comically Missing the Point       0.00      0.00      0.00         7
                    Couch Gag       0.25      0.50      0.33         8
 Disproportionate Retribution       0.33      0.17      0.22         6
Establishing Character Moment       0.25      0.17      0.20         6
                 Imagine Spot       0.00      0.00      0.00         6
                Karma Houdini       0.35      1.00      0.52         6
                  Running Gag       0.50      0.67      0.57         6
                Special Guest       0.60      0.50      0.55         6

                     accuracy                           0.31        64
                    macro avg       0.33      0.32      0.2

In [23]:
test_df = test_df.set_index(['10 Tropes'])['text_lemma'].str.split().apply(lambda x:  pd.Series([' '.join(x[i:i+ (len(x)//30)]) 
                                                                                         for i in range(0, len(x), len(x)//30)])).stack().reset_index().drop('level_1', axis = 1)

In [24]:
test_df.columns = ['Trope Name', 'text_lemma']

In [25]:
# Defining X and y
X = test_df['text_lemma']
y = test_df['Trope Name']


In [26]:
# Train test split
X_train, X_test, y_train, y_test = train_test_split(X,
                                                    y,
                                                    stratify = y,
                                                    train_size=0.8,
                                                    random_state=42)

print(f'The X train set is {X_train.shape[0]} rows long.')
print(f'The y train set is {y_train.shape[0]} rows long.')
print(f'The X test set is {X_test.shape[0]} rows long.')
print(f'The y test set is {y_test.shape[0]} rows long.')


The X train set is 324 rows long.
The y train set is 324 rows long.
The X test set is 81 rows long.
The y test set is 81 rows long.


In [27]:
test_df['Trope Name'] = test_df['Trope Name'].apply(literal_eval)

In [28]:
mlb = MultiLabelBinarizer()
binarised_df = pd.DataFrame(mlb.fit_transform(test_df['Trope Name']),columns=mlb.classes_, index=test_df.index)


In [29]:
binarised_df

Unnamed: 0,Catchphrase,Characterization Marches On,Comically Missing the Point,Couch Gag,Disproportionate Retribution,Establishing Character Moment,Imagine Spot,Karma Houdini,Running Gag,Special Guest
0,0,0,0,0,0,1,0,0,1,0
1,0,0,0,0,0,1,0,0,1,0
2,0,0,0,0,0,1,0,0,1,0
3,0,0,0,0,0,1,0,0,1,0
4,0,0,0,0,0,1,0,0,1,0
...,...,...,...,...,...,...,...,...,...,...
400,0,1,0,1,1,1,0,1,0,1
401,0,1,0,1,1,1,0,1,0,1
402,0,1,0,1,1,1,0,1,0,1
403,0,1,0,1,1,1,0,1,0,1


In [38]:
binarised_df.dtypes

Catchphrase                      int32
Characterization Marches On      int32
Comically Missing the Point      int32
Couch Gag                        int32
Disproportionate Retribution     int32
Establishing Character Moment    int32
Imagine Spot                     int32
Karma Houdini                    int32
Running Gag                      int32
Special Guest                    int32
dtype: object

In [30]:
test_df = test_df.join(binarised_df).drop('Trope Name', axis = 1)

In [31]:
test_df.head()

Unnamed: 0,text_lemma,Catchphrase,Characterization Marches On,Comically Missing the Point,Couch Gag,Disproportionate Retribution,Establishing Character Moment,Imagine Spot,Karma Houdini,Running Gag,Special Guest
0,homer hastily drive the family sedan with marge and maggie through a snowcovered street they be ...,0,0,0,0,0,1,0,0,1,0
1,tawanga the santa claus of the south sea lisa ’ s dance cause awe throughout the crowd lisa perf...,0,0,0,0,0,1,0,0,1,0
2,a disappointed look for bart action but a the pageant continue he grows bore and wonder aloud ho...,0,0,0,0,0,1,0,0,1,0
3,and lisa show marge their wish list marge be uncomfortable when lisa once again asks for a pony ...,0,0,0,0,0,1,0,0,1,0
4,with him request marge a grumble homer hand the phone over to marge and the two sister discus th...,0,0,0,0,0,1,0,0,1,0


In [32]:
X = test_df['text_lemma']
y = test_df.drop('text_lemma', axis = 1)

In [33]:
# Train test split
X_train, X_test, y_train, y_test = train_test_split(X,
                                                    y,
                                                    stratify = y,
                                                    train_size=0.8,
                                                    random_state=42)

print(f'The X train set is {X_train.shape[0]} rows long.')
print(f'The y train set is {y_train.shape[0]} rows long.')
print(f'The X test set is {X_test.shape[0]} rows long.')
print(f'The y test set is {y_test.shape[0]} rows long.')


The X train set is 324 rows long.
The y train set is 324 rows long.
The X test set is 81 rows long.
The y test set is 81 rows long.


In [34]:
pipe_cv_xgb = Pipeline(steps=[('cvec', CountVectorizer()),
                              ('xgb', MultiOutputClassifier(estimator=XGBClassifier(booster = 'gblinear',
                                                                                   eta = 0.05)))])

pipe_cv_xgb_params = {'cvec__max_features':[5000], #2000, 3000, 4000, 5000
                      'cvec__max_df':[.85], #.85, .90, .95
                      'cvec__ngram_range':[(1,2)], #(1,1), (1,2), (1,3), (2,2)
                     } 

gs_cv_xgb = GridSearchCV(pipe_cv_xgb, param_grid=pipe_cv_xgb_params, cv=3)

gs_cv_xgb.fit(X_train, y_train)


In [35]:
# Making predictions
y_pred_cv_xgb_train = gs_cv_xgb.predict(X_train)
y_pred_cv_xgb = gs_cv_xgb.predict(X_test)
y_pred_proba_cv_xgb = gs_cv_xgb.predict_proba(X_test)


In [36]:
gs_cv_xgb.score(X_train, y_train)

1.0

In [37]:
print(classification_report(y_test, y_pred_cv_xgb))

              precision    recall  f1-score   support

           0       0.83      0.28      0.42        18
           1       0.81      0.59      0.69        37
           2       0.75      0.80      0.77        45
           3       0.93      1.00      0.96        75
           4       0.80      0.77      0.79        31
           5       0.86      0.25      0.39        24
           6       1.00      0.33      0.50        18
           7       0.76      0.69      0.72        32
           8       1.00      0.33      0.50        18
           9       0.90      0.80      0.84        44

   micro avg       0.85      0.69      0.76       342
   macro avg       0.86      0.58      0.66       342
weighted avg       0.86      0.69      0.74       342
 samples avg       0.83      0.70      0.74       342

