# Marvel Dialogue Classification
#### CS 345 Final Project
By: Preston Dunton  
December 8, 2020

<img src="https://blog.umhb.edu/wp-content/uploads/2019/06/mcu-1920x1080.jpg" alt="MCU Banner" width="60%" height="60%">

# Introduction
This notebook summarizes the project that can be found in [this repository](https://github.com/prestondunton/marvel-dialogue-nlp).




# Problem / Dataset

## The Problem



## About the Dataset
This repository contains a newly created dataset to train and test models on, as well as several Jupyter Notebooks that describe the process used to create each `.csv`.  These Jupyter notebooks explain the process of parsing the `.pdf`s with the `pandas` library.  The end file, [mcu.csv](https://github.com/prestondunton/marvel-dialogue-nlp/blob/master/data/mcu.csv), contains columns `character` and `line` that hold the dialogue for several movies from the MCU. There are more columns that provide additional features for use.  See [/data/MCU.ipynb](https://github.com/prestondunton/marvel-dialogue-nlp/blob/master/data/MCU.ipynb) for more details on those features. For individual movies, the corresponding `.csv` can be found in [/data/cleaned/](https://github.com/prestondunton/marvel-dialogue-nlp/blob/master/data/cleaned) and contain columns `character` and `line`.  Each movie file was created using the same process, though improvements were found as more movies were processed.

The movie script `.pdf`s were obtained from [Script Slug](https://www.scriptslug.com/scripts/category/marvel), though other copies of the Marvel released scripts can be found online elsewhere.  Not all of the MCU movie scripts were released, so this dataset only contains a subset of the movies in the MCU.  Transcripts exist for all 21 movies, though these transcripts can contain many errors, so they were not used.  Additionally, creating each `.csv` took quite a bit of time, so currently, this dataset only contains 5 movies (listed below).


| Movies Included                       |
| ------------------------------------- |
| Iron Man (2008)                       |
| The Avengers (2012)                   |
| Thor: Ragnorak (2017)                 |
| Guardians of the Galaxy Vol. 2 (2017) |
| Avengers Endgame (2019)               |


# Methods

## Imports and Classes

In [1]:
import pandas as pd
import numpy as np

RANDOM_SEED = 42

In [2]:
from sklearn.pipeline import Pipeline

from sklearn.model_selection import GridSearchCV
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import StratifiedKFold

from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfTransformer

from sklearn.naive_bayes import MultinomialNB
from sklearn.svm import SVC
from sklearn.ensemble import RandomForestClassifier

from sklearn import metrics

import matplotlib.pyplot as plt

In [3]:
import nltk
from nltk.corpus import stopwords
from nltk.stem import SnowballStemmer

class StemCountVectorizer(CountVectorizer):
    def build_analyzer(self):
        analyzer = super(StemCountVectorizer, self).build_analyzer()
        
        return lambda document: ([SnowballStemmer('english', ignore_stopwords=True).stem(word) for word in analyzer(document)])

## Dataset Import and Preprocessing

In [4]:
from sklearn.utils import shuffle

mcu = pd.read_csv("./data/mcu.csv")

min_line_count = 150

is_main_character = mcu["character"].value_counts() > min_line_count
is_main_character = is_main_character.rename("is main character", axis=0)

main_character_rows = is_main_character[mcu["character"]]
main_character_rows = main_character_rows.reset_index(drop=True)

mcu = mcu[main_character_rows]

y = mcu["character"].to_numpy().astype(str)
X = mcu["line"].to_numpy().astype(str)

X, y = shuffle(X, y, random_state=RANDOM_SEED)

X, y

(array(['Got it.', 'And terrifying.', 'What’s the delta rate?', ...,
        'I’m sorry. He seemed like a good man.', 'Heimdall, come on.',
        'I am a king!'], dtype='<U606'),
 array(['BRUCE BANNER', 'PEPPER POTTS', 'TONY STARK', ..., 'STEVE ROGERS',
        'THOR', 'LOKI'], dtype='<U12'))

## Models

In [5]:
cross_validator = StratifiedKFold(n_splits=5, random_state=RANDOM_SEED, shuffle=True)
score_method = "balanced_accuracy"

In [6]:
count_vectorizer = CountVectorizer()
stem_count_vectorizer = StemCountVectorizer()

tfidf_transformer = TfidfTransformer()

nb_classifier = MultinomialNB()
rf_classifier = RandomForestClassifier(n_jobs=-1, random_state=RANDOM_SEED)
svm_classifier = SVC()

In [16]:
count_params = {'vect__binary': [True, False],
               'vect__stop_words': [None, 'english', stopwords.words('english')],
              'vect__ngram_range': [(1,1), (1,2), (1,3)]}

tfidf_params = {'tfidf__norm': ['l1', 'l2'],
              'tfidf__use_idf': [True, False]}

nb_params = {'clf__alpha': [1, 1e-1, 1e-2, 1e-3],
             'clf__fit_prior': [True, False]}

rf_params = {'clf__criterion': ["gini", "entropy"],
             'clf__max_depth': [None, 7, 8, 9, 10 ,11 ,12],
             'clf__max_features': [None, "sqrt", "log2"],
             'clf__class_weight': [None, 'balanced']}

svm_params = {'clf__C': [1e-2, 1e-1, 0, 1, 10, 100],
              'clf__kernel': ['linear', 'poly', 'rbf'],
              'clf__degree': [2,3,4,5,6],
              'clf__gamma': ['scale', 'auto'],
              'clf__class_weight': [None, 'balanced']}

### Model 1 (Naive Bayes, no TFIDF, no stemming)

In [8]:
pipe1 = Pipeline([('vect', count_vectorizer), 
                  ('clf', nb_classifier)])

parameters1 = {**count_params, **nb_params}

grid1 = GridSearchCV(pipe1, parameters1, cv=cross_validator, scoring=score_method, n_jobs=-1, verbose=3)

grid1.fit(X,y)

grid1.best_params_

Fitting 5 folds for each of 144 candidates, totalling 720 fits


[Parallel(n_jobs=-1)]: Using backend LokyBackend with 12 concurrent workers.
[Parallel(n_jobs=-1)]: Done   8 tasks      | elapsed:    1.5s
[Parallel(n_jobs=-1)]: Done 104 tasks      | elapsed:    2.8s
[Parallel(n_jobs=-1)]: Done 264 tasks      | elapsed:    5.1s
[Parallel(n_jobs=-1)]: Done 488 tasks      | elapsed:    8.3s
[Parallel(n_jobs=-1)]: Done 720 out of 720 | elapsed:   11.7s finished


{'clf__alpha': 0.1,
 'clf__fit_prior': False,
 'vect__binary': True,
 'vect__ngram_range': (1, 3),
 'vect__stop_words': ['i',
  'me',
  'my',
  'myself',
  'we',
  'our',
  'ours',
  'ourselves',
  'you',
  "you're",
  "you've",
  "you'll",
  "you'd",
  'your',
  'yours',
  'yourself',
  'yourselves',
  'he',
  'him',
  'his',
  'himself',
  'she',
  "she's",
  'her',
  'hers',
  'herself',
  'it',
  "it's",
  'its',
  'itself',
  'they',
  'them',
  'their',
  'theirs',
  'themselves',
  'what',
  'which',
  'who',
  'whom',
  'this',
  'that',
  "that'll",
  'these',
  'those',
  'am',
  'is',
  'are',
  'was',
  'were',
  'be',
  'been',
  'being',
  'have',
  'has',
  'had',
  'having',
  'do',
  'does',
  'did',
  'doing',
  'a',
  'an',
  'the',
  'and',
  'but',
  'if',
  'or',
  'because',
  'as',
  'until',
  'while',
  'of',
  'at',
  'by',
  'for',
  'with',
  'about',
  'against',
  'between',
  'into',
  'through',
  'during',
  'before',
  'after',
  'above',
  'below',
 

### Model 2 (Naive Bayes, TFIDF, no stemming)

In [10]:
pipe2 = Pipeline([('vect', count_vectorizer),
                  ('tfidf', tfidf_transformer),
                  ('clf', nb_classifier)])

parameters2 = {**count_params, **tfidf_params, **nb_params}

grid2 = GridSearchCV(pipe2, parameters2, cv=cross_validator, scoring=score_method, n_jobs=-1, verbose=3)

grid2.fit(X,y)

grid2.best_params_

Fitting 5 folds for each of 576 candidates, totalling 2880 fits


[Parallel(n_jobs=-1)]: Using backend LokyBackend with 12 concurrent workers.
[Parallel(n_jobs=-1)]: Done   8 tasks      | elapsed:    0.1s
[Parallel(n_jobs=-1)]: Done 184 tasks      | elapsed:    2.4s
[Parallel(n_jobs=-1)]: Done 504 tasks      | elapsed:    6.6s
[Parallel(n_jobs=-1)]: Done 952 tasks      | elapsed:   12.9s
[Parallel(n_jobs=-1)]: Done 1528 tasks      | elapsed:   21.3s
[Parallel(n_jobs=-1)]: Done 2232 tasks      | elapsed:   32.3s
[Parallel(n_jobs=-1)]: Done 2880 out of 2880 | elapsed:   42.1s finished


{'clf__alpha': 0.1,
 'clf__fit_prior': False,
 'tfidf__norm': 'l2',
 'tfidf__use_idf': True,
 'vect__binary': False,
 'vect__ngram_range': (1, 2),
 'vect__stop_words': ['i',
  'me',
  'my',
  'myself',
  'we',
  'our',
  'ours',
  'ourselves',
  'you',
  "you're",
  "you've",
  "you'll",
  "you'd",
  'your',
  'yours',
  'yourself',
  'yourselves',
  'he',
  'him',
  'his',
  'himself',
  'she',
  "she's",
  'her',
  'hers',
  'herself',
  'it',
  "it's",
  'its',
  'itself',
  'they',
  'them',
  'their',
  'theirs',
  'themselves',
  'what',
  'which',
  'who',
  'whom',
  'this',
  'that',
  "that'll",
  'these',
  'those',
  'am',
  'is',
  'are',
  'was',
  'were',
  'be',
  'been',
  'being',
  'have',
  'has',
  'had',
  'having',
  'do',
  'does',
  'did',
  'doing',
  'a',
  'an',
  'the',
  'and',
  'but',
  'if',
  'or',
  'because',
  'as',
  'until',
  'while',
  'of',
  'at',
  'by',
  'for',
  'with',
  'about',
  'against',
  'between',
  'into',
  'through',
  'during'

### Model 3 (Naive Bayes, no TFIDF,  stemming)

In [11]:
pipe3 = Pipeline([('vect', stem_count_vectorizer),
                  ('clf', nb_classifier)])

parameters3 = {**count_params, **nb_params}

grid3 = GridSearchCV(pipe3, parameters3, cv=cross_validator, scoring=score_method, n_jobs=-1, verbose=3)

grid3.fit(X,y)

grid3.best_params_

Fitting 5 folds for each of 144 candidates, totalling 720 fits


[Parallel(n_jobs=-1)]: Using backend LokyBackend with 12 concurrent workers.
[Parallel(n_jobs=-1)]: Done   8 tasks      | elapsed:   19.2s
[Parallel(n_jobs=-1)]: Done 104 tasks      | elapsed:  3.3min
[Parallel(n_jobs=-1)]: Done 264 tasks      | elapsed:  9.2min
[Parallel(n_jobs=-1)]: Done 488 tasks      | elapsed: 16.6min
[Parallel(n_jobs=-1)]: Done 720 out of 720 | elapsed: 24.0min finished


{'clf__alpha': 0.1,
 'clf__fit_prior': False,
 'vect__binary': False,
 'vect__ngram_range': (1, 1),
 'vect__stop_words': None}

### Model 4 (Naive Bayes, TFIDF, stemming)

In [12]:
pipe4 = Pipeline([('vect', stem_count_vectorizer),
                  ('tfidf', tfidf_transformer),
                  ('clf', nb_classifier)])

parameters4 = {**count_params, **tfidf_params, **nb_params}

grid4 = GridSearchCV(pipe4, parameters4, cv=cross_validator, scoring=score_method, n_jobs=-1, verbose=3)

grid4.fit(X,y)

grid4.best_params_

Fitting 5 folds for each of 576 candidates, totalling 2880 fits


[Parallel(n_jobs=-1)]: Using backend LokyBackend with 12 concurrent workers.
[Parallel(n_jobs=-1)]: Done   8 tasks      | elapsed:   18.4s
[Parallel(n_jobs=-1)]: Done 104 tasks      | elapsed:  3.2min
[Parallel(n_jobs=-1)]: Done 264 tasks      | elapsed:  8.8min
[Parallel(n_jobs=-1)]: Done 488 tasks      | elapsed: 16.1min
[Parallel(n_jobs=-1)]: Done 776 tasks      | elapsed: 25.2min
[Parallel(n_jobs=-1)]: Done 1128 tasks      | elapsed: 36.7min
[Parallel(n_jobs=-1)]: Done 1544 tasks      | elapsed: 50.9min
[Parallel(n_jobs=-1)]: Done 2024 tasks      | elapsed: 67.0min
[Parallel(n_jobs=-1)]: Done 2568 tasks      | elapsed: 85.2min
[Parallel(n_jobs=-1)]: Done 2880 out of 2880 | elapsed: 95.7min finished


{'clf__alpha': 0.01,
 'clf__fit_prior': False,
 'tfidf__norm': 'l2',
 'tfidf__use_idf': True,
 'vect__binary': False,
 'vect__ngram_range': (1, 2),
 'vect__stop_words': None}

### Model 5 (Random Forest, no TFIDF, no stemming)

In [14]:
pipe5 = Pipeline([('vect', count_vectorizer), 
                  ('clf', rf_classifier)])

parameters5 = {**count_params, **rf_params}

grid5 = GridSearchCV(pipe5, parameters5, cv=cross_validator, scoring=score_method, n_jobs=-1, verbose=3)

grid5.fit(X,y)

grid5.best_params_

Fitting 5 folds for each of 1512 candidates, totalling 7560 fits


[Parallel(n_jobs=-1)]: Using backend LokyBackend with 12 concurrent workers.


ValueError: Invalid parameter clf_class_weight for estimator Pipeline(steps=[('vect', CountVectorizer()),
                ('clf', RandomForestClassifier(n_jobs=-1, random_state=42))]). Check the list of available parameters with `estimator.get_params().keys()`.

### Model 6 (Random Forest, TFIDF, no stemming)

In [None]:
pipe6 = Pipeline([('vect', count_vectorizer),
                  ('tfidf', tfidf_transformer),
                  ('clf', rf_classifier)])

parameters6 = {**count_params, **tfidf_params, **rf_params}

grid6 = GridSearchCV(pipe6, parameters6, cv=cross_validator, scoring=score_method, n_jobs=-1, verbose=3)

grid6.fit(X,y)

grid6.best_params_

### Model 7 (Random Forest, no TFIDF,  stemming)

In [15]:
pipe7 = Pipeline([('vect', stem_count_vectorizer),
                  ('clf', rf_classifier)])

parameters7 = {**count_params, **rf_params}

grid7 = GridSearchCV(pipe7, parameters7, cv=cross_validator, scoring=score_method, n_jobs=-1, verbose=3)

grid7.fit(X,y)

grid7.best_params_

Fitting 5 folds for each of 1512 candidates, totalling 7560 fits


[Parallel(n_jobs=-1)]: Using backend LokyBackend with 12 concurrent workers.


ValueError: Invalid parameter clf_class_weight for estimator Pipeline(steps=[('vect', StemCountVectorizer()),
                ('clf', RandomForestClassifier(n_jobs=-1, random_state=42))]). Check the list of available parameters with `estimator.get_params().keys()`.

### Model 8 (Random Forest, TFIDF, stemming)

In [None]:
pipe8 = Pipeline([('vect', stem_count_vectorizer),
                  ('tfidf', tfidf_transformer),
                  ('clf', rf_classifier)])

parameters8 = {**count_params, **tfidf_params, **rf_params}

grid8 = GridSearchCV(pipe8, parameters8, cv=cross_validator, scoring=score_method, n_jobs=-1, verbose=3)

grid8.fit(X,y)

grid8.best_params_

### Model 9 (SVM, no TFIDF, no stemming)

In [None]:
pipe9 = Pipeline([('vect', count_vectorizer), 
                  ('clf', svm_classifier)])

parameters9 = {**count_params, **svm_params}

grid9 = GridSearchCV(pipe9, parameters9, cv=cross_validator, scoring=score_method, n_jobs=-1, verbose=3)

grid9.fit(X,y)

grid9.best_params_

### Model 10 (SVM, TFIDF, no stemming)

In [None]:
pipe10 = Pipeline([('vect', count_vectorizer),
                  ('tfidf', tfidf_transformer),
                  ('clf', svm_classifier)])

parameters10 = {**count_params, **tfidf_params, **svm_params}

grid10 = GridSearchCV(pipe10, parameters10, cv=cross_validator, scoring=score_method, n_jobs=-1, verbose=3)

grid10.fit(X,y)

grid10.best_params_

### Model 11 (SVM, no TFIDF,  stemming)

In [None]:
pipe11 = Pipeline([('vect', stem_count_vectorizer),
                  ('clf', svm_classifier)])

parameters11 = {**count_params, **svm_params}

grid11 = GridSearchCV(pipe11, parameters11, cv=cross_validator, scoring=score_method, n_jobs=-1, verbose=3)

grid11.fit(X,y)

grid11.best_params_

### Model 12 (SVM, TFIDF, stemming)

In [None]:
pipe12 = Pipeline([('vect', stem_count_vectorizer),
                  ('tfidf', tfidf_transformer),
                  ('clf', svm_classifier)])

parameters12 = {**count_params, **tfidf_params, **svm_params}

grid12 = GridSearchCV(pipe12, parameters12, cv=cross_validator, scoring=score_method, n_jobs=-1, verbose=3)

grid12.fit(X,y)

grid12.best_params_

# Results

# Conclusions

## Project Improvements

## What I Learned

## What I Would Still Like to Learn

# Sources

Ben-Hur, Asa. “CS345: Machine Learning Foundations and Practice.” GitHub, Colorado State University, 7 Dec. 2020, www.github.com/asabenhur/CS345. 

Shaikh, Javed. “Machine Learning, NLP: Text Classification Using Scikit-Learn, Python and NLTK.” Medium, Towards Data Science, 30 Oct. 2017, www.towardsdatascience.com/machine-learning-nlp-text-classification-using-scikit-learn-python-and-nltk-c52b92a7c73a. 

Starmer, Josh. “Naive Bayes, Clearly Explained!!!” YouTube, StatQuest with Josh Starmer, 3 June 2020, www.youtube.com/watch?v=O2L2Uv9pdDA. 

Starmer, Josh. “Support Vector Machines, Clearly Explained!!!” YouTube, StatQuest with Josh Starmer, 30 September 2019, www.youtube.com/watch?v=O2L2Uv9pdDA. 

