# Marvel Dialogue Classification
#### CS 345 Final Project
By: Preston Dunton  
December 8, 2020

<img src="https://blog.umhb.edu/wp-content/uploads/2019/06/mcu-1920x1080.jpg" alt="MCU Banner" width="60%" height="60%" align="center">

# Introduction
This notebook summarizes the project that can be found in [this repository](https://github.com/prestondunton/marvel-dialogue-nlp). This project is my final project for CS 345, Machine Learning Foundations and Practice, at Colorado State University (Fall 2020).  The goals of this project were for me to apply what I've learned this semester, as well as introduce me into NLP problems and methods.


## The Problem
The problem whished to be accomplished in this project is an NLP classification problem.  The goal was to create a model that can predict a character's name given a line of their dialogue from a Marvel Cinematic Universe (MCU) movie.  Data was taken from Marvel released scripts and transformed into labels of names and feature documents of their dialogue.


## About the Dataset
This repository contains a newly created dataset to train and test models on, as well as several Jupyter Notebooks that describe the process used to create each `.csv`.  These Jupyter notebooks explain the process of parsing the `.pdf`s with the `pandas` library.  The end file, [mcu.csv](https://github.com/prestondunton/marvel-dialogue-nlp/blob/master/data/mcu.csv), contains columns `character` and `line` that hold the dialogue for several movies from the MCU. There are more columns that provide additional features for context, but were not used in this project.  See [/data/MCU.ipynb](https://github.com/prestondunton/marvel-dialogue-nlp/blob/master/data/MCU.ipynb) for more details on those features. For individual movies, the corresponding `.csv` can be found in [/data/cleaned/](https://github.com/prestondunton/marvel-dialogue-nlp/blob/master/data/cleaned) and contain columns `character` and `line`.  Each movie file was created using the same partially automated process, though improvements were found as more movies were processed.

The movie script `.pdf`s were obtained from [Script Slug](https://www.scriptslug.com/scripts/category/marvel), though other copies of the Marvel released scripts can be found online elsewhere.  Only a few of the MCU movie scripts were released, so this dataset only contains a subset of the movies in the MCU (listed below).  Transcripts exist for all 21 movies, though these transcripts can contain many errors, so they were not used.  Additionally, creating each `.csv` took quite a bit of time (approximately 12 hours per movie), so currently, this dataset only contains 5 movies (listed below).


| MCU Movies on Script Slug             | Included in Project |
| ------------------------------------- | ------------------- |
| Iron Man (2008)                       | ✔️                 | 
| The Avengers (2012)                   | ✔️                 |
| Thor: Ragnorak (2017)                 | ✔️                 |
| Guardians of the Galaxy Vol. 2 (2017) | ✔️                 |
| Avengers Endgame (2019)               | ✔️                 |
| Thor (2011)                           | ❌                  |
| Captain America (2011)                | ❌                  |
| Black Panther (2018)                  | ❌                  |


# Methods

In order to accomplish the task described above, 12 models were created.  These models employ different combinations of NLP techniques and different ML classifiers.  A summary of the architecture of the 12 models, as well as an explaination of this project's use of each technique and classifier, is below.

| Model # | Classifier    | Uses Stemming | Uses TF / IDF Transformation |
| ------- | ------------- | -------- | ---------------------------- |
| 1       | Naive Bayes   | ❌      | ❌                           |
| 2       | Naive Bayes   | ❌      | ✔️                           |
| 3       | Naive Bayes   | ✔️      | ❌                           |
| 4       | Naive Bayes   | ✔️      | ✔️                           |
| 5       | Random Forest | ❌      | ❌                           |
| 6       | Random Forest | ❌      | ✔️                           |
| 7       | Random Forest | ✔️      | ❌                           |
| 8       | Random Forest | ✔️      | ✔️                           |
| 9       | SVM           | ❌      | ❌                           |
| 10      | SVM           | ❌      | ✔️                           |
| 11      | SVM           | ✔️      | ❌                           |
| 12      | SVM           | ✔️      | ✔️                           |


## Grid Searching

Scikit-learn provides an interface for model selection called `GridSearchCV` which uses cross validation to select the best combination of hyperparameters for a given model.  The first 11 of the models all use `GridSearchCV` to select the best performing parameters for the `CountVectorizer` (or the custom `StemCountVectorizer` if the model uses stemming).  Model 12 does not use GridSearchCV, because the estimated compute time for searching every combination of defined parameters was 90 hours. Instead, model 12 uses parameters from model 10, which is the same architecture except for the use of stemming.  It should also be noted that the first 11 models were trained from the same options of parameters, so that they could be compared.

## Stop Words

The removal of non-important words from the feature documents, called "stop words," is often an important step in NLP problems.  Stop words like `["the","I", "we", "she" ...]` often don't provide any value to a model. Three options were provided to `GridSearchCV` for removing stop words.  The first is `None`, which means we don't remove stop words.  The other two options are two different sets of stop words, the `english` set provided by Scikit-learn, and the `english` set provided by NLTK, which was included because the documentation for Scikit-learn's [CountVectorizer](https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.CountVectorizer.html) recommends a different set of stop words.


## Stemming

Stemming is the process of taking derrived words, such as verb conjugations and plurals, and transforming them back into their base versions.  For example, "fishing," "fished," and "fishes" might all be stemmed into "fish."  Stemming is useful in NLP because it allows our word count features to group derrived words by their base words.  In this project, NLTK's Snowball Stemmer is used in a custom class that extends Scikit-learn's `CountVectorizer`.  See Javed Shaikh's article in the Sources section for the inspiration for this approach.


## TF / IDF Transformations


## Naive Bayes Classifiers

Naive Bayes Classification is based on Bayes's theorem, which says that 

$$
p(y|\textbf{x})=\frac{p(\textbf{x}|y)*p(y)}{p(\textbf{x})}
$$

where 

$p(y|\textbf{x})$ is our **posterior**, or probability of our label given an observation's features.

$p(\textbf{x}|y)$ is our **likelihood**, or the probability of our features given a observation's label.

$p(y)$ is our **prior**, or the probability of our label without seeing any of the observation's features.  

The use of Bayes's theorem in ML assumes that all of our features are independent, which they often aren't.  Especially in NLP, certain words are more likely when observing related words.  By ignoring this, we make Bayes's theorem "naive", and can compute  our model.

In this project, we compute the probability of a line belonging to a certain character by treating each word in the text as a feature.  We then select the character with the highest posterior given the features as our prediction.


## Support Vector Machine Classifiers


## Random Forest Classifiers




## Imports and Classes

In [2]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

from sklearn.pipeline import Pipeline

from sklearn.model_selection import GridSearchCV
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import StratifiedKFold

from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfTransformer

from sklearn.naive_bayes import MultinomialNB
from sklearn.svm import SVC
from sklearn.ensemble import RandomForestClassifier

from sklearn import metrics
from sklearn.utils import shuffle

import nltk
from nltk.corpus import stopwords
from nltk.stem import SnowballStemmer

RANDOM_SEED = 42

class StemCountVectorizer(CountVectorizer):
    def build_analyzer(self):
        analyzer = super(StemCountVectorizer, self).build_analyzer()
        
        return lambda document: ([SnowballStemmer('english', ignore_stopwords=True).stem(word) for word in analyzer(document)])

### Dataset Import

In [4]:
mcu = pd.read_csv("./data/mcu.csv")

mcu.head()

(array(['Got it.', 'And terrifying.', 'What’s the delta rate?', ...,
        'I’m sorry. He seemed like a good man.', 'Heimdall, come on.',
        'I am a king!'], dtype='<U606'),
 array(['BRUCE BANNER', 'PEPPER POTTS', 'TONY STARK', ..., 'STEVE ROGERS',
        'THOR', 'LOKI'], dtype='<U12'))

## Preliminary Data Analysis

Word Count distribution, Line count distribution, the number of charachters chosen and why.

### Line Count Distributions

In [None]:
line_count = pd.DataFrame(mcu.groupby(["movie","character"]).line.nunique())
line_count.reset_index(inplace=True)
line_count = line_count.pivot(index="character", columns="movie", values="line")
line_count.fillna(0, inplace=True)
line_count["total"] = line_count.sum(axis=1)
line_count = line_count.astype("int64")
line_count.sort_values(by="total", ascending=False)

line_count['total'].hist(bins=60)
pd.DataFrame(line_count['total']).describe(percentiles = DESCRIBE_PERCENTILES)

### Words Per Line Distribution

In [None]:
mcu.hist(column="words", bins=60)
pd.DataFrame(mcu["words"]).describe(percentiles = DESCRIBE_PERCENTILES)

## Dataset Preprocessing

In [None]:
min_line_count = 150

is_main_character = mcu["character"].value_counts() > min_line_count
is_main_character = is_main_character.rename("is main character", axis=0)

main_character_rows = is_main_character[mcu["character"]]
main_character_rows = main_character_rows.reset_index(drop=True)

mcu = mcu[main_character_rows]

y = mcu["character"].to_numpy().astype(str)
X = mcu["line"].to_numpy().astype(str)

X, y = shuffle(X, y, random_state=RANDOM_SEED)

X, y

## Models

In [17]:
cross_validator = StratifiedKFold(n_splits=5, random_state=RANDOM_SEED, shuffle=True)
score_method = "balanced_accuracy"

In [18]:
count_vectorizer = CountVectorizer()
stem_count_vectorizer = StemCountVectorizer()

tfidf_transformer = TfidfTransformer()

nb_classifier = MultinomialNB()
rf_classifier = RandomForestClassifier(n_jobs=-1, random_state=RANDOM_SEED)
svm_classifier = SVC()

In [19]:
count_params = {'vect__binary': [True, False],
               'vect__stop_words': [None, 'english', stopwords.words('english')],
              'vect__ngram_range': [(1,1), (1,2), (1,3)]}

tfidf_params = {'tfidf__norm': ['l1', 'l2'],
              'tfidf__use_idf': [True, False]}

nb_params = {'clf__alpha': [1, 1e-1, 1e-2, 1e-3],
             'clf__fit_prior': [True, False]}

rf_params = {'clf__criterion': ["gini", "entropy"],
             'clf__max_depth': [None, 7, 8, 9, 10 ,11 ,12],
             'clf__max_features': [None, "sqrt", "log2"],
             'clf__class_weight': [None, 'balanced']}

svm_params = {'clf__C': [1e-2, 1e-1, 0, 1, 10, 100],
              'clf__kernel': ['linear', 'poly', 'rbf'],
              'clf__degree': [2,3,4,5,6],
              'clf__gamma': ['scale', 'auto'],
              'clf__class_weight': [None, 'balanced']}

### Model 1 (Naive Bayes, no TFIDF, no stemming)

In [8]:
pipe1 = Pipeline([('vect', count_vectorizer), 
                  ('clf', nb_classifier)])

parameters1 = {**count_params, **nb_params}

grid1 = GridSearchCV(pipe1, parameters1, cv=cross_validator, scoring=score_method, n_jobs=-1, verbose=3)

grid1.fit(X,y)

grid1.best_params_

Fitting 5 folds for each of 144 candidates, totalling 720 fits


[Parallel(n_jobs=-1)]: Using backend LokyBackend with 12 concurrent workers.
[Parallel(n_jobs=-1)]: Done   8 tasks      | elapsed:    1.5s
[Parallel(n_jobs=-1)]: Done 104 tasks      | elapsed:    2.8s
[Parallel(n_jobs=-1)]: Done 264 tasks      | elapsed:    5.1s
[Parallel(n_jobs=-1)]: Done 488 tasks      | elapsed:    8.3s
[Parallel(n_jobs=-1)]: Done 720 out of 720 | elapsed:   11.7s finished


{'clf__alpha': 0.1,
 'clf__fit_prior': False,
 'vect__binary': True,
 'vect__ngram_range': (1, 3),
 'vect__stop_words': ['i',
  'me',
  'my',
  'myself',
  'we',
  'our',
  'ours',
  'ourselves',
  'you',
  "you're",
  "you've",
  "you'll",
  "you'd",
  'your',
  'yours',
  'yourself',
  'yourselves',
  'he',
  'him',
  'his',
  'himself',
  'she',
  "she's",
  'her',
  'hers',
  'herself',
  'it',
  "it's",
  'its',
  'itself',
  'they',
  'them',
  'their',
  'theirs',
  'themselves',
  'what',
  'which',
  'who',
  'whom',
  'this',
  'that',
  "that'll",
  'these',
  'those',
  'am',
  'is',
  'are',
  'was',
  'were',
  'be',
  'been',
  'being',
  'have',
  'has',
  'had',
  'having',
  'do',
  'does',
  'did',
  'doing',
  'a',
  'an',
  'the',
  'and',
  'but',
  'if',
  'or',
  'because',
  'as',
  'until',
  'while',
  'of',
  'at',
  'by',
  'for',
  'with',
  'about',
  'against',
  'between',
  'into',
  'through',
  'during',
  'before',
  'after',
  'above',
  'below',
 

In [None]:
model1 = Pipeline([('vect', CountVectorizer(binary=True, ngram_range = (1,3), stop_words = stopwords.words('english'))),
                  ('clf', MultinomialNB(alpha=0.1, fit_prior=False))])

### Model 2 (Naive Bayes, TFIDF, no stemming)

In [10]:
pipe2 = Pipeline([('vect', count_vectorizer),
                  ('tfidf', tfidf_transformer),
                  ('clf', nb_classifier)])

parameters2 = {**count_params, **tfidf_params, **nb_params}

grid2 = GridSearchCV(pipe2, parameters2, cv=cross_validator, scoring=score_method, n_jobs=-1, verbose=3)

grid2.fit(X,y)

grid2.best_params_

Fitting 5 folds for each of 576 candidates, totalling 2880 fits


[Parallel(n_jobs=-1)]: Using backend LokyBackend with 12 concurrent workers.
[Parallel(n_jobs=-1)]: Done   8 tasks      | elapsed:    0.1s
[Parallel(n_jobs=-1)]: Done 184 tasks      | elapsed:    2.4s
[Parallel(n_jobs=-1)]: Done 504 tasks      | elapsed:    6.6s
[Parallel(n_jobs=-1)]: Done 952 tasks      | elapsed:   12.9s
[Parallel(n_jobs=-1)]: Done 1528 tasks      | elapsed:   21.3s
[Parallel(n_jobs=-1)]: Done 2232 tasks      | elapsed:   32.3s
[Parallel(n_jobs=-1)]: Done 2880 out of 2880 | elapsed:   42.1s finished


{'clf__alpha': 0.1,
 'clf__fit_prior': False,
 'tfidf__norm': 'l2',
 'tfidf__use_idf': True,
 'vect__binary': False,
 'vect__ngram_range': (1, 2),
 'vect__stop_words': ['i',
  'me',
  'my',
  'myself',
  'we',
  'our',
  'ours',
  'ourselves',
  'you',
  "you're",
  "you've",
  "you'll",
  "you'd",
  'your',
  'yours',
  'yourself',
  'yourselves',
  'he',
  'him',
  'his',
  'himself',
  'she',
  "she's",
  'her',
  'hers',
  'herself',
  'it',
  "it's",
  'its',
  'itself',
  'they',
  'them',
  'their',
  'theirs',
  'themselves',
  'what',
  'which',
  'who',
  'whom',
  'this',
  'that',
  "that'll",
  'these',
  'those',
  'am',
  'is',
  'are',
  'was',
  'were',
  'be',
  'been',
  'being',
  'have',
  'has',
  'had',
  'having',
  'do',
  'does',
  'did',
  'doing',
  'a',
  'an',
  'the',
  'and',
  'but',
  'if',
  'or',
  'because',
  'as',
  'until',
  'while',
  'of',
  'at',
  'by',
  'for',
  'with',
  'about',
  'against',
  'between',
  'into',
  'through',
  'during'

In [None]:
model2 = Pipeline([('vect', CountVectorizer(binary=False, ngram_range=(1,2), stop_words=stopwords.words('english'))),
                  ('tfidf', TfidfTransformer(norm='l2', use_idf=True)),
                  ('clf', MultinomialNB(alpha=0.1, fit_prior=False))])

### Model 3 (Naive Bayes, no TFIDF,  stemming)

In [11]:
pipe3 = Pipeline([('vect', stem_count_vectorizer),
                  ('clf', nb_classifier)])

parameters3 = {**count_params, **nb_params}

grid3 = GridSearchCV(pipe3, parameters3, cv=cross_validator, scoring=score_method, n_jobs=-1, verbose=3)

grid3.fit(X,y)

grid3.best_params_

Fitting 5 folds for each of 144 candidates, totalling 720 fits


[Parallel(n_jobs=-1)]: Using backend LokyBackend with 12 concurrent workers.
[Parallel(n_jobs=-1)]: Done   8 tasks      | elapsed:   19.2s
[Parallel(n_jobs=-1)]: Done 104 tasks      | elapsed:  3.3min
[Parallel(n_jobs=-1)]: Done 264 tasks      | elapsed:  9.2min
[Parallel(n_jobs=-1)]: Done 488 tasks      | elapsed: 16.6min
[Parallel(n_jobs=-1)]: Done 720 out of 720 | elapsed: 24.0min finished


{'clf__alpha': 0.1,
 'clf__fit_prior': False,
 'vect__binary': False,
 'vect__ngram_range': (1, 1),
 'vect__stop_words': None}

In [None]:
model3 = Pipeline([('vect', StemCountVectorizer(binary=False, ngram_range = (1,1), stop_words = None)),
                  ('clf', MultinomialNB(alpha=0.1, fit_prior=False))])

### Model 4 (Naive Bayes, TFIDF, stemming)

In [12]:
pipe4 = Pipeline([('vect', stem_count_vectorizer),
                  ('tfidf', tfidf_transformer),
                  ('clf', nb_classifier)])

parameters4 = {**count_params, **tfidf_params, **nb_params}

grid4 = GridSearchCV(pipe4, parameters4, cv=cross_validator, scoring=score_method, n_jobs=-1, verbose=3)

grid4.fit(X,y)

grid4.best_params_

Fitting 5 folds for each of 576 candidates, totalling 2880 fits


[Parallel(n_jobs=-1)]: Using backend LokyBackend with 12 concurrent workers.
[Parallel(n_jobs=-1)]: Done   8 tasks      | elapsed:   18.4s
[Parallel(n_jobs=-1)]: Done 104 tasks      | elapsed:  3.2min
[Parallel(n_jobs=-1)]: Done 264 tasks      | elapsed:  8.8min
[Parallel(n_jobs=-1)]: Done 488 tasks      | elapsed: 16.1min
[Parallel(n_jobs=-1)]: Done 776 tasks      | elapsed: 25.2min
[Parallel(n_jobs=-1)]: Done 1128 tasks      | elapsed: 36.7min
[Parallel(n_jobs=-1)]: Done 1544 tasks      | elapsed: 50.9min
[Parallel(n_jobs=-1)]: Done 2024 tasks      | elapsed: 67.0min
[Parallel(n_jobs=-1)]: Done 2568 tasks      | elapsed: 85.2min
[Parallel(n_jobs=-1)]: Done 2880 out of 2880 | elapsed: 95.7min finished


{'clf__alpha': 0.01,
 'clf__fit_prior': False,
 'tfidf__norm': 'l2',
 'tfidf__use_idf': True,
 'vect__binary': False,
 'vect__ngram_range': (1, 2),
 'vect__stop_words': None}

In [None]:
model4 = Pipeline([('vect', StemCountVectorizer(binary=False, ngram_range=(1,2), stop_words=None)),
                  ('tfidf', TfidfTransformer(norm='l2', use_idf=True)),
                  ('clf', MultinomialNB(alpha=0.01, fit_prior=False))])

### Model 5 (Random Forest, no TFIDF, no stemming)

In [20]:
pipe5 = Pipeline([('vect', count_vectorizer), 
                  ('clf', rf_classifier)])

parameters5 = {**count_params, **rf_params}

grid5 = GridSearchCV(pipe5, parameters5, cv=cross_validator, scoring=score_method, n_jobs=-1, verbose=3)

grid5.fit(X,y)

grid5.best_params_

Fitting 5 folds for each of 1512 candidates, totalling 7560 fits


[Parallel(n_jobs=-1)]: Using backend LokyBackend with 12 concurrent workers.
[Parallel(n_jobs=-1)]: Done   8 tasks      | elapsed:   20.4s
[Parallel(n_jobs=-1)]: Done 104 tasks      | elapsed:  6.9min
[Parallel(n_jobs=-1)]: Done 264 tasks      | elapsed:  8.3min
[Parallel(n_jobs=-1)]: Done 488 tasks      | elapsed:  8.8min
[Parallel(n_jobs=-1)]: Done 776 tasks      | elapsed:  9.4min
[Parallel(n_jobs=-1)]: Done 1128 tasks      | elapsed: 10.3min
[Parallel(n_jobs=-1)]: Done 1544 tasks      | elapsed: 11.4min
[Parallel(n_jobs=-1)]: Done 2024 tasks      | elapsed: 19.7min
[Parallel(n_jobs=-1)]: Done 2568 tasks      | elapsed: 22.5min
[Parallel(n_jobs=-1)]: Done 3176 tasks      | elapsed: 24.7min
[Parallel(n_jobs=-1)]: Done 3848 tasks      | elapsed: 32.8min
[Parallel(n_jobs=-1)]: Done 4584 tasks      | elapsed: 37.9min
[Parallel(n_jobs=-1)]: Done 5384 tasks      | elapsed: 40.0min
[Parallel(n_jobs=-1)]: Done 6248 tasks      | elapsed: 51.1min
[Parallel(n_jobs=-1)]: Done 7176 tasks      | 

{'clf__class_weight': 'balanced',
 'clf__criterion': 'gini',
 'clf__max_depth': None,
 'clf__max_features': 'sqrt',
 'vect__binary': False,
 'vect__ngram_range': (1, 2),
 'vect__stop_words': ['i',
  'me',
  'my',
  'myself',
  'we',
  'our',
  'ours',
  'ourselves',
  'you',
  "you're",
  "you've",
  "you'll",
  "you'd",
  'your',
  'yours',
  'yourself',
  'yourselves',
  'he',
  'him',
  'his',
  'himself',
  'she',
  "she's",
  'her',
  'hers',
  'herself',
  'it',
  "it's",
  'its',
  'itself',
  'they',
  'them',
  'their',
  'theirs',
  'themselves',
  'what',
  'which',
  'who',
  'whom',
  'this',
  'that',
  "that'll",
  'these',
  'those',
  'am',
  'is',
  'are',
  'was',
  'were',
  'be',
  'been',
  'being',
  'have',
  'has',
  'had',
  'having',
  'do',
  'does',
  'did',
  'doing',
  'a',
  'an',
  'the',
  'and',
  'but',
  'if',
  'or',
  'because',
  'as',
  'until',
  'while',
  'of',
  'at',
  'by',
  'for',
  'with',
  'about',
  'against',
  'between',
  'into',


In [None]:
model5 = Pipeline([('vect', CountVectorizer(binary=False, ngram_range = (1,2), stop_words = stopwords.words('english'))),
                  ('clf', RandomForestClassifier(class_weight='balanced', criterion="gini", max_depth=None, max_features="sqrt",
                                                 n_jobs=-1, random_state=RANDOM_SEED))])

### Model 6 (Random Forest, TFIDF, no stemming)

In [21]:
pipe6 = Pipeline([('vect', count_vectorizer),
                  ('tfidf', tfidf_transformer),
                  ('clf', rf_classifier)])

parameters6 = {**count_params, **tfidf_params, **rf_params}

grid6 = GridSearchCV(pipe6, parameters6, cv=cross_validator, scoring=score_method, n_jobs=-1, verbose=3)

grid6.fit(X,y)

grid6.best_params_

Fitting 5 folds for each of 6048 candidates, totalling 30240 fits


[Parallel(n_jobs=-1)]: Using backend LokyBackend with 12 concurrent workers.
[Parallel(n_jobs=-1)]: Done   8 tasks      | elapsed:   26.0s
[Parallel(n_jobs=-1)]: Done 104 tasks      | elapsed:  8.0min


KeyboardInterrupt: 

In [None]:
model6 = Pipeline([('vect', CountVectorizer(binary=True, ngram_range=(1,1), stop_words=stopwords.words('english'))),
                  ('tfidf', TfidfTransformer(norm='l2', use_idf=True)),
                  ('clf', RandomForestClassifier(class_weight='balanced', criterion="gini", max_depth=None, max_features="log2",
                                                 n_jobs=-1, random_state=RANDOM_SEED))])

### Model 7 (Random Forest, no TFIDF,  stemming)

In [22]:
pipe7 = Pipeline([('vect', stem_count_vectorizer),
                  ('clf', rf_classifier)])

parameters7 = {**count_params, **rf_params}

grid7 = GridSearchCV(pipe7, parameters7, cv=cross_validator, scoring=score_method, n_jobs=-1, verbose=3)

grid7.fit(X,y)

grid7.best_params_

Fitting 5 folds for each of 1512 candidates, totalling 7560 fits


[Parallel(n_jobs=-1)]: Using backend LokyBackend with 12 concurrent workers.
[Parallel(n_jobs=-1)]: Done   8 tasks      | elapsed:   36.7s
[Parallel(n_jobs=-1)]: Done 104 tasks      | elapsed:  9.2min
[Parallel(n_jobs=-1)]: Done 264 tasks      | elapsed: 16.4min
[Parallel(n_jobs=-1)]: Done 488 tasks      | elapsed: 24.6min
[Parallel(n_jobs=-1)]: Done 776 tasks      | elapsed: 34.4min
[Parallel(n_jobs=-1)]: Done 1128 tasks      | elapsed: 47.3min
[Parallel(n_jobs=-1)]: Done 1544 tasks      | elapsed: 62.1min
[Parallel(n_jobs=-1)]: Done 2024 tasks      | elapsed: 85.3min
[Parallel(n_jobs=-1)]: Done 2568 tasks      | elapsed: 105.5min
[Parallel(n_jobs=-1)]: Done 3176 tasks      | elapsed: 127.2min
[Parallel(n_jobs=-1)]: Done 3848 tasks      | elapsed: 155.7min
[Parallel(n_jobs=-1)]: Done 4584 tasks      | elapsed: 183.8min
[Parallel(n_jobs=-1)]: Done 5384 tasks      | elapsed: 211.4min
[Parallel(n_jobs=-1)]: Done 6248 tasks      | elapsed: 250.6min
[Parallel(n_jobs=-1)]: Done 7176 tasks  

{'clf__class_weight': 'balanced',
 'clf__criterion': 'gini',
 'clf__max_depth': 12,
 'clf__max_features': 'log2',
 'vect__binary': True,
 'vect__ngram_range': (1, 1),
 'vect__stop_words': ['i',
  'me',
  'my',
  'myself',
  'we',
  'our',
  'ours',
  'ourselves',
  'you',
  "you're",
  "you've",
  "you'll",
  "you'd",
  'your',
  'yours',
  'yourself',
  'yourselves',
  'he',
  'him',
  'his',
  'himself',
  'she',
  "she's",
  'her',
  'hers',
  'herself',
  'it',
  "it's",
  'its',
  'itself',
  'they',
  'them',
  'their',
  'theirs',
  'themselves',
  'what',
  'which',
  'who',
  'whom',
  'this',
  'that',
  "that'll",
  'these',
  'those',
  'am',
  'is',
  'are',
  'was',
  'were',
  'be',
  'been',
  'being',
  'have',
  'has',
  'had',
  'having',
  'do',
  'does',
  'did',
  'doing',
  'a',
  'an',
  'the',
  'and',
  'but',
  'if',
  'or',
  'because',
  'as',
  'until',
  'while',
  'of',
  'at',
  'by',
  'for',
  'with',
  'about',
  'against',
  'between',
  'into',
  '

In [None]:
model7 = Pipeline([('vect', CountVectorizer(binary=True, ngram_range = (1,1), stop_words = stopwords.words('english'))),
                  ('clf', RandomForestClassifier(class_weight='balanced', criterion="gini", max_depth=12, max_features="log2",
                                                 n_jobs=-1, random_state=RANDOM_SEED))])

### Model 8 (Random Forest, TFIDF, stemming)

In [None]:
pipe8 = Pipeline([('vect', stem_count_vectorizer),
                  ('tfidf', tfidf_transformer),
                  ('clf', rf_classifier)])

parameters8 = {**count_params, **tfidf_params, **rf_params}

grid8 = GridSearchCV(pipe8, parameters8, cv=cross_validator, scoring=score_method, n_jobs=-1, verbose=3)

grid8.fit(X,y)

grid8.best_params_

### Model 9 (SVM, no TFIDF, no stemming)

In [None]:
pipe9 = Pipeline([('vect', count_vectorizer), 
                  ('clf', svm_classifier)])

parameters9 = {**count_params, **svm_params}

grid9 = GridSearchCV(pipe9, parameters9, cv=cross_validator, scoring=score_method, n_jobs=-1, verbose=3)

grid9.fit(X,y)

grid9.best_params_

In [None]:
model9 = Pipeline([('vect', CountVectorizer(binary=True, ngram_range = (1,1), stop_words = stopwords.words('english'))),
                  ('clf', SVC(C=0.1, class_weight='balanced', degree=2, gamma="scale", kernel='linear'))])

### Model 10 (SVM, TFIDF, no stemming)

In [None]:
pipe10 = Pipeline([('vect', count_vectorizer),
                  ('tfidf', tfidf_transformer),
                  ('clf', svm_classifier)])

parameters10 = {**count_params, **tfidf_params, **svm_params}

grid10 = GridSearchCV(pipe10, parameters10, cv=cross_validator, scoring=score_method, n_jobs=-1, verbose=3)

grid10.fit(X,y)

grid10.best_params_

In [None]:
model10 = Pipeline([('vect', CountVectorizer(binary=True, ngram_range = (1,2), stop_words = stopwords.words('english'))),
                    ('tfidf', TfidfTransformer(norm='l1', use_idf=True)),
                  ('clf', SVC(C=100, class_weight=None, degree=2, gamma="scale", kernel='linear'))])

### Model 11 (SVM, no TFIDF,  stemming)

In [23]:
pipe11 = Pipeline([('vect', stem_count_vectorizer),
                  ('clf', svm_classifier)])

parameters11 = {**count_params, **svm_params}

grid11 = GridSearchCV(pipe11, parameters11, cv=cross_validator, scoring=score_method, n_jobs=-1, verbose=3)

grid11.fit(X,y)

grid11.best_params_

Fitting 5 folds for each of 6480 candidates, totalling 32400 fits


[Parallel(n_jobs=-1)]: Using backend LokyBackend with 12 concurrent workers.
[Parallel(n_jobs=-1)]: Done   8 tasks      | elapsed:   18.8s
[Parallel(n_jobs=-1)]: Done 104 tasks      | elapsed:  3.5min
[Parallel(n_jobs=-1)]: Done 264 tasks      | elapsed:  9.7min
[Parallel(n_jobs=-1)]: Done 488 tasks      | elapsed: 17.4min
[Parallel(n_jobs=-1)]: Done 776 tasks      | elapsed: 26.8min
[Parallel(n_jobs=-1)]: Done 1128 tasks      | elapsed: 38.9min
[Parallel(n_jobs=-1)]: Done 1544 tasks      | elapsed: 52.9min
[Parallel(n_jobs=-1)]: Done 2024 tasks      | elapsed: 69.3min
[Parallel(n_jobs=-1)]: Done 2568 tasks      | elapsed: 88.4min
[Parallel(n_jobs=-1)]: Done 3176 tasks      | elapsed: 109.0min
[Parallel(n_jobs=-1)]: Done 3848 tasks      | elapsed: 131.8min
[Parallel(n_jobs=-1)]: Done 4584 tasks      | elapsed: 157.0min
[Parallel(n_jobs=-1)]: Done 5384 tasks      | elapsed: 184.3min
[Parallel(n_jobs=-1)]: Done 6248 tasks      | elapsed: 214.8min
[Parallel(n_jobs=-1)]: Done 7176 tasks   

{'clf__C': 0.1,
 'clf__class_weight': 'balanced',
 'clf__degree': 2,
 'clf__gamma': 'scale',
 'clf__kernel': 'linear',
 'vect__binary': True,
 'vect__ngram_range': (1, 1),
 'vect__stop_words': ['i',
  'me',
  'my',
  'myself',
  'we',
  'our',
  'ours',
  'ourselves',
  'you',
  "you're",
  "you've",
  "you'll",
  "you'd",
  'your',
  'yours',
  'yourself',
  'yourselves',
  'he',
  'him',
  'his',
  'himself',
  'she',
  "she's",
  'her',
  'hers',
  'herself',
  'it',
  "it's",
  'its',
  'itself',
  'they',
  'them',
  'their',
  'theirs',
  'themselves',
  'what',
  'which',
  'who',
  'whom',
  'this',
  'that',
  "that'll",
  'these',
  'those',
  'am',
  'is',
  'are',
  'was',
  'were',
  'be',
  'been',
  'being',
  'have',
  'has',
  'had',
  'having',
  'do',
  'does',
  'did',
  'doing',
  'a',
  'an',
  'the',
  'and',
  'but',
  'if',
  'or',
  'because',
  'as',
  'until',
  'while',
  'of',
  'at',
  'by',
  'for',
  'with',
  'about',
  'against',
  'between',
  'into'

### Model 12 (SVM, TFIDF, stemming)

**Reminder:** Model 12 was not trained using a GridSearch because the estimated training time was too large (90 hours).  Instead, we take the parameters from model 10 (which is the same architecture minus stemming) and apply stemming.  

In [None]:
model12 = Pipeline([('vect', StemCountVectorizer(binary=True, ngram_range = (1,2), stop_words = stopwords.words('english'))),
                    ('tfidf', TfidfTransformer(norm='l1', use_idf=True)),
                  ('clf', SVC(C=100, class_weight=None, degree=2, gamma="scale", kernel='linear'))])

# Results


Compairons of TFIDF vs no TFIDF

Comparisons of Stemming vs no stemming

Comparisons of classifiers


## Cross Validation Scores

To compare the models generated, the cell below performs cross validation using all of the data and reports the balanced accuracy of each model.  Balanced accuracy was used because the number of examples in each class is unequal.  

The use of all of the data to estimate the balanced accuracies of the models poses a problem.  Because the hyperparameters were tunned using `GridSearchCV` on all of our data, the models are being evaluated on the same data that was used to construct them.  This could inflate their balanced accuracies higher than what might be observed on new, unseen data.  One solution to this problem, nested cross validation, was tried, but not used in the final models because it would be too computationally expensive to compute.  

In [None]:
models = [model1, model2, model3, model4, model5, model6, model7, model8, model9, model10, model11, model12]

cv_score_table = pd.DataFrame()

for i in range(0,len(models)):
    results = cross_val_score(models[i], X, y, cv=cross_validator, scoring=score_method, n_jobs=-1)
    cv_score_table.insert(i, "model " + str(i+1), results, True)

cv_score_table.index.name = "fold"
cv_score_table.loc["mean"] = cv_score_table.mean()
cv_score_table.loc["std"] = cv_score_table.std()
cv_score_table.loc["max"] = cv_score_table.max()

cv_score_table

## Confusion Matrix

Below is a confusion matrix generated by training our best model, model **BLANK** on a subset of the data.  This yields a more accurate balanced accuracy than reported above because the model here is being tested using data it was not trained with.  

In [None]:
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=RANDOM_SEED)

.fit(X_train,y_train)
yhat = .predict(X_test)

print("balanced_accuracy:", metrics.balanced_accuracy_score(y_test, yhat))

plot = metrics.plot_confusion_matrix(, X_test, y_test,
                             values_format = 'd',
                             cmap=plt.cm.Blues)

## Learning Curves

In [None]:
def plot_learning_curve(estimator, title, X, y, axes=None, cv=None,
                        n_jobs=-1, train_sizes=np.linspace(.1, 1.0, 5)):
    
    plt.title(title)
    plt.xlabel("Training examples")
    plt.ylabel("Score")

    train_sizes, train_scores, test_scores = \
        learning_curve(estimator, X, y, cv=cv, n_jobs=n_jobs,
                       train_sizes=train_sizes, verbose=3)
    train_scores_mean = np.mean(train_scores, axis=1)
    train_scores_std = np.std(train_scores, axis=1)
    test_scores_mean = np.mean(test_scores, axis=1)
    test_scores_std = np.std(test_scores, axis=1)

    # Plot learning curve
    plt.grid()
    plt.fill_between(train_sizes, train_scores_mean - train_scores_std,
                         train_scores_mean + train_scores_std, alpha=0.1,
                         color="r")
    plt.fill_between(train_sizes, test_scores_mean - test_scores_std,
                         test_scores_mean + test_scores_std, alpha=0.1,
                         color="g")
    plt.plot(train_sizes, train_scores_mean, 'o-', color="r",
                 label="Training score")
    plt.plot(train_sizes, test_scores_mean, 'o-', color="g",
                 label="Cross-validation score")
    plt.legend(loc="best")

    return plot

In [None]:
from sklearn.model_selection import learning_curve

title = 

plot_learning_curve(, title, X, y, cv=cross_validator)

# Conclusions

## Project Improvements

## Takeaways

In this project I learned a lot about NLP techniques, as well as learned about models I hadn't been exposed to before. New to me were the concepts of turning a document into a word count vector, stemming, TF / IDF transformations, and stop words.  As far as models go, this was also the first time that I got to create a Naive Bayes, SVM, and Random Forest model.  Using Scikit-learn in this context also helped establish a set of skills which I can apply to other projects. 

## Moving Forward

Moving forward, I would like to continue to explore the models and techniques used in this project.  I believe that what hindered my progress through this project was a lack of knowledge about which parameters are most significant to a model's success.  Although I read and learned about what each parameter does, I was unsure about which would have the biggest change on a model.  This is why I performed such vast grid searches.  I almost wonder if my time would have been better exploring more models and processing techniques, such as Word2Vec or neural networks, than waiting for these models to train.  

I would also like to be more scientific and organized when I start my next project.  In this project, I began by just trying different things, whereas I would have liked to started with more of a plan.  I also would like to think about how and on what machine I train, so I can not waste time.  In future projects, I think having the experience I gained here will be a big help when constructing a more robust methodology.

# Sources

Ben-Hur, Asa. “CS345: Machine Learning Foundations and Practice.” GitHub, Colorado State University, 7 Dec. 2020, www.github.com/asabenhur/CS345. 

Shaikh, Javed. “Machine Learning, NLP: Text Classification Using Scikit-Learn, Python and NLTK.” Medium, Towards Data Science, 30 Oct. 2017, www.towardsdatascience.com/machine-learning-nlp-text-classification-using-scikit-learn-python-and-nltk-c52b92a7c73a. 

Starmer, Josh. “Naive Bayes, Clearly Explained!!!” YouTube, StatQuest with Josh Starmer, 3 June 2020, www.youtube.com/watch?v=O2L2Uv9pdDA. 

Starmer, Josh. “Support Vector Machines, Clearly Explained!!!” YouTube, StatQuest with Josh Starmer, 30 September 2019, www.youtube.com/watch?v=O2L2Uv9pdDA. 

