I based my initial selection for imports off the work we did in the NLP Practice breakfast hour challenge.

In [41]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

from sklearn.linear_model import LogisticRegression, LogisticRegressionCV
from sklearn.naive_bayes import MultinomialNB
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
from sklearn.pipeline import Pipeline
from sklearn.feature_extraction import text 
from sklearn.ensemble import RandomForestClassifier, ExtraTreesClassifier, GradientBoostingClassifier, AdaBoostClassifier, VotingClassifier
from sklearn.tree import DecisionTreeClassifier

import functions as fun
from lists import expanded_stopwords, expanded_proper_names

import time

In [42]:
whole_df = pd.read_csv('data/cleaned_all_text2022-06-29.csv')

whole_df.head()

Unnamed: 0,created_utc,selftext,subreddit_name,title,all_words,submission_length,title_length,submission_word_count,title_word_count,no_selftext,avg_word_length
0,1656261965,[removed],startrek,Which version of Klingons will appear in SNW?,Which version of Klingons will appear in SNW?,9.0,45.0,1.0,8.0,0.0,4.625
1,1656254308,[removed],startrek,On the Gorn and language,On the Gorn and language,9.0,24.0,1.0,5.0,0.0,4.0
2,1656248567,[removed],startrek,What are some good things that can be said abo...,What are some good things that can be said abo...,9.0,61.0,1.0,13.0,0.0,3.692308
3,1656238740,[removed],startrek,A Lord of the Rings reference in SNW 1x08,A Lord of the Rings reference in SNW 1x08,9.0,41.0,1.0,9.0,0.0,3.333333
4,1656238132,[removed],startrek,The sword props used in SNW 1x08 are replicas ...,The sword props used in SNW 1x08 are replicas ...,9.0,119.0,1.0,22.0,0.0,4.090909


In [43]:
df = whole_df[['subreddit_name', 'all_words']]

# Baseline Accuracy

Our baseline accuracy is 50.9%

In [44]:
df.subreddit_name.value_counts(normalize = True)

startrek    0.509214
starwars    0.490786
Name: subreddit_name, dtype: float64

In [11]:
def lemmatize_text(text):
    split_text = text.split()
    lemmatizer = WordNetLemmatizer()
    return ' '.join([lemmatizer.lemmatize(word) for word in split_text])

def stem_text(text):
    split_text = text.split()
    p_stemmer = PorterStemmer()
    return ' '.join([p_stemmer.stem(word) for word in split_text])

test_phrase1 = 'my computer computes computationally'
print(f'Test phrase: {test_phrase1}')
print(f'Test phrase lemmatized: {lemmatize_text(test_phrase1)}')
print(f'Test phrase stemmed: {stem_text(test_phrase1)}')
print('')

test_phrase2 = 'studies studying cries cry'
print(f'Test phrase: {test_phrase2}')
print(f'Test phrase lemmatized: {lemmatize_text(test_phrase2)}')
print(f'Test phrase stemmed: {stem_text(test_phrase2)}')

Test phrase: my computer computes computationally
Test phrase lemmatized: my computer computes computationally
Test phrase stemmed: my comput comput comput

Test phrase: studies studying cries cry
Test phrase lemmatized: study studying cry cry
Test phrase stemmed: studi studi cri cri


# Prepping Data for Modeling

Getting training and test data set up.

In [21]:
X = df['all_words']
y = df['subreddit_name']

X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=42, stratify=y)

print('X_train shape:', X_train.shape)
print('y_train shape:', y_train.shape)
print('X_test shape:', X_test.shape)
print('y_test shape:', y_test.shape)
print('')
print('Dataframe shape:', df.shape)
print('')
print('y_train value counts:', y_train.value_counts(normalize = True))
print('y_test value counts:', y_test.value_counts(normalize = True))

X_train shape: (5209,)
y_train shape: (5209,)
X_test shape: (1737,)
y_test shape: (1737,)

Dataframe shape: (6946, 2)

y_train value counts: startrek    0.509119
starwars    0.490881
Name: subreddit_name, dtype: float64
y_test value counts: startrek    0.509499
starwars    0.490501
Name: subreddit_name, dtype: float64


# General Notes on Modeling
## Previous Work
I developed several models in which I didn't eliminate proper names, but as it seemed too easy, per Hank's suggestion I eliminated the proper names and franchise specific references that appeared in the top 150 words from the overall list and from each show. Even after using a list of names/terms from the two franchises, I felt it was a little too easy. For that reason, in this notebook, I worked from an expanded list of names/terms from the shows that I eliminated from the models in the form of stopwords.

I saved the previous modeling I did without removing show specific names/words and with the shorter list. I didn't include them here in order to avoid clutter. Please let me know if you want to see them.

## Notes On These Models
As I developed the models in this notebook, I started from the best parameters from that first modeling process as my starting point with these models.

In the course of tuning the models in this notebook, I used 'expanded_proper_names' and 'expanded_stopwords'. The first is the longest list of proper names and show specific references I developed. The other is that list combined with the scikit-learn 'english' stopwords. This enabled me to see if the 'english' stopwords helped the models or not.

In the models without the expanded stopwords, I checked lemmatized and stemmed word lists, but as a rule found that the stop words were more helpful. Between that and the way that I'm removing the stop words, which doesn't play nice with lemmatizing and stemming, I'm leaving those out in these models.

## Observations
### On Parameters
In previous modeling with no or fewer franchise-specific stopwords, I noticed that the best models often had best 'ngram_range' = (1,2) and the 'english' stopwords taken out. Interestingly, and perhaps intuitively, with more of the franchise specific words eliminated, the best 'ngram_range' is (1,1) and the models are tending to perform better with the 'english' stopwords *not* removed from the vectors more frequently than I would have anticipated. Eliminating enough franchise-specific words would logically turn many two word ngrams into gibberish, and it's not illogical that stopwords would become more helpful to modeling in the absence of these other words.

### On Tuning Decisions
GridSearchCV prefers the best parameters on the training data it's being tuned with. I watched the test score to make parameter decisions, so in some cases I chose parameters that GridSearchCV didn't select as optimal for that reason.

## Best Models
### Model 8, TfidfVectorizer, MultinomialNB, with 'no_selftext'
This was the best model as measured by score, with particular focus on test score. It's also the best fit model. I experimented with adding other features, but the 'no_selftext' feature was most predictive (in fact, other features made the model weaker).

The parameters:
* 'nb__alpha': 0.2 
* 'tvec__max_df': 0.7 
* 'tvec__max_features': 4800 
* 'tvec__ngram_range': (1, 1) 
* 'tvec__stop_words': expanded_proper_names
* 'no_selftext' added to dataframe

Training score: 0.9249376079861777\
Test score: 0.8733448474381117

### Model 9, TfidfVectorizer, LogisticRegression, with 'no_selftext'
This was the best LogisticRegrssion model as measured by score, with particular focus on test score. I'm particularly interested in LogisticRegressions so that I can do some inference and see what I learn. Again, I experimented with adding other features, but the 'no_selftext' feature was most predictive (in fact, other features made the model weaker).

The parameters:
* 'log__C': 1.4
* 'tvec__max_df': 0.12
* 'tvec__max_features': 6500
* 'tvec__ngram_range': (1, 2)
* 'tvec__stop_words': expanded_stopwords
* 'no_selftext' added to dataframe

Training score: 0.9149548857746208
Test score: 0.8480138169257341

## Acknowledgements
I modeled my work flow initially on the NLP Breakfast Hour, though I think it wound up changing a lot from there.

**IMPORTANT: I ran the following models with n_jobs = -1. If you want to run any of them without n_jobs = -1, you can change `'fun.pip_grid_njobs'` to `'fun.pip_grid'`.

# Model 1, TfidfVectorizer, MultinomialNB
My best TfidfVectorizer, MultinomialNB model was my best model with proper names, so I started with this model here.

Best parameters: {'nb__alpha': 0.25, 'tvec__max_df': 0.7, 'tvec__max_features': 4800, 'tvec__ngram_range': (1, 1), 'tvec__stop_words': expanded_proper_names}
Training score: 0.9199462468803993
Test score: 0.8514680483592401

In [29]:
pipe_params = [('tvec', TfidfVectorizer()), 
               ('nb', MultinomialNB())]

grid_params = {
    'tvec__ngram_range': [(1,1),(1,2)],
    'tvec__stop_words': [expanded_stopwords, expanded_proper_names],
    'tvec__max_df': [0.1, 0.40, 0.9],
    'tvec__max_features': [2_000, 3_000, 4_900],
    'nb__alpha': [0.5, 1.0, 1.5]
    
}

gs = fun.pipe_grid_njobs(pipe_params, grid_params)

t0 = time.time()
gs.fit(X_train, y_train)
print('Time to run:', time.time()-t0)

print(f'Best parameters: {gs.best_params_}')
print(f'Training score: {gs.score(X_train, y_train)}')
print(f'Test score: {gs.score(X_test, y_test)}')

Time to run: 37.219505071640015
Best parameters: {'nb__alpha': 0.5, 'tvec__max_df': 0.4, 'tvec__max_features': 4900, 'tvec__ngram_range': (1, 1), 'tvec__stop_words': ['seven', 'clone', 'warp', 'borg', 'trilogy', 'contact', 'anakin', 'paramount', 'leia', 'kirk', 'wan', 'jedi', 'kenobi', 'snw', 'wars', 'vader', 'order', 'skywalker', 'klingon', 'starfleet', 'ds9', 'captain', 'maul', 'luke', 'obi', 'rebels', 'data', 'voyager', 'st', 'discovery', 'federation', 'pike', 'picard', 'mandalorian', 'klingons', 'star', 'tng', 'reva', 'strange', 'disney', 'worf', 'riker', 'empire', 'jurati', 'palpatine', 'yoda', 'force', 'darth', 'republic', 'lightsaber', 'sith', 'spock', 'boba', 'fett', 'inquisitor', 'trek', 'enterprise', 'tos', 'padme']}
Training score: 0.9187943943175274
Test score: 0.8497409326424871


In [30]:
pipe_params = [('tvec', TfidfVectorizer()), 
               ('nb', MultinomialNB())]

grid_params = {
    'tvec__ngram_range': [(1,1),(1,2)],
    'tvec__stop_words': [expanded_stopwords, expanded_proper_names],
    'tvec__max_df': [0.1, 0.40, 0.9],
    'tvec__max_features': [2_000, 3_000, 4_900],
    'nb__alpha': [0.3, 0.5, 0.7]
    
}

gs = fun.pipe_grid_njobs(pipe_params, grid_params)

t0 = time.time()
gs.fit(X_train, y_train)
print('Time to run:', time.time()-t0)

print(f'Best parameters: {gs.best_params_}')
print(f'Training score: {gs.score(X_train, y_train)}')
print(f'Test score: {gs.score(X_test, y_test)}')

Time to run: 39.72764801979065
Best parameters: {'nb__alpha': 0.3, 'tvec__max_df': 0.9, 'tvec__max_features': 4900, 'tvec__ngram_range': (1, 1), 'tvec__stop_words': ['seven', 'clone', 'warp', 'borg', 'trilogy', 'contact', 'anakin', 'paramount', 'leia', 'kirk', 'wan', 'jedi', 'kenobi', 'snw', 'wars', 'vader', 'order', 'skywalker', 'klingon', 'starfleet', 'ds9', 'captain', 'maul', 'luke', 'obi', 'rebels', 'data', 'voyager', 'st', 'discovery', 'federation', 'pike', 'picard', 'mandalorian', 'klingons', 'star', 'tng', 'reva', 'strange', 'disney', 'worf', 'riker', 'empire', 'jurati', 'palpatine', 'yoda', 'force', 'darth', 'republic', 'lightsaber', 'sith', 'spock', 'boba', 'fett', 'inquisitor', 'trek', 'enterprise', 'tos', 'padme']}
Training score: 0.9207141485889806
Test score: 0.8508923431203224


In [31]:
pipe_params = [('tvec', TfidfVectorizer()), 
               ('nb', MultinomialNB())]

grid_params = {
    'tvec__ngram_range': [(1,1),(1,2)],
    'tvec__stop_words': [expanded_stopwords, expanded_proper_names],
    'tvec__max_df': [0.1, 0.40, 0.9],
    'tvec__max_features': [2_000, 3_000, 4_900],
    'nb__alpha': [0.2, 0.3, 0.4]
    
}

gs = fun.pipe_grid_njobs(pipe_params, grid_params)

t0 = time.time()
gs.fit(X_train, y_train)
print('Time to run:', time.time()-t0)

print(f'Best parameters: {gs.best_params_}')
print(f'Training score: {gs.score(X_train, y_train)}')
print(f'Test score: {gs.score(X_test, y_test)}')

Time to run: 38.770002126693726
Best parameters: {'nb__alpha': 0.3, 'tvec__max_df': 0.9, 'tvec__max_features': 4900, 'tvec__ngram_range': (1, 1), 'tvec__stop_words': ['seven', 'clone', 'warp', 'borg', 'trilogy', 'contact', 'anakin', 'paramount', 'leia', 'kirk', 'wan', 'jedi', 'kenobi', 'snw', 'wars', 'vader', 'order', 'skywalker', 'klingon', 'starfleet', 'ds9', 'captain', 'maul', 'luke', 'obi', 'rebels', 'data', 'voyager', 'st', 'discovery', 'federation', 'pike', 'picard', 'mandalorian', 'klingons', 'star', 'tng', 'reva', 'strange', 'disney', 'worf', 'riker', 'empire', 'jurati', 'palpatine', 'yoda', 'force', 'darth', 'republic', 'lightsaber', 'sith', 'spock', 'boba', 'fett', 'inquisitor', 'trek', 'enterprise', 'tos', 'padme']}
Training score: 0.9207141485889806
Test score: 0.8508923431203224


In [32]:
pipe_params = [('tvec', TfidfVectorizer()), 
               ('nb', MultinomialNB())]

grid_params = {
    'tvec__ngram_range': [(1,1),(1,2)],
    'tvec__stop_words': [expanded_stopwords, expanded_proper_names],
    'tvec__max_df': [0.1, 0.40, 0.9],
    'tvec__max_features': [2_000, 3_000, 4_900],
    'nb__alpha': [0.25, 0.3, 0.35]
    
}

gs = fun.pipe_grid_njobs(pipe_params, grid_params)

t0 = time.time()
gs.fit(X_train, y_train)
print('Time to run:', time.time()-t0)

print(f'Best parameters: {gs.best_params_}')
print(f'Training score: {gs.score(X_train, y_train)}')
print(f'Test score: {gs.score(X_test, y_test)}')

Time to run: 40.09021806716919
Best parameters: {'nb__alpha': 0.25, 'tvec__max_df': 0.9, 'tvec__max_features': 4900, 'tvec__ngram_range': (1, 1), 'tvec__stop_words': ['seven', 'clone', 'warp', 'borg', 'trilogy', 'contact', 'anakin', 'paramount', 'leia', 'kirk', 'wan', 'jedi', 'kenobi', 'snw', 'wars', 'vader', 'order', 'skywalker', 'klingon', 'starfleet', 'ds9', 'captain', 'maul', 'luke', 'obi', 'rebels', 'data', 'voyager', 'st', 'discovery', 'federation', 'pike', 'picard', 'mandalorian', 'klingons', 'star', 'tng', 'reva', 'strange', 'disney', 'worf', 'riker', 'empire', 'jurati', 'palpatine', 'yoda', 'force', 'darth', 'republic', 'lightsaber', 'sith', 'spock', 'boba', 'fett', 'inquisitor', 'trek', 'enterprise', 'tos', 'padme']}
Training score: 0.9218660011518526
Test score: 0.8514680483592401


In [34]:
pipe_params = [('tvec', TfidfVectorizer()), 
               ('nb', MultinomialNB())]

grid_params = {
    'tvec__ngram_range': [(1,1),(1,2)],
    'tvec__stop_words': [expanded_stopwords, expanded_proper_names],
    'tvec__max_df': [0.1, 0.40, 0.9],
    'tvec__max_features': [4_700, 4_900, 5_100],
    'nb__alpha': [0.25]
    
}

gs = fun.pipe_grid_njobs(pipe_params, grid_params)

t0 = time.time()
gs.fit(X_train, y_train)
print('Time to run:', time.time()-t0)

print(f'Best parameters: {gs.best_params_}')
print(f'Training score: {gs.score(X_train, y_train)}')
print(f'Test score: {gs.score(X_test, y_test)}')

Time to run: 12.288998126983643
Best parameters: {'nb__alpha': 0.25, 'tvec__max_df': 0.9, 'tvec__max_features': 4900, 'tvec__ngram_range': (1, 1), 'tvec__stop_words': ['seven', 'clone', 'warp', 'borg', 'trilogy', 'contact', 'anakin', 'paramount', 'leia', 'kirk', 'wan', 'jedi', 'kenobi', 'snw', 'wars', 'vader', 'order', 'skywalker', 'klingon', 'starfleet', 'ds9', 'captain', 'maul', 'luke', 'obi', 'rebels', 'data', 'voyager', 'st', 'discovery', 'federation', 'pike', 'picard', 'mandalorian', 'klingons', 'star', 'tng', 'reva', 'strange', 'disney', 'worf', 'riker', 'empire', 'jurati', 'palpatine', 'yoda', 'force', 'darth', 'republic', 'lightsaber', 'sith', 'spock', 'boba', 'fett', 'inquisitor', 'trek', 'enterprise', 'tos', 'padme']}
Training score: 0.9218660011518526
Test score: 0.8514680483592401


In [35]:
pipe_params = [('tvec', TfidfVectorizer()), 
               ('nb', MultinomialNB())]

grid_params = {
    'tvec__ngram_range': [(1,1),(1,2)],
    'tvec__stop_words': [expanded_stopwords, expanded_proper_names],
    'tvec__max_df': [0.1, 0.40, 0.9],
    'tvec__max_features': [4_800, 4_900, 5_000],
    'nb__alpha': [0.25]
    
}

gs = fun.pipe_grid_njobs(pipe_params, grid_params)

t0 = time.time()
gs.fit(X_train, y_train)
print('Time to run:', time.time()-t0)

print(f'Best parameters: {gs.best_params_}')
print(f'Training score: {gs.score(X_train, y_train)}')
print(f'Test score: {gs.score(X_test, y_test)}')

Time to run: 12.847345113754272
Best parameters: {'nb__alpha': 0.25, 'tvec__max_df': 0.9, 'tvec__max_features': 4800, 'tvec__ngram_range': (1, 1), 'tvec__stop_words': ['seven', 'clone', 'warp', 'borg', 'trilogy', 'contact', 'anakin', 'paramount', 'leia', 'kirk', 'wan', 'jedi', 'kenobi', 'snw', 'wars', 'vader', 'order', 'skywalker', 'klingon', 'starfleet', 'ds9', 'captain', 'maul', 'luke', 'obi', 'rebels', 'data', 'voyager', 'st', 'discovery', 'federation', 'pike', 'picard', 'mandalorian', 'klingons', 'star', 'tng', 'reva', 'strange', 'disney', 'worf', 'riker', 'empire', 'jurati', 'palpatine', 'yoda', 'force', 'darth', 'republic', 'lightsaber', 'sith', 'spock', 'boba', 'fett', 'inquisitor', 'trek', 'enterprise', 'tos', 'padme']}
Training score: 0.9199462468803993
Test score: 0.8514680483592401


In [36]:
pipe_params = [('tvec', TfidfVectorizer()), 
               ('nb', MultinomialNB())]

grid_params = {
    'tvec__ngram_range': [(1,1),(1,2)],
    'tvec__stop_words': [expanded_stopwords, expanded_proper_names],
    'tvec__max_df': [0.1, 0.40, 0.9],
    'tvec__max_features': [4_700, 4_800],
    'nb__alpha': [0.25]
    
}

gs = fun.pipe_grid_njobs(pipe_params, grid_params)

t0 = time.time()
gs.fit(X_train, y_train)
print('Time to run:', time.time()-t0)

print(f'Best parameters: {gs.best_params_}')
print(f'Training score: {gs.score(X_train, y_train)}')
print(f'Test score: {gs.score(X_test, y_test)}')

Time to run: 8.708423852920532
Best parameters: {'nb__alpha': 0.25, 'tvec__max_df': 0.9, 'tvec__max_features': 4800, 'tvec__ngram_range': (1, 1), 'tvec__stop_words': ['seven', 'clone', 'warp', 'borg', 'trilogy', 'contact', 'anakin', 'paramount', 'leia', 'kirk', 'wan', 'jedi', 'kenobi', 'snw', 'wars', 'vader', 'order', 'skywalker', 'klingon', 'starfleet', 'ds9', 'captain', 'maul', 'luke', 'obi', 'rebels', 'data', 'voyager', 'st', 'discovery', 'federation', 'pike', 'picard', 'mandalorian', 'klingons', 'star', 'tng', 'reva', 'strange', 'disney', 'worf', 'riker', 'empire', 'jurati', 'palpatine', 'yoda', 'force', 'darth', 'republic', 'lightsaber', 'sith', 'spock', 'boba', 'fett', 'inquisitor', 'trek', 'enterprise', 'tos', 'padme']}
Training score: 0.9199462468803993
Test score: 0.8514680483592401


In [37]:
pipe_params = [('tvec', TfidfVectorizer()), 
               ('nb', MultinomialNB())]

grid_params = {
    'tvec__ngram_range': [(1,1),(1,2)],
    'tvec__stop_words': [expanded_stopwords, expanded_proper_names],
    'tvec__max_df': [0.8, 0.9, 1],
    'tvec__max_features': [4_800],
    'nb__alpha': [0.25]
    
}

gs = fun.pipe_grid_njobs(pipe_params, grid_params)

t0 = time.time()
gs.fit(X_train, y_train)
print('Time to run:', time.time()-t0)

print(f'Best parameters: {gs.best_params_}')
print(f'Training score: {gs.score(X_train, y_train)}')
print(f'Test score: {gs.score(X_test, y_test)}')

Time to run: 4.433788061141968
Best parameters: {'nb__alpha': 0.25, 'tvec__max_df': 0.8, 'tvec__max_features': 4800, 'tvec__ngram_range': (1, 1), 'tvec__stop_words': ['seven', 'clone', 'warp', 'borg', 'trilogy', 'contact', 'anakin', 'paramount', 'leia', 'kirk', 'wan', 'jedi', 'kenobi', 'snw', 'wars', 'vader', 'order', 'skywalker', 'klingon', 'starfleet', 'ds9', 'captain', 'maul', 'luke', 'obi', 'rebels', 'data', 'voyager', 'st', 'discovery', 'federation', 'pike', 'picard', 'mandalorian', 'klingons', 'star', 'tng', 'reva', 'strange', 'disney', 'worf', 'riker', 'empire', 'jurati', 'palpatine', 'yoda', 'force', 'darth', 'republic', 'lightsaber', 'sith', 'spock', 'boba', 'fett', 'inquisitor', 'trek', 'enterprise', 'tos', 'padme']}
Training score: 0.9199462468803993
Test score: 0.8514680483592401


In [38]:
pipe_params = [('tvec', TfidfVectorizer()), 
               ('nb', MultinomialNB())]

grid_params = {
    'tvec__ngram_range': [(1,1),(1,2)],
    'tvec__stop_words': [expanded_stopwords, expanded_proper_names],
    'tvec__max_df': [0.7, 0.8],
    'tvec__max_features': [4_800],
    'nb__alpha': [0.25]
    
}

gs = fun.pipe_grid_njobs(pipe_params, grid_params)

t0 = time.time()
gs.fit(X_train, y_train)
print('Time to run:', time.time()-t0)

print(f'Best parameters: {gs.best_params_}')
print(f'Training score: {gs.score(X_train, y_train)}')
print(f'Test score: {gs.score(X_test, y_test)}')

Time to run: 3.195868730545044
Best parameters: {'nb__alpha': 0.25, 'tvec__max_df': 0.7, 'tvec__max_features': 4800, 'tvec__ngram_range': (1, 1), 'tvec__stop_words': ['seven', 'clone', 'warp', 'borg', 'trilogy', 'contact', 'anakin', 'paramount', 'leia', 'kirk', 'wan', 'jedi', 'kenobi', 'snw', 'wars', 'vader', 'order', 'skywalker', 'klingon', 'starfleet', 'ds9', 'captain', 'maul', 'luke', 'obi', 'rebels', 'data', 'voyager', 'st', 'discovery', 'federation', 'pike', 'picard', 'mandalorian', 'klingons', 'star', 'tng', 'reva', 'strange', 'disney', 'worf', 'riker', 'empire', 'jurati', 'palpatine', 'yoda', 'force', 'darth', 'republic', 'lightsaber', 'sith', 'spock', 'boba', 'fett', 'inquisitor', 'trek', 'enterprise', 'tos', 'padme']}
Training score: 0.9199462468803993
Test score: 0.8514680483592401


In [39]:
pipe_params = [('tvec', TfidfVectorizer()), 
               ('nb', MultinomialNB())]

grid_params = {
    'tvec__ngram_range': [(1,1),(1,2),(1,3)],
    'tvec__stop_words': [expanded_stopwords, expanded_proper_names],
    'tvec__max_df': [0.7],
    'tvec__max_features': [4_800],
    'nb__alpha': [0.25]
    
}

gs = fun.pipe_grid_njobs(pipe_params, grid_params)

t0 = time.time()
gs.fit(X_train, y_train)
print('Time to run:', time.time()-t0)

print(f'Best parameters: {gs.best_params_}')
print(f'Training score: {gs.score(X_train, y_train)}')
print(f'Test score: {gs.score(X_test, y_test)}')

Time to run: 4.024868011474609
Best parameters: {'nb__alpha': 0.25, 'tvec__max_df': 0.7, 'tvec__max_features': 4800, 'tvec__ngram_range': (1, 1), 'tvec__stop_words': ['seven', 'clone', 'warp', 'borg', 'trilogy', 'contact', 'anakin', 'paramount', 'leia', 'kirk', 'wan', 'jedi', 'kenobi', 'snw', 'wars', 'vader', 'order', 'skywalker', 'klingon', 'starfleet', 'ds9', 'captain', 'maul', 'luke', 'obi', 'rebels', 'data', 'voyager', 'st', 'discovery', 'federation', 'pike', 'picard', 'mandalorian', 'klingons', 'star', 'tng', 'reva', 'strange', 'disney', 'worf', 'riker', 'empire', 'jurati', 'palpatine', 'yoda', 'force', 'darth', 'republic', 'lightsaber', 'sith', 'spock', 'boba', 'fett', 'inquisitor', 'trek', 'enterprise', 'tos', 'padme']}
Training score: 0.9199462468803993
Test score: 0.8514680483592401


In [40]:
pipe_params = [('tvec', TfidfVectorizer()), 
               ('nb', MultinomialNB())]

grid_params = {
    'tvec__ngram_range': [(1,1)],
    'tvec__stop_words': [expanded_proper_names],
    'tvec__max_df': [0.7],
    'tvec__max_features': [4_800],
    'nb__alpha': [0.25]
    
}

gs = fun.pipe_grid_njobs(pipe_params, grid_params)

t0 = time.time()
gs.fit(X_train, y_train)
print('Time to run:', time.time()-t0)

print(f'Best parameters: {gs.best_params_}')
print(f'Training score: {gs.score(X_train, y_train)}')
print(f'Test score: {gs.score(X_test, y_test)}')

Time to run: 0.5145351886749268
Best parameters: {'nb__alpha': 0.25, 'tvec__max_df': 0.7, 'tvec__max_features': 4800, 'tvec__ngram_range': (1, 1), 'tvec__stop_words': ['seven', 'clone', 'warp', 'borg', 'trilogy', 'contact', 'anakin', 'paramount', 'leia', 'kirk', 'wan', 'jedi', 'kenobi', 'snw', 'wars', 'vader', 'order', 'skywalker', 'klingon', 'starfleet', 'ds9', 'captain', 'maul', 'luke', 'obi', 'rebels', 'data', 'voyager', 'st', 'discovery', 'federation', 'pike', 'picard', 'mandalorian', 'klingons', 'star', 'tng', 'reva', 'strange', 'disney', 'worf', 'riker', 'empire', 'jurati', 'palpatine', 'yoda', 'force', 'darth', 'republic', 'lightsaber', 'sith', 'spock', 'boba', 'fett', 'inquisitor', 'trek', 'enterprise', 'tos', 'padme']}
Training score: 0.9199462468803993
Test score: 0.8514680483592401


# Model 2, CountVectorizer, MultinomialNB 
Without proper names removed, the second best model was the optimized CountVectorizer/MultinomialNB model, so I worked through this next.

Time to run 0.4991631507873535
Best parameters: {'cvec__max_df': 0.34, 'cvec__max_features': 6600, 'cvec__ngram_range': (1, 1), 'cvec__stop_words': expanded_proper_names, 'nb__alpha': 0.5}
Training score: 0.8871184488385486
Test score: 0.8457109959700633

This is the better fit of the two top performing models (performance being measured by Test score (accuracy))

In [48]:
pipe_params = [('cvec', CountVectorizer()), 
               ('nb', MultinomialNB())]

grid_params = {
    'cvec__ngram_range': [(1,1), (1,2)],
    'cvec__stop_words': [expanded_stopwords, expanded_proper_names],
    'cvec__max_df': [0.34, 0.64, 0.94],
    'cvec__max_features': [2_500, 4_500, 6_500],
    'nb__alpha': [0.5, 1.0, 1.5]
    
}

gs = fun.pipe_grid_njobs(pipe_params, grid_params)

t0 = time.time()
gs.fit(X_train, y_train)
print('Time to run', time.time()-t0)

print(f'Best parameters: {gs.best_params_}')
print(f'Training score: {gs.score(X_train, y_train)}')
print(f'Test score: {gs.score(X_test, y_test)}')

Time to run 35.862080097198486
Best parameters: {'cvec__max_df': 0.34, 'cvec__max_features': 6500, 'cvec__ngram_range': (1, 1), 'cvec__stop_words': ['seven', 'clone', 'warp', 'borg', 'trilogy', 'contact', 'anakin', 'paramount', 'leia', 'kirk', 'wan', 'jedi', 'kenobi', 'snw', 'wars', 'vader', 'order', 'skywalker', 'klingon', 'starfleet', 'ds9', 'captain', 'maul', 'luke', 'obi', 'rebels', 'data', 'voyager', 'st', 'discovery', 'federation', 'pike', 'picard', 'mandalorian', 'klingons', 'star', 'tng', 'reva', 'strange', 'disney', 'worf', 'riker', 'empire', 'jurati', 'palpatine', 'yoda', 'force', 'darth', 'republic', 'lightsaber', 'sith', 'spock', 'boba', 'fett', 'inquisitor', 'trek', 'enterprise', 'tos', 'padme'], 'nb__alpha': 0.5}
Training score: 0.8869264734114033
Test score: 0.8457109959700633


In [50]:
pipe_params = [('cvec', CountVectorizer()), 
               ('nb', MultinomialNB())]

grid_params = {
    'cvec__ngram_range': [(1,1), (1,2)],
    'cvec__stop_words': [expanded_stopwords, expanded_proper_names],
    'cvec__max_df': [0.2, 0.3, 0.4],
    'cvec__max_features': [2_500, 4_500, 6_500],
    'nb__alpha': [0.5, 1.0, 1.5]
    
}

gs = fun.pipe_grid_njobs(pipe_params, grid_params)

t0 = time.time()
gs.fit(X_train, y_train)
print('Time to run', time.time()-t0)

print(f'Best parameters: {gs.best_params_}')
print(f'Training score: {gs.score(X_train, y_train)}')
print(f'Test score: {gs.score(X_test, y_test)}')

Time to run 35.79885125160217
Best parameters: {'cvec__max_df': 0.3, 'cvec__max_features': 6500, 'cvec__ngram_range': (1, 1), 'cvec__stop_words': ['seven', 'clone', 'warp', 'borg', 'trilogy', 'contact', 'anakin', 'paramount', 'leia', 'kirk', 'wan', 'jedi', 'kenobi', 'snw', 'wars', 'vader', 'order', 'skywalker', 'klingon', 'starfleet', 'ds9', 'captain', 'maul', 'luke', 'obi', 'rebels', 'data', 'voyager', 'st', 'discovery', 'federation', 'pike', 'picard', 'mandalorian', 'klingons', 'star', 'tng', 'reva', 'strange', 'disney', 'worf', 'riker', 'empire', 'jurati', 'palpatine', 'yoda', 'force', 'darth', 'republic', 'lightsaber', 'sith', 'spock', 'boba', 'fett', 'inquisitor', 'trek', 'enterprise', 'tos', 'padme'], 'nb__alpha': 0.5}
Training score: 0.887310424265694
Test score: 0.8451352907311457


In [51]:
pipe_params = [('cvec', CountVectorizer()), 
               ('nb', MultinomialNB())]

grid_params = {
    'cvec__ngram_range': [(1,1), (1,2)],
    'cvec__stop_words': [expanded_stopwords, expanded_proper_names],
    'cvec__max_df': [0.4],
    'cvec__max_features': [2_500, 4_500, 6_500],
    'nb__alpha': [0.5, 1.0, 1.5]
    
}

gs = fun.pipe_grid_njobs(pipe_params, grid_params)

t0 = time.time()
gs.fit(X_train, y_train)
print('Time to run', time.time()-t0)

print(f'Best parameters: {gs.best_params_}')
print(f'Training score: {gs.score(X_train, y_train)}')
print(f'Test score: {gs.score(X_test, y_test)}')

Time to run 12.410980939865112
Best parameters: {'cvec__max_df': 0.4, 'cvec__max_features': 6500, 'cvec__ngram_range': (1, 1), 'cvec__stop_words': ['seven', 'clone', 'warp', 'borg', 'trilogy', 'contact', 'anakin', 'paramount', 'leia', 'kirk', 'wan', 'jedi', 'kenobi', 'snw', 'wars', 'vader', 'order', 'skywalker', 'klingon', 'starfleet', 'ds9', 'captain', 'maul', 'luke', 'obi', 'rebels', 'data', 'voyager', 'st', 'discovery', 'federation', 'pike', 'picard', 'mandalorian', 'klingons', 'star', 'tng', 'reva', 'strange', 'disney', 'worf', 'riker', 'empire', 'jurati', 'palpatine', 'yoda', 'force', 'darth', 'republic', 'lightsaber', 'sith', 'spock', 'boba', 'fett', 'inquisitor', 'trek', 'enterprise', 'tos', 'padme'], 'nb__alpha': 0.5}
Training score: 0.8865425225571127
Test score: 0.844559585492228


In [52]:
pipe_params = [('cvec', CountVectorizer()), 
               ('nb', MultinomialNB())]

grid_params = {
    'cvec__ngram_range': [(1,1), (1,2)],
    'cvec__stop_words': [expanded_stopwords, expanded_proper_names],
    'cvec__max_df': [0.34, 0.35],
    'cvec__max_features': [2_500, 4_500, 6_500],
    'nb__alpha': [0.5, 1.0, 1.5]
    
}

gs = fun.pipe_grid_njobs(pipe_params, grid_params)

t0 = time.time()
gs.fit(X_train, y_train)
print('Time to run', time.time()-t0)

print(f'Best parameters: {gs.best_params_}')
print(f'Training score: {gs.score(X_train, y_train)}')
print(f'Test score: {gs.score(X_test, y_test)}')

Time to run 25.22023916244507
Best parameters: {'cvec__max_df': 0.34, 'cvec__max_features': 6500, 'cvec__ngram_range': (1, 1), 'cvec__stop_words': ['seven', 'clone', 'warp', 'borg', 'trilogy', 'contact', 'anakin', 'paramount', 'leia', 'kirk', 'wan', 'jedi', 'kenobi', 'snw', 'wars', 'vader', 'order', 'skywalker', 'klingon', 'starfleet', 'ds9', 'captain', 'maul', 'luke', 'obi', 'rebels', 'data', 'voyager', 'st', 'discovery', 'federation', 'pike', 'picard', 'mandalorian', 'klingons', 'star', 'tng', 'reva', 'strange', 'disney', 'worf', 'riker', 'empire', 'jurati', 'palpatine', 'yoda', 'force', 'darth', 'republic', 'lightsaber', 'sith', 'spock', 'boba', 'fett', 'inquisitor', 'trek', 'enterprise', 'tos', 'padme'], 'nb__alpha': 0.5}
Training score: 0.8869264734114033
Test score: 0.8457109959700633


In [54]:
pipe_params = [('cvec', CountVectorizer()), 
               ('nb', MultinomialNB())]

grid_params = {
    'cvec__ngram_range': [(1,1), (1,2)],
    'cvec__stop_words': [expanded_stopwords, expanded_proper_names],
    'cvec__max_df': [0.33, 0.34],
    'cvec__max_features': [2_500, 4_500, 6_500],
    'nb__alpha': [0.5, 1.0, 1.5]
    
}

gs = fun.pipe_grid_njobs(pipe_params, grid_params)

t0 = time.time()
gs.fit(X_train, y_train)
print('Time to run', time.time()-t0)

print(f'Best parameters: {gs.best_params_}')
print(f'Training score: {gs.score(X_train, y_train)}')
print(f'Test score: {gs.score(X_test, y_test)}')

Time to run 28.40887975692749
Best parameters: {'cvec__max_df': 0.33, 'cvec__max_features': 6500, 'cvec__ngram_range': (1, 1), 'cvec__stop_words': ['seven', 'clone', 'warp', 'borg', 'trilogy', 'contact', 'anakin', 'paramount', 'leia', 'kirk', 'wan', 'jedi', 'kenobi', 'snw', 'wars', 'vader', 'order', 'skywalker', 'klingon', 'starfleet', 'ds9', 'captain', 'maul', 'luke', 'obi', 'rebels', 'data', 'voyager', 'st', 'discovery', 'federation', 'pike', 'picard', 'mandalorian', 'klingons', 'star', 'tng', 'reva', 'strange', 'disney', 'worf', 'riker', 'empire', 'jurati', 'palpatine', 'yoda', 'force', 'darth', 'republic', 'lightsaber', 'sith', 'spock', 'boba', 'fett', 'inquisitor', 'trek', 'enterprise', 'tos', 'padme'], 'nb__alpha': 0.5}
Training score: 0.8884622768285659
Test score: 0.8451352907311457


In [55]:
pipe_params = [('cvec', CountVectorizer()), 
               ('nb', MultinomialNB())]

grid_params = {
    'cvec__ngram_range': [(1,1), (1,2)],
    'cvec__stop_words': [expanded_stopwords, expanded_proper_names],
    'cvec__max_df': [0.34],
    'cvec__max_features': [5_500, 6_500, 7_500],
    'nb__alpha': [0.5, 1.0, 1.5]
    
}

gs = fun.pipe_grid_njobs(pipe_params, grid_params)

t0 = time.time()
gs.fit(X_train, y_train)
print('Time to run', time.time()-t0)

print(f'Best parameters: {gs.best_params_}')
print(f'Training score: {gs.score(X_train, y_train)}')
print(f'Test score: {gs.score(X_test, y_test)}')

Time to run 13.449173927307129
Best parameters: {'cvec__max_df': 0.34, 'cvec__max_features': 6500, 'cvec__ngram_range': (1, 1), 'cvec__stop_words': ['seven', 'clone', 'warp', 'borg', 'trilogy', 'contact', 'anakin', 'paramount', 'leia', 'kirk', 'wan', 'jedi', 'kenobi', 'snw', 'wars', 'vader', 'order', 'skywalker', 'klingon', 'starfleet', 'ds9', 'captain', 'maul', 'luke', 'obi', 'rebels', 'data', 'voyager', 'st', 'discovery', 'federation', 'pike', 'picard', 'mandalorian', 'klingons', 'star', 'tng', 'reva', 'strange', 'disney', 'worf', 'riker', 'empire', 'jurati', 'palpatine', 'yoda', 'force', 'darth', 'republic', 'lightsaber', 'sith', 'spock', 'boba', 'fett', 'inquisitor', 'trek', 'enterprise', 'tos', 'padme'], 'nb__alpha': 0.5}
Training score: 0.8869264734114033
Test score: 0.8457109959700633


In [56]:
pipe_params = [('cvec', CountVectorizer()), 
               ('nb', MultinomialNB())]

grid_params = {
    'cvec__ngram_range': [(1,1), (1,2)],
    'cvec__stop_words': [expanded_stopwords, expanded_proper_names],
    'cvec__max_df': [0.34],
    'cvec__max_features': [6_000, 6_500, 7_000],
    'nb__alpha': [0.5, 1.0, 1.5]
    
}

gs = fun.pipe_grid_njobs(pipe_params, grid_params)

t0 = time.time()
gs.fit(X_train, y_train)
print('Time to run', time.time()-t0)

print(f'Best parameters: {gs.best_params_}')
print(f'Training score: {gs.score(X_train, y_train)}')
print(f'Test score: {gs.score(X_test, y_test)}')

Time to run 12.116826295852661
Best parameters: {'cvec__max_df': 0.34, 'cvec__max_features': 6500, 'cvec__ngram_range': (1, 1), 'cvec__stop_words': ['seven', 'clone', 'warp', 'borg', 'trilogy', 'contact', 'anakin', 'paramount', 'leia', 'kirk', 'wan', 'jedi', 'kenobi', 'snw', 'wars', 'vader', 'order', 'skywalker', 'klingon', 'starfleet', 'ds9', 'captain', 'maul', 'luke', 'obi', 'rebels', 'data', 'voyager', 'st', 'discovery', 'federation', 'pike', 'picard', 'mandalorian', 'klingons', 'star', 'tng', 'reva', 'strange', 'disney', 'worf', 'riker', 'empire', 'jurati', 'palpatine', 'yoda', 'force', 'darth', 'republic', 'lightsaber', 'sith', 'spock', 'boba', 'fett', 'inquisitor', 'trek', 'enterprise', 'tos', 'padme'], 'nb__alpha': 0.5}
Training score: 0.8869264734114033
Test score: 0.8457109959700633


In [57]:
pipe_params = [('cvec', CountVectorizer()), 
               ('nb', MultinomialNB())]

grid_params = {
    'cvec__ngram_range': [(1,1), (1,2)],
    'cvec__stop_words': [expanded_stopwords, expanded_proper_names],
    'cvec__max_df': [0.34],
    'cvec__max_features': [6_400, 6_500, 6_600],
    'nb__alpha': [0.5, 1.0, 1.5]
    
}

gs = fun.pipe_grid_njobs(pipe_params, grid_params)

t0 = time.time()
gs.fit(X_train, y_train)
print('Time to run', time.time()-t0)

print(f'Best parameters: {gs.best_params_}')
print(f'Training score: {gs.score(X_train, y_train)}')
print(f'Test score: {gs.score(X_test, y_test)}')

Time to run 12.709076166152954
Best parameters: {'cvec__max_df': 0.34, 'cvec__max_features': 6600, 'cvec__ngram_range': (1, 1), 'cvec__stop_words': ['seven', 'clone', 'warp', 'borg', 'trilogy', 'contact', 'anakin', 'paramount', 'leia', 'kirk', 'wan', 'jedi', 'kenobi', 'snw', 'wars', 'vader', 'order', 'skywalker', 'klingon', 'starfleet', 'ds9', 'captain', 'maul', 'luke', 'obi', 'rebels', 'data', 'voyager', 'st', 'discovery', 'federation', 'pike', 'picard', 'mandalorian', 'klingons', 'star', 'tng', 'reva', 'strange', 'disney', 'worf', 'riker', 'empire', 'jurati', 'palpatine', 'yoda', 'force', 'darth', 'republic', 'lightsaber', 'sith', 'spock', 'boba', 'fett', 'inquisitor', 'trek', 'enterprise', 'tos', 'padme'], 'nb__alpha': 0.5}
Training score: 0.8871184488385486
Test score: 0.8457109959700633


In [58]:
pipe_params = [('cvec', CountVectorizer()), 
               ('nb', MultinomialNB())]

grid_params = {
    'cvec__ngram_range': [(1,1), (1,2)],
    'cvec__stop_words': [expanded_stopwords, expanded_proper_names],
    'cvec__max_df': [0.34],
    'cvec__max_features': [6_600, 6_700, 6_800],
    'nb__alpha': [0.5, 1.0, 1.5]
    
}

gs = fun.pipe_grid_njobs(pipe_params, grid_params)

t0 = time.time()
gs.fit(X_train, y_train)
print('Time to run', time.time()-t0)

print(f'Best parameters: {gs.best_params_}')
print(f'Training score: {gs.score(X_train, y_train)}')
print(f'Test score: {gs.score(X_test, y_test)}')

Time to run 13.818279027938843
Best parameters: {'cvec__max_df': 0.34, 'cvec__max_features': 6800, 'cvec__ngram_range': (1, 1), 'cvec__stop_words': ['seven', 'clone', 'warp', 'borg', 'trilogy', 'contact', 'anakin', 'paramount', 'leia', 'kirk', 'wan', 'jedi', 'kenobi', 'snw', 'wars', 'vader', 'order', 'skywalker', 'klingon', 'starfleet', 'ds9', 'captain', 'maul', 'luke', 'obi', 'rebels', 'data', 'voyager', 'st', 'discovery', 'federation', 'pike', 'picard', 'mandalorian', 'klingons', 'star', 'tng', 'reva', 'strange', 'disney', 'worf', 'riker', 'empire', 'jurati', 'palpatine', 'yoda', 'force', 'darth', 'republic', 'lightsaber', 'sith', 'spock', 'boba', 'fett', 'inquisitor', 'trek', 'enterprise', 'tos', 'padme'], 'nb__alpha': 0.5}
Training score: 0.8880783259742753
Test score: 0.844559585492228


In [59]:
pipe_params = [('cvec', CountVectorizer()), 
               ('nb', MultinomialNB())]

grid_params = {
    'cvec__ngram_range': [(1,1), (1,2)],
    'cvec__stop_words': [expanded_stopwords, expanded_proper_names],
    'cvec__max_df': [0.34],
    'cvec__max_features': [6_600, 6_700],
    'nb__alpha': [0.5, 1.0, 1.5]
    
}

gs = fun.pipe_grid_njobs(pipe_params, grid_params)

t0 = time.time()
gs.fit(X_train, y_train)
print('Time to run', time.time()-t0)

print(f'Best parameters: {gs.best_params_}')
print(f'Training score: {gs.score(X_train, y_train)}')
print(f'Test score: {gs.score(X_test, y_test)}')

Time to run 9.655218839645386
Best parameters: {'cvec__max_df': 0.34, 'cvec__max_features': 6600, 'cvec__ngram_range': (1, 1), 'cvec__stop_words': ['seven', 'clone', 'warp', 'borg', 'trilogy', 'contact', 'anakin', 'paramount', 'leia', 'kirk', 'wan', 'jedi', 'kenobi', 'snw', 'wars', 'vader', 'order', 'skywalker', 'klingon', 'starfleet', 'ds9', 'captain', 'maul', 'luke', 'obi', 'rebels', 'data', 'voyager', 'st', 'discovery', 'federation', 'pike', 'picard', 'mandalorian', 'klingons', 'star', 'tng', 'reva', 'strange', 'disney', 'worf', 'riker', 'empire', 'jurati', 'palpatine', 'yoda', 'force', 'darth', 'republic', 'lightsaber', 'sith', 'spock', 'boba', 'fett', 'inquisitor', 'trek', 'enterprise', 'tos', 'padme'], 'nb__alpha': 0.5}
Training score: 0.8871184488385486
Test score: 0.8457109959700633


In [60]:
pipe_params = [('cvec', CountVectorizer()), 
               ('nb', MultinomialNB())]

grid_params = {
    'cvec__ngram_range': [(1,1), (1,2)],
    'cvec__stop_words': [expanded_stopwords, expanded_proper_names],
    'cvec__max_df': [0.34],
    'cvec__max_features': [6_600],
    'nb__alpha': [0.25, 0.5, 0.75]
    
}

gs = fun.pipe_grid_njobs(pipe_params, grid_params)

t0 = time.time()
gs.fit(X_train, y_train)
print('Time to run', time.time()-t0)

print(f'Best parameters: {gs.best_params_}')
print(f'Training score: {gs.score(X_train, y_train)}')
print(f'Test score: {gs.score(X_test, y_test)}')

Time to run 4.406274080276489
Best parameters: {'cvec__max_df': 0.34, 'cvec__max_features': 6600, 'cvec__ngram_range': (1, 1), 'cvec__stop_words': ['seven', 'clone', 'warp', 'borg', 'trilogy', 'contact', 'anakin', 'paramount', 'leia', 'kirk', 'wan', 'jedi', 'kenobi', 'snw', 'wars', 'vader', 'order', 'skywalker', 'klingon', 'starfleet', 'ds9', 'captain', 'maul', 'luke', 'obi', 'rebels', 'data', 'voyager', 'st', 'discovery', 'federation', 'pike', 'picard', 'mandalorian', 'klingons', 'star', 'tng', 'reva', 'strange', 'disney', 'worf', 'riker', 'empire', 'jurati', 'palpatine', 'yoda', 'force', 'darth', 'republic', 'lightsaber', 'sith', 'spock', 'boba', 'fett', 'inquisitor', 'trek', 'enterprise', 'tos', 'padme'], 'nb__alpha': 0.25}
Training score: 0.8905740065271646
Test score: 0.8451352907311457


In [62]:
pipe_params = [('cvec', CountVectorizer()), 
               ('nb', MultinomialNB())]

grid_params = {
    'cvec__ngram_range': [(1,1), (1,2)],
    'cvec__stop_words': [expanded_stopwords, expanded_proper_names],
    'cvec__max_df': [0.34],
    'cvec__max_features': [6_600],
    'nb__alpha': [0.4, 0.5, 0.6]
    
}

gs = fun.pipe_grid_njobs(pipe_params, grid_params)

t0 = time.time()
gs.fit(X_train, y_train)
print('Time to run', time.time()-t0)

print(f'Best parameters: {gs.best_params_}')
print(f'Training score: {gs.score(X_train, y_train)}')
print(f'Test score: {gs.score(X_test, y_test)}')

Time to run 4.289211988449097
Best parameters: {'cvec__max_df': 0.34, 'cvec__max_features': 6600, 'cvec__ngram_range': (1, 1), 'cvec__stop_words': ['seven', 'clone', 'warp', 'borg', 'trilogy', 'contact', 'anakin', 'paramount', 'leia', 'kirk', 'wan', 'jedi', 'kenobi', 'snw', 'wars', 'vader', 'order', 'skywalker', 'klingon', 'starfleet', 'ds9', 'captain', 'maul', 'luke', 'obi', 'rebels', 'data', 'voyager', 'st', 'discovery', 'federation', 'pike', 'picard', 'mandalorian', 'klingons', 'star', 'tng', 'reva', 'strange', 'disney', 'worf', 'riker', 'empire', 'jurati', 'palpatine', 'yoda', 'force', 'darth', 'republic', 'lightsaber', 'sith', 'spock', 'boba', 'fett', 'inquisitor', 'trek', 'enterprise', 'tos', 'padme'], 'nb__alpha': 0.6}
Training score: 0.8861585717028221
Test score: 0.8451352907311457


In [63]:
#BEST PERFORMANCE, MODEL 2
pipe_params = [('cvec', CountVectorizer()), 
               ('nb', MultinomialNB())]

grid_params = {
    'cvec__ngram_range': [(1,1)],
    'cvec__stop_words': [expanded_proper_names],
    'cvec__max_df': [0.34],
    'cvec__max_features': [6_600],
    'nb__alpha': [0.5]
    
}

gs = fun.pipe_grid_njobs(pipe_params, grid_params)

t0 = time.time()
gs.fit(X_train, y_train)
print('Time to run', time.time()-t0)

print(f'Best parameters: {gs.best_params_}')
print(f'Training score: {gs.score(X_train, y_train)}')
print(f'Test score: {gs.score(X_test, y_test)}')

Time to run 0.4991631507873535
Best parameters: {'cvec__max_df': 0.34, 'cvec__max_features': 6600, 'cvec__ngram_range': (1, 1), 'cvec__stop_words': ['seven', 'clone', 'warp', 'borg', 'trilogy', 'contact', 'anakin', 'paramount', 'leia', 'kirk', 'wan', 'jedi', 'kenobi', 'snw', 'wars', 'vader', 'order', 'skywalker', 'klingon', 'starfleet', 'ds9', 'captain', 'maul', 'luke', 'obi', 'rebels', 'data', 'voyager', 'st', 'discovery', 'federation', 'pike', 'picard', 'mandalorian', 'klingons', 'star', 'tng', 'reva', 'strange', 'disney', 'worf', 'riker', 'empire', 'jurati', 'palpatine', 'yoda', 'force', 'darth', 'republic', 'lightsaber', 'sith', 'spock', 'boba', 'fett', 'inquisitor', 'trek', 'enterprise', 'tos', 'padme'], 'nb__alpha': 0.5}
Training score: 0.8871184488385486
Test score: 0.8457109959700633


# Model 3, CountVectorizer & LogisticRegression
Without proper names removed, the optimized CountVectorizer and LogisticRegression combination was the third best, so I worked with that 3rd.

Best parameters: {'cvec__max_df': 0.09, 'cvec__max_features': 6500, 'cvec__ngram_range': (1, 1), 'cvec__stop_words': expanded_stopwords, 'log__C': 1.1}
Training score: 0.9537339220579766
Test score: 0.8215313759355211

In [64]:
pipe_params = [('cvec', CountVectorizer()), 
               ('log', LogisticRegression(max_iter = 10_000))]

grid_params = {
    'cvec__ngram_range': [(1,1),(1,2)],
    'cvec__stop_words': [expanded_stopwords, expanded_proper_names],
    'cvec__max_df': [0.01, 0.19, 0.37],
    'cvec__max_features': [1_500, 3_000, 4_500],
    'log__C': [0.5, 1.0, 1.5]
    
}

gs = fun.pipe_grid_njobs(pipe_params, grid_params)

t0 = time.time()
gs.fit(X_train, y_train)
print(time.time()-t0)

print(f'Best parameters: {gs.best_params_}')
print(f'Training score: {gs.score(X_train, y_train)}')
print(f'Test score: {gs.score(X_test, y_test)}')

44.88155698776245
Best parameters: {'cvec__max_df': 0.19, 'cvec__max_features': 4500, 'cvec__ngram_range': (1, 1), 'cvec__stop_words': frozenset({'thereafter', 'empty', 'show', 'whole', 'this', 'wherein', 'nine', 'him', 'what', 'their', 'me', 'cannot', 'eg', 'alone', 'mostly', 'whether', 'vader', 'other', 'themselves', 'become', 'namely', 'against', 'much', 'fett', 'least', 'data', 'former', 'myself', 'i', 'than', 'be', 'down', 'rebels', 'the', 'sith', 'done', 'nevertheless', 'besides', 'to', 'above', 'get', 'however', 'captain', 'amount', 'lightsaber', 'anyway', 'ourselves', 'first', 'is', 'out', 'hasnt', 'those', 'from', 'along', 'not', 'if', 'will', 'at', 'whose', 'picard', 'one', 'beforehand', 'side', 'further', 'twelve', 'have', 'am', 'was', 'us', 'mill', 'already', 'a', 'its', 'she', 'klingons', 'there', 'always', 'could', 'our', 'by', 'yourself', 'too', 'without', 'though', 'eleven', 'being', 'up', 'force', 'under', 'worf', 'else', 'becomes', 'each', 'itself', 'riker', 'himself'

In [65]:
# since the model was overfit, I'm going to try making C smaller

pipe_params = [('cvec', CountVectorizer()), 
               ('log', LogisticRegression(max_iter = 10_000))]

grid_params = {
    'cvec__ngram_range': [(1,1),(1,2)],
    'cvec__stop_words': [expanded_stopwords, expanded_proper_names],
    'cvec__max_df': [0.01, 0.19, 0.37],
    'cvec__max_features': [1_500, 3_000, 4_500],
    'log__C': [0.5, 1.0]
    
}

gs = fun.pipe_grid_njobs(pipe_params, grid_params)

t0 = time.time()
gs.fit(X_train, y_train)
print(time.time()-t0)

print(f'Best parameters: {gs.best_params_}')
print(f'Training score: {gs.score(X_train, y_train)}')
print(f'Test score: {gs.score(X_test, y_test)}')

30.472047805786133
Best parameters: {'cvec__max_df': 0.19, 'cvec__max_features': 4500, 'cvec__ngram_range': (1, 1), 'cvec__stop_words': frozenset({'thereafter', 'empty', 'show', 'whole', 'this', 'wherein', 'nine', 'him', 'what', 'their', 'me', 'cannot', 'eg', 'alone', 'mostly', 'whether', 'vader', 'other', 'themselves', 'become', 'namely', 'against', 'much', 'fett', 'least', 'data', 'former', 'myself', 'i', 'than', 'be', 'down', 'rebels', 'the', 'sith', 'done', 'nevertheless', 'besides', 'to', 'above', 'get', 'however', 'captain', 'amount', 'lightsaber', 'anyway', 'ourselves', 'first', 'is', 'out', 'hasnt', 'those', 'from', 'along', 'not', 'if', 'will', 'at', 'whose', 'picard', 'one', 'beforehand', 'side', 'further', 'twelve', 'have', 'am', 'was', 'us', 'mill', 'already', 'a', 'its', 'she', 'klingons', 'there', 'always', 'could', 'our', 'by', 'yourself', 'too', 'without', 'though', 'eleven', 'being', 'up', 'force', 'under', 'worf', 'else', 'becomes', 'each', 'itself', 'riker', 'himself

In [66]:
pipe_params = [('cvec', CountVectorizer()), 
               ('log', LogisticRegression(max_iter = 10_000))]

grid_params = {
    'cvec__ngram_range': [(1,1),(1,2)],
    'cvec__stop_words': [expanded_stopwords, expanded_proper_names],
    'cvec__max_df': [0.01, 0.19, 0.37],
    'cvec__max_features': [4_500, 5_500],
    'log__C': [0.5, 1.0]
    
}

gs = fun.pipe_grid_njobs(pipe_params, grid_params)

t0 = time.time()
gs.fit(X_train, y_train)
print(time.time()-t0)

print(f'Best parameters: {gs.best_params_}')
print(f'Training score: {gs.score(X_train, y_train)}')
print(f'Test score: {gs.score(X_test, y_test)}')

19.166616916656494
Best parameters: {'cvec__max_df': 0.19, 'cvec__max_features': 5500, 'cvec__ngram_range': (1, 1), 'cvec__stop_words': frozenset({'thereafter', 'empty', 'show', 'whole', 'this', 'wherein', 'nine', 'him', 'what', 'their', 'me', 'cannot', 'eg', 'alone', 'mostly', 'whether', 'vader', 'other', 'themselves', 'become', 'namely', 'against', 'much', 'fett', 'least', 'data', 'former', 'myself', 'i', 'than', 'be', 'down', 'rebels', 'the', 'sith', 'done', 'nevertheless', 'besides', 'to', 'above', 'get', 'however', 'captain', 'amount', 'lightsaber', 'anyway', 'ourselves', 'first', 'is', 'out', 'hasnt', 'those', 'from', 'along', 'not', 'if', 'will', 'at', 'whose', 'picard', 'one', 'beforehand', 'side', 'further', 'twelve', 'have', 'am', 'was', 'us', 'mill', 'already', 'a', 'its', 'she', 'klingons', 'there', 'always', 'could', 'our', 'by', 'yourself', 'too', 'without', 'though', 'eleven', 'being', 'up', 'force', 'under', 'worf', 'else', 'becomes', 'each', 'itself', 'riker', 'himself

In [67]:
pipe_params = [('cvec', CountVectorizer()), 
               ('log', LogisticRegression(max_iter = 10_000))]

grid_params = {
    'cvec__ngram_range': [(1,1),(1,2)],
    'cvec__stop_words': [expanded_stopwords, expanded_proper_names],
    'cvec__max_df': [0.01, 0.19, 0.37],
    'cvec__max_features': [5_500, 6_500, 7_500],
    'log__C': [0.5, 1.0]
    
}

gs = fun.pipe_grid_njobs(pipe_params, grid_params)

t0 = time.time()
gs.fit(X_train, y_train)
print(time.time()-t0)

print(f'Best parameters: {gs.best_params_}')
print(f'Training score: {gs.score(X_train, y_train)}')
print(f'Test score: {gs.score(X_test, y_test)}')

30.362706899642944
Best parameters: {'cvec__max_df': 0.19, 'cvec__max_features': 6500, 'cvec__ngram_range': (1, 1), 'cvec__stop_words': frozenset({'thereafter', 'empty', 'show', 'whole', 'this', 'wherein', 'nine', 'him', 'what', 'their', 'me', 'cannot', 'eg', 'alone', 'mostly', 'whether', 'vader', 'other', 'themselves', 'become', 'namely', 'against', 'much', 'fett', 'least', 'data', 'former', 'myself', 'i', 'than', 'be', 'down', 'rebels', 'the', 'sith', 'done', 'nevertheless', 'besides', 'to', 'above', 'get', 'however', 'captain', 'amount', 'lightsaber', 'anyway', 'ourselves', 'first', 'is', 'out', 'hasnt', 'those', 'from', 'along', 'not', 'if', 'will', 'at', 'whose', 'picard', 'one', 'beforehand', 'side', 'further', 'twelve', 'have', 'am', 'was', 'us', 'mill', 'already', 'a', 'its', 'she', 'klingons', 'there', 'always', 'could', 'our', 'by', 'yourself', 'too', 'without', 'though', 'eleven', 'being', 'up', 'force', 'under', 'worf', 'else', 'becomes', 'each', 'itself', 'riker', 'himself

In [68]:
pipe_params = [('cvec', CountVectorizer()), 
               ('log', LogisticRegression(max_iter = 10_000))]

grid_params = {
    'cvec__ngram_range': [(1,1),(1,2)],
    'cvec__stop_words': [expanded_stopwords, expanded_proper_names],
    'cvec__max_df': [0.01, 0.19, 0.37],
    'cvec__max_features': [6_000, 6_500, 7_000],
    'log__C': [0.5, 1.0]
    
}

gs = fun.pipe_grid_njobs(pipe_params, grid_params)

t0 = time.time()
gs.fit(X_train, y_train)
print(time.time()-t0)

print(f'Best parameters: {gs.best_params_}')
print(f'Training score: {gs.score(X_train, y_train)}')
print(f'Test score: {gs.score(X_test, y_test)}')

31.755382776260376
Best parameters: {'cvec__max_df': 0.19, 'cvec__max_features': 6500, 'cvec__ngram_range': (1, 1), 'cvec__stop_words': frozenset({'thereafter', 'empty', 'show', 'whole', 'this', 'wherein', 'nine', 'him', 'what', 'their', 'me', 'cannot', 'eg', 'alone', 'mostly', 'whether', 'vader', 'other', 'themselves', 'become', 'namely', 'against', 'much', 'fett', 'least', 'data', 'former', 'myself', 'i', 'than', 'be', 'down', 'rebels', 'the', 'sith', 'done', 'nevertheless', 'besides', 'to', 'above', 'get', 'however', 'captain', 'amount', 'lightsaber', 'anyway', 'ourselves', 'first', 'is', 'out', 'hasnt', 'those', 'from', 'along', 'not', 'if', 'will', 'at', 'whose', 'picard', 'one', 'beforehand', 'side', 'further', 'twelve', 'have', 'am', 'was', 'us', 'mill', 'already', 'a', 'its', 'she', 'klingons', 'there', 'always', 'could', 'our', 'by', 'yourself', 'too', 'without', 'though', 'eleven', 'being', 'up', 'force', 'under', 'worf', 'else', 'becomes', 'each', 'itself', 'riker', 'himself

In [69]:
pipe_params = [('cvec', CountVectorizer()), 
               ('log', LogisticRegression(max_iter = 10_000))]

grid_params = {
    'cvec__ngram_range': [(1,1),(1,2)],
    'cvec__stop_words': [expanded_stopwords, expanded_proper_names],
    'cvec__max_df': [0.01, 0.19, 0.37],
    'cvec__max_features': [6_250, 6_500, 6_750],
    'log__C': [0.5, 1.0]
    
}

gs = fun.pipe_grid_njobs(pipe_params, grid_params)

t0 = time.time()
gs.fit(X_train, y_train)
print(time.time()-t0)

print(f'Best parameters: {gs.best_params_}')
print(f'Training score: {gs.score(X_train, y_train)}')
print(f'Test score: {gs.score(X_test, y_test)}')

30.713257312774658
Best parameters: {'cvec__max_df': 0.19, 'cvec__max_features': 6500, 'cvec__ngram_range': (1, 1), 'cvec__stop_words': frozenset({'thereafter', 'empty', 'show', 'whole', 'this', 'wherein', 'nine', 'him', 'what', 'their', 'me', 'cannot', 'eg', 'alone', 'mostly', 'whether', 'vader', 'other', 'themselves', 'become', 'namely', 'against', 'much', 'fett', 'least', 'data', 'former', 'myself', 'i', 'than', 'be', 'down', 'rebels', 'the', 'sith', 'done', 'nevertheless', 'besides', 'to', 'above', 'get', 'however', 'captain', 'amount', 'lightsaber', 'anyway', 'ourselves', 'first', 'is', 'out', 'hasnt', 'those', 'from', 'along', 'not', 'if', 'will', 'at', 'whose', 'picard', 'one', 'beforehand', 'side', 'further', 'twelve', 'have', 'am', 'was', 'us', 'mill', 'already', 'a', 'its', 'she', 'klingons', 'there', 'always', 'could', 'our', 'by', 'yourself', 'too', 'without', 'though', 'eleven', 'being', 'up', 'force', 'under', 'worf', 'else', 'becomes', 'each', 'itself', 'riker', 'himself

In [70]:
pipe_params = [('cvec', CountVectorizer()), 
               ('log', LogisticRegression(max_iter = 10_000))]

grid_params = {
    'cvec__ngram_range': [(1,1),(1,2)],
    'cvec__stop_words': [expanded_stopwords, expanded_proper_names],
    'cvec__max_df': [0.01, 0.19, 0.37],
    'cvec__max_features': [6_400, 6_500, 6_600],
    'log__C': [0.5, 1.0]
    
}

gs = fun.pipe_grid_njobs(pipe_params, grid_params)

t0 = time.time()
gs.fit(X_train, y_train)
print(time.time()-t0)

print(f'Best parameters: {gs.best_params_}')
print(f'Training score: {gs.score(X_train, y_train)}')
print(f'Test score: {gs.score(X_test, y_test)}')

31.442172288894653
Best parameters: {'cvec__max_df': 0.19, 'cvec__max_features': 6500, 'cvec__ngram_range': (1, 1), 'cvec__stop_words': frozenset({'thereafter', 'empty', 'show', 'whole', 'this', 'wherein', 'nine', 'him', 'what', 'their', 'me', 'cannot', 'eg', 'alone', 'mostly', 'whether', 'vader', 'other', 'themselves', 'become', 'namely', 'against', 'much', 'fett', 'least', 'data', 'former', 'myself', 'i', 'than', 'be', 'down', 'rebels', 'the', 'sith', 'done', 'nevertheless', 'besides', 'to', 'above', 'get', 'however', 'captain', 'amount', 'lightsaber', 'anyway', 'ourselves', 'first', 'is', 'out', 'hasnt', 'those', 'from', 'along', 'not', 'if', 'will', 'at', 'whose', 'picard', 'one', 'beforehand', 'side', 'further', 'twelve', 'have', 'am', 'was', 'us', 'mill', 'already', 'a', 'its', 'she', 'klingons', 'there', 'always', 'could', 'our', 'by', 'yourself', 'too', 'without', 'though', 'eleven', 'being', 'up', 'force', 'under', 'worf', 'else', 'becomes', 'each', 'itself', 'riker', 'himself

In [71]:
pipe_params = [('cvec', CountVectorizer()), 
               ('log', LogisticRegression(max_iter = 10_000))]

grid_params = {
    'cvec__ngram_range': [(1,1),(1,2)],
    'cvec__stop_words': [expanded_stopwords, expanded_proper_names],
    'cvec__max_df': [0.09, 0.19, 0.29],
    'cvec__max_features': [6_500],
    'log__C': [0.5, 1.0]
    
}

gs = fun.pipe_grid_njobs(pipe_params, grid_params)

t0 = time.time()
gs.fit(X_train, y_train)
print(time.time()-t0)

print(f'Best parameters: {gs.best_params_}')
print(f'Training score: {gs.score(X_train, y_train)}')
print(f'Test score: {gs.score(X_test, y_test)}')

10.92202615737915
Best parameters: {'cvec__max_df': 0.09, 'cvec__max_features': 6500, 'cvec__ngram_range': (1, 1), 'cvec__stop_words': frozenset({'thereafter', 'empty', 'show', 'whole', 'this', 'wherein', 'nine', 'him', 'what', 'their', 'me', 'cannot', 'eg', 'alone', 'mostly', 'whether', 'vader', 'other', 'themselves', 'become', 'namely', 'against', 'much', 'fett', 'least', 'data', 'former', 'myself', 'i', 'than', 'be', 'down', 'rebels', 'the', 'sith', 'done', 'nevertheless', 'besides', 'to', 'above', 'get', 'however', 'captain', 'amount', 'lightsaber', 'anyway', 'ourselves', 'first', 'is', 'out', 'hasnt', 'those', 'from', 'along', 'not', 'if', 'will', 'at', 'whose', 'picard', 'one', 'beforehand', 'side', 'further', 'twelve', 'have', 'am', 'was', 'us', 'mill', 'already', 'a', 'its', 'she', 'klingons', 'there', 'always', 'could', 'our', 'by', 'yourself', 'too', 'without', 'though', 'eleven', 'being', 'up', 'force', 'under', 'worf', 'else', 'becomes', 'each', 'itself', 'riker', 'himself'

In [72]:
pipe_params = [('cvec', CountVectorizer()), 
               ('log', LogisticRegression(max_iter = 10_000))]

grid_params = {
    'cvec__ngram_range': [(1,1),(1,2)],
    'cvec__stop_words': [expanded_stopwords, expanded_proper_names],
    'cvec__max_df': [0.04, 0.09, 0.14],
    'cvec__max_features': [6_500],
    'log__C': [0.5, 1.0]
    
}

gs = fun.pipe_grid_njobs(pipe_params, grid_params)

t0 = time.time()
gs.fit(X_train, y_train)
print(time.time()-t0)

print(f'Best parameters: {gs.best_params_}')
print(f'Training score: {gs.score(X_train, y_train)}')
print(f'Test score: {gs.score(X_test, y_test)}')

10.339927673339844
Best parameters: {'cvec__max_df': 0.09, 'cvec__max_features': 6500, 'cvec__ngram_range': (1, 1), 'cvec__stop_words': frozenset({'thereafter', 'empty', 'show', 'whole', 'this', 'wherein', 'nine', 'him', 'what', 'their', 'me', 'cannot', 'eg', 'alone', 'mostly', 'whether', 'vader', 'other', 'themselves', 'become', 'namely', 'against', 'much', 'fett', 'least', 'data', 'former', 'myself', 'i', 'than', 'be', 'down', 'rebels', 'the', 'sith', 'done', 'nevertheless', 'besides', 'to', 'above', 'get', 'however', 'captain', 'amount', 'lightsaber', 'anyway', 'ourselves', 'first', 'is', 'out', 'hasnt', 'those', 'from', 'along', 'not', 'if', 'will', 'at', 'whose', 'picard', 'one', 'beforehand', 'side', 'further', 'twelve', 'have', 'am', 'was', 'us', 'mill', 'already', 'a', 'its', 'she', 'klingons', 'there', 'always', 'could', 'our', 'by', 'yourself', 'too', 'without', 'though', 'eleven', 'being', 'up', 'force', 'under', 'worf', 'else', 'becomes', 'each', 'itself', 'riker', 'himself

In [73]:
pipe_params = [('cvec', CountVectorizer()), 
               ('log', LogisticRegression(max_iter = 10_000))]

grid_params = {
    'cvec__ngram_range': [(1,1),(1,2)],
    'cvec__stop_words': [expanded_stopwords, expanded_proper_names],
    'cvec__max_df': [0.08, 0.09, 0.10],
    'cvec__max_features': [6_500],
    'log__C': [0.5, 1.0]
    
}

gs = fun.pipe_grid_njobs(pipe_params, grid_params)

t0 = time.time()
gs.fit(X_train, y_train)
print(time.time()-t0)

print(f'Best parameters: {gs.best_params_}')
print(f'Training score: {gs.score(X_train, y_train)}')
print(f'Test score: {gs.score(X_test, y_test)}')

11.070679903030396
Best parameters: {'cvec__max_df': 0.08, 'cvec__max_features': 6500, 'cvec__ngram_range': (1, 2), 'cvec__stop_words': frozenset({'thereafter', 'empty', 'show', 'whole', 'this', 'wherein', 'nine', 'him', 'what', 'their', 'me', 'cannot', 'eg', 'alone', 'mostly', 'whether', 'vader', 'other', 'themselves', 'become', 'namely', 'against', 'much', 'fett', 'least', 'data', 'former', 'myself', 'i', 'than', 'be', 'down', 'rebels', 'the', 'sith', 'done', 'nevertheless', 'besides', 'to', 'above', 'get', 'however', 'captain', 'amount', 'lightsaber', 'anyway', 'ourselves', 'first', 'is', 'out', 'hasnt', 'those', 'from', 'along', 'not', 'if', 'will', 'at', 'whose', 'picard', 'one', 'beforehand', 'side', 'further', 'twelve', 'have', 'am', 'was', 'us', 'mill', 'already', 'a', 'its', 'she', 'klingons', 'there', 'always', 'could', 'our', 'by', 'yourself', 'too', 'without', 'though', 'eleven', 'being', 'up', 'force', 'under', 'worf', 'else', 'becomes', 'each', 'itself', 'riker', 'himself

In [74]:
pipe_params = [('cvec', CountVectorizer()), 
               ('log', LogisticRegression(max_iter = 10_000))]

grid_params = {
    'cvec__ngram_range': [(1,1),(1,2)],
    'cvec__stop_words': [expanded_stopwords, expanded_proper_names],
    'cvec__max_df': [0.09],
    'cvec__max_features': [6_500],
    'log__C': [0.9, 1.0, 1.1]
    
}

gs = fun.pipe_grid_njobs(pipe_params, grid_params)

t0 = time.time()
gs.fit(X_train, y_train)
print(time.time()-t0)

print(f'Best parameters: {gs.best_params_}')
print(f'Training score: {gs.score(X_train, y_train)}')
print(f'Test score: {gs.score(X_test, y_test)}')

5.323198318481445
Best parameters: {'cvec__max_df': 0.09, 'cvec__max_features': 6500, 'cvec__ngram_range': (1, 1), 'cvec__stop_words': frozenset({'thereafter', 'empty', 'show', 'whole', 'this', 'wherein', 'nine', 'him', 'what', 'their', 'me', 'cannot', 'eg', 'alone', 'mostly', 'whether', 'vader', 'other', 'themselves', 'become', 'namely', 'against', 'much', 'fett', 'least', 'data', 'former', 'myself', 'i', 'than', 'be', 'down', 'rebels', 'the', 'sith', 'done', 'nevertheless', 'besides', 'to', 'above', 'get', 'however', 'captain', 'amount', 'lightsaber', 'anyway', 'ourselves', 'first', 'is', 'out', 'hasnt', 'those', 'from', 'along', 'not', 'if', 'will', 'at', 'whose', 'picard', 'one', 'beforehand', 'side', 'further', 'twelve', 'have', 'am', 'was', 'us', 'mill', 'already', 'a', 'its', 'she', 'klingons', 'there', 'always', 'could', 'our', 'by', 'yourself', 'too', 'without', 'though', 'eleven', 'being', 'up', 'force', 'under', 'worf', 'else', 'becomes', 'each', 'itself', 'riker', 'himself'

In [75]:
pipe_params = [('cvec', CountVectorizer()), 
               ('log', LogisticRegression(max_iter = 10_000))]

grid_params = {
    'cvec__ngram_range': [(1,1),(1,2)],
    'cvec__stop_words': [expanded_stopwords, expanded_proper_names],
    'cvec__max_df': [0.09],
    'cvec__max_features': [6_500],
    'log__C': [1.1, 1.2]
    
}

gs = fun.pipe_grid_njobs(pipe_params, grid_params)

t0 = time.time()
gs.fit(X_train, y_train)
print(time.time()-t0)

print(f'Best parameters: {gs.best_params_}')
print(f'Training score: {gs.score(X_train, y_train)}')
print(f'Test score: {gs.score(X_test, y_test)}')

3.89455509185791
Best parameters: {'cvec__max_df': 0.09, 'cvec__max_features': 6500, 'cvec__ngram_range': (1, 1), 'cvec__stop_words': frozenset({'thereafter', 'empty', 'show', 'whole', 'this', 'wherein', 'nine', 'him', 'what', 'their', 'me', 'cannot', 'eg', 'alone', 'mostly', 'whether', 'vader', 'other', 'themselves', 'become', 'namely', 'against', 'much', 'fett', 'least', 'data', 'former', 'myself', 'i', 'than', 'be', 'down', 'rebels', 'the', 'sith', 'done', 'nevertheless', 'besides', 'to', 'above', 'get', 'however', 'captain', 'amount', 'lightsaber', 'anyway', 'ourselves', 'first', 'is', 'out', 'hasnt', 'those', 'from', 'along', 'not', 'if', 'will', 'at', 'whose', 'picard', 'one', 'beforehand', 'side', 'further', 'twelve', 'have', 'am', 'was', 'us', 'mill', 'already', 'a', 'its', 'she', 'klingons', 'there', 'always', 'could', 'our', 'by', 'yourself', 'too', 'without', 'though', 'eleven', 'being', 'up', 'force', 'under', 'worf', 'else', 'becomes', 'each', 'itself', 'riker', 'himself',

In [76]:
#BEST MODEL 3
pipe_params = [('cvec', CountVectorizer()), 
               ('log', LogisticRegression(max_iter = 10_000))]

grid_params = {
    'cvec__ngram_range': [(1,1)],
    'cvec__stop_words': [expanded_stopwords],
    'cvec__max_df': [0.09],
    'cvec__max_features': [6_500],
    'log__C': [1.1]
    
}

gs = fun.pipe_grid_njobs(pipe_params, grid_params)

t0 = time.time()
gs.fit(X_train, y_train)
print(time.time()-t0)

print(f'Best parameters: {gs.best_params_}')
print(f'Training score: {gs.score(X_train, y_train)}')
print(f'Test score: {gs.score(X_test, y_test)}')

0.6170868873596191
Best parameters: {'cvec__max_df': 0.09, 'cvec__max_features': 6500, 'cvec__ngram_range': (1, 1), 'cvec__stop_words': frozenset({'thereafter', 'empty', 'show', 'whole', 'this', 'wherein', 'nine', 'him', 'what', 'their', 'me', 'cannot', 'eg', 'alone', 'mostly', 'whether', 'vader', 'other', 'themselves', 'become', 'namely', 'against', 'much', 'fett', 'least', 'data', 'former', 'myself', 'i', 'than', 'be', 'down', 'rebels', 'the', 'sith', 'done', 'nevertheless', 'besides', 'to', 'above', 'get', 'however', 'captain', 'amount', 'lightsaber', 'anyway', 'ourselves', 'first', 'is', 'out', 'hasnt', 'those', 'from', 'along', 'not', 'if', 'will', 'at', 'whose', 'picard', 'one', 'beforehand', 'side', 'further', 'twelve', 'have', 'am', 'was', 'us', 'mill', 'already', 'a', 'its', 'she', 'klingons', 'there', 'always', 'could', 'our', 'by', 'yourself', 'too', 'without', 'though', 'eleven', 'being', 'up', 'force', 'under', 'worf', 'else', 'becomes', 'each', 'itself', 'riker', 'himself

# Model 4, TfidfVectorizer, LogisticRegression
Of the 4 combinations I tried without proper names/show references removed, this was the least effect, so I tried it last.

Best parameters: {'log__C': 1.4, 'tvec__max_df': 0.12, 'tvec__max_features': 6500, 'tvec__ngram_range': (1, 2), 'tvec__stop_words': expanded_stopwords}
Training score: 0.9383758878863505
Test score: 0.8307426597582038

In [77]:
pipe_params = [('tvec', TfidfVectorizer()), 
               ('log', LogisticRegression(max_iter=10_000))]

grid_params = {
    'tvec__ngram_range': [(1,1), (1,2)],
    'tvec__stop_words': [expanded_stopwords, expanded_proper_names],
    'tvec__max_df': [0.02, 0.22, 0.42],
    'tvec__max_features': [900, 2_900, 4_900],
    'log__C': [0.5, 1.0, 1.5]
    
}

gs = fun.pipe_grid_njobs(pipe_params, grid_params)

t0 = time.time()
gs.fit(X_train, y_train)
print("Time to run:", time.time()-t0)

print(f'Best parameters: {gs.best_params_}')
print(f'Training score: {gs.score(X_train, y_train)}')
print(f'Test score: {gs.score(X_test, y_test)}')

Time to run: 39.4936580657959
Best parameters: {'log__C': 1.5, 'tvec__max_df': 0.22, 'tvec__max_features': 4900, 'tvec__ngram_range': (1, 1), 'tvec__stop_words': frozenset({'thereafter', 'empty', 'show', 'whole', 'this', 'wherein', 'nine', 'him', 'what', 'their', 'me', 'cannot', 'eg', 'alone', 'mostly', 'whether', 'vader', 'other', 'themselves', 'become', 'namely', 'against', 'much', 'fett', 'least', 'data', 'former', 'myself', 'i', 'than', 'be', 'down', 'rebels', 'the', 'sith', 'done', 'nevertheless', 'besides', 'to', 'above', 'get', 'however', 'captain', 'amount', 'lightsaber', 'anyway', 'ourselves', 'first', 'is', 'out', 'hasnt', 'those', 'from', 'along', 'not', 'if', 'will', 'at', 'whose', 'picard', 'one', 'beforehand', 'side', 'further', 'twelve', 'have', 'am', 'was', 'us', 'mill', 'already', 'a', 'its', 'she', 'klingons', 'there', 'always', 'could', 'our', 'by', 'yourself', 'too', 'without', 'though', 'eleven', 'being', 'up', 'force', 'under', 'worf', 'else', 'becomes', 'each', '

In [78]:
pipe_params = [('tvec', TfidfVectorizer()), 
               ('log', LogisticRegression(max_iter=10_000))]

grid_params = {
    'tvec__ngram_range': [(1,1), (1,2)],
    'tvec__stop_words': [expanded_stopwords, expanded_proper_names],
    'tvec__max_df': [0.12, 0.22, 0.32],
    'tvec__max_features': [900, 2_900, 4_900],
    'log__C': [0.5, 1.0, 1.5]
    
}

gs = fun.pipe_grid_njobs(pipe_params, grid_params)

t0 = time.time()
gs.fit(X_train, y_train)
print("Time to run:", time.time()-t0)

print(f'Best parameters: {gs.best_params_}')
print(f'Training score: {gs.score(X_train, y_train)}')
print(f'Test score: {gs.score(X_test, y_test)}')

Time to run: 40.74934983253479
Best parameters: {'log__C': 1.5, 'tvec__max_df': 0.12, 'tvec__max_features': 4900, 'tvec__ngram_range': (1, 1), 'tvec__stop_words': frozenset({'thereafter', 'empty', 'show', 'whole', 'this', 'wherein', 'nine', 'him', 'what', 'their', 'me', 'cannot', 'eg', 'alone', 'mostly', 'whether', 'vader', 'other', 'themselves', 'become', 'namely', 'against', 'much', 'fett', 'least', 'data', 'former', 'myself', 'i', 'than', 'be', 'down', 'rebels', 'the', 'sith', 'done', 'nevertheless', 'besides', 'to', 'above', 'get', 'however', 'captain', 'amount', 'lightsaber', 'anyway', 'ourselves', 'first', 'is', 'out', 'hasnt', 'those', 'from', 'along', 'not', 'if', 'will', 'at', 'whose', 'picard', 'one', 'beforehand', 'side', 'further', 'twelve', 'have', 'am', 'was', 'us', 'mill', 'already', 'a', 'its', 'she', 'klingons', 'there', 'always', 'could', 'our', 'by', 'yourself', 'too', 'without', 'though', 'eleven', 'being', 'up', 'force', 'under', 'worf', 'else', 'becomes', 'each', 

In [79]:
pipe_params = [('tvec', TfidfVectorizer()), 
               ('log', LogisticRegression(max_iter=10_000))]

grid_params = {
    'tvec__ngram_range': [(1,1), (1,2)],
    'tvec__stop_words': [expanded_stopwords, expanded_proper_names],
    'tvec__max_df': [0.07, 0.12, 0.17],
    'tvec__max_features': [900, 2_900, 4_900],
    'log__C': [0.5, 1.0, 1.5]
    
}

gs = fun.pipe_grid_njobs(pipe_params, grid_params)

t0 = time.time()
gs.fit(X_train, y_train)
print("Time to run:", time.time()-t0)

print(f'Best parameters: {gs.best_params_}')
print(f'Training score: {gs.score(X_train, y_train)}')
print(f'Test score: {gs.score(X_test, y_test)}')

Time to run: 38.62906289100647
Best parameters: {'log__C': 1.5, 'tvec__max_df': 0.12, 'tvec__max_features': 4900, 'tvec__ngram_range': (1, 1), 'tvec__stop_words': frozenset({'thereafter', 'empty', 'show', 'whole', 'this', 'wherein', 'nine', 'him', 'what', 'their', 'me', 'cannot', 'eg', 'alone', 'mostly', 'whether', 'vader', 'other', 'themselves', 'become', 'namely', 'against', 'much', 'fett', 'least', 'data', 'former', 'myself', 'i', 'than', 'be', 'down', 'rebels', 'the', 'sith', 'done', 'nevertheless', 'besides', 'to', 'above', 'get', 'however', 'captain', 'amount', 'lightsaber', 'anyway', 'ourselves', 'first', 'is', 'out', 'hasnt', 'those', 'from', 'along', 'not', 'if', 'will', 'at', 'whose', 'picard', 'one', 'beforehand', 'side', 'further', 'twelve', 'have', 'am', 'was', 'us', 'mill', 'already', 'a', 'its', 'she', 'klingons', 'there', 'always', 'could', 'our', 'by', 'yourself', 'too', 'without', 'though', 'eleven', 'being', 'up', 'force', 'under', 'worf', 'else', 'becomes', 'each', 

In [80]:
pipe_params = [('tvec', TfidfVectorizer()), 
               ('log', LogisticRegression(max_iter=10_000))]

grid_params = {
    'tvec__ngram_range': [(1,1), (1,2)],
    'tvec__stop_words': [expanded_stopwords, expanded_proper_names],
    'tvec__max_df': [0.10, 0.12, 0.14],
    'tvec__max_features': [900, 2_900, 4_900],
    'log__C': [0.5, 1.0, 1.5]
    
}

gs = fun.pipe_grid_njobs(pipe_params, grid_params)

t0 = time.time()
gs.fit(X_train, y_train)
print("Time to run:", time.time()-t0)

print(f'Best parameters: {gs.best_params_}')
print(f'Training score: {gs.score(X_train, y_train)}')
print(f'Test score: {gs.score(X_test, y_test)}')

Time to run: 40.75169014930725
Best parameters: {'log__C': 1.5, 'tvec__max_df': 0.12, 'tvec__max_features': 4900, 'tvec__ngram_range': (1, 1), 'tvec__stop_words': frozenset({'thereafter', 'empty', 'show', 'whole', 'this', 'wherein', 'nine', 'him', 'what', 'their', 'me', 'cannot', 'eg', 'alone', 'mostly', 'whether', 'vader', 'other', 'themselves', 'become', 'namely', 'against', 'much', 'fett', 'least', 'data', 'former', 'myself', 'i', 'than', 'be', 'down', 'rebels', 'the', 'sith', 'done', 'nevertheless', 'besides', 'to', 'above', 'get', 'however', 'captain', 'amount', 'lightsaber', 'anyway', 'ourselves', 'first', 'is', 'out', 'hasnt', 'those', 'from', 'along', 'not', 'if', 'will', 'at', 'whose', 'picard', 'one', 'beforehand', 'side', 'further', 'twelve', 'have', 'am', 'was', 'us', 'mill', 'already', 'a', 'its', 'she', 'klingons', 'there', 'always', 'could', 'our', 'by', 'yourself', 'too', 'without', 'though', 'eleven', 'being', 'up', 'force', 'under', 'worf', 'else', 'becomes', 'each', 

In [81]:
pipe_params = [('tvec', TfidfVectorizer()), 
               ('log', LogisticRegression(max_iter=10_000))]

grid_params = {
    'tvec__ngram_range': [(1,1), (1,2)],
    'tvec__stop_words': [expanded_stopwords, expanded_proper_names],
    'tvec__max_df': [0.11, 0.12, 0.13],
    'tvec__max_features': [900, 2_900, 4_900],
    'log__C': [0.5, 1.0, 1.5]
    
}

gs = fun.pipe_grid_njobs(pipe_params, grid_params)

t0 = time.time()
gs.fit(X_train, y_train)
print("Time to run:", time.time()-t0)

print(f'Best parameters: {gs.best_params_}')
print(f'Training score: {gs.score(X_train, y_train)}')
print(f'Test score: {gs.score(X_test, y_test)}')

Time to run: 42.18435192108154
Best parameters: {'log__C': 1.5, 'tvec__max_df': 0.12, 'tvec__max_features': 4900, 'tvec__ngram_range': (1, 1), 'tvec__stop_words': frozenset({'thereafter', 'empty', 'show', 'whole', 'this', 'wherein', 'nine', 'him', 'what', 'their', 'me', 'cannot', 'eg', 'alone', 'mostly', 'whether', 'vader', 'other', 'themselves', 'become', 'namely', 'against', 'much', 'fett', 'least', 'data', 'former', 'myself', 'i', 'than', 'be', 'down', 'rebels', 'the', 'sith', 'done', 'nevertheless', 'besides', 'to', 'above', 'get', 'however', 'captain', 'amount', 'lightsaber', 'anyway', 'ourselves', 'first', 'is', 'out', 'hasnt', 'those', 'from', 'along', 'not', 'if', 'will', 'at', 'whose', 'picard', 'one', 'beforehand', 'side', 'further', 'twelve', 'have', 'am', 'was', 'us', 'mill', 'already', 'a', 'its', 'she', 'klingons', 'there', 'always', 'could', 'our', 'by', 'yourself', 'too', 'without', 'though', 'eleven', 'being', 'up', 'force', 'under', 'worf', 'else', 'becomes', 'each', 

In [82]:
pipe_params = [('tvec', TfidfVectorizer()), 
               ('log', LogisticRegression(max_iter=10_000))]

grid_params = {
    'tvec__ngram_range': [(1,1), (1,2)],
    'tvec__stop_words': [expanded_stopwords, expanded_proper_names],
    'tvec__max_df': [0.12],
    'tvec__max_features': [3_400, 4_900, 6_400],
    'log__C': [0.5, 1.0, 1.5]
    
}

gs = fun.pipe_grid_njobs(pipe_params, grid_params)

t0 = time.time()
gs.fit(X_train, y_train)
print("Time to run:", time.time()-t0)

print(f'Best parameters: {gs.best_params_}')
print(f'Training score: {gs.score(X_train, y_train)}')
print(f'Test score: {gs.score(X_test, y_test)}')

Time to run: 13.28843092918396
Best parameters: {'log__C': 1.5, 'tvec__max_df': 0.12, 'tvec__max_features': 6400, 'tvec__ngram_range': (1, 2), 'tvec__stop_words': frozenset({'thereafter', 'empty', 'show', 'whole', 'this', 'wherein', 'nine', 'him', 'what', 'their', 'me', 'cannot', 'eg', 'alone', 'mostly', 'whether', 'vader', 'other', 'themselves', 'become', 'namely', 'against', 'much', 'fett', 'least', 'data', 'former', 'myself', 'i', 'than', 'be', 'down', 'rebels', 'the', 'sith', 'done', 'nevertheless', 'besides', 'to', 'above', 'get', 'however', 'captain', 'amount', 'lightsaber', 'anyway', 'ourselves', 'first', 'is', 'out', 'hasnt', 'those', 'from', 'along', 'not', 'if', 'will', 'at', 'whose', 'picard', 'one', 'beforehand', 'side', 'further', 'twelve', 'have', 'am', 'was', 'us', 'mill', 'already', 'a', 'its', 'she', 'klingons', 'there', 'always', 'could', 'our', 'by', 'yourself', 'too', 'without', 'though', 'eleven', 'being', 'up', 'force', 'under', 'worf', 'else', 'becomes', 'each', 

In [83]:
pipe_params = [('tvec', TfidfVectorizer()), 
               ('log', LogisticRegression(max_iter=10_000))]

grid_params = {
    'tvec__ngram_range': [(1,1), (1,2)],
    'tvec__stop_words': [expanded_stopwords, expanded_proper_names],
    'tvec__max_df': [0.12],
    'tvec__max_features': [5_400, 6_400, 7_400],
    'log__C': [0.5, 1.0, 1.5]
    
}

gs = fun.pipe_grid_njobs(pipe_params, grid_params)

t0 = time.time()
gs.fit(X_train, y_train)
print("Time to run:", time.time()-t0)

print(f'Best parameters: {gs.best_params_}')
print(f'Training score: {gs.score(X_train, y_train)}')
print(f'Test score: {gs.score(X_test, y_test)}')

Time to run: 13.605923175811768
Best parameters: {'log__C': 1.5, 'tvec__max_df': 0.12, 'tvec__max_features': 6400, 'tvec__ngram_range': (1, 2), 'tvec__stop_words': frozenset({'thereafter', 'empty', 'show', 'whole', 'this', 'wherein', 'nine', 'him', 'what', 'their', 'me', 'cannot', 'eg', 'alone', 'mostly', 'whether', 'vader', 'other', 'themselves', 'become', 'namely', 'against', 'much', 'fett', 'least', 'data', 'former', 'myself', 'i', 'than', 'be', 'down', 'rebels', 'the', 'sith', 'done', 'nevertheless', 'besides', 'to', 'above', 'get', 'however', 'captain', 'amount', 'lightsaber', 'anyway', 'ourselves', 'first', 'is', 'out', 'hasnt', 'those', 'from', 'along', 'not', 'if', 'will', 'at', 'whose', 'picard', 'one', 'beforehand', 'side', 'further', 'twelve', 'have', 'am', 'was', 'us', 'mill', 'already', 'a', 'its', 'she', 'klingons', 'there', 'always', 'could', 'our', 'by', 'yourself', 'too', 'without', 'though', 'eleven', 'being', 'up', 'force', 'under', 'worf', 'else', 'becomes', 'each',

In [84]:
pipe_params = [('tvec', TfidfVectorizer()), 
               ('log', LogisticRegression(max_iter=10_000))]

grid_params = {
    'tvec__ngram_range': [(1,1), (1,2)],
    'tvec__stop_words': [expanded_stopwords, expanded_proper_names],
    'tvec__max_df': [0.12],
    'tvec__max_features': [5_900, 6_400, 6_900],
    'log__C': [0.5, 1.0, 1.5]
    
}

gs = fun.pipe_grid_njobs(pipe_params, grid_params)

t0 = time.time()
gs.fit(X_train, y_train)
print("Time to run:", time.time()-t0)

print(f'Best parameters: {gs.best_params_}')
print(f'Training score: {gs.score(X_train, y_train)}')
print(f'Test score: {gs.score(X_test, y_test)}')

Time to run: 13.74911093711853
Best parameters: {'log__C': 1.5, 'tvec__max_df': 0.12, 'tvec__max_features': 6400, 'tvec__ngram_range': (1, 2), 'tvec__stop_words': frozenset({'thereafter', 'empty', 'show', 'whole', 'this', 'wherein', 'nine', 'him', 'what', 'their', 'me', 'cannot', 'eg', 'alone', 'mostly', 'whether', 'vader', 'other', 'themselves', 'become', 'namely', 'against', 'much', 'fett', 'least', 'data', 'former', 'myself', 'i', 'than', 'be', 'down', 'rebels', 'the', 'sith', 'done', 'nevertheless', 'besides', 'to', 'above', 'get', 'however', 'captain', 'amount', 'lightsaber', 'anyway', 'ourselves', 'first', 'is', 'out', 'hasnt', 'those', 'from', 'along', 'not', 'if', 'will', 'at', 'whose', 'picard', 'one', 'beforehand', 'side', 'further', 'twelve', 'have', 'am', 'was', 'us', 'mill', 'already', 'a', 'its', 'she', 'klingons', 'there', 'always', 'could', 'our', 'by', 'yourself', 'too', 'without', 'though', 'eleven', 'being', 'up', 'force', 'under', 'worf', 'else', 'becomes', 'each', 

In [85]:
pipe_params = [('tvec', TfidfVectorizer()), 
               ('log', LogisticRegression(max_iter=10_000))]

grid_params = {
    'tvec__ngram_range': [(1,1), (1,2)],
    'tvec__stop_words': [expanded_stopwords, expanded_proper_names],
    'tvec__max_df': [0.12],
    'tvec__max_features': [6_300, 6_400, 6_500],
    'log__C': [0.5, 1.0, 1.5]
    
}

gs = fun.pipe_grid_njobs(pipe_params, grid_params)

t0 = time.time()
gs.fit(X_train, y_train)
print("Time to run:", time.time()-t0)

print(f'Best parameters: {gs.best_params_}')
print(f'Training score: {gs.score(X_train, y_train)}')
print(f'Test score: {gs.score(X_test, y_test)}')

Time to run: 14.678189992904663
Best parameters: {'log__C': 1.5, 'tvec__max_df': 0.12, 'tvec__max_features': 6500, 'tvec__ngram_range': (1, 2), 'tvec__stop_words': frozenset({'thereafter', 'empty', 'show', 'whole', 'this', 'wherein', 'nine', 'him', 'what', 'their', 'me', 'cannot', 'eg', 'alone', 'mostly', 'whether', 'vader', 'other', 'themselves', 'become', 'namely', 'against', 'much', 'fett', 'least', 'data', 'former', 'myself', 'i', 'than', 'be', 'down', 'rebels', 'the', 'sith', 'done', 'nevertheless', 'besides', 'to', 'above', 'get', 'however', 'captain', 'amount', 'lightsaber', 'anyway', 'ourselves', 'first', 'is', 'out', 'hasnt', 'those', 'from', 'along', 'not', 'if', 'will', 'at', 'whose', 'picard', 'one', 'beforehand', 'side', 'further', 'twelve', 'have', 'am', 'was', 'us', 'mill', 'already', 'a', 'its', 'she', 'klingons', 'there', 'always', 'could', 'our', 'by', 'yourself', 'too', 'without', 'though', 'eleven', 'being', 'up', 'force', 'under', 'worf', 'else', 'becomes', 'each',

In [86]:
pipe_params = [('tvec', TfidfVectorizer()), 
               ('log', LogisticRegression(max_iter=10_000))]

grid_params = {
    'tvec__ngram_range': [(1,1), (1,2)],
    'tvec__stop_words': [expanded_stopwords, expanded_proper_names],
    'tvec__max_df': [0.12],
    'tvec__max_features': [6_500, 6_600],
    'log__C': [0.5, 1.0, 1.5]
    
}

gs = fun.pipe_grid_njobs(pipe_params, grid_params)

t0 = time.time()
gs.fit(X_train, y_train)
print("Time to run:", time.time()-t0)

print(f'Best parameters: {gs.best_params_}')
print(f'Training score: {gs.score(X_train, y_train)}')
print(f'Test score: {gs.score(X_test, y_test)}')

Time to run: 9.958917140960693
Best parameters: {'log__C': 1.5, 'tvec__max_df': 0.12, 'tvec__max_features': 6500, 'tvec__ngram_range': (1, 2), 'tvec__stop_words': frozenset({'thereafter', 'empty', 'show', 'whole', 'this', 'wherein', 'nine', 'him', 'what', 'their', 'me', 'cannot', 'eg', 'alone', 'mostly', 'whether', 'vader', 'other', 'themselves', 'become', 'namely', 'against', 'much', 'fett', 'least', 'data', 'former', 'myself', 'i', 'than', 'be', 'down', 'rebels', 'the', 'sith', 'done', 'nevertheless', 'besides', 'to', 'above', 'get', 'however', 'captain', 'amount', 'lightsaber', 'anyway', 'ourselves', 'first', 'is', 'out', 'hasnt', 'those', 'from', 'along', 'not', 'if', 'will', 'at', 'whose', 'picard', 'one', 'beforehand', 'side', 'further', 'twelve', 'have', 'am', 'was', 'us', 'mill', 'already', 'a', 'its', 'she', 'klingons', 'there', 'always', 'could', 'our', 'by', 'yourself', 'too', 'without', 'though', 'eleven', 'being', 'up', 'force', 'under', 'worf', 'else', 'becomes', 'each', 

In [87]:
pipe_params = [('tvec', TfidfVectorizer()), 
               ('log', LogisticRegression(max_iter=10_000))]

grid_params = {
    'tvec__ngram_range': [(1,1), (1,2)],
    'tvec__stop_words': [expanded_stopwords, expanded_proper_names],
    'tvec__max_df': [0.12],
    'tvec__max_features': [6_500],
    'log__C': [1.25, 1.5, 1.75]
    
}

gs = fun.pipe_grid_njobs(pipe_params, grid_params)

t0 = time.time()
gs.fit(X_train, y_train)
print("Time to run:", time.time()-t0)

print(f'Best parameters: {gs.best_params_}')
print(f'Training score: {gs.score(X_train, y_train)}')
print(f'Test score: {gs.score(X_test, y_test)}')

Time to run: 5.618391036987305
Best parameters: {'log__C': 1.75, 'tvec__max_df': 0.12, 'tvec__max_features': 6500, 'tvec__ngram_range': (1, 2), 'tvec__stop_words': frozenset({'thereafter', 'empty', 'show', 'whole', 'this', 'wherein', 'nine', 'him', 'what', 'their', 'me', 'cannot', 'eg', 'alone', 'mostly', 'whether', 'vader', 'other', 'themselves', 'become', 'namely', 'against', 'much', 'fett', 'least', 'data', 'former', 'myself', 'i', 'than', 'be', 'down', 'rebels', 'the', 'sith', 'done', 'nevertheless', 'besides', 'to', 'above', 'get', 'however', 'captain', 'amount', 'lightsaber', 'anyway', 'ourselves', 'first', 'is', 'out', 'hasnt', 'those', 'from', 'along', 'not', 'if', 'will', 'at', 'whose', 'picard', 'one', 'beforehand', 'side', 'further', 'twelve', 'have', 'am', 'was', 'us', 'mill', 'already', 'a', 'its', 'she', 'klingons', 'there', 'always', 'could', 'our', 'by', 'yourself', 'too', 'without', 'though', 'eleven', 'being', 'up', 'force', 'under', 'worf', 'else', 'becomes', 'each',

In [89]:
pipe_params = [('tvec', TfidfVectorizer()), 
               ('log', LogisticRegression(max_iter=10_000))]

grid_params = {
    'tvec__ngram_range': [(1,1), (1,2)],
    'tvec__stop_words': [expanded_stopwords, expanded_proper_names],
    'tvec__max_df': [0.12],
    'tvec__max_features': [6_500],
    'log__C': [1.0]
    
}

gs = fun.pipe_grid_njobs(pipe_params, grid_params)

t0 = time.time()
gs.fit(X_train, y_train)
print("Time to run:", time.time()-t0)

print(f'Best parameters: {gs.best_params_}')
print(f'Training score: {gs.score(X_train, y_train)}')
print(f'Test score: {gs.score(X_test, y_test)}')

Time to run: 2.1659951210021973
Best parameters: {'log__C': 1.0, 'tvec__max_df': 0.12, 'tvec__max_features': 6500, 'tvec__ngram_range': (1, 2), 'tvec__stop_words': frozenset({'thereafter', 'empty', 'show', 'whole', 'this', 'wherein', 'nine', 'him', 'what', 'their', 'me', 'cannot', 'eg', 'alone', 'mostly', 'whether', 'vader', 'other', 'themselves', 'become', 'namely', 'against', 'much', 'fett', 'least', 'data', 'former', 'myself', 'i', 'than', 'be', 'down', 'rebels', 'the', 'sith', 'done', 'nevertheless', 'besides', 'to', 'above', 'get', 'however', 'captain', 'amount', 'lightsaber', 'anyway', 'ourselves', 'first', 'is', 'out', 'hasnt', 'those', 'from', 'along', 'not', 'if', 'will', 'at', 'whose', 'picard', 'one', 'beforehand', 'side', 'further', 'twelve', 'have', 'am', 'was', 'us', 'mill', 'already', 'a', 'its', 'she', 'klingons', 'there', 'always', 'could', 'our', 'by', 'yourself', 'too', 'without', 'though', 'eleven', 'being', 'up', 'force', 'under', 'worf', 'else', 'becomes', 'each',

In [90]:
pipe_params = [('tvec', TfidfVectorizer()), 
               ('log', LogisticRegression(max_iter=10_000))]

grid_params = {
    'tvec__ngram_range': [(1,1), (1,2)],
    'tvec__stop_words': [expanded_stopwords, expanded_proper_names],
    'tvec__max_df': [0.12],
    'tvec__max_features': [6_500],
    'log__C': [1.25]
    
}

gs = fun.pipe_grid_njobs(pipe_params, grid_params)

t0 = time.time()
gs.fit(X_train, y_train)
print("Time to run:", time.time()-t0)

print(f'Best parameters: {gs.best_params_}')
print(f'Training score: {gs.score(X_train, y_train)}')
print(f'Test score: {gs.score(X_test, y_test)}')

Time to run: 2.173959970474243
Best parameters: {'log__C': 1.25, 'tvec__max_df': 0.12, 'tvec__max_features': 6500, 'tvec__ngram_range': (1, 2), 'tvec__stop_words': frozenset({'thereafter', 'empty', 'show', 'whole', 'this', 'wherein', 'nine', 'him', 'what', 'their', 'me', 'cannot', 'eg', 'alone', 'mostly', 'whether', 'vader', 'other', 'themselves', 'become', 'namely', 'against', 'much', 'fett', 'least', 'data', 'former', 'myself', 'i', 'than', 'be', 'down', 'rebels', 'the', 'sith', 'done', 'nevertheless', 'besides', 'to', 'above', 'get', 'however', 'captain', 'amount', 'lightsaber', 'anyway', 'ourselves', 'first', 'is', 'out', 'hasnt', 'those', 'from', 'along', 'not', 'if', 'will', 'at', 'whose', 'picard', 'one', 'beforehand', 'side', 'further', 'twelve', 'have', 'am', 'was', 'us', 'mill', 'already', 'a', 'its', 'she', 'klingons', 'there', 'always', 'could', 'our', 'by', 'yourself', 'too', 'without', 'though', 'eleven', 'being', 'up', 'force', 'under', 'worf', 'else', 'becomes', 'each',

In [91]:
pipe_params = [('tvec', TfidfVectorizer()), 
               ('log', LogisticRegression(max_iter=10_000))]

grid_params = {
    'tvec__ngram_range': [(1,1), (1,2)],
    'tvec__stop_words': [expanded_stopwords, expanded_proper_names],
    'tvec__max_df': [0.12],
    'tvec__max_features': [6_500],
    'log__C': [1.4]
    
}

gs = fun.pipe_grid_njobs(pipe_params, grid_params)

t0 = time.time()
gs.fit(X_train, y_train)
print("Time to run:", time.time()-t0)

print(f'Best parameters: {gs.best_params_}')
print(f'Training score: {gs.score(X_train, y_train)}')
print(f'Test score: {gs.score(X_test, y_test)}')

Time to run: 2.3524134159088135
Best parameters: {'log__C': 1.4, 'tvec__max_df': 0.12, 'tvec__max_features': 6500, 'tvec__ngram_range': (1, 2), 'tvec__stop_words': frozenset({'thereafter', 'empty', 'show', 'whole', 'this', 'wherein', 'nine', 'him', 'what', 'their', 'me', 'cannot', 'eg', 'alone', 'mostly', 'whether', 'vader', 'other', 'themselves', 'become', 'namely', 'against', 'much', 'fett', 'least', 'data', 'former', 'myself', 'i', 'than', 'be', 'down', 'rebels', 'the', 'sith', 'done', 'nevertheless', 'besides', 'to', 'above', 'get', 'however', 'captain', 'amount', 'lightsaber', 'anyway', 'ourselves', 'first', 'is', 'out', 'hasnt', 'those', 'from', 'along', 'not', 'if', 'will', 'at', 'whose', 'picard', 'one', 'beforehand', 'side', 'further', 'twelve', 'have', 'am', 'was', 'us', 'mill', 'already', 'a', 'its', 'she', 'klingons', 'there', 'always', 'could', 'our', 'by', 'yourself', 'too', 'without', 'though', 'eleven', 'being', 'up', 'force', 'under', 'worf', 'else', 'becomes', 'each',

In [92]:
#BEST VERSION OF MODEL 4
pipe_params = [('tvec', TfidfVectorizer()), 
               ('log', LogisticRegression(max_iter=10_000))]

grid_params = {
    'tvec__ngram_range': [(1,2)],
    'tvec__stop_words': [expanded_stopwords],
    'tvec__max_df': [0.12],
    'tvec__max_features': [6_500],
    'log__C': [1.4]
    
}

gs = fun.pipe_grid_njobs(pipe_params, grid_params)

t0 = time.time()
gs.fit(X_train, y_train)
print("Time to run:", time.time()-t0)

print(f'Best parameters: {gs.best_params_}')
print(f'Training score: {gs.score(X_train, y_train)}')
print(f'Test score: {gs.score(X_test, y_test)}')

Time to run: 1.0933971405029297
Best parameters: {'log__C': 1.4, 'tvec__max_df': 0.12, 'tvec__max_features': 6500, 'tvec__ngram_range': (1, 2), 'tvec__stop_words': frozenset({'thereafter', 'empty', 'show', 'whole', 'this', 'wherein', 'nine', 'him', 'what', 'their', 'me', 'cannot', 'eg', 'alone', 'mostly', 'whether', 'vader', 'other', 'themselves', 'become', 'namely', 'against', 'much', 'fett', 'least', 'data', 'former', 'myself', 'i', 'than', 'be', 'down', 'rebels', 'the', 'sith', 'done', 'nevertheless', 'besides', 'to', 'above', 'get', 'however', 'captain', 'amount', 'lightsaber', 'anyway', 'ourselves', 'first', 'is', 'out', 'hasnt', 'those', 'from', 'along', 'not', 'if', 'will', 'at', 'whose', 'picard', 'one', 'beforehand', 'side', 'further', 'twelve', 'have', 'am', 'was', 'us', 'mill', 'already', 'a', 'its', 'she', 'klingons', 'there', 'always', 'could', 'our', 'by', 'yourself', 'too', 'without', 'though', 'eleven', 'being', 'up', 'force', 'under', 'worf', 'else', 'becomes', 'each',

# Model 5, CountVectorizer, RandomForest

Because the performance of this model is so much weaker than above, I didn't spend much time trying to tune it.

In [94]:
#THIS MODEL IS EXTREMELY WELL FIT BUT IT'S NOT NEARLY AS ACCURATE AS THE ONES ABOVE.

pipe_params = [('cvec', CountVectorizer()),
    ('rf', RandomForestClassifier())]

grid_params = {
    'cvec__ngram_range': [(1,1), (1,2)],
    'cvec__stop_words': [expanded_proper_names, expanded_stopwords],
    'cvec__max_df': [0.1, 0.5, 1.0],
    'cvec__max_features': [1_000, 2_500, 5_000],
    'rf__n_estimators': [50, 100, 150],
    'rf__max_depth': [3, 5],
    'rf__min_samples_split': [3, 5],
    'rf__min_samples_leaf': [3, 5]
    
}

gs = fun.pipe_grid_njobs(pipe_params, grid_params)

t0 = time.time()
gs.fit(X_train, y_train)
print("Time to run:", time.time()-t0)

print('Best parameters:', gs.best_params_)
print('Training score:', gs.score(X_train, y_train))
print('Test score:', gs.score(X_test, y_test))

Time to run: 618.7906250953674
Best parameters: {'cvec__max_df': 1.0, 'cvec__max_features': 2500, 'cvec__ngram_range': (1, 2), 'cvec__stop_words': frozenset({'thereafter', 'empty', 'show', 'whole', 'this', 'wherein', 'nine', 'him', 'what', 'their', 'me', 'cannot', 'eg', 'alone', 'mostly', 'whether', 'vader', 'other', 'themselves', 'become', 'namely', 'against', 'much', 'fett', 'least', 'data', 'former', 'myself', 'i', 'than', 'be', 'down', 'rebels', 'the', 'sith', 'done', 'nevertheless', 'besides', 'to', 'above', 'get', 'however', 'captain', 'amount', 'lightsaber', 'anyway', 'ourselves', 'first', 'is', 'out', 'hasnt', 'those', 'from', 'along', 'not', 'if', 'will', 'at', 'whose', 'picard', 'one', 'beforehand', 'side', 'further', 'twelve', 'have', 'am', 'was', 'us', 'mill', 'already', 'a', 'its', 'she', 'klingons', 'there', 'always', 'could', 'our', 'by', 'yourself', 'too', 'without', 'though', 'eleven', 'being', 'up', 'force', 'under', 'worf', 'else', 'becomes', 'each', 'itself', 'riker

In [95]:
pipe_params = [('cvec', CountVectorizer()),
    ('rf', RandomForestClassifier())]

grid_params = {
    'cvec__ngram_range': [(1,1), (1,2)],
    'cvec__stop_words': [expanded_proper_names, expanded_stopwords],
    'cvec__max_df': [0.1, 0.5, 1.0],
    'cvec__max_features': [2_250, 2_500],
    'rf__n_estimators': [200],
    'rf__max_depth': [7],
    'rf__min_samples_split': [3, 5],
    'rf__min_samples_leaf': [3, 5]
    
}

gs = fun.pipe_grid_njobs(pipe_params, grid_params)

t0 = time.time()
gs.fit(X_train, y_train)
print("Time to run:", time.time()-t0)

print('Best parameters:', gs.best_params_)
print('Training score:', gs.score(X_train, y_train))
print('Test score:', gs.score(X_test, y_test))

Time to run: 66.63195705413818
Best parameters: {'cvec__max_df': 0.1, 'cvec__max_features': 2250, 'cvec__ngram_range': (1, 1), 'cvec__stop_words': frozenset({'thereafter', 'empty', 'show', 'whole', 'this', 'wherein', 'nine', 'him', 'what', 'their', 'me', 'cannot', 'eg', 'alone', 'mostly', 'whether', 'vader', 'other', 'themselves', 'become', 'namely', 'against', 'much', 'fett', 'least', 'data', 'former', 'myself', 'i', 'than', 'be', 'down', 'rebels', 'the', 'sith', 'done', 'nevertheless', 'besides', 'to', 'above', 'get', 'however', 'captain', 'amount', 'lightsaber', 'anyway', 'ourselves', 'first', 'is', 'out', 'hasnt', 'those', 'from', 'along', 'not', 'if', 'will', 'at', 'whose', 'picard', 'one', 'beforehand', 'side', 'further', 'twelve', 'have', 'am', 'was', 'us', 'mill', 'already', 'a', 'its', 'she', 'klingons', 'there', 'always', 'could', 'our', 'by', 'yourself', 'too', 'without', 'though', 'eleven', 'being', 'up', 'force', 'under', 'worf', 'else', 'becomes', 'each', 'itself', 'riker

# Model 6, TfidfVecorizer, RandomForestClassifier()

Again, this model is so much weaker than above, with no obvious factors to adjust to dramatically improve it, so I stopped tuning it pretty quickly.

In [98]:
pipe_params = [('tvec', TfidfVectorizer()),
    ('rf', RandomForestClassifier())]

grid_params = {
    'tvec__ngram_range': [(1,1), (1,2)],
    'tvec__stop_words': [expanded_proper_names, expanded_stopwords],
    'tvec__max_df': [0.1, 0.5, 1.0],
    'tvec__max_features': [1_000, 2_500, 5_000],
    'rf__n_estimators': [100], #50, 150
    'rf__max_depth': [3], #5, 7
    'rf__min_samples_split': [3], #, 5, 7
    'rf__min_samples_leaf': [3] #, 5, 7
    
}

gs = fun.pipe_grid_njobs(pipe_params, grid_params)

t0 = time.time()
gs.fit(X_train, y_train)
print("Time to run:", time.time()-t0)

print('Best parameters:', gs.best_params_)
print('Training score:', gs.score(X_train, y_train))
print('Test score:', gs.score(X_test, y_test))

Time to run: 16.51498818397522
Best parameters: {'rf__max_depth': 3, 'rf__min_samples_leaf': 3, 'rf__min_samples_split': 3, 'rf__n_estimators': 100, 'tvec__max_df': 1.0, 'tvec__max_features': 1000, 'tvec__ngram_range': (1, 1), 'tvec__stop_words': ['seven', 'clone', 'warp', 'borg', 'trilogy', 'contact', 'anakin', 'paramount', 'leia', 'kirk', 'wan', 'jedi', 'kenobi', 'snw', 'wars', 'vader', 'order', 'skywalker', 'klingon', 'starfleet', 'ds9', 'captain', 'maul', 'luke', 'obi', 'rebels', 'data', 'voyager', 'st', 'discovery', 'federation', 'pike', 'picard', 'mandalorian', 'klingons', 'star', 'tng', 'reva', 'strange', 'disney', 'worf', 'riker', 'empire', 'jurati', 'palpatine', 'yoda', 'force', 'darth', 'republic', 'lightsaber', 'sith', 'spock', 'boba', 'fett', 'inquisitor', 'trek', 'enterprise', 'tos', 'padme']}
Training score: 0.7066615473219428
Test score: 0.6655152561888313


In [100]:
pipe_params = [('tvec', TfidfVectorizer()),
    ('rf', RandomForestClassifier())]

grid_params = {
    'tvec__ngram_range': [(1,1), (1,2)],
    'tvec__stop_words': [expanded_proper_names, expanded_stopwords],
    'tvec__max_df': [0.1, 0.5, 1.0],
    'tvec__max_features': [1_000, 2_500, 5_000],
    'rf__n_estimators': [100, 150], #50, 150
    'rf__max_depth': [5, 7],
    'rf__min_samples_split': [3], #, 5, 7
    'rf__min_samples_leaf': [3] #, 5, 7
    
}

gs = fun.pipe_grid_njobs(pipe_params, grid_params)

t0 = time.time()
gs.fit(X_train, y_train)
print("Time to run:", time.time()-t0)

print('Best parameters:', gs.best_params_)
print('Training score:', gs.score(X_train, y_train))
print('Test score:', gs.score(X_test, y_test))

Time to run: 84.97272491455078
Best parameters: {'rf__max_depth': 5, 'rf__min_samples_leaf': 3, 'rf__min_samples_split': 3, 'rf__n_estimators': 100, 'tvec__max_df': 1.0, 'tvec__max_features': 2500, 'tvec__ngram_range': (1, 2), 'tvec__stop_words': frozenset({'thereafter', 'empty', 'show', 'whole', 'this', 'wherein', 'nine', 'him', 'what', 'their', 'me', 'cannot', 'eg', 'alone', 'mostly', 'whether', 'vader', 'other', 'themselves', 'become', 'namely', 'against', 'much', 'fett', 'least', 'data', 'former', 'myself', 'i', 'than', 'be', 'down', 'rebels', 'the', 'sith', 'done', 'nevertheless', 'besides', 'to', 'above', 'get', 'however', 'captain', 'amount', 'lightsaber', 'anyway', 'ourselves', 'first', 'is', 'out', 'hasnt', 'those', 'from', 'along', 'not', 'if', 'will', 'at', 'whose', 'picard', 'one', 'beforehand', 'side', 'further', 'twelve', 'have', 'am', 'was', 'us', 'mill', 'already', 'a', 'its', 'she', 'klingons', 'there', 'always', 'could', 'our', 'by', 'yourself', 'too', 'without', 'tho

# Model 7 VotingClassifier with AdaBoosting and GradientBoosting

Best Parameters: {'cvec__max_df': 0.96, 'cvec__max_features': 3000, 'cvec__ngram_range': (1, 2), 'cvec__stop_words': expanded_stopwords, 'vote__ada__n_estimators': 400, 'vote__gb__n_estimators': 350, 'vote__tree__max_depth': None}
Training Score: 0.8957573430600884
Test Score: 0.7910189982728842

In [101]:
vote = VotingClassifier([
    ('tree', DecisionTreeClassifier()),
    ('ada', AdaBoostClassifier()),
    ('gb', GradientBoostingClassifier())
    
])

vote_params = {
    'cvec__stop_words': [expanded_proper_names, expanded_stopwords],
    'cvec__ngram_range': [(1,1), (1,2)],
    'cvec__max_df': [0.5, 1.0],
    'cvec__max_features': [1_000, 2_000],
    'vote__tree__max_depth': [None, 5],
    'vote__ada__n_estimators': [50, 75],
    'vote__gb__n_estimators': [50, 75]
    
}

pipe = Pipeline([('cvec', CountVectorizer()),
                 ('vote', vote)])

gs = GridSearchCV(pipe, param_grid = vote_params, cv = 3, n_jobs = -1)

t0 = time.time()
gs.fit(X_train, y_train)
print(time.time()-t0)

print('Best Parameters:',gs.best_params_)
print('Training Score:', gs.score(X_train, y_train))
print(('Test Score:'), gs.score(X_test, y_test))

75.22982907295227
Best Parameters: {'cvec__max_df': 1.0, 'cvec__max_features': 2000, 'cvec__ngram_range': (1, 2), 'cvec__stop_words': ['seven', 'clone', 'warp', 'borg', 'trilogy', 'contact', 'anakin', 'paramount', 'leia', 'kirk', 'wan', 'jedi', 'kenobi', 'snw', 'wars', 'vader', 'order', 'skywalker', 'klingon', 'starfleet', 'ds9', 'captain', 'maul', 'luke', 'obi', 'rebels', 'data', 'voyager', 'st', 'discovery', 'federation', 'pike', 'picard', 'mandalorian', 'klingons', 'star', 'tng', 'reva', 'strange', 'disney', 'worf', 'riker', 'empire', 'jurati', 'palpatine', 'yoda', 'force', 'darth', 'republic', 'lightsaber', 'sith', 'spock', 'boba', 'fett', 'inquisitor', 'trek', 'enterprise', 'tos', 'padme'], 'vote__ada__n_estimators': 75, 'vote__gb__n_estimators': 75, 'vote__tree__max_depth': None}
Training Score: 0.8107122288347092
Test Score: 0.7328727691421992


In [102]:
vote = VotingClassifier([
    ('tree', DecisionTreeClassifier()),
    ('ada', AdaBoostClassifier()),
    ('gb', GradientBoostingClassifier())
    
])

vote_params = {
    'cvec__stop_words': [expanded_proper_names, expanded_stopwords],
    'cvec__ngram_range': [(1,1), (1,2)],
    'cvec__max_df': [0.9, 1.0],
    'cvec__max_features': [2_000, 3_000],
    'vote__tree__max_depth': [None, 5],
    'vote__ada__n_estimators': [75, 100],
    'vote__gb__n_estimators': [75, 100]
    
}

pipe = Pipeline([('cvec', CountVectorizer()),
                 ('vote', vote)])

gs = GridSearchCV(pipe, param_grid = vote_params, cv = 3, n_jobs = -1)

t0 = time.time()
gs.fit(X_train, y_train)
print(time.time()-t0)

print('Best Parameters:',gs.best_params_)
print('Training Score:', gs.score(X_train, y_train))
print(('Test Score:'), gs.score(X_test, y_test))

106.06975388526917
Best Parameters: {'cvec__max_df': 0.9, 'cvec__max_features': 3000, 'cvec__ngram_range': (1, 2), 'cvec__stop_words': ['seven', 'clone', 'warp', 'borg', 'trilogy', 'contact', 'anakin', 'paramount', 'leia', 'kirk', 'wan', 'jedi', 'kenobi', 'snw', 'wars', 'vader', 'order', 'skywalker', 'klingon', 'starfleet', 'ds9', 'captain', 'maul', 'luke', 'obi', 'rebels', 'data', 'voyager', 'st', 'discovery', 'federation', 'pike', 'picard', 'mandalorian', 'klingons', 'star', 'tng', 'reva', 'strange', 'disney', 'worf', 'riker', 'empire', 'jurati', 'palpatine', 'yoda', 'force', 'darth', 'republic', 'lightsaber', 'sith', 'spock', 'boba', 'fett', 'inquisitor', 'trek', 'enterprise', 'tos', 'padme'], 'vote__ada__n_estimators': 100, 'vote__gb__n_estimators': 100, 'vote__tree__max_depth': None}
Training Score: 0.8174313687847955
Test Score: 0.7501439263097294


In [103]:
vote = VotingClassifier([
    ('tree', DecisionTreeClassifier()),
    ('ada', AdaBoostClassifier()),
    ('gb', GradientBoostingClassifier())
    
])

vote_params = {
    'cvec__stop_words': [expanded_proper_names, expanded_stopwords],
    'cvec__ngram_range': [(1,1), (1,2)],
    'cvec__max_df': [0.95, 1.0],
    'cvec__max_features': [3_000, 4_000],
    #'vote__tree__max_depth': [None, 5],
    'vote__ada__n_estimators': [100, 150],
    'vote__gb__n_estimators': [100, 150]
    
}

pipe = Pipeline([('cvec', CountVectorizer()),
                 ('vote', vote)])

gs = GridSearchCV(pipe, param_grid = vote_params, cv = 3, n_jobs = -1)

t0 = time.time()
gs.fit(X_train, y_train)
print(time.time()-t0)

print('Best Parameters:',gs.best_params_)
print('Training Score:', gs.score(X_train, y_train))
print(('Test Score:'), gs.score(X_test, y_test))

78.07606506347656
Best Parameters: {'cvec__max_df': 0.95, 'cvec__max_features': 3000, 'cvec__ngram_range': (1, 1), 'cvec__stop_words': frozenset({'thereafter', 'empty', 'show', 'whole', 'this', 'wherein', 'nine', 'him', 'what', 'their', 'me', 'cannot', 'eg', 'alone', 'mostly', 'whether', 'vader', 'other', 'themselves', 'become', 'namely', 'against', 'much', 'fett', 'least', 'data', 'former', 'myself', 'i', 'than', 'be', 'down', 'rebels', 'the', 'sith', 'done', 'nevertheless', 'besides', 'to', 'above', 'get', 'however', 'captain', 'amount', 'lightsaber', 'anyway', 'ourselves', 'first', 'is', 'out', 'hasnt', 'those', 'from', 'along', 'not', 'if', 'will', 'at', 'whose', 'picard', 'one', 'beforehand', 'side', 'further', 'twelve', 'have', 'am', 'was', 'us', 'mill', 'already', 'a', 'its', 'she', 'klingons', 'there', 'always', 'could', 'our', 'by', 'yourself', 'too', 'without', 'though', 'eleven', 'being', 'up', 'force', 'under', 'worf', 'else', 'becomes', 'each', 'itself', 'riker', 'himself'

In [104]:
vote = VotingClassifier([
    ('tree', DecisionTreeClassifier()),
    ('ada', AdaBoostClassifier()),
    ('gb', GradientBoostingClassifier())
    
])

vote_params = {
    'cvec__stop_words': [expanded_proper_names, expanded_stopwords],
    'cvec__ngram_range': [(1,1), (1,2)],
    'cvec__max_df': [0.95, .96],
    'cvec__max_features': [3_000, 3_500],
    #'vote__tree__max_depth': [None, 5],
    'vote__ada__n_estimators': [150, 200],
    'vote__gb__n_estimators': [150, 200]
    
}

pipe = Pipeline([('cvec', CountVectorizer()),
                 ('vote', vote)])

gs = GridSearchCV(pipe, param_grid = vote_params, cv = 3, n_jobs = -1)

t0 = time.time()
gs.fit(X_train, y_train)
print(time.time()-t0)

print('Best Parameters:',gs.best_params_)
print('Training Score:', gs.score(X_train, y_train))
print(('Test Score:'), gs.score(X_test, y_test))

108.22899413108826
Best Parameters: {'cvec__max_df': 0.95, 'cvec__max_features': 3000, 'cvec__ngram_range': (1, 2), 'cvec__stop_words': frozenset({'thereafter', 'empty', 'show', 'whole', 'this', 'wherein', 'nine', 'him', 'what', 'their', 'me', 'cannot', 'eg', 'alone', 'mostly', 'whether', 'vader', 'other', 'themselves', 'become', 'namely', 'against', 'much', 'fett', 'least', 'data', 'former', 'myself', 'i', 'than', 'be', 'down', 'rebels', 'the', 'sith', 'done', 'nevertheless', 'besides', 'to', 'above', 'get', 'however', 'captain', 'amount', 'lightsaber', 'anyway', 'ourselves', 'first', 'is', 'out', 'hasnt', 'those', 'from', 'along', 'not', 'if', 'will', 'at', 'whose', 'picard', 'one', 'beforehand', 'side', 'further', 'twelve', 'have', 'am', 'was', 'us', 'mill', 'already', 'a', 'its', 'she', 'klingons', 'there', 'always', 'could', 'our', 'by', 'yourself', 'too', 'without', 'though', 'eleven', 'being', 'up', 'force', 'under', 'worf', 'else', 'becomes', 'each', 'itself', 'riker', 'himself

In [105]:
vote = VotingClassifier([
    ('tree', DecisionTreeClassifier()),
    ('ada', AdaBoostClassifier()),
    ('gb', GradientBoostingClassifier())
    
])

vote_params = {
    'cvec__stop_words': [expanded_stopwords],
    'cvec__ngram_range': [(1,1), (1,2)],
    'cvec__max_df': [0.96],
    'cvec__max_features': [3_500],
    #'vote__tree__max_depth': [None, 5],
    'vote__ada__n_estimators': [400, 500],
    'vote__gb__n_estimators': [300, 350]
    
}

pipe = Pipeline([('cvec', CountVectorizer()),
                 ('vote', vote)])

gs = GridSearchCV(pipe, param_grid = vote_params, cv = 3, n_jobs = -1)

t0 = time.time()
gs.fit(X_train, y_train)
print(time.time()-t0)

print('Best Parameters:',gs.best_params_)
print('Training Score:', gs.score(X_train, y_train))
print(('Test Score:'), gs.score(X_test, y_test))

24.841930389404297
Best Parameters: {'cvec__max_df': 0.96, 'cvec__max_features': 3500, 'cvec__ngram_range': (1, 1), 'cvec__stop_words': frozenset({'thereafter', 'empty', 'show', 'whole', 'this', 'wherein', 'nine', 'him', 'what', 'their', 'me', 'cannot', 'eg', 'alone', 'mostly', 'whether', 'vader', 'other', 'themselves', 'become', 'namely', 'against', 'much', 'fett', 'least', 'data', 'former', 'myself', 'i', 'than', 'be', 'down', 'rebels', 'the', 'sith', 'done', 'nevertheless', 'besides', 'to', 'above', 'get', 'however', 'captain', 'amount', 'lightsaber', 'anyway', 'ourselves', 'first', 'is', 'out', 'hasnt', 'those', 'from', 'along', 'not', 'if', 'will', 'at', 'whose', 'picard', 'one', 'beforehand', 'side', 'further', 'twelve', 'have', 'am', 'was', 'us', 'mill', 'already', 'a', 'its', 'she', 'klingons', 'there', 'always', 'could', 'our', 'by', 'yourself', 'too', 'without', 'though', 'eleven', 'being', 'up', 'force', 'under', 'worf', 'else', 'becomes', 'each', 'itself', 'riker', 'himself

In [106]:
#above is better

vote = VotingClassifier([
    ('tree', DecisionTreeClassifier()),
    ('ada', AdaBoostClassifier()),
    ('gb', GradientBoostingClassifier())
    
])

vote_params = {
    'cvec__stop_words': [expanded_stopwords],
    'cvec__ngram_range': [(1,1), (1,2)],
    'cvec__max_df': [0.96],
    'cvec__max_features': [3_500],
    #'vote__tree__max_depth': [None, 5],
    'vote__ada__n_estimators': [500, 600],
    'vote__gb__n_estimators': [350, 400]
    
}

pipe = Pipeline([('cvec', CountVectorizer()),
                 ('vote', vote)])

gs = GridSearchCV(pipe, param_grid = vote_params, cv = 3, n_jobs = -1)

t0 = time.time()
gs.fit(X_train, y_train)
print(time.time()-t0)

print('Best Parameters:',gs.best_params_)
print('Training Score:', gs.score(X_train, y_train))
print(('Test Score:'), gs.score(X_test, y_test))

27.3620388507843
Best Parameters: {'cvec__max_df': 0.96, 'cvec__max_features': 3500, 'cvec__ngram_range': (1, 1), 'cvec__stop_words': frozenset({'thereafter', 'empty', 'show', 'whole', 'this', 'wherein', 'nine', 'him', 'what', 'their', 'me', 'cannot', 'eg', 'alone', 'mostly', 'whether', 'vader', 'other', 'themselves', 'become', 'namely', 'against', 'much', 'fett', 'least', 'data', 'former', 'myself', 'i', 'than', 'be', 'down', 'rebels', 'the', 'sith', 'done', 'nevertheless', 'besides', 'to', 'above', 'get', 'however', 'captain', 'amount', 'lightsaber', 'anyway', 'ourselves', 'first', 'is', 'out', 'hasnt', 'those', 'from', 'along', 'not', 'if', 'will', 'at', 'whose', 'picard', 'one', 'beforehand', 'side', 'further', 'twelve', 'have', 'am', 'was', 'us', 'mill', 'already', 'a', 'its', 'she', 'klingons', 'there', 'always', 'could', 'our', 'by', 'yourself', 'too', 'without', 'though', 'eleven', 'being', 'up', 'force', 'under', 'worf', 'else', 'becomes', 'each', 'itself', 'riker', 'himself',

In [107]:
vote = VotingClassifier([
    ('tree', DecisionTreeClassifier()),
    ('ada', AdaBoostClassifier()),
    ('gb', GradientBoostingClassifier())
    
])

vote_params = {
    'cvec__stop_words': [expanded_stopwords],
    'cvec__ngram_range': [(1,2)],
    'cvec__max_df': [0.96],
    'cvec__max_features': [3_500],
    'vote__tree__max_depth': [5, 10, 15],
    'vote__ada__n_estimators': [400],
    'vote__gb__n_estimators': [350]
    
}

pipe = Pipeline([('cvec', CountVectorizer()),
                 ('vote', vote)])

gs = GridSearchCV(pipe, param_grid = vote_params, cv = 3, n_jobs = -1)

t0 = time.time()
gs.fit(X_train, y_train)
print(time.time()-t0)

print('Best Parameters:',gs.best_params_)
print('Training Score:', gs.score(X_train, y_train))
print(('Test Score:'), gs.score(X_test, y_test))

12.751996278762817
Best Parameters: {'cvec__max_df': 0.96, 'cvec__max_features': 3500, 'cvec__ngram_range': (1, 2), 'cvec__stop_words': frozenset({'thereafter', 'empty', 'show', 'whole', 'this', 'wherein', 'nine', 'him', 'what', 'their', 'me', 'cannot', 'eg', 'alone', 'mostly', 'whether', 'vader', 'other', 'themselves', 'become', 'namely', 'against', 'much', 'fett', 'least', 'data', 'former', 'myself', 'i', 'than', 'be', 'down', 'rebels', 'the', 'sith', 'done', 'nevertheless', 'besides', 'to', 'above', 'get', 'however', 'captain', 'amount', 'lightsaber', 'anyway', 'ourselves', 'first', 'is', 'out', 'hasnt', 'those', 'from', 'along', 'not', 'if', 'will', 'at', 'whose', 'picard', 'one', 'beforehand', 'side', 'further', 'twelve', 'have', 'am', 'was', 'us', 'mill', 'already', 'a', 'its', 'she', 'klingons', 'there', 'always', 'could', 'our', 'by', 'yourself', 'too', 'without', 'though', 'eleven', 'being', 'up', 'force', 'under', 'worf', 'else', 'becomes', 'each', 'itself', 'riker', 'himself

In [109]:
#best version of model 7, with cvec__max_features: [3500]
vote = VotingClassifier([
    ('tree', DecisionTreeClassifier()),
    ('ada', AdaBoostClassifier()),
    ('gb', GradientBoostingClassifier())
    
])

vote_params = {
    'cvec__stop_words': [expanded_stopwords],
    'cvec__ngram_range': [(1,2)],
    'cvec__max_df': [0.96],
    'cvec__max_features': [3_000, 3_500, 4_000],
    'vote__tree__max_depth': [None],
    'vote__ada__n_estimators': [400],
    'vote__gb__n_estimators': [350]
    
}

pipe = Pipeline([('cvec', CountVectorizer()),
                 ('vote', vote)])

gs = GridSearchCV(pipe, param_grid = vote_params, cv = 3, n_jobs = -1)

t0 = time.time()
gs.fit(X_train, y_train)
print(time.time()-t0)

print('Best Parameters:',gs.best_params_)
print('Training Score:', gs.score(X_train, y_train))
print(('Test Score:'), gs.score(X_test, y_test))

12.699783325195312
Best Parameters: {'cvec__max_df': 0.96, 'cvec__max_features': 3000, 'cvec__ngram_range': (1, 2), 'cvec__stop_words': frozenset({'thereafter', 'empty', 'show', 'whole', 'this', 'wherein', 'nine', 'him', 'what', 'their', 'me', 'cannot', 'eg', 'alone', 'mostly', 'whether', 'vader', 'other', 'themselves', 'become', 'namely', 'against', 'much', 'fett', 'least', 'data', 'former', 'myself', 'i', 'than', 'be', 'down', 'rebels', 'the', 'sith', 'done', 'nevertheless', 'besides', 'to', 'above', 'get', 'however', 'captain', 'amount', 'lightsaber', 'anyway', 'ourselves', 'first', 'is', 'out', 'hasnt', 'those', 'from', 'along', 'not', 'if', 'will', 'at', 'whose', 'picard', 'one', 'beforehand', 'side', 'further', 'twelve', 'have', 'am', 'was', 'us', 'mill', 'already', 'a', 'its', 'she', 'klingons', 'there', 'always', 'could', 'our', 'by', 'yourself', 'too', 'without', 'though', 'eleven', 'being', 'up', 'force', 'under', 'worf', 'else', 'becomes', 'each', 'itself', 'riker', 'himself

In [111]:
#above is better
vote = VotingClassifier([
    ('tree', DecisionTreeClassifier()),
    ('ada', AdaBoostClassifier()),
    ('gb', GradientBoostingClassifier())
    
])

vote_params = {
    'cvec__stop_words': [expanded_stopwords],
    'cvec__ngram_range': [(1,2)],
    'cvec__max_df': [0.96],
    'cvec__max_features': [3_400, 3_500, 3_600],
    'vote__tree__max_depth': [None],
    'vote__ada__n_estimators': [400],
    'vote__gb__n_estimators': [350]
    
}

pipe = Pipeline([('cvec', CountVectorizer()),
                 ('vote', vote)])

gs = GridSearchCV(pipe, param_grid = vote_params, cv = 3, n_jobs = -1)

t0 = time.time()
gs.fit(X_train, y_train)
print(time.time()-t0)

print('Best Parameters:',gs.best_params_)
print('Training Score:', gs.score(X_train, y_train))
print(('Test Score:'), gs.score(X_test, y_test))

12.455538749694824
Best Parameters: {'cvec__max_df': 0.96, 'cvec__max_features': 3400, 'cvec__ngram_range': (1, 2), 'cvec__stop_words': frozenset({'thereafter', 'empty', 'show', 'whole', 'this', 'wherein', 'nine', 'him', 'what', 'their', 'me', 'cannot', 'eg', 'alone', 'mostly', 'whether', 'vader', 'other', 'themselves', 'become', 'namely', 'against', 'much', 'fett', 'least', 'data', 'former', 'myself', 'i', 'than', 'be', 'down', 'rebels', 'the', 'sith', 'done', 'nevertheless', 'besides', 'to', 'above', 'get', 'however', 'captain', 'amount', 'lightsaber', 'anyway', 'ourselves', 'first', 'is', 'out', 'hasnt', 'those', 'from', 'along', 'not', 'if', 'will', 'at', 'whose', 'picard', 'one', 'beforehand', 'side', 'further', 'twelve', 'have', 'am', 'was', 'us', 'mill', 'already', 'a', 'its', 'she', 'klingons', 'there', 'always', 'could', 'our', 'by', 'yourself', 'too', 'without', 'though', 'eleven', 'being', 'up', 'force', 'under', 'worf', 'else', 'becomes', 'each', 'itself', 'riker', 'himself

# Model 8, TfidfVectorizer, MultinomialNB, Title and Submission (body) Lengths and Wordcounts
Using best TfidfVectorizer parameters from above\:
'nb__alpha': 0.25\ 
'tvec__max_df': 0.7\ 
'tvec__max_features': 4800\ 
'tvec__ngram_range': (1, 1)\ 
'tvec__stop_words': expanded_proper_names\
Training score: 0.9199462468803993\
Test score: 0.8514680483592401

Best Result with just 'no_selftext' added, alpha = .2:\
Training score: 0.9249376079861777\
Test score: 0.8733448474381117

In [235]:
X = whole_df.drop(columns = 'subreddit_name')
y = whole_df['subreddit_name']

X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=42, stratify = y)

print("X_train shape:", X_train.shape)
print("X_test shape:", X_test.shape)
print("y_train shape:", y_train.shape)
print("y_test shape:", y_test.shape)

tvec = TfidfVectorizer(max_df=0.7, max_features=4_800, stop_words = expanded_proper_names)
vectorized_train = tvec.fit_transform(X_train['all_words'])
vectorized_test = tvec.transform(X_test['all_words'])

vectorized_train = pd.DataFrame(vectorized_train.todense(), columns = tvec.get_feature_names_out())
vectorized_test = pd.DataFrame(vectorized_test.todense(), columns = tvec.get_feature_names_out())

print('vectorized_train shape:', vectorized_train.shape)
print('vectorized_test shape:', vectorized_test.shape)

X_train = X_train.reset_index()
X_test = X_test.reset_index()

print('X_train reset index shape:', X_train.shape)
print('X_test reset index shape:', X_test.shape)

vectorized_train = pd.concat([X_train[['index', 'submission_length', 'title_length', 'submission_word_count', 'title_word_count']], vectorized_train], axis = 1)
vectorized_test = pd.concat([X_test[['index', 'submission_length', 'title_length', 'submission_word_count', 'title_word_count']], vectorized_test], axis = 1)

print('vectorized_train concatenated shape:', vectorized_train.shape)
print('vectorized_test concatenated shape:', vectorized_test.shape)

vectorized_train = vectorized_train.set_index('index')
vectorized_test = vectorized_test.set_index('index')

print('vectorized_train corrected index shape:', vectorized_train.shape)
print('vectorized_test corrected index shape:', vectorized_test.shape)

nb = MultinomialNB(alpha=0.25)

nb.fit(vectorized_train, y_train)

print('Training score:', nb.score(vectorized_train, y_train))
print('Test score:', nb.score(vectorized_test, y_test))

X_train shape: (5209, 10)
X_test shape: (1737, 10)
y_train shape: (5209,)
y_test shape: (1737,)
vectorized_train shape: (5209, 4800)
vectorized_test shape: (1737, 4800)
X_train reset index shape: (5209, 11)
X_test reset index shape: (1737, 11)
vectorized_train concatenated shape: (5209, 4805)
vectorized_test concatenated shape: (1737, 4805)
vectorized_train corrected index shape: (5209, 4804)
vectorized_test corrected index shape: (1737, 4804)
Training score: 0.5371472451526205
Test score: 0.5371329879101899


In [227]:
# WITH JUST 'no_selftext'
X = whole_df.drop(columns = 'subreddit_name')
y = whole_df['subreddit_name']

X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=42, stratify = y)

print("X_train shape:", X_train.shape)
print("X_test shape:", X_test.shape)
print("y_train shape:", y_train.shape)
print("y_test shape:", y_test.shape)

tvec = TfidfVectorizer(max_df=0.7, max_features=4_800, stop_words = expanded_proper_names)
vectorized_train = tvec.fit_transform(X_train['all_words'])
vectorized_test = tvec.transform(X_test['all_words'])

vectorized_train = pd.DataFrame(vectorized_train.todense(), columns = tvec.get_feature_names_out())
vectorized_test = pd.DataFrame(vectorized_test.todense(), columns = tvec.get_feature_names_out())

print('vectorized_train shape:', vectorized_train.shape)
print('vectorized_test shape:', vectorized_test.shape)

X_train = X_train.reset_index()
X_test = X_test.reset_index()

print('X_train reset index shape:', X_train.shape)
print('X_test reset index shape:', X_test.shape)

vectorized_train = pd.concat([X_train[['index', 'no_selftext']], vectorized_train], axis = 1)
vectorized_test = pd.concat([X_test[['index', 'no_selftext']], vectorized_test], axis = 1)

print('vectorized_train concatenated shape:', vectorized_train.shape)
print('vectorized_test concatenated shape:', vectorized_test.shape)

vectorized_train = vectorized_train.set_index('index')
vectorized_test = vectorized_test.set_index('index')

print('vectorized_train corrected index shape:', vectorized_train.shape)
print('vectorized_test corrected index shape:', vectorized_test.shape)

nb = MultinomialNB(alpha=0.25)

nb.fit(vectorized_train, y_train)

print('Training score:', nb.score(vectorized_train, y_train))
print('Test score:', nb.score(vectorized_test, y_test))

X_train shape: (5209, 10)
X_test shape: (1737, 10)
y_train shape: (5209,)
y_test shape: (1737,)
vectorized_train shape: (5209, 4800)
vectorized_test shape: (1737, 4800)
X_train reset index shape: (5209, 11)
X_test reset index shape: (1737, 11)
vectorized_train concatenated shape: (5209, 4802)
vectorized_test concatenated shape: (1737, 4802)
vectorized_train corrected index shape: (5209, 4801)
vectorized_test corrected index shape: (1737, 4801)
Training score: 0.9235937799961605
Test score: 0.8716177317213587


In [228]:
# WITH JUST 'no_selftext'
# ADJUSTING ALPHA
X = whole_df.drop(columns = 'subreddit_name')
y = whole_df['subreddit_name']

X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=42, stratify = y)

print("X_train shape:", X_train.shape)
print("X_test shape:", X_test.shape)
print("y_train shape:", y_train.shape)
print("y_test shape:", y_test.shape)

tvec = TfidfVectorizer(max_df=0.7, max_features=4_800, stop_words = expanded_proper_names)
vectorized_train = tvec.fit_transform(X_train['all_words'])
vectorized_test = tvec.transform(X_test['all_words'])

vectorized_train = pd.DataFrame(vectorized_train.todense(), columns = tvec.get_feature_names_out())
vectorized_test = pd.DataFrame(vectorized_test.todense(), columns = tvec.get_feature_names_out())

print('vectorized_train shape:', vectorized_train.shape)
print('vectorized_test shape:', vectorized_test.shape)

X_train = X_train.reset_index()
X_test = X_test.reset_index()

print('X_train reset index shape:', X_train.shape)
print('X_test reset index shape:', X_test.shape)

vectorized_train = pd.concat([X_train[['index', 'no_selftext']], vectorized_train], axis = 1)
vectorized_test = pd.concat([X_test[['index', 'no_selftext']], vectorized_test], axis = 1)

print('vectorized_train concatenated shape:', vectorized_train.shape)
print('vectorized_test concatenated shape:', vectorized_test.shape)

vectorized_train = vectorized_train.set_index('index')
vectorized_test = vectorized_test.set_index('index')

print('vectorized_train corrected index shape:', vectorized_train.shape)
print('vectorized_test corrected index shape:', vectorized_test.shape)

nb = MultinomialNB(alpha=0.5)

nb.fit(vectorized_train, y_train)

print('Training score:', nb.score(vectorized_train, y_train))
print('Test score:', nb.score(vectorized_test, y_test))

X_train shape: (5209, 10)
X_test shape: (1737, 10)
y_train shape: (5209,)
y_test shape: (1737,)
vectorized_train shape: (5209, 4800)
vectorized_test shape: (1737, 4800)
X_train reset index shape: (5209, 11)
X_test reset index shape: (1737, 11)
vectorized_train concatenated shape: (5209, 4802)
vectorized_test concatenated shape: (1737, 4802)
vectorized_train corrected index shape: (5209, 4801)
vectorized_test corrected index shape: (1737, 4801)
Training score: 0.9170666154732194
Test score: 0.8658606793321819


In [229]:
# WITH JUST 'no_selftext'
# ADJUSTING ALPHA
X = whole_df.drop(columns = 'subreddit_name')
y = whole_df['subreddit_name']

X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=42, stratify = y)

print("X_train shape:", X_train.shape)
print("X_test shape:", X_test.shape)
print("y_train shape:", y_train.shape)
print("y_test shape:", y_test.shape)

tvec = TfidfVectorizer(max_df=0.7, max_features=4_800, stop_words = expanded_proper_names)
vectorized_train = tvec.fit_transform(X_train['all_words'])
vectorized_test = tvec.transform(X_test['all_words'])

vectorized_train = pd.DataFrame(vectorized_train.todense(), columns = tvec.get_feature_names_out())
vectorized_test = pd.DataFrame(vectorized_test.todense(), columns = tvec.get_feature_names_out())

print('vectorized_train shape:', vectorized_train.shape)
print('vectorized_test shape:', vectorized_test.shape)

X_train = X_train.reset_index()
X_test = X_test.reset_index()

print('X_train reset index shape:', X_train.shape)
print('X_test reset index shape:', X_test.shape)

vectorized_train = pd.concat([X_train[['index', 'no_selftext']], vectorized_train], axis = 1)
vectorized_test = pd.concat([X_test[['index', 'no_selftext']], vectorized_test], axis = 1)

print('vectorized_train concatenated shape:', vectorized_train.shape)
print('vectorized_test concatenated shape:', vectorized_test.shape)

vectorized_train = vectorized_train.set_index('index')
vectorized_test = vectorized_test.set_index('index')

print('vectorized_train corrected index shape:', vectorized_train.shape)
print('vectorized_test corrected index shape:', vectorized_test.shape)

nb = MultinomialNB(alpha=0.3)

nb.fit(vectorized_train, y_train)

print('Training score:', nb.score(vectorized_train, y_train))
print('Test score:', nb.score(vectorized_test, y_test))

X_train shape: (5209, 10)
X_test shape: (1737, 10)
y_train shape: (5209,)
y_test shape: (1737,)
vectorized_train shape: (5209, 4800)
vectorized_test shape: (1737, 4800)
X_train reset index shape: (5209, 11)
X_test reset index shape: (1737, 11)
vectorized_train concatenated shape: (5209, 4802)
vectorized_test concatenated shape: (1737, 4802)
vectorized_train corrected index shape: (5209, 4801)
vectorized_test corrected index shape: (1737, 4801)
Training score: 0.9220579765789979
Test score: 0.869314910765688


In [230]:
# BEST VERSION
# WITH JUST 'no_selftext'
# ADJUSTING ALPHA
X = whole_df.drop(columns = 'subreddit_name')
y = whole_df['subreddit_name']

X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=42, stratify = y)

print("X_train shape:", X_train.shape)
print("X_test shape:", X_test.shape)
print("y_train shape:", y_train.shape)
print("y_test shape:", y_test.shape)

tvec = TfidfVectorizer(max_df=0.7, max_features=4_800, stop_words = expanded_proper_names)
vectorized_train = tvec.fit_transform(X_train['all_words'])
vectorized_test = tvec.transform(X_test['all_words'])

vectorized_train = pd.DataFrame(vectorized_train.todense(), columns = tvec.get_feature_names_out())
vectorized_test = pd.DataFrame(vectorized_test.todense(), columns = tvec.get_feature_names_out())

print('vectorized_train shape:', vectorized_train.shape)
print('vectorized_test shape:', vectorized_test.shape)

X_train = X_train.reset_index()
X_test = X_test.reset_index()

print('X_train reset index shape:', X_train.shape)
print('X_test reset index shape:', X_test.shape)

vectorized_train = pd.concat([X_train[['index', 'no_selftext']], vectorized_train], axis = 1)
vectorized_test = pd.concat([X_test[['index', 'no_selftext']], vectorized_test], axis = 1)

print('vectorized_train concatenated shape:', vectorized_train.shape)
print('vectorized_test concatenated shape:', vectorized_test.shape)

vectorized_train = vectorized_train.set_index('index')
vectorized_test = vectorized_test.set_index('index')

print('vectorized_train corrected index shape:', vectorized_train.shape)
print('vectorized_test corrected index shape:', vectorized_test.shape)

nb = MultinomialNB(alpha=0.2)

nb.fit(vectorized_train, y_train)

print('Training score:', nb.score(vectorized_train, y_train))
print('Test score:', nb.score(vectorized_test, y_test))

X_train shape: (5209, 10)
X_test shape: (1737, 10)
y_train shape: (5209,)
y_test shape: (1737,)
vectorized_train shape: (5209, 4800)
vectorized_test shape: (1737, 4800)
X_train reset index shape: (5209, 11)
X_test reset index shape: (1737, 11)
vectorized_train concatenated shape: (5209, 4802)
vectorized_test concatenated shape: (1737, 4802)
vectorized_train corrected index shape: (5209, 4801)
vectorized_test corrected index shape: (1737, 4801)
Training score: 0.9249376079861777
Test score: 0.8733448474381117


In [231]:
# WITH JUST 'no_selftext'
# ADJUSTING ALPHA
X = whole_df.drop(columns = 'subreddit_name')
y = whole_df['subreddit_name']

X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=42, stratify = y)

print("X_train shape:", X_train.shape)
print("X_test shape:", X_test.shape)
print("y_train shape:", y_train.shape)
print("y_test shape:", y_test.shape)

tvec = TfidfVectorizer(max_df=0.7, max_features=4_800, stop_words = expanded_proper_names)
vectorized_train = tvec.fit_transform(X_train['all_words'])
vectorized_test = tvec.transform(X_test['all_words'])

vectorized_train = pd.DataFrame(vectorized_train.todense(), columns = tvec.get_feature_names_out())
vectorized_test = pd.DataFrame(vectorized_test.todense(), columns = tvec.get_feature_names_out())

print('vectorized_train shape:', vectorized_train.shape)
print('vectorized_test shape:', vectorized_test.shape)

X_train = X_train.reset_index()
X_test = X_test.reset_index()

print('X_train reset index shape:', X_train.shape)
print('X_test reset index shape:', X_test.shape)

vectorized_train = pd.concat([X_train[['index', 'no_selftext']], vectorized_train], axis = 1)
vectorized_test = pd.concat([X_test[['index', 'no_selftext']], vectorized_test], axis = 1)

print('vectorized_train concatenated shape:', vectorized_train.shape)
print('vectorized_test concatenated shape:', vectorized_test.shape)

vectorized_train = vectorized_train.set_index('index')
vectorized_test = vectorized_test.set_index('index')

print('vectorized_train corrected index shape:', vectorized_train.shape)
print('vectorized_test corrected index shape:', vectorized_test.shape)

nb = MultinomialNB(alpha=0.15)

nb.fit(vectorized_train, y_train)

print('Training score:', nb.score(vectorized_train, y_train))
print('Test score:', nb.score(vectorized_test, y_test))

X_train shape: (5209, 10)
X_test shape: (1737, 10)
y_train shape: (5209,)
y_test shape: (1737,)
vectorized_train shape: (5209, 4800)
vectorized_test shape: (1737, 4800)
X_train reset index shape: (5209, 11)
X_test reset index shape: (1737, 11)
vectorized_train concatenated shape: (5209, 4802)
vectorized_test concatenated shape: (1737, 4802)
vectorized_train corrected index shape: (5209, 4801)
vectorized_test corrected index shape: (1737, 4801)
Training score: 0.9272413131119217
Test score: 0.8733448474381117


In [232]:
# WITH JUST 'no_selftext'
# ADJUSTING ALPHA
X = whole_df.drop(columns = 'subreddit_name')
y = whole_df['subreddit_name']

X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=42, stratify = y)

print("X_train shape:", X_train.shape)
print("X_test shape:", X_test.shape)
print("y_train shape:", y_train.shape)
print("y_test shape:", y_test.shape)

tvec = TfidfVectorizer(max_df=0.7, max_features=4_800, stop_words = expanded_proper_names)
vectorized_train = tvec.fit_transform(X_train['all_words'])
vectorized_test = tvec.transform(X_test['all_words'])

vectorized_train = pd.DataFrame(vectorized_train.todense(), columns = tvec.get_feature_names_out())
vectorized_test = pd.DataFrame(vectorized_test.todense(), columns = tvec.get_feature_names_out())

print('vectorized_train shape:', vectorized_train.shape)
print('vectorized_test shape:', vectorized_test.shape)

X_train = X_train.reset_index()
X_test = X_test.reset_index()

print('X_train reset index shape:', X_train.shape)
print('X_test reset index shape:', X_test.shape)

vectorized_train = pd.concat([X_train[['index', 'no_selftext']], vectorized_train], axis = 1)
vectorized_test = pd.concat([X_test[['index', 'no_selftext']], vectorized_test], axis = 1)

print('vectorized_train concatenated shape:', vectorized_train.shape)
print('vectorized_test concatenated shape:', vectorized_test.shape)

vectorized_train = vectorized_train.set_index('index')
vectorized_test = vectorized_test.set_index('index')

print('vectorized_train corrected index shape:', vectorized_train.shape)
print('vectorized_test corrected index shape:', vectorized_test.shape)

nb = MultinomialNB(alpha=0.1)

nb.fit(vectorized_train, y_train)

print('Training score:', nb.score(vectorized_train, y_train))
print('Test score:', nb.score(vectorized_test, y_test))

X_train shape: (5209, 10)
X_test shape: (1737, 10)
y_train shape: (5209,)
y_test shape: (1737,)
vectorized_train shape: (5209, 4800)
vectorized_test shape: (1737, 4800)
X_train reset index shape: (5209, 11)
X_test reset index shape: (1737, 11)
vectorized_train concatenated shape: (5209, 4802)
vectorized_test concatenated shape: (1737, 4802)
vectorized_train corrected index shape: (5209, 4801)
vectorized_test corrected index shape: (1737, 4801)
Training score: 0.9301209445191015
Test score: 0.872769142199194


# Model 9, TfidfVectorizer, LogisticRegression, Adding Features
Using best TfidfVectorizer parameters from above
Best parameters: {'log__C': 1.4, 'tvec__max_df': 0.12, 'tvec__max_features': 6500, 'tvec__ngram_range': (1, 2), 'tvec__stop_words': expanded_stopwords} Training score: 0.9383758878863505 Test score: 0.8307426597582038

Best result occurs with just adding 'no_selftext':
Training score: 0.9149548857746208
Test score: 0.8480138169257341

In [207]:
X = whole_df.drop(columns = 'subreddit_name')
y = whole_df['subreddit_name']

X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=42, stratify = y)

print("X_train shape:", X_train.shape)
print("X_test shape:", X_test.shape)
print("y_train shape:", y_train.shape)
print("y_test shape:", y_test.shape)

tvec = TfidfVectorizer(max_df=0.12, max_features=6_500, ngram_range=(1,2), stop_words = expanded_stopwords)
vectorized_train = tvec.fit_transform(X_train['all_words'])
vectorized_test = tvec.transform(X_test['all_words'])

vectorized_train = pd.DataFrame(vectorized_train.todense(), columns = tvec.get_feature_names_out())
vectorized_test = pd.DataFrame(vectorized_test.todense(), columns = tvec.get_feature_names_out())

print('vectorized_train shape:', vectorized_train.shape)
print('vectorized_test shape:', vectorized_test.shape)

X_train = X_train.reset_index()
X_test = X_test.reset_index()

print('X_train reset index shape:', X_train.shape)
print('X_test reset index shape:', X_test.shape)

vectorized_train = pd.concat([X_train[['index', 'submission_length', 'title_length', 'submission_word_count', 'title_word_count']], vectorized_train], axis = 1)
vectorized_test = pd.concat([X_test[['index', 'submission_length', 'title_length', 'submission_word_count', 'title_word_count']], vectorized_test], axis = 1)

print('vectorized_train concatenated shape:', vectorized_train.shape)
print('vectorized_test concatenated shape:', vectorized_test.shape)

vectorized_train = vectorized_train.set_index('index')
vectorized_test = vectorized_test.set_index('index')

print('vectorized_train corrected index shape:', vectorized_train.shape)
print('vectorized_test corrected index shape:', vectorized_test.shape)

log = LogisticRegression(max_iter=10_000, C=1.4)

log.fit(vectorized_train, y_train)

print('Training score:', log.score(vectorized_train, y_train))
print('Test score:', log.score(vectorized_test, y_test))

X_train shape: (5209, 10)
X_test shape: (1737, 10)
y_train shape: (5209,)
y_test shape: (1737,)
vectorized_train shape: (5209, 6500)
vectorized_test shape: (1737, 6500)
X_train reset index shape: (5209, 11)
X_test reset index shape: (1737, 11)
vectorized_train concatenated shape: (5209, 6505)
vectorized_test concatenated shape: (1737, 6505)
vectorized_train corrected index shape: (5209, 6504)
vectorized_test corrected index shape: (1737, 6504)


In [212]:
pipe = Pipeline([('log', LogisticRegression(max_iter = 10_000))])

pipe_params = {
    'log__C': [0.5, 1, 1.5]
}

gs = GridSearchCV(pipe, 
                  param_grid = pipe_params,
                 n_jobs=-1)

t0 = time.time()
gs.fit(vectorized_train, y_train)
print('Time to run:', time.time()-t0)

print('Best paramaters:', gs.best_params_)
print('Training score:', gs.score(vectorized_train, y_train))
print('Test score:', gs.score(vectorized_test, y_test))

Best paramaters: {'log__C': 1.5}
Training score: 0.9251295834133231
Test score: 0.82671272308578


In [226]:
# BEST RESULT

X = whole_df.drop(columns = 'subreddit_name')
y = whole_df['subreddit_name']

X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=42, stratify = y)

print("X_train shape:", X_train.shape)
print("X_test shape:", X_test.shape)
print("y_train shape:", y_train.shape)
print("y_test shape:", y_test.shape)

tvec = TfidfVectorizer(max_df=0.12, max_features=6_500, ngram_range=(1,2), stop_words = expanded_stopwords)
vectorized_train = tvec.fit_transform(X_train['all_words'])
vectorized_test = tvec.transform(X_test['all_words'])

vectorized_train = pd.DataFrame(vectorized_train.todense(), columns = tvec.get_feature_names_out())
vectorized_test = pd.DataFrame(vectorized_test.todense(), columns = tvec.get_feature_names_out())

print('vectorized_train shape:', vectorized_train.shape)
print('vectorized_test shape:', vectorized_test.shape)

X_train = X_train.reset_index()
X_test = X_test.reset_index()

print('X_train reset index shape:', X_train.shape)
print('X_test reset index shape:', X_test.shape)

vectorized_train = pd.concat([X_train[['index', 'no_selftext']], vectorized_train], axis = 1)
vectorized_test = pd.concat([X_test[['index', 'no_selftext']], vectorized_test], axis = 1)

print('vectorized_train concatenated shape:', vectorized_train.shape)
print('vectorized_test concatenated shape:', vectorized_test.shape)

vectorized_train = vectorized_train.set_index('index')
vectorized_test = vectorized_test.set_index('index')

print('vectorized_train corrected index shape:', vectorized_train.shape)
print('vectorized_test corrected index shape:', vectorized_test.shape)

log = LogisticRegression(max_iter=10_000, C=1.4)

log.fit(vectorized_train, y_train)

print('Training score:', log.score(vectorized_train, y_train))
print('Test score:', log.score(vectorized_test, y_test))

X_train shape: (5209, 10)
X_test shape: (1737, 10)
y_train shape: (5209,)
y_test shape: (1737,)
vectorized_train shape: (5209, 6500)
vectorized_test shape: (1737, 6500)
X_train reset index shape: (5209, 11)
X_test reset index shape: (1737, 11)
vectorized_train concatenated shape: (5209, 6502)
vectorized_test concatenated shape: (1737, 6502)
vectorized_train corrected index shape: (5209, 6501)
vectorized_test corrected index shape: (1737, 6501)
Training score: 0.9149548857746208
Test score: 0.8480138169257341


In [225]:
X = whole_df.drop(columns = 'subreddit_name')
y = whole_df['subreddit_name']

X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=42, stratify = y)

print("X_train shape:", X_train.shape)
print("X_test shape:", X_test.shape)
print("y_train shape:", y_train.shape)
print("y_test shape:", y_test.shape)

tvec = TfidfVectorizer(max_df=0.12, max_features=6_500, ngram_range=(1,2), stop_words = expanded_stopwords)
vectorized_train = tvec.fit_transform(X_train['all_words'])
vectorized_test = tvec.transform(X_test['all_words'])

vectorized_train = pd.DataFrame(vectorized_train.todense(), columns = tvec.get_feature_names_out())
vectorized_test = pd.DataFrame(vectorized_test.todense(), columns = tvec.get_feature_names_out())

print('vectorized_train shape:', vectorized_train.shape)
print('vectorized_test shape:', vectorized_test.shape)

X_train = X_train.reset_index()
X_test = X_test.reset_index()

print('X_train reset index shape:', X_train.shape)
print('X_test reset index shape:', X_test.shape)

vectorized_train = pd.concat([X_train[['index', 'no_selftext', 'avg_word_length']], vectorized_train], axis = 1)
vectorized_test = pd.concat([X_test[['index', 'no_selftext', 'avg_word_length']], vectorized_test], axis = 1)

print('vectorized_train concatenated shape:', vectorized_train.shape)
print('vectorized_test concatenated shape:', vectorized_test.shape)

vectorized_train = vectorized_train.set_index('index')
vectorized_test = vectorized_test.set_index('index')

print('vectorized_train corrected index shape:', vectorized_train.shape)
print('vectorized_test corrected index shape:', vectorized_test.shape)

log = LogisticRegression(max_iter=10_000, C=1.4)

log.fit(vectorized_train, y_train)

print('Training score:', log.score(vectorized_train, y_train))
print('Test score:', log.score(vectorized_test, y_test))

X_train shape: (5209, 10)
X_test shape: (1737, 10)
y_train shape: (5209,)
y_test shape: (1737,)
vectorized_train shape: (5209, 6500)
vectorized_test shape: (1737, 6500)
X_train reset index shape: (5209, 11)
X_test reset index shape: (1737, 11)
vectorized_train concatenated shape: (5209, 6503)
vectorized_test concatenated shape: (1737, 6503)
vectorized_train corrected index shape: (5209, 6502)
vectorized_test corrected index shape: (1737, 6502)
Training score: 0.9130351315031676
Test score: 0.8474381116868164


In [224]:
X = whole_df.drop(columns = 'subreddit_name')
y = whole_df['subreddit_name']

X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=42, stratify = y)

print("X_train shape:", X_train.shape)
print("X_test shape:", X_test.shape)
print("y_train shape:", y_train.shape)
print("y_test shape:", y_test.shape)

tvec = TfidfVectorizer(max_df=0.12, max_features=6_500, ngram_range=(1,2), stop_words = expanded_stopwords)
vectorized_train = tvec.fit_transform(X_train['all_words'])
vectorized_test = tvec.transform(X_test['all_words'])

vectorized_train = pd.DataFrame(vectorized_train.todense(), columns = tvec.get_feature_names_out())
vectorized_test = pd.DataFrame(vectorized_test.todense(), columns = tvec.get_feature_names_out())

print('vectorized_train shape:', vectorized_train.shape)
print('vectorized_test shape:', vectorized_test.shape)

X_train = X_train.reset_index()
X_test = X_test.reset_index()

print('X_train reset index shape:', X_train.shape)
print('X_test reset index shape:', X_test.shape)

vectorized_train = pd.concat([X_train[['index', 'no_selftext', 'submission_length']], vectorized_train], axis = 1)
vectorized_test = pd.concat([X_test[['index', 'no_selftext', 'submission_length']], vectorized_test], axis = 1)

print('vectorized_train concatenated shape:', vectorized_train.shape)
print('vectorized_test concatenated shape:', vectorized_test.shape)

vectorized_train = vectorized_train.set_index('index')
vectorized_test = vectorized_test.set_index('index')

print('vectorized_train corrected index shape:', vectorized_train.shape)
print('vectorized_test corrected index shape:', vectorized_test.shape)

log = LogisticRegression(max_iter=10_000, C=1.4)

log.fit(vectorized_train, y_train)

print('Training score:', log.score(vectorized_train, y_train))
print('Test score:', log.score(vectorized_test, y_test))

X_train shape: (5209, 10)
X_test shape: (1737, 10)
y_train shape: (5209,)
y_test shape: (1737,)
vectorized_train shape: (5209, 6500)
vectorized_test shape: (1737, 6500)
X_train reset index shape: (5209, 11)
X_test reset index shape: (1737, 11)
vectorized_train concatenated shape: (5209, 6503)
vectorized_test concatenated shape: (1737, 6503)
vectorized_train corrected index shape: (5209, 6502)
vectorized_test corrected index shape: (1737, 6502)
Training score: 0.8892301785371473
Test score: 0.8411053540587219
