# Test Modelling: Classifying Subreddits Based On Text

In this notebook, I compare test/train scores on my data using the following four models. For each model, I used CountVectorizer and TfidfVectorizer a range of hyperparamters set with Pipeline for a total of 8 models. I toggled features out until I found optimized settings. 

1. Logistic Regression
2. K-Nearest Neighbors
3. Naive Bayes: Bernoulli
4. Naive Bayes: Multinomial

Following the 8 models, I compare their scores to see which models I will use for the rest of this project. My X variable is the text scraped and cleaned in my first notebook (both submissions and comments), and my y variable is the subreddit whence the text came. 

### Library Imports

In [2]:
# Import basic libraries
import numpy as np
import pandas as pd

# Import modelling libraries
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.pipeline import Pipeline
from sklearn.linear_model import LogisticRegression
from sklearn.neighbors import KNeighborsClassifier
from sklearn.naive_bayes import MultinomialNB, BernoulliNB
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer

### Read In Cleaned Data & Prepare Variables For Modelling

In [3]:
# Import dataset
west_house = pd.read_csv('../data/west_house.csv')

In [None]:
# Check for nulls just to be safe
west_house.isnull().sum()

In [3]:
# Sets X variable as text column and sets y variable as subreddit column
# Prepares train/test split
X = west_house['text']
y = west_house['subreddit']
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=42)

In [4]:
# Sets a baseline accuracy score
y.value_counts(normalize=True)

1    0.50069
0    0.49931
Name: subreddit, dtype: float64

## Logistic Regression

**Pipe & Grid CountVectorizer**

In [5]:
# Sets pipeline to transformer and estimator
pipe = Pipeline([('vec', CountVectorizer()),
                 ('model', LogisticRegression())])

# Sets pipeline paramters to toggle hyperparamter settings on/off
pipe_params = {
    'vec__max_features': [
                        #50, 
                        #100,
                        250
                        #500,
                        #600,
                        #1000, 
                        #5000
                        ],
    'vec__min_df': [
                   #1,
                   2, 
                   3, 
                   #5,
                   #10 
                   ],
    'vec__max_df': [
                   #.1,
                   .25, 
                   .33, 
                   #.5, 
                   #.9
                    ],
    'vec__ngram_range': [
                        (1, 1), 
                        (1, 2),
                        #(1, 3),
                        #(1, 4)
                        ]
                        }

# Passes transformer/estimator Pipeline into GridSearch 
grid = GridSearchCV(pipe, 
                       pipe_params,
                       cv=5)

# Fits GridSearch model to X_train and y_train
grid.fit(X_train, y_train)

# Assigns best estimator to variable and scores both train and test datasets
mod = grid.best_estimator_
print('Train: ', mod.score(X_train, y_train))
print('Test: ', mod.score(X_test, y_test))
grid.best_params_

Train:  0.7950133868808568
Test:  0.7772202709483191


{'vec__max_df': 0.25,
 'vec__max_features': 250,
 'vec__min_df': 2,
 'vec__ngram_range': (1, 1)}

**Pipe & Grid TfidfVectorizer**

In [6]:
# Sets pipeline to transformer and estimator
pipe = Pipeline([('vec', TfidfVectorizer()),
                 ('model', LogisticRegression())])

# Sets pipeline paramters to toggle hyperparamter settings on/off
pipe_params = {
    'vec__max_features': [
                        #50, 
                        #100,
                        250
                        #500,
                        #600,
                        #1000, 
                        #5000
                        ],
    'vec__min_df': [
                   #1,
                   2, 
                   3, 
                   #5,
                   #10 
                   ],
    'vec__max_df': [
                   #.1,
                   .25, 
                   .33, 
                   #.5, 
                   #.9
                    ],
    'vec__ngram_range': [
                        (1, 1), 
                        (1, 2),
                        #(1, 3),
                        #(1, 4)
                        ]
                        }

# Passes transformer/estimator Pipeline into GridSearch 
grid = GridSearchCV(pipe, 
                       pipe_params,
                       cv=5)

# Fits GridSearch model to X_train and y_train
grid.fit(X_train, y_train)

# Assigns best estimator to variable and scores both train and test datasets
mod = grid.best_estimator_
print('Train: ', mod.score(X_train, y_train))
print('Test: ', mod.score(X_test, y_test))
grid.best_params_

Train:  0.7961847389558233
Test:  0.772704465629704


{'vec__max_df': 0.25,
 'vec__max_features': 250,
 'vec__min_df': 2,
 'vec__ngram_range': (1, 2)}

## KNN

**Pipe & Grid CountVectorizer**

In [7]:
# Sets pipeline to transformer and estimator
pipe = Pipeline([('vec', CountVectorizer()),
                 ('model', KNeighborsClassifier())])

# Sets pipeline paramters to toggle hyperparamter settings on/off
pipe_params = {
    'vec__max_features': [
                        #50, 
                        #100, 
                        500,
                        #600,
                        1000, 
                        #5000
                        ],
    'vec__min_df': [
                   1, 
                   3, 
                   5,
                   #10 
                   ],
    'vec__max_df': [
                   .1, 
                   .33, 
                   .5, 
                   #.9
                    ],
    'vec__ngram_range': [
                        (1, 1), 
                        (1, 2),
                        #(1, 3),
                        #(1, 4)
                        ]
                        }

# Passes transformer/estimator Pipeline into GridSearch 
grid = GridSearchCV(pipe, 
                       pipe_params,
                       cv=5)

# Fits GridSearch model to X_train and y_train
grid.fit(X_train, y_train)

# Assigns best estimator to variable and scores both train and test datasets
mod = grid.best_estimator_
print('Train: ', mod.score(X_train, y_train))
print('Test: ', mod.score(X_test, y_test))
grid.best_params_

Train:  0.7873159303882196
Test:  0.6909182137481185


{'vec__max_df': 0.33,
 'vec__max_features': 500,
 'vec__min_df': 3,
 'vec__ngram_range': (1, 2)}

**Pipe & Grid TfidfVectorizer**

In [8]:
# Sets pipeline to transformer and estimator
pipe = Pipeline([('vec', TfidfVectorizer()),
                 ('model', KNeighborsClassifier())])

# Sets pipeline paramters to toggle hyperparamter settings on/off
pipe_params = {
    'vec__max_features': [
                        #50, 
                        #100, 
                        500,
                        #600,
                        1000, 
                        #5000
                        ],
    'vec__min_df': [
                   1, 
                   3, 
                   5,
                   #10 
                   ],
    'vec__max_df': [
                   #.1, 
                   .33, 
                   .5, 
                   #.9
                    ],
    'vec__ngram_range': [
                        (1, 1), 
                        (1, 2),
                        #(1, 3),
                        #(1, 4)
                        ]
                        }

# Passes transformer/estimator Pipeline into GridSearch 
grid = GridSearchCV(pipe, 
                       pipe_params,
                       cv=5)

# Fits GridSearch model to X_train and y_train
grid.fit(X_train, y_train)

# Assigns best estimator to variable and scores both train and test datasets
mod = grid.best_estimator_
print('Train: ', mod.score(X_train, y_train))
print('Test: ', mod.score(X_test, y_test))
grid.best_params_

Train:  0.7655622489959839
Test:  0.6292022077270446


{'vec__max_df': 0.33,
 'vec__max_features': 500,
 'vec__min_df': 3,
 'vec__ngram_range': (1, 1)}

## Multinomial Naive Bayes

**Pipe & Grid CountVectorizer**

In [9]:
# Sets pipeline to transformer and estimator
pipe = Pipeline([('vec', CountVectorizer()),
                 ('model', MultinomialNB())])

# Sets pipeline paramters to toggle hyperparamter settings on/off
pipe_params = {
    'vec__max_features': [
                        #50, 
                        #100, 
                        #500,
                        600,
                        #1000, 
                        #5000
                        ],
    'vec__min_df': [
                   #1, 
                   #3, 
                   #5,
                   10 
                   ],
    'vec__max_df': [
                   .1, 
                   .33, 
                   #.5, 
                   #.9
                    ],
    'vec__ngram_range': [
                        (1, 1), 
                        (1, 2),
                        #(1, 3),
                        #(1, 4)
                        ]
                        }

# Passes transformer/estimator Pipeline into GridSearch 
grid = GridSearchCV(pipe, 
                       pipe_params,
                       cv=5)

# Fits GridSearch model to X_train and y_train
grid.fit(X_train, y_train)

# Assigns best estimator to variable and scores both train and test datasets
mod = grid.best_estimator_
print('Train: ', mod.score(X_train, y_train))
print('Test: ', mod.score(X_test, y_test))
grid.best_params_

Train:  0.8122489959839357
Test:  0.8023080782739589


{'vec__max_df': 0.33,
 'vec__max_features': 600,
 'vec__min_df': 10,
 'vec__ngram_range': (1, 2)}

**Pipe & Grid TfidfVectorizer**

In [10]:
# Sets pipeline to transformer and estimator
pipe = Pipeline([('vec', TfidfVectorizer()),
                 ('model', MultinomialNB())])

# Sets pipeline paramters to toggle hyperparamter settings on/off
pipe_params = {
    'vec__max_features': [
                        #50, 
                        #100, 
                        #500,
                        600,
                        #1000, 
                        #5000
                        ],
    'vec__min_df': [
                   #1, 
                   #3, 
                   #5,
                   10 
                   ],
    'vec__max_df': [
                   #.1, 
                   .33, 
                   #.5, 
                   #.9
                    ],
    'vec__ngram_range': [
                        (1, 1), 
                        (1, 2),
                        #(1, 3),
                        #(1, 4)
                        ]
                        }

# Passes transformer/estimator Pipeline into GridSearch 
grid = GridSearchCV(pipe, 
                       pipe_params,
                       cv=5)

# Fits GridSearch model to X_train and y_train
grid.fit(X_train, y_train)

# Assigns best estimator to variable and scores both train and test datasets
mod = grid.best_estimator_
print('Train: ', mod.score(X_train, y_train))
print('Test: ', mod.score(X_test, y_test))
grid.best_params_

Train:  0.8274765729585006
Test:  0.787255393878575


{'vec__max_df': 0.33,
 'vec__max_features': 600,
 'vec__min_df': 10,
 'vec__ngram_range': (1, 1)}

## Bernoulli Naive Bayes

**Pipe & Grid CountVectorizer**

In [11]:
# Sets pipeline to transformer and estimator
pipe = Pipeline([('vec', CountVectorizer()),
                 ('model', BernoulliNB())])

# Sets pipeline paramters to toggle hyperparamter settings on/off
pipe_params = {
    'vec__max_features': [
                        #50, 
                        #100, 
                        #500,
                        600,
                        #1000, 
                        #5000
                        ],
    'vec__min_df': [
                   #1, 
                   #3, 
                   5,
                   10 
                   ],
    'vec__max_df': [
                   .1, 
                   .33, 
                   #.5, 
                   #.9
                    ],
    'vec__ngram_range': [
                        (1, 1), 
                        (1, 2),
                        #(1, 3),
                        #(1, 4)
                        ]
                        }

# Passes transformer/estimator Pipeline into GridSearch 
grid = GridSearchCV(pipe, 
                       pipe_params,
                       cv=5)

# Fits GridSearch model to X_train and y_train
grid.fit(X_train, y_train)

# Assigns best estimator to variable and scores both train and test datasets
mod = grid.best_estimator_
print('Train: ', mod.score(X_train, y_train))
print('Test: ', mod.score(X_test, y_test))
grid.best_params_

Train:  0.7901606425702812
Test:  0.7757150025087808


{'vec__max_df': 0.33,
 'vec__max_features': 600,
 'vec__min_df': 5,
 'vec__ngram_range': (1, 2)}

**Pipe & Grid TfidfVectorizer**

In [12]:
# Sets pipeline to transformer and estimator
pipe = Pipeline([('vec', TfidfVectorizer()),
                 ('model', BernoulliNB())])

# Sets pipeline paramters to toggle hyperparamter settings on/off
pipe_params = {
    'vec__max_features': [
                        #50, 
                        #100, 
                        #500,
                        600,
                        #1000, 
                        #5000
                        ],
    'vec__min_df': [
                   #1, 
                   #3, 
                   5,
                   10 
                   ],
    'vec__max_df': [
                   #.1, 
                   .33, 
                   #.5, 
                   #.9
                    ],
    'vec__ngram_range': [
                        (1, 1), 
                        (1, 2),
                        #(1, 3),
                        #(1, 4)
                        ]
                        }

# Passes transformer/estimator Pipeline into GridSearch 
grid = GridSearchCV(pipe, 
                       pipe_params,
                       cv=5)

# Fits GridSearch model to X_train and y_train
grid.fit(X_train, y_train)

# Assigns best estimator to variable and scores both train and test datasets
mod = grid.best_estimator_
print('Train: ', mod.score(X_train, y_train))
print('Test: ', mod.score(X_test, y_test))
grid.best_params_

Train:  0.7901606425702812
Test:  0.7757150025087808


{'vec__max_df': 0.33,
 'vec__max_features': 600,
 'vec__min_df': 5,
 'vec__ngram_range': (1, 2)}

## Model Analysis

In the fine-tuning process above, I quickly noticed that KNN was performing much more poory than the others, and I abandoned it without spending too much time adjusting the hyperparamters. The variance was high AND the scores were low, and mitigating variance would almost certainly drop the higher score down even more. 

Of the two Naive Bayes models tried, Bernoulli performed slightly worse than Multinomial. I actually suspected Bernoulli to perform much worse, since Bernoulli is supposed to be used with binary X variables and neither CountVectorizer nor TfidfVectorizer convert data into stricty 1s and 0s. I assume Bernoulli did so well because though CVec and TVec don't only use 1s and 0s, the data is mostly composed of 1s and 0s.

The remaining four estimator/transformer combinations above produced very similar optimal scores after several rounds of hyperparamter tweaking. Mulitnomial Naive Bayes with CountVectorizer had both train and test scores over 80, so that will be my Naive Bayes model. Though Logistic Regression performed better with CountVectorizer than with TfidfVectorizer, the difference is marginal, and I'd like to use two completely different models (with different transformers) so that I can approach my data in two completely different ways. I will use these two models for the rest of the project.