# Preprocessing & Modeling

Notebook to perform text pre-processing and modeling. Preprocessing / transformations will occur simultaneasuly with modelling as we will use GridSearch and other similar techniques to tune the hyperparameters of the model and transformations.

> **Data Science Problems**<br> 
1) *Given the text contained within the title and original post from r/woodworking and r/mtb can we predict which subreddit the post came from with >85% accuracy?*<br> 
2) Further, using the same model and hyperparameters can we achieve >80% accuracy using the two similar subreddits r/mtb and r/bicycling?

## Contents

- [Imports & Functions](#Imports-&-Functions)
- [Baseline Model & Importing Data](#Baseline-Model-&-Importing-Data)
- [Logistic Regression Model](#Logistic-Regression-Model)
- [KNN Model](#KNN-Model)
- [Naive Bayes Model](#Naive-Bayes-Model)
- [Random Forest Model](#Random-Forest-Model)
- [VotingClassifier Model](#VotingClassifier-Model)

### Imports & Functions

In [72]:
# Key Imports
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt

# General Modeling Imports 
from sklearn.model_selection import train_test_split, cross_val_score, GridSearchCV
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import Pipeline
from sklearn.metrics import confusion_matrix

# NLP Imports
from sklearn.feature_extraction import stop_words
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
from nltk.stem import PorterStemmer, WordNetLemmatizer

# Models
from sklearn.linear_model import LogisticRegression
from sklearn.neighbors import KNeighborsClassifier
from sklearn.naive_bayes import GaussianNB, MultinomialNB
from sklearn.ensemble import VotingClassifier, RandomForestClassifier

In [28]:
# Function to calculate and display classification metrics, works for bernoulli y
def class_metrics(model, X, y):
    # Generate predictions
    preds = model.predict(X)
    # Get confusion matrix and unravel
    tn, fp, fn, tp = confusion_matrix(y,preds).ravel()
    # Accuracy
    print(f'Accuracy: {round((tp+tn)/len(y),3)}')
    # Sensitivity
    print(f'Sensitivity: {round(tp/(tp+fn),3)}')
    # Specificity
    print(f'Specificity: {round(tn/(tn+fp),3)}')
    # Precision
    print(f'Precision: {round(tp/(tp+fp),3)}')

In [36]:
# Analyzers so that we can stem in our pipelines
# Thanks joeln
# https://stackoverflow.com/questions/36182502/add-stemming-support-to-countvectorizer-sklearn/36191362

# PorterStemmer - CVEC
stemmer = PorterStemmer()
cvec_analyzer = CountVectorizer().build_analyzer()

def porter_cvec_words(doc):
    return (stemmer.stem(w) for w in cvec_analyzer(doc))

# PorterStemmer - TFIDF
tfidf_analyzer = TfidfVectorizer().build_analyzer()

def porter_tfidf_words(doc):
    return (stemmer.stem(w) for w in tfidf_analyzer(doc))

# WordNetLemmatizer - CVEC
lemm = WordNetLemmatizer()

def lemm_cvec_words(doc):
    return (lemm.lemmatize(w) for w in cvec_analyzer(doc))

# WordNetLemmatizer - TFIDF
def lemm_tfidf_words(doc):
    return (lemm.lemmatize(w) for w in tfidf_analyzer(doc))

### Baseline Model & Importing Data

Our data set has equal classes and we will stratify y when performing our train test split. This mean that our baseline accuracy score will be 0.5 since we would get that score if we predicted all posts to be from 1 subbreddit or the other.

We won't model this out, but, let's import our data and perform our train test split to get ready for modeling.

In [8]:
# Read in data
df = pd.read_csv('../data/subreddit_text.csv')
df.head()

Unnamed: 0,title,selftext,subreddit,text
0,Is it a bad idea to apply oil based polyuretha...,What are your experiences,1,Is it a bad idea to apply oil based polyuretha...
1,First project: a needlessly complicated learni...,,1,First project: a needlessly complicated learni...
2,"Welded this 2x2 steel tube TV console table, u...",,1,"Welded this 2x2 steel tube TV console table, u..."
3,Router means: Marble Run! Making toys for my S...,,1,Router means: Marble Run! Making toys for my S...
4,structural feasability,"Hey everyone, need some design advice from fol...",1,"structural feasability Hey everyone, need some..."


In [9]:
df.isnull().sum()

title            0
selftext     12241
subreddit        0
text             7
dtype: int64

We can see that over half the 'selftext' columns are empty and that 7 'text' columns are empty indicating that the post has neither a title nor text. Let's look at these observations since they shouldn't be empty given all our title columns have text. We will consider droping those 7 observations since we have 20,000 total observations and we can't trust anything we learn from empty posts.

Additionally we will focus on the 'text' column to create our X variables as it has the information from both the 'title' and 'selftext' embedded within it.

In [10]:
df[df['text'].isnull()]

Unnamed: 0,title,selftext,subreddit,text
4434,Which is the better table saw? Help,,1,
4488,Looking for table saw input. Help,,1,
4492,2x4 45 degree cut off corner scrap as shelf su...,,1,
11276,It's been 22 years Bros. Time to start sending...,,0,
11333,"goodbye Stache, hello MEGA TRS",,0,
11336,Goodbye stache! Hello nukeproof,,0,
12453,Used bike w/upgrades. Good idea? Or no? Also f...,,0,


Clearly something went wrong in when saving over our csv or in cleaning previously. Let's recreate the 'text' column to ensure everything is good to go.

In [13]:
# Fill na's with '' so that we can add the string together
df['selftext'].fillna('',inplace=True)
df['text'] = df['title'] + ' ' + df['selftext']
df.isnull().sum()

title        0
selftext     0
subreddit    0
text         0
dtype: int64

In [14]:
# Set up our X and y variables
X = df['text']
y = df['subreddit']
# Split the data into the training and testing sets.
X_train, X_test, y_train, y_test = train_test_split(X,
                                                    y,
                                                    stratify=y,
                                                    random_state=33)

We are now ready to start our modeling, beginning with a LogisticRegression

### Logistic Regression Model

For our Logistic Regression Model we will set up 2 pipelines and GridSearch with a five-fold cross-validation. The difference between our 2 pipelines is that one will use a CountVectorizer and prioritize whichever tokens appear the most often, while the other pipeline will use a TfidfVectorizer which prioritizes tokens that appear often in some documents but not in the whole corpus. Below are the hyperparameters that we will search over to optimize our model.

**Vectorizer Hyperparameters**
- max_features: Number of features (i.e. tokens) that are outputted from the vectorizer, we will test with 100 and 500 features
- stop_words: Whether or not to include sklearn's english stop words, we will test both with and without them
- ngram_range: Whether to include only single string tokens or multi-string tokens, we will test with both single string and single and double string
- analyzer: Here we will include our custom analyzer from above so that we can test whether using the default 'word' analyzer or a WordNetLemmatizer or PorterStemmer will be best 

**Regression Hyperparameters**
- penalty: We will test with both a ridge and lasso regularization
- C: We will attempt standard, strong and weak regularization 

#### Logistic Regression - CountVectorizer Transformation

In [37]:
# Set up pipleline
c_pipe = Pipeline([
    ('cvec',CountVectorizer()),
    ('lr',LogisticRegression(solver = 'liblinear'))
])

# Pipe parameters
c_pipe_params = {
    'cvec__max_features': [100, 500],
    'cvec__stop_words': [None,'english'],
    'cvec__ngram_range': [(1,1), (1,2)],
    'cvec__analyzer': ['word',porter_cvec_words,lemm_cvec_words],
    'lr__C': [0.1, 1, 1e9],
    'lr__penalty': ['l1','l2']
}

# Instantiate GridSearchCV
c_gs = GridSearchCV(c_pipe, 
                    c_pipe_params, 
                    cv=5,
                    n_jobs = 2) 

# Fit
c_gs.fit(X_train,y_train);

# Show metrics and best parameters
print(c_gs.best_params_)
class_metrics(c_gs,X_test,y_test)

{'cvec__analyzer': <function porter_cvec_words at 0x1a24321200>, 'cvec__max_features': 500, 'cvec__ngram_range': (1, 1), 'cvec__stop_words': None, 'lr__C': 1, 'lr__penalty': 'l2'}
Accuracy: 0.921
Sensitivity: 0.919
Specificity: 0.922
Precision: 0.922


In [45]:
class_metrics(c_gs,X_train,y_train)

Accuracy: 0.938
Sensitivity: 0.931
Specificity: 0.946
Precision: 0.945


These seem to be very good results, our 4 classification metrics were all ~0.92 which is a strong score. Additionally since in this case we don't have a true 'positive' case we would like to see Sensitivity and Specificity both be similar so we are correctly classifying both subreddits at a similar accuracy. The train and test metrics at most 0.024 apart so it does not seem that there is significant overfitting.

This block of code took over 40 minutes to run so going forward we will use some of the results from this GridSearch and assume that certain parameters will be the best for all the upcoming models. While this is not ideal, given computing power and time constraints, we will go forward with the following hyper-parameters set:
- Vectorizer analyzer: PorterStemmer
- Regression penalty: l2

**Best CountVectorizer Logistic Regression**

| Hyperparameter | Value |
|---|---|
| Stemming | Porter |
|Max Features|500|
|Ngram Range|(1,1)|
|Stop Words|None|
|C|1|
|Penalty|l2|

#### Logistic Regression - TfidfVectorizer Transformation

In [38]:
# Set up pipleline
lr_tf_pipe = Pipeline([
    ('tfidf',TfidfVectorizer(analyzer=porter_tfidf_words)),
    ('lr',LogisticRegression(solver = 'liblinear'))
])

# Pipe parameters
lr_tf_pipe_params = {
    'tfidf__max_features': [100, 500],
    'tfidf__stop_words': [None,'english'],
    'tfidf__ngram_range': [(1,1), (1,2)],
    'lr__C': [0.1, 1, 1e9]
}

# tfidf grid search
# Instantiate GridSearchCV.
lr_tf_gs = GridSearchCV(lr_tf_pipe, 
                    lr_tf_pipe_params, 
                    cv=5,
                    n_jobs=2) 

# Fit grid search
lr_tf_gs.fit(X_train,y_train);


# Show metrics and best parameters
print(lr_tf_gs.best_params_)
class_metrics(lr_tf_gs,X_test,y_test)

{'lr__C': 1, 'tfidf__max_features': 500, 'tfidf__ngram_range': (1, 1), 'tfidf__stop_words': None}
Accuracy: 0.919
Sensitivity: 0.917
Specificity: 0.92
Precision: 0.92


In [44]:
class_metrics(lr_tf_gs,X_train,y_train)

Accuracy: 0.935
Sensitivity: 0.926
Specificity: 0.944
Precision: 0.943


Our Logistic Regression using TFIDF Vectorizing gave us very similar, if slightly worse, performace than our previous model using CountVectorizer. Our classification scores all round to 0.92 and are similar whether looking at overall accuracy or accuracy for both classes. Overfitting also does not appear to be an issue with this model. Additionally it was interesting to see that the same hyperparameter values ended up being the best for both models. 

**Best TfidfVectorizer Logistic Regression**

| Hyperparameter | Value |
|---|---|
| Stemming | Porter |
|Max Features|500|
|Ngram Range|(1,1)|
|Stop Words|None|
|C|1|
|Penalty|l2|

Overall it seems as though our Logistic Regressions are quite good and if we had to chose between the 2 we would use the CountVectorizer transformation. Going forward attempting different models we will simplify our optimizing by assuming that the following vectorizing hyperparameters are set:
- Analyser: PorterStemmer
- Max Features: 500
- Ngram Range: (1,1)
- Stop Words: none

### X Transformation

Since we have decided to keep the same hyperparameters for our transformers we will perform the transformations now so that they do not need to happen as part of the grid search.

In [50]:
# CountVectorizer 
# Instantiate
cvec = CountVectorizer(analyzer=porter_cvec_words, max_features=500)
# Fit
cvec.fit(X_train)
# Transform
C_train = cvec.transform(X_train)
C_test = cvec.transform(X_test)

In [56]:
# Convert to dataframe
C_train = pd.DataFrame(C_train.toarray(),columns=cvec.get_feature_names())
C_test = pd.DataFrame(C_test.toarray(),columns=cvec.get_feature_names())

In [57]:
# TfidfVectorizer 
# Instantiate
tf = TfidfVectorizer(analyzer=porter_cvec_words, max_features=500)
# Fit
tf.fit(X_train)
# Transform
Tf_train = tf.transform(X_train)
Tf_test = tf.transform(X_test)

In [58]:
# Convert to dataframe
Tf_train = pd.DataFrame(Tf_train.toarray(),columns=tf.get_feature_names())
Tf_test = pd.DataFrame(Tf_test.toarray(),columns=tf.get_feature_names())

### KNN Model

For our KNN Model we will set up 2 pipelines and GridSearch with a five-fold cross-validation just as we did for our logistic regression. We decided above to utilize standard hyperparameters for the vectorizers, so in this case we will only check our KNN hyperparameters.

Note that for our CountVectorizer GridSearch we will need to add a StandardScaler to get our features on the same scale, however, TfidfVectorizer automatically provides scaled outputs from 0 to 1 so it will not be necessary in that pipeline.

**KNN Hyperparameter**
- k_neighbors: Number of neighbors that have a vote, we will test with 5, 15, 25

#### KNN Model - CountVectorizer Transformation

In [61]:
# Only running a GridSearch Now, no longer need pipeline since we have created
# out transformed X_train and X_test

# Pipe parameters
knn_c_params = {
    'n_neighbors': [5,15,25]
}

# Instantiate GridSearchCV
knn_c_gs = GridSearchCV(KNeighborsClassifier(), 
                    knn_c_params, 
                    cv=5,
                    n_jobs = 2) 

# Scale Data
ss = StandardScaler()
C_train_sc = ss.fit_transform(C_train)
C_test_sc = ss.transform(C_test)

# Fit
knn_c_gs.fit(C_train_sc,y_train);

# Show metrics and best parameters
print(f'Best hyperparameter: {knn_c_gs.best_params_}\n')
print('Training Scores')
class_metrics(knn_c_gs,C_train_sc,y_train)
print('\nTest Scores')
class_metrics(knn_c_gs,C_test_sc,y_test)

Best hyperparameter: {'n_neighbors': 25}

Training Scores
Accuracy: 0.848
Sensitivity: 0.864
Specificity: 0.831
Precision: 0.837

Test Scores
Accuracy: 0.82
Sensitivity: 0.847
Specificity: 0.794
Precision: 0.804


Our initial KNN Model performed ]worse than our logistic regression, with test scores around 0.8 - 0.85 vs 0.92 for our logistic regression models. Additionally our KNN model appears to be much better at accurately classifying the woodworking subreddit than the mtb subbreddit as our Sensitivity (i.e. success classifying 1's or woodworking) is 0.05 higher than Specificity. While overfitting does not appear to be an issue here either this model by itself is definitely not our best and may not be worth including in our Voting Classifier either.  Let's see if KNN performs better with a TFIDF Vectorizer.

#### KNN Model - TfidfVectorizer Transformation

In [63]:
# GridSearch parameters
knn_tf_params = {
    'n_neighbors': [5,15,25]
}

# Instantiate GridSearchCV
knn_tf_gs = GridSearchCV(KNeighborsClassifier(), 
                    knn_tf_params, 
                    cv=5,
                    n_jobs = 2) 

# Fit
knn_tf_gs.fit(Tf_train,y_train);

# Show metrics and best parameters
print(f'Best hyperparameter: {knn_tf_gs.best_params_}\n')
print('Training Scores')
class_metrics(knn_tf_gs,Tf_train,y_train)
print('\nTest Scores')
class_metrics(knn_tf_gs,Tf_test,y_test)

Best hyperparameter: {'n_neighbors': 5}

Training Scores
Accuracy: 0.835
Sensitivity: 0.857
Specificity: 0.813
Precision: 0.821

Test Scores
Accuracy: 0.737
Sensitivity: 0.787
Specificity: 0.686
Precision: 0.715


The TFIDF Vectorizer model performed even more poorly with the KNN model, with our test score in the 0.69 - 0.79 range and even more spread between sensitivity and specificity. Additionally, overfitting appears to be an issue with the KNN - Tfidf model. Overall, KNN does not appear to be a good model for this dataset and problem. Interestingly the best number of neighbors was 5 vs 25 for the previous knn model. 

Let's move on to Naive Bayes.

### Naive Bayes Model

For our Naive Bayes models we are going to keep the standard hyperparameters and use a MultinomialNB for our Cvec model and GaussianNB for our Tfidf model as those are the types that best fit the X data in the respective cases.

#### Multinomial Naive Bayes Model - CountVectorizer

In [64]:
# Cvec - Multinomial NB
# Instantiate
mnb = MultinomialNB()
# Fit
mnb.fit(C_train,y_train)
# Metrics
print('Training Scores')
class_metrics(mnb,C_train,y_train)
print('\nTest Scores')
class_metrics(mnb,C_test,y_test)

Training Scores
Accuracy: 0.92
Sensitivity: 0.919
Specificity: 0.92
Precision: 0.92

Test Scores
Accuracy: 0.914
Sensitivity: 0.925
Specificity: 0.903
Precision: 0.905


Our Cvec - MultinomialNB model is close to as good as our Logistic Regression. There are no signs of overfitting and our scores are in the 0.9 - 0.92 range. One note of caution is that it is about 0.02 better at classifying the woodworking subreddit, whereas, the logistic regression was equally good at both. 

While not the best model so far, this is definitely worth including in our VotingClassifier later on.

#### Gaussian Naive Bayes Model - TfidfVectorizer

In [65]:
# Tfidf - Gaussian NB
# Instantiate
gnb = GaussianNB()
# Fit
gnb.fit(C_train,y_train)
# Metrics
print('Training Scores')
class_metrics(gnb,Tf_train,y_train)
print('\nTest Scores')
class_metrics(gnb,Tf_test,y_test)

Training Scores
Accuracy: 0.746
Sensitivity: 1.0
Specificity: 0.493
Precision: 0.663

Test Scores
Accuracy: 0.737
Sensitivity: 0.999
Specificity: 0.475
Precision: 0.656


Very interestingly our Gaussian NB - Tfidf model is essentially not missing any woodworking posts, however, it is classifying way too many as woodworking. We can see this with the test preceision score of 0.66 which means that despite getting almost all the woodworking correct we are actually guessing that ~2/3 of the posts are from r/woodworking.

This is likely a result of the Tfidf Transformation, but makes it so that this model isn't of much use to us.

Next we'll look at a Random Forest model.

### Random Forest Model

For our Random Forest Models we will once again GridSearch over a couple parameters. We will attempt both a random forest and bagging model as well as optimize over a number of other parameters.

**Random Forest Hyperparameters**
- n_estimators: Number of trees created, we will attempt 100 and 125
- max_depth: How deep the tree is, we will test none (i.e. until all leaves are pure), 10, 25 and 50
- max_features: How many features to include in the model, none means that we will use all the features and the model will be a bagging tree while auto uses sqrt(n_features) and is the standard for random forests

#### Random Forest Model - CountVectorizer

In [70]:
# Adapted from GA DSI Lesson 6.03
rf = RandomForestClassifier(random_state=42)
rf_params = {
    'n_estimators': [100, 125],
    'max_depth': [None, 10, 25, 50],
    'max_features': [None, # bagging
                     'auto'] # random forest
}

rf_gs = GridSearchCV(rf, 
                  param_grid=rf_params,
                  cv=5,
                  n_jobs=2)

rf_gs.fit(C_train,y_train)

# Show metrics and best parameters
print(f'Best hyperparameter: {rf_gs.best_params_}\n')
print('Training Scores')
class_metrics(rf_gs,C_train,y_train)
print('\nTest Scores')
class_metrics(rf_gs,C_test,y_test)

Best hyperparameter: {'max_depth': None, 'max_features': 'auto', 'n_estimators': 125}

Training Scores
Accuracy: 0.987
Sensitivity: 0.995
Specificity: 0.978
Precision: 0.978

Test Scores
Accuracy: 0.917
Sensitivity: 0.932
Specificity: 0.903
Precision: 0.906


The Random Forest - CountVectorizer model performed quite well with our classification scores between 0.90 and 0.93, this is comparable to the Naive Bayes - CountVectorizer model and is in contention as the 3rd best model behind the 2 Logistic Regression Models. 

With near perfect training scores it does appear as though there is some overfitting, however, the resulting model is quite good and tree models are prone to overfitting. If this were the final model I might have more concern, but, so far we have other models that have performed better and this is well worth including in our Voting Classifier.

Due to the computational requirement of a Random Forest, this cell took very long to run and for our Tfidf random forest we will assume the max_depth and max_features from this GridSearch are also the best. Thus we will remove those form the gridsearch parameters, but add 150 to our n_estimators parameters to see if adding more estimators helps.

#### Random Forest Model - TfidfVectorizer

In [71]:
# Adapted from GA DSI Lesson 6.03
rf = RandomForestClassifier(random_state=42)
rf_tf_params = {
    'n_estimators': [100, 125, 150],
}

rf_tf_gs = GridSearchCV(rf, 
                  param_grid=rf_tf_params,
                  cv=5,
                  n_jobs=2)

rf_tf_gs.fit(Tf_train,y_train)

# Show metrics and best parameters
print(f'Best hyperparameter: {rf_tf_gs.best_params_}\n')
print('Training Scores')
class_metrics(rf_tf_gs,Tf_train,y_train)
print('\nTest Scores')
class_metrics(rf_tf_gs,Tf_test,y_test)

Best hyperparameter: {'n_estimators': 150}

Training Scores
Accuracy: 0.987
Sensitivity: 0.995
Specificity: 0.978
Precision: 0.978

Test Scores
Accuracy: 0.918
Sensitivity: 0.93
Specificity: 0.906
Precision: 0.909


The Random Forest - TfidfVectorizer model performed nearly identically to the previous random forest model. It seems as though different Vectorization techniques do not have much of an effect for the model type given our data. 

# VotingClassifier Model

For our voting classifier we need include a couple of separate models and have them vote for each test observation to determine which subreddit a post came from. Let's first look at all of the models we created to decide which one's to include in the voting classifier. (*Note that we can only fit the voting classifier with 1 set of training data so we will organize on vectorizer first and then estimator*)

| **Vectorizer**    | **Estimator**             | **Test Accuracy** | **Test Sensitivity** | **Test Specificity** |
|-------------------|---------------------------|-------------------|----------------------|----------------------|
| *CountVectorizer* | *Logistic Regression*     | 0.921             | 0.919                | 0.922                |
| *CountVectorizer* | *KNN*                     | 0.820             | 0.847                | 0.794                |
| *CountVectorizer* | *Multinomial Naive Bayes* | 0.914             | 0.925                | 0.903                |
| *CountVectorizer* | *Random Forest*           | 0.917             | 0.932                | 0.903                |
| *TfidfVectorizer* | *Logistic Regression*     | 0.919             | 0.917                | 0.920                |
| *TfidfVectorizer* | *KNN*                     | 0.737             | 0.787                | 0.686                |
| *TfidfVectorizer* | *Gaussian Naive Bayes*    | 0.737             | 0.999                | 0.475                |
| *TfidfVectorizer* | *Random Forest*           | 0.918             | 0.930                | 0.906                |

Of the 8 models we have a clear split between the 5 that performed best and the 3 that did not perform well.

**Good Models**
- Cvec + Logistic Regression
- Cvec + Multinomial Naive Bayes
- Cvec + Random Forest
- Tfidf + Logistic Regression
- Tfidf + Random Forest

**Not Great Models**
- Cvec + KNN
- Tfidf + KNN
- Tfidf + Gaussian Naive Bayes

All of our 'good' models had score above 0.9 for the 3 metrics we care most about:
- Accuracy: Total model accuracy
- Sensitivity: Accuracy for r/woodworking
- Specificity: Accuracy for r/mtb

We saw the best and most stable performance with both of our Logistic Regression models and Random Forest also generally performed well, though it was better at classifying r/woodworking than r/mtb. Since we have 3 good models with CountVectorizer (and even within knn the CountVectorizer model performed better) we will use those 3 for our VotingClassifier, using the hyperparameters we GridSearched for earlier and equal weights.

In [79]:
# Instantiate Voting Classifier
vote = VotingClassifier([
            ('lr',LogisticRegression(solver='liblinear')),
            ('mnb',MultinomialNB()),
            ('rf',RandomForestClassifier(n_estimators=125, random_state=42)) 
])
# Fit 
vote.fit(C_train,y_train)

# metrics
print('Training Scores')
class_metrics(vote,C_train,y_train)
print('\nTest Scores')
class_metrics(vote,C_test,y_test)

Training Scores
Accuracy: 0.951
Sensitivity: 0.946
Specificity: 0.957
Precision: 0.956

Test Scores
Accuracy: 0.923
Sensitivity: 0.922
Specificity: 0.923
Precision: 0.923


The Voting Classifier performed very well, scoring a slightly higher than our previous best model (Logistic Regression + CountVectorizer) on the test data. Additionally being very stable and predicting both classes with essentially the same accuracy, ~0.92.

While not as interpretable as a stand-alone Logistic Regression, this will be our final model and the one we use to test more similar datasets (i.e. r/mtb and r/bicycling). We will use this model as it scored the best and most consistently of any of the models and intuitively it should be more robust to fitting on different data as the 3 model types within the voting classifier can cover each other's flaws to an extent. 

As for our first problem: creating a model to classify a text string as coming from 1 of 2 subreddit's with >85% accuracy, these results indicate that we have achieved that goal. Now we will move on to the next notebook (similar_classes.ipynb) ot address our second question.