# Similar Classes

Notebook for training and testing our final VotingClassifier model on two similar subreddit's (r/mtb and r/bicycling) to see whether we can maintain an accuracy above 0.80.

> **Data Science Problems**<br> 
1) Given the text contained within the title and original post from r/woodworking and r/mtb can we predict which subreddit the post came from with >85% accuracy?<br> 
2) *Further, using the same model and hyperparameters can we achieve >80% accuracy using the two similar subreddits r/mtb and r/bicycling?*

## Contents

- [Imports & Functions](#Imports-&-Functions)
- [Importing Data & Cleaning](#BImporting-Data-&-Cleaning)
- [VotingClassifier Model](#VotingClassifier-Model)

### Imports & Functions

In [1]:
# Key Imports
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt

# General Modeling Imports 
from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.metrics import confusion_matrix

# NLP Imports
from sklearn.feature_extraction.text import CountVectorizer
from nltk.stem import PorterStemmer

# Models
from sklearn.linear_model import LogisticRegression
from sklearn.naive_bayes import MultinomialNB
from sklearn.ensemble import VotingClassifier, RandomForestClassifier

In [2]:
# Function to calculate and display classification metrics, works for bernoulli y
def class_metrics(model, X, y):
    # Generate predictions
    preds = model.predict(X)
    # Get confusion matrix and unravel
    tn, fp, fn, tp = confusion_matrix(y,preds).ravel()
    # Accuracy
    print(f'Accuracy: {round((tp+tn)/len(y),3)}')
    # Sensitivity
    print(f'Sensitivity: {round(tp/(tp+fn),3)}')
    # Specificity
    print(f'Specificity: {round(tn/(tn+fp),3)}')
    # Precision
    print(f'Precision: {round(tp/(tp+fp),3)}')

In [3]:
# Analyzers so that we can stem in our pipelines
# Thanks joeln
# https://stackoverflow.com/questions/36182502/add-stemming-support-to-countvectorizer-sklearn/36191362

# PorterStemmer - CVEC
stemmer = PorterStemmer()
cvec_analyzer = CountVectorizer().build_analyzer()

def porter_cvec_words(doc):
    return (stemmer.stem(w) for w in cvec_analyzer(doc))

### Importing Data & Cleaning

We will import the previously created csv that contains the r/mtb and r/bicycling post, quickly ensure that there aren't any errors, train-test split and CountVectorize our data.

In [4]:
# Read in data
df = pd.read_csv('../data/similar_subreddits.csv')
df.head()

Unnamed: 0,title,selftext,subreddit,text
0,Anyone done the Mt. Washington Century in New ...,,1,Anyone done the Mt. Washington Century in New ...
1,Built Up A 90s Cannondale Super V with SRAM NX...,,1,Built Up A 90s Cannondale Super V with SRAM NX...
2,[NBD] New bike for collegiate road racing!,,1,[NBD] New bike for collegiate road racing!
3,Best hybrid commuter bike for under £1000?,Recently had my Cube SL stolen and looking to ...,1,Best hybrid commuter bike for under £1000? Rec...
4,NBD Post,My first new(ish) bike as a college student! I...,1,NBD Post My first new(ish) bike as a college s...


In [5]:
# Missing values
df.isnull().sum()

title            0
selftext     11269
subreddit        0
text             4
dtype: int64

It looks as though we have similar issues to our intial dataset, let's recast the text column and create our X and y variables.

In [7]:
# Fill na's with '' so that we can add the string together
df['selftext'].fillna('',inplace=True)
df['text'] = df['title'] + ' ' + df['selftext']
df.isnull().sum()

title        0
selftext     0
subreddit    0
text         0
dtype: int64

In [8]:
# Set up our X and y variables
X = df['text']
y = df['subreddit']
# Split the data into the training and testing sets.
X_train, X_test, y_train, y_test = train_test_split(X,
                                                    y,
                                                    stratify=y,
                                                    random_state=33)

Now we will transform our test data using CountVectorizer, PorterStemmer and 500 max features so that we are duplicating the same procedure as our final model.

In [9]:
# CountVectorizer & PorterStemmer transformation 
# Instantiate
cvec = CountVectorizer(analyzer=porter_cvec_words, max_features=500)
# Fit
cvec.fit(X_train)
# Transform
X_train = cvec.transform(X_train)
X_test = cvec.transform(X_test)

In [10]:
# Convert to dataframe
X_train = pd.DataFrame(X_train.toarray(),columns=cvec.get_feature_names())
X_test = pd.DataFrame(X_test.toarray(),columns=cvec.get_feature_names())

### VotingClassifier Model

We will now instantiate, fit and test our final model from the initial analysis. Note that our analysis yielded the following model and transformation hyperparameters:

**Transformation Hyperparameters**
- Include stop words
- PorterStemmer to stem words
- Single string ngrams
- 500 features

**Model Hyperparameters**
- Sklearn standard LogisticRegression with C=1 and l2 / ridge penalty and liblinear solver
- Standard Multinomial Naive Bayes model
- Random Forest model with 125 tress

In [12]:
# Instantiate Voting Classifier
vote = VotingClassifier([
            ('lr',LogisticRegression(solver='liblinear')),
            ('mnb',MultinomialNB()),
            ('rf',RandomForestClassifier(n_estimators=125, random_state=42)) 
])
# Fit 
vote.fit(X_train,y_train)

# metrics
print('Training Scores')
class_metrics(vote,X_train,y_train)
print('\nTest Scores')
class_metrics(vote,X_test,y_test)

Training Scores
Accuracy: 0.834
Sensitivity: 0.869
Specificity: 0.798
Precision: 0.812

Test Scores
Accuracy: 0.763
Sensitivity: 0.797
Specificity: 0.728
Precision: 0.746


Our model performs much worse on similar subreddits. We have an overall test accuracy of 0.763 and r/bicycling accuracy (sensitivity) of 0.797 and r/mtb accuracy (specificity) of 0.728. These scores are not only much worse than our ~0.92 scores we achieved with our more different subreddits but also have a large variance which is not ideal for the problem at hand. Additionally we have signs of overfitting with ~0.07 difference between train and test scores for those 3 metrics.

Overall we have not been able to affirmatively answer our second problem, but it was expected that our model would have a more difficult time with data from 2 similar subreddits. Classifying 2 set's similar data will always be more difficult that 2 very different sets of data and in general our accuracy score of 0.763 is pretty good. 

In the future some next steps would be to gather more posts and also first train the model on a couple sets similar subreddit's to optimize for solving the hardest challenges before generalizing for easier challenges.

Additionally it would be interesting to build a classifier that can handle a greater number of subreddit's and see how our performance varies in that case.