# Using Natural Language Processing to Predict Subreddit Submissions

## Problem Statement

Reddit is one of the most popular sites on the internet with over 11 million daily active users. The variety and breadth of subreddits allows ample oppurtinites to explore data collection and machine learning techniques. Here we will collect the submissions from three subreddits and build a variety of classifier models in an effort to predict from which subreddit a post is from.

### Subreddits

In order to build a rigorous model subreddits with overlapping themes were chosen.
- r/whiskey
- r/scotch
- r/bourbon

1. Not all whiskies are Bourbon.

2. Not all whiskies are Scotch.

3. Both Scotch and Bourbon are whiskies.

The following models will be compared to see if any can beat a baseline calculation of prediction.
- Random Forest Classifier
- Multinomial Naive Bayes
- AdaBoost Classifier
- Gradient BoostingClassifier

## Executive Summary

### Contents:
- [Imports and Options](#Imports-and-Options)
- [Data](#Data)
    * [Data Collection](#Data-Collection)
    * [Data Analysis](#Data-Analysis)
    * [Data Cleaning](#Data-Cleaning)
    * [Data Dictionary](#Data-Dictionary)
- [Modeling](#Modeling)
    * [Model Preperation](#Model-Preperation)
    * [Model Proceessing](#Model-Processing)
- [Results](#Results)
    * [Conclusions](#Conclusions) 
    * [Recommendations and Next Steps](#Recommendations-and-Next-Steps)

## Imports and Options

In [64]:
# Imports
import pandas as pd
import numpy as np
import matplotlib as plt
import time

from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
from sklearn.feature_extraction import stop_words
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.naive_bayes import MultinomialNB
from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier, AdaBoostClassifier, VotingClassifier
from sklearn.pipeline import Pipeline
from sklearn.tree import DecisionTreeClassifier

# Options
pd.set_option('display.max_columns', 100)
pd.set_option('display.max_colwidth', 100)

## Data

### Data Collection

Data was collected from reddit.com using the pushshift api.

https://github.com/pushshift/api

Data was collected using the reddit-subbmissions-pull.py file in the [code](./code) folder of this repository. For the purposes of this exercise five years of data was collected from 2014 to 2019.

Each subreddit's data is stored as a separate csv in the datasets folder

### Data Analysis

In [None]:
# Importing all three csv's as dataframes
df_whiskey = pd.read_csv('./datasets/whiskey_submissions.csv')
df_scotch  = pd.read_csv('./datasets/scotch_submissions.csv')
df_bourbon = pd.read_csv('./datasets/bourbon_submissions.csv')

print(df_whiskey.shape)
print(df_scotch.shape)
print(df_bourbon.shape)

In [71]:
df_bourbon.shape

(19380, 101)

In [None]:
# Merging all dataframes
df = pd.concat([df_whiskey, df_scotch, df_bourbon], ignore_index=True, sort=True)
df.reset_index().head(3)

### Data Cleaning

The pushshift api returns a large variety of meta data from each submission. For the purposes of comparing natural language processing  between classification models the following features were extracted.

- title
- selftext
- score
- num_comments
- subreddit

Even though score and number of comments are numerical sets of data they are an intrensical aspect of how a subreddit may operate. How active a subreddit is via it's karma score and general discussion could be used as a way to infer the the origin of a post.

#### Reasons for lack of stemming and lemmatization.
Whiskeys are primarily known by their type and distillery. Given the wide range of names for Scottish distilleries there was concern that applying a stemming or lemmatization transformation could have a negative effect on model performance.

Examples:
 - Glentauchers, Mulben
 - Laphroaig, Port Ellena
 - Benrinnes, Banffshire
 
Additionally both vectorizers will have a the n-gram range set to (1,2) in the hyperparameter search to account for distilleries with multiples words for their names or locations.

In [None]:
df = df[['title', 'selftext', 'score', 'num_comments', 'subreddit']]
df.head(10)

In [None]:
df.isnull().sum()

In [None]:
# Since reddit doesn't require a poster describe their post there are many nulls for selftext
# We will replace all nulls with a placeholder string that will be passed as a stopword
df.fillna('stopwordplaceholder', inplace=True)

# Creating a list to pass into stop_words parameter for Transformers
stopwordplaceholder = ['stopwordplaceholder']

df.head(5)

### Data Dictionary

| Variable           | Variable Name | Data Type                  | Description                                                                                                                              |
|--------------------|---------------|----------------------------|------------------------------------------------------------------------------------------------------------------------------------------|
| Title              | title         | str                        | Tite of the post. This is what will appear on the subreddit and is mandatory                                                             |
| Post Description   | selftext      | str                        | A user can provide a description for the post that elaborates on the point or is to begin a discussion. This is not required for a post. |
| Number of Comments | num_comments  | int                        | Number of comments in the post thread. Submission does not count as comment.                                                             |
| Score              | score         | int (positive or negative) | Reddit users and upvote or downvote a post based on its relevance or general enjoyment.                                                  |
| Subreddit          | subreddit     | str                        | The subreddit the submission is from.                                                                                                    |

## Modeling

### Model Preperation

Since models may require different transformation to the days we will train/test/split first.

Due to proccessing limitations inital analysis will be performed only on the 'title' aspect of the submission.

In [None]:
# Train/Test/Split
X = df['title']
y = df['subreddit']


X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=42)

In [61]:
X_train.shape

(45782,)

In [63]:
X_test.shape

(15261,)

### Model Processing

Each of the models listed will be run with similar hyperparameters where applicable.

Additionally each model will be transformed by both a CountVectorizer and TFIDFVectorizer.


- [Random Forest Classifier](#Random-Forest-Classifier)
- [Multinomial Naive Bayes](#Multinomial-Naive-Bayes)
- [AdaBoost Classifier](#AdaBoost-Classifier)
- [Gradient Boost Classifier](#Gradient-Boost-Classifier)

In [None]:
# Creating a CrossVectorizer Pipe Parameter Dictionary
cvec_pipe_params = {
    'cvec__max_features'  : [2000, 3000, 5000],
    'cvec__min_df'        : [2, 3],
    'cvec__max_df'        : [.9],
    'cvec__ngram_range'   : [(1,1), (1,2)],
    'cvec__stop_words'    : [stopwordplaceholder],
    'cvec__strip_accents' : ['unicode']
}

In [None]:
# Creating a TFIDFVectorizer Pipe Parameter Dictionary
tfidf_pipe_params = {
    'tfidf__max_features' : [2000, 3000, 5000],
    'tfidf__min_df'       : [2, 3],
    'tfidf__max_df'       : [.9],
    'tfidf__ngram_range'  : [(1,1), (1,2)],
    'tfidf__stop_words'   : [stopwordplaceholder],
    'tfidf__strip_accents': ['unicode']
}

#### Random Forest Classifier

In [None]:
# Creating RandomForest Parameter Dictionary
rf_pipe_params = {
    'rf__n_estimators' : [50, 100, 150],
    'rf__max_depth'    : [None, 5],
    'rf__n_jobs'       : [6]
}

##### CountVectorizer

In [None]:
# Creating Pipline
rf_cvec_pipe = Pipeline([
    ('cvec', CountVectorizer()),
    ('rf', RandomForestClassifier())
])

# Creating GridSearch

# Temporary parameters dict
temp_params_dict = {**cvec_pipe_params, **rf_pipe_params}

gs_rf_cvec = GridSearchCV(rf_cvec_pipe,
                 temp_params_dict,
                 cv=5)

# Fitting Gridsearch Data
# Setting timer
t0 = time.time()

gs_rf_cvec.fit(X_train, y_train)

rf_cvec_time = time.time() - t0

# Calling Scores and Best Parameters
best_rf_cvec_train_score =  gs_rf_cvec.score(X_train, y_train)
best_rf_cvec_test_score  =  gs_rf_cvec.score(X_test, y_test)
best_rf_cvec_params      =  gs_rf_cvec.best_params_
print(rf_cvec_time)

#### TFIDFVectorizer

In [None]:
# Creating Pipline
rf_tfidf_pipe = Pipeline([
    ('tfidf', TfidfVectorizer()),
    ('rf', RandomForestClassifier())
])

# Creating GridSearch

# Temporary parameters dict
temp_params_dict = {**tfidf_pipe_params, **rf_pipe_params}

gs_rf_tfidf = GridSearchCV(rf_tfidf_pipe,
                 temp_params_dict,
                 cv=5)

# Fitting Gridsearch Data
# Setting timer
t0 = time.time()

gs_rf_tfidf.fit(X_train, y_train)

rf_tfidf_time = time.time() - t0

# Calling Scores and Best Parameters
best_rf_tfidf_train_score =  gs_rf_tfidf.score(X_train, y_train)
best_rf_tfidf_test_score  =  gs_rf_tfidf.score(X_test, y_test)
best_rf_tfidf_params      =  gs_rf_tfidf.best_params_
print(rf_tfidf_time)

### Multinomial Naive Bayes

In [None]:
# Creating MultinomialNB Pipeline Params
mnb_pipe_params = {
    'mnb__alpha' : [.01, .1, 1]
}

#### CountVectorizer

In [None]:
# Creating Pipline
mnb_cvec_pipe = Pipeline([
    ('cvec', CountVectorizer()),
    ('mnb', MultinomialNB())
])

# Creating GridSearch

# Temporary parameters dict
temp_params_dict = {**cvec_pipe_params, **mnb_pipe_params}

gs_mnb_cvec = GridSearchCV(mnb_cvec_pipe,
                 temp_params_dict,
                 cv=5)

# Fitting Gridsearch Data
# Setting timer
t0 = time.time()

gs_mnb_cvec.fit(X_train, y_train)

mnb_cvec_time = time.time() - t0

# Calling Scores and Best Parameters
best_mnb_cvec_train_score =  gs_mnb_cvec.score(X_train, y_train)
best_mnb_cvec_test_score  =  gs_mnb_cvec.score(X_test, y_test)
best_mnb_cvec_params      =  gs_mnb_cvec.best_params_
print(mnb_cvec_time)

#### TFIDFVectorizer

In [None]:
# Creating Pipline
mnb_tfidf_pipe = Pipeline([
    ('tfidf', TfidfVectorizer()),
    ('mnb', MultinomialNB())
])

# Creating GridSearch

# Temporary parameters dict
temp_params_dict = {**tfidf_pipe_params, **mnb_pipe_params}

gs_mnb_tfidf = GridSearchCV(mnb_tfidf_pipe,
                 temp_params_dict,
                 cv=5)

# Setting timer
t0 = time.time()

# Fitting Gridsearch Data
gs_mnb_tfidf.fit(X_train, y_train)

# Storing timer
mnb_tfidf_time = time.time() - t0

# Calling Scores and Best Parameters
best_mnb_tfidf_train_score =  gs_mnb_tfidf.score(X_train, y_train)
best_mnb_tfidf_test_score  =  gs_mnb_tfidf.score(X_test, y_test)
best_mnb_tfidf_params      =  gs_mnb_tfidf.best_params_
print(mnb_tfidf_time)

### AdaBoost Classifier

In [None]:
# Setting AdaBoostClassifier Pipeline Parameters
ada_pipe_params = {
    'ada__n_estimators'              : [50, 100, 150],
    'ada__base_estimator__max_depth' : [1, 2],
    'ada__learning_rate'             : [.9, 1]
}

#### CountVectorizer

In [None]:
# Creating Pipeline
ada_cvec_pipe = Pipeline([
    ('cvec', CountVectorizer()),
    ('ada', AdaBoostClassifier(base_estimator=DecisionTreeClassifier()))
])

# Creating Gridsearch paramaters dictionary
temp_params_dict = {**cvec_pipe_params, **ada_pipe_params}

# Creating GridsearchCV
gs_ada_cvec = GridSearchCV(ada_cvec_pipe,
                           temp_params_dict,
                           cv=5)

# Setting Timer
t0 = time.time()

# Fitting Gridsearch
gs_ada_cvec.fit(X_train, y_train)

# Storimg Timer
ada_cvec_time = time.time() - t0

# Calling scores and best parameters
best_ada_cvec_train_score = gs_ada_cvec.score(X_train, y_train)
best_ada_cvec_test_score  = gs_ada_cvec.score(X_test, y_test)
best_ada_cvec_params      = gs_ada_cvec.best_params_

#### TFIDFVectorizer

In [None]:
# Creating Pipeline
ada_tfidf_pipe = Pipeline([
    ('tfidf', CountVectorizer()),
    ('ada', AdaBoostClassifier(base_estimator=DecisionTreeClassifier()))
])

# Creating Gridsearch paramaters dictionary
temp_params_dict = {**tfidf_pipe_params, **ada_pipe_params}

# Creating GridsearchCV
gs_ada_tfidf = GridSearchCV(ada_tfidf_pipe,
                           temp_params_dict,
                           cv=5)

# Setting Timer
t0 = time.time()

# Fitting Gridsearch
gs_ada_tfidf.fit(X_train, y_train)

# Storimg Timer
ada_tfidf_time = time.time() - t0

# Calling scores and best parameters
best_ada_tfidf_train_score = gs_ada_tfidf.score(X_train, y_train)
best_ada_tfidf_test_score  = gs_ada_tfidf.score(X_test, y_test)
best_ada_tfidf_params      = gs_ada_tfidf.best_params_

### Gradient Boost Classifier

In [None]:
# Setting GradienBoostClassifier pipeline parameters
gb_pipe_params = {
    'gb__n_estimators'  : [50, 100, 150],
    'gb__max_depth'      : [1, 2],
    'gb__learning_rate' : [.9, 1]
}

#### CountVectorizer

In [None]:
# Creating pipeline
gb_cvec_pipe = Pipeline([
    ('cvec', CountVectorizer()),
    ('gb', GradientBoostingClassifier())
])

# Creating gridsearch parameters dictionary
temp_params_dict = {**cvec_pipe_params, **gb_pipe_params}

# Creating GridsearchCV
gs_gb_cvec = GridSearchCV(gb_cvec_pipe,
                          temp_params_dict,
                          cv=5)

# Setting Timer
t0 = time.time()

# Fitting Gridsearch
gs_gb_cvec.fit(X_train, y_train)

# Storing Timer
gb_cvec_time = time.time() - t0

# Calling scores and best parameters
best_gb_cvec_train_score = gs_gb_cvec.score(X_train, y_train)
best_gb_cvec_test_score  = gs_gb_cvec.score(X_test, y_test)
best_gb_cvec_params      = gs_gb_cvec.best_params_

#### TFIDFVectorizer

In [None]:
# Creating pipeline
gb_tfidf_pipe = Pipeline([
    ('tfidf', CountVectorizer()),
    ('gb', GradientBoostingClassifier())
])

# Creating gridsearch parameters dictionary
temp_params_dict = {**tfidf_pipe_params, **gb_pipe_params}

# Creating GridsearchCV
gs_gb_tfidf = GridSearchCV(gb_tfidf_pipe,
                          temp_params_dict,
                          cv=5)

# Setting Timer
t0 = time.time()

# Fitting Gridsearch
gs_gb_tfidf.fit(X_train, y_train)

# Storing Timer
gb_tfidf_time = int(round(time.time() - t0))

# Calling scores and best parameters
best_gb_tfidf_train_score = gs_gb_tfidf.score(X_train, y_train)
best_gb_tfidf_test_score  = gs_gb_tfidf.score(X_test, y_test)
best_gb_tfidf_params      = gs_gb_tfidf.best_params_

## Results

In [56]:
results_data= {'Model'           : [
                                    'Random Forest', 
                                    'Random Forest',
                                    'Multinomial Naive Bayes',
                                    'Multinomial Naive Bayes',
                                    'Ada Boost Classifier',
                                    'Ada Boost Classifier',
                                    'Gradient Boost Classifier',
                                    'Gradient Boost Classifier'
                                    ], 
               'Transformer'     : [
                                    'CountVectorizer',
                                    'TFIDFVectorizer',
                                    'CountVectorizer',
                                    'TFIDFVectorizer',
                                    'CountVectorizer',
                                    'TFIDFVectorizer',
                                    'CountVectorizer',
                                    'TFIDFVectorizer'
                                   ],
               'Training Score'  : [
                                    best_rf_cvec_train_score,
                                    best_rf_tfidf_train_score,
                                    best_mnb_cvec_train_score,
                                    best_mnb_tfidf_train_score,
                                    best_ada_cvec_train_score,
                                    best_ada_tfidf_train_score,
                                    best_gb_cvec_train_score,
                                    best_gb_tfidf_train_score
                                   ],
               'Testing Score'   : [
                                    best_rf_cvec_test_score,
                                    best_rf_tfidf_test_score,
                                    best_mnb_cvec_test_score,
                                    best_mnb_tfidf_test_score,
                                    best_ada_cvec_test_score,
                                    best_ada_tfidf_test_score,
                                    best_gb_cvec_test_score,
                                    best_gb_tfidf_test_score
                                   ],
               'Best Parameters' : [
                                    best_rf_cvec_params,
                                    best_rf_tfidf_params,
                                    best_mnb_cvec_params,
                                    best_mnb_tfidf_params,
                                    best_ada_cvec_params,
                                    best_ada_tfidf_params,
                                    best_gb_cvec_params,
                                    best_gb_tfidf_params
                                   ],
               'Run Time (min)'        : [
                                    int(round(rf_cvec_time   / 60)),
                                    int(round(rf_tfidf_time  / 60)),
                                    int(round(mnb_cvec_time  / 60)),
                                    int(round(mnb_tfidf_time / 60)),
                                    int(round(ada_cvec_time  / 60)),
                                    int(round(ada_tfidf_time / 60)),
                                    int(round(gb_cvec_time   / 60)),
                                    int(round(gb_tfidf_time  / 60))
                                   ]            
              }

results = pd.DataFrame(data=results_data)

# Exporting results table to CSV to build visualizations
results.to_csv('results.csv')

results

Unnamed: 0,Model,Transformer,Training Score,Testing Score,Best Parameters,Run Time (min)
0,Random Forest,CountVectorizer,0.985955,0.753489,"{'cvec__max_df': 0.9, 'cvec__max_features': 5000, 'cvec__min_df': 2, 'cvec__ngram_range': (1, 1)...",48
1,Random Forest,TFIDFVectorizer,0.98589,0.757683,"{'rf__max_depth': None, 'rf__n_estimators': 100, 'rf__n_jobs': 6, 'tfidf__max_df': 0.9, 'tfidf__...",45
2,Multinomial Naive Bayes,CountVectorizer,0.761107,0.747395,"{'cvec__max_df': 0.9, 'cvec__max_features': 5000, 'cvec__min_df': 2, 'cvec__ngram_range': (1, 2)...",3
3,Multinomial Naive Bayes,TFIDFVectorizer,0.761041,0.743464,"{'mnb__alpha': 0.01, 'tfidf__max_df': 0.9, 'tfidf__max_features': 5000, 'tfidf__min_df': 2, 'tfi...",3
4,Ada Boost Classifier,CountVectorizer,0.771504,0.761418,"{'ada__base_estimator__max_depth': 2, 'ada__learning_rate': 0.9, 'ada__n_estimators': 150, 'cvec...",81
5,Ada Boost Classifier,TFIDFVectorizer,0.771504,0.76168,"{'ada__base_estimator__max_depth': 2, 'ada__learning_rate': 0.9, 'ada__n_estimators': 150, 'tfid...",81
6,Gradient Boost Classifier,CountVectorizer,0.800708,0.774196,"{'cvec__max_df': 0.9, 'cvec__max_features': 3000, 'cvec__min_df': 2, 'cvec__ngram_range': (1, 1)...",68
7,Gradient Boost Classifier,TFIDFVectorizer,0.799113,0.774523,"{'gb__learning_rate': 0.9, 'gb__max_depth': 2, 'gb__n_estimators': 150, 'tfidf__max_df': 0.9, 't...",68


### Conclusions

#### Baseline Scores

In [53]:
# Calculating baseline
baseline = y_test.value_counts(normalize=True)
baseline

whiskey    0.350108
Scotch     0.327960
bourbon    0.321932
Name: subreddit, dtype: float64

#### Insights
Each model performed very similarly no matter the vectorizer. Given the lack of structured data cleaning and only analyzing the submission title the lower scores seem reasonable given the general overlap that r/whiskey will have when compared to the more specialized r/bourbon and r/scotch.

The baseline scores were about .33 for each subreddit which is to be expected by the even distribution in sample size. Each model performed well above the baseline with an apporximate accuracy of 75% for predicting submissions.

##### Model Performance
![proj3-model-performance.png](images/proj3-model-performance.png)

While all models had similar accuracy scores the runtimes were very different. It's worth noting the very low processing required by the Multinomial Naive Bayes model means it should be considered in future modeling regimes.

##### Run Time
![proj3-time.png](images/proj3-time.png)

### Recommendations and Next Steps

#### Alter Stopwords Dictionary
- The default 'english' stopwords dictionary contains the word 'still'. I considered this to be problematic for comparing various whiskeys and their manufacturers.

#### Combine title and self text
- Expand on data set by combining the 'title' and 'self text' data. Use the 'stopwordplaceholder' as a way to prevent nulls. Adding this string to the edited stopwords dictionary will remove it from analysis.

#### Expand on Hyperparameters using AWS
- Given the process limitations of my local machine I was limited to the hyperparameters I could gridsearch across all my models. Spinning up a virtual machine to expand processing power will allow for larger and more powerful grid searches.

#### Fine Tune Pipelines
- After Expanded GridSearching a better idea of what number of features that will be ideal will be known. Pulling out the most influential keywords for comparison may show additional insights.

#### Compare just r/scotch and r/bourbon
- Removing r/whiskey data from the comparison may show larger differences in model scores