# Project 3 - A Tale of Two Subreddits 
---

## Modeling
---

Now that the data cleaning is complete, we can move on to modeling. As previously stated, this will mean creating a Random Forest Classifier and a Naive Bayes Classifier. Before moving on, recall that the baseline accuracy for this data is **50.5%**.

In [59]:
# But first, imports
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import nltk

# Modeling imports
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.pipeline import Pipeline
from sklearn.naive_bayes import MultinomialNB
from sklearn.metrics import classification_report
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.ensemble import RandomForestClassifier

In [13]:
# Load in the cleaned data
data = pd.read_csv('../datasets/raw_text_data.csv')
data.head()

Unnamed: 0,selftext,subreddit,title,text,Target_col
0,,Conservative,John Kerry Declares Biden Administration Will ...,John Kerry Declares Biden Administration Will...,0
1,,Conservative,Illegal Immigration and ‘Racial Equity’,Illegal Immigration and ‘Racial Equity’,0
2,[deleted],Conservative,Of course it is,[deleted] Of course it is,0
3,[removed],Conservative,The Future of the GOP,[removed] The Future of the GOP,0
4,,Conservative,Joe Biden to Ban New Fracking Leases on Federa...,Joe Biden to Ban New Fracking Leases on Feder...,0


In [26]:
# Grab the needed data and split the data
X = data['title']
y = data['Target_col']
X_train, X_test, y_train, y_test = train_test_split(X, y, stratify=y, random_state=42)

In [27]:
X_train.shape

(6798,)

In [28]:
y_train.shape

(6798,)

In [5]:
# Add in the stop words list
stop_words = nltk.corpus.stopwords.words('english')
stop_words_plus = stop_words + ['removed', 'deleted']

### The Random Forest Classifier

To create a Random Forest Classfier, I will utilize a Pipeline with the TF-IDF word vectorizer along with the estimator. This will allow me to tune the hyperparameters of both in a grid search:

In [40]:
# Set up a counter and empty dict to store model params in
model_params_rf = {}
count_rf = 0

In [50]:
# Reset the model params each time the cell is run
model_params_rf = model_params_rf
count_rf = count_rf

# Set up the pipeline and params to tune
pipe_rf = Pipeline([
    ('tvec', TfidfVectorizer()),
    ('rf', RandomForestClassifier())
])

params_rf = {
    'tvec__stop_words'     : [None, stop_words, stop_words_plus],
#     'tvec__min_df'         : [1.0, 2.0],
    'tvec__max_features'   : [2000],
    'tvec__ngram_range'    : [(1, 1), (1, 2)],
    'rf__n_estimators'     : [120, 140],
    'rf__max_depth'        : [15, 20, 30, 35],
    'rf__max_features'     : ['auto'],
    'rf__min_samples_split': [2, 3, 4],
    'rf__ccp_alpha'        : [0.0001, 0.001] 
}

# Perform gridsearch
grid_rf = GridSearchCV(pipe_rf, param_grid=params_rf, n_jobs=-1, cv=5, verbose=2)

grid_rf.fit(X_train, y_train)

# Stole this part from the lesson on Random Forests - all credit to Patrick here
count_rf += 1

grid_rf.best_params_['training_score'] = grid_rf.best_score_
model_params_rf[f'model_{count_rf}'] = grid_rf.best_params_
grid_rf.best_params_['test_score'] = grid_rf.score(X_test, y_test)

model_df = pd.DataFrame.from_dict(model_params_rf, orient='index')
model_df

Fitting 5 folds for each of 288 candidates, totalling 1440 fits


Unnamed: 0,rf__ccp_alpha,rf__max_depth,rf__max_features,rf__min_samples_split,rf__n_estimators,tvec__max_features,tvec__ngram_range,tvec__stop_words,training_score,test_score
model_1,0.001,4,auto,3,120,2000,"(1, 1)",,0.638569,0.632833
model_2,0.001,5,auto,3,120,2000,"(1, 1)",,0.639303,0.628861
model_3,0.0001,6,auto,2,120,2000,"(1, 1)",,0.643127,0.634157
model_4,0.0001,9,auto,3,140,2000,"(1, 1)",,0.651806,0.643866
model_5,0.0001,15,auto,2,140,2000,"(1, 1)","[i, me, my, myself, we, our, ours, ourselves, ...",0.668725,0.667255
model_6,0.0001,30,auto,3,120,2000,"(1, 1)","[i, me, my, myself, we, our, ours, ourselves, ...",0.686671,0.672109


This model is not performing well, even after several rounds of hyperparameter tuning. As you can see, the best accuracy I got was 68%  on the training set and 67%  on the test set. At least the model isn't overfit!

To get a better idea of this model's performance, I will employ sklearn's `ClassificationReport`:

In [60]:
y_preds_rf = grid_rf.predict(X_test) 

In [63]:
print(classification_report(y_test, y_preds_rf, target_names=['Conservative', 'neoliberal']))

              precision    recall  f1-score   support

Conservative       0.64      0.79      0.71      1146
  neoliberal       0.72      0.55      0.63      1120

    accuracy                           0.67      2266
   macro avg       0.68      0.67      0.67      2266
weighted avg       0.68      0.67      0.67      2266



The precision and recall of this model were 0.68 and 0.67, respectively.

### The Naive Bayes Classifier

In [45]:
model_params_nb = {}
count_nb = 0

In [51]:
# Reset the model params each time the cell is run
model_params_nb = model_params_nb
count_nb = count_nb

# Set up the pipeline and params to tune
pipe_nb = Pipeline([
    ('tvec', TfidfVectorizer()),
    ('nb', MultinomialNB())
])

params_nb = {
    'tvec__stop_words'     : [None, stop_words, stop_words_plus],
#     'tvec__min_df'         : [1.0, 2.0],
    'tvec__max_features'   : [2000, 4000, 5000, 6000],
    'tvec__ngram_range'    : [(1, 1), (1, 2)],
    'nb__alpha'            : [0, 0.4, 0.6, 0.8, 1],
    'nb__fit_prior'        : [True, False]
}

# Perform gridsearch
grid_nb = GridSearchCV(pipe_nb, param_grid=params_nb, n_jobs=-1, cv=5, verbose=1)

grid_nb.fit(X_train, y_train)

# Stole this part from the lesson on Random Forests - all credit to Patrick here
count_nb += 1

grid_nb.best_params_['training_score'] = grid_nb.best_score_
model_params_nb[f'model_{count_nb}'] = grid_nb.best_params_
grid_nb.best_params_['test_score'] = grid_nb.score(X_test, y_test)

model_df_nb = pd.DataFrame.from_dict(model_params_nb, orient='index')
model_df_nb

Fitting 5 folds for each of 240 candidates, totalling 1200 fits


Unnamed: 0,nb__alpha,nb__fit_prior,tvec__max_features,tvec__ngram_range,tvec__stop_words,training_score,test_score
model_1,1,True,4000,"(1, 1)","[i, me, my, myself, we, our, ours, ourselves, ...",0.721094,0.71624
model_2,1,True,6000,"(1, 1)","[i, me, my, myself, we, our, ours, ourselves, ...",0.724772,0.720653


This model certainly performed better than the random forest regressor, but the accuracy of this model is still not fantastic at 72%. Once again, this model has a bias issue.

Below is the `ClassificationReport` for this model:

In [65]:
y_preds_nb = grid_nb.predict(X_test) 

In [66]:
print(classification_report(y_test, y_preds_nb, target_names=['Conservative', 'neoliberal']))

              precision    recall  f1-score   support

Conservative       0.70      0.77      0.74      1146
  neoliberal       0.74      0.67      0.70      1120

    accuracy                           0.72      2266
   macro avg       0.72      0.72      0.72      2266
weighted avg       0.72      0.72      0.72      2266



For this model, the precision and recall also increased to 0.72, the same value as the accuracy of the model. This also highlights the fact that the classes of the data were very well balanced.

### The Kernel SVM Model

Since the previous two models hit a score ceiling, I will instead try to fit the data to a SVM model. In this case, the SVM will use a second order polynomial kernel to better classify the data. The process for creating this model will be very similar to the workflows for the previous two models

*Note:* Since the output of the TF-IDF vectorizer is a matrix scaled to the inverse frequency of words across the corpus, I shouldn't need to scale the data beforehand, but I'm going to try this anyways to see if it will improve performance.

In [69]:
# Import SVM
from sklearn.svm import SVC
from sklearn.preprocessing import StandardScaler

In [54]:
# Providing the basic framework for the modelisklearn.preprocessingmodel_params_svc = {}
count_svc = 0

In [72]:
# Reset the model params each time the cell is run
model_params_svc = model_params_svc
count_svc = count_svc

# Set up the pipeline and params to tune
pipe_svc = Pipeline([
    ('tvec', TfidfVectorizer()),
    ('sc', StandardScaler(with_mean=False)), # This prevents an error with the sparse array
    ('svc', SVC(kernel='poly'))
])

params_svc = {
    'tvec__stop_words'     : [None, stop_words, stop_words_plus],
    'tvec__max_features'   : [6000, 8000],
    'tvec__ngram_range'    : [(1, 1), (1, 2)],
    'svc__C'               : np.linspace(0.01, 20, 10),
    'svc__gamma'           : ['scale', 'auto'],
    'svc__degree'          : [2]
}

# Perform gridsearch
grid_svc = GridSearchCV(pipe_svc, param_grid=params_svc, n_jobs=-1, cv=5, verbose=1)

grid_svc.fit(X_train, y_train)

# Stole this part from the lesson on Random Forests - all credit to Patrick here
count_svc += 1

grid_svc.best_params_['training_score'] = grid_svc.best_score_
model_params_svc[f'model_{count_svc}'] = grid_svc.best_params_
grid_svc.best_params_['test_score'] = grid_svc.score(X_test, y_test)

model_df_svc = pd.DataFrame.from_dict(model_params_svc, orient='index')
model_df_svc

Fitting 5 folds for each of 240 candidates, totalling 1200 fits


Unnamed: 0,svc__C,svc__degree,svc__gamma,tvec__max_features,tvec__ngram_range,tvec__stop_words,training_score,test_score
model_1,2.222311,2,scale,6000,"(1, 1)",,0.719181,0.736099
model_2,2.231111,2,scale,8000,"(1, 1)",,0.722711,0.740071
model_3,17.778889,2,scale,8000,"(1, 2)",,0.666959,0.675199


In [67]:
y_preds_svc = grid_svc.predict(X_test) 

In [68]:
print(classification_report(y_test, y_preds_svc, target_names=['Conservative', 'neoliberal']))

              precision    recall  f1-score   support

Conservative       0.73      0.77      0.75      1146
  neoliberal       0.75      0.71      0.73      1120

    accuracy                           0.74      2266
   macro avg       0.74      0.74      0.74      2266
weighted avg       0.74      0.74      0.74      2266



SVM did offer an improvement over the other two models, but only by *2 percent*. Another interesting thing that I was not expecting from this model was the lower accuracy when `StandardScaler` was included in the pipeline - usually, scaling your data makes an SVM perform better. I suppose this reinforces my original hypothesis that the vectorized text data would not need to be scaled for SVM to perform well.

For this model, the accuracy (on the test set), the recall. and precision were all 0.74. The accuracy on the test set was actually better than on the training set (0.72)!

## Results
---

In the end, I was able to create a classification model that could predict on unseen subreddit posts from r/neoliberal and r/Conservative and classify them into these categories. However, the best performing SVM model only had an accuracy of 0.74 on unseen data.

I believe the main factor that kept the score down here was the fact that r/neoliberal and r/Conservative have similar post titles and descriptions. Recall that the top ten words from both subreddits included "Biden", "Trump" and "US." The models therefore had a difficult time differentiating  between the two subreddits. Despite the difference in talking points and overall politics, the tone and vocabulary of both subreddits are very similar.

## Conclusions and Next Steps

Based on the findings above, I would recommend that the Pushshift archiving team use this model to sort through the mixed up 2020 data in *chunks*, as to make sure the misclassifications the model produces can be caught and corrected before moving on. This should make the task of cleaning up the clerical error much simpler, although it isn't the cleanest solution.

If there was more time to create further models, I might instead recommend using a more powerful model, such as a neural net, to more accurately classify the posts. The process for training a neural net would likely take a long time to complete, however, and I'm not familiar enough with neural nets to be able to make one from scratch at this time.

Another idea would be to introduce more complexity into the models. The bias on all of these models was high, so there could be a lot of room for score improvements if this is done. I think this is especially relevant for the Random Forest Regressor, as increasing the complexity of the model via `max_depth` led to good score improvements. Out of all three models, the Random Forest Classifier showed the most improvement with iterative hyperparameter tuning. The main issue with more tuning, however, is the *significant* time sink grid searching each model presents.

One more thing to consider is the fact that the data used for these models is very perishable. Political discourse changes all the time as the issues facing the world change. I chose to scape posts from early 2021 for this reason - I believe that this time period best captures the political discourse of late 2020, from the 2020 US elections to the COVID-19 pandemic. Trying to classify the 2020 posts with data from another year would result in low accuracy, and trying to classify posts, from say, 2018 based on 2021 data would also result in low accuracy.