# Reddit toxic comment classifier: <br />Random Forest
## Hyperparameter tuning with hyperopt

### John Burt


### Introduction:

The goal of my first Capstone project is to develop a toxic comment classifier. This notebook will implement a Random Forest classifier and tune hyperparameters using the hyperopt Baysian hyperparameter optimization package.

### Load the data.

The comment data used in this analysis was prepared in three stages:

- [acquired using Reddit Python API PRAW](https://github.com/johnmburt/springboard/blob/master/capstone_1/reddit_collect_comments_v1.ipynb) from 12 subs. 8 of the subs are non-political, and 4 are political in nature. Models are trained on data for only one subreddit at a time, so that they are specialized to that subreddit.


- The raw comment metadata was [processed using PCA to produce a single toxicity score](https://github.com/johnmburt/springboard/blob/master/capstone_1/reddit_generate_PCA_score_v2.ipynb) based on the votes and number of replies. Toxicity score was calculated and normalized within each subreddit and then ranged between -5 and +5 to create a toxicity score comparable between subs. The toxicity score was then thresholded to generate binary "toxic" vs. "not toxic" labels for supervised model training. The threshold applied was: score <= -1 = "toxic", otherwise "not toxic". 


- [Features for training the models were generated](https://github.com/johnmburt/springboard/blob/master/capstone_1/reddit_comment_create_model_features_v1.ipynb) and saved to two sample aligned feature files for each subreddit. These files are used by the models for input.



In [4]:
# remove warnings
import warnings
warnings.filterwarnings('ignore')
# ---

%matplotlib inline
from matplotlib import pyplot as plt
import matplotlib
matplotlib.style.use('ggplot')

import pandas as pd
pd.options.display.max_columns = 100

import numpy as np
import datetime
import time
import csv
import glob

# import helper functions
import sys
sys.path.append('./')
import capstone1_helper
import importlib
importlib.reload(capstone1_helper)

<module 'capstone1_helper' from 'C:\\Users\\john\\notebooks\\reddit\\capstone1_helper.py'>

## Define custom model and pipeline 



In [12]:
from hyperopt import tpe, hp, fmin, Trials
from sklearn.model_selection import cross_val_score
from time import time
from sklearn.pipeline import Pipeline
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.ensemble import RandomForestClassifier
from sklearn.base import BaseEstimator, ClassifierMixin
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.compose import ColumnTransformer, make_column_transformer
from hyperopt import space_eval
    
# Custom classifier that balances the training data
class RandomForest_bal(RandomForestClassifier):
    """Wrapper class that balances data by upsampling prior to training"""
    
    def __init__(self, **kwargs):
        super().__init__(**kwargs)

    def set_params(self, **params):
        print(params)
        super().__init__(**params)

    def fit(self, X, y, **fit_params):
        bal_X, bal_y = capstone1_helper.balance_classes_sparse(X, y, verbose=False)
        super().fit(bal_X, bal_y, **fit_params)
        return self
    
def build_pipeline(classifier):
    """Create a pipeline to vectorize text,
        combine all features, and pass that to classifier
    """
    preprocessor = ColumnTransformer(
        transformers=[('tfidf', TfidfVectorizer(),'text')],
        remainder="passthrough"
        )        
    return Pipeline([('pre', preprocessor),
                     ('clf', classifier)])
    
    

In [15]:
# best = fmin(fn=objective, space=paramspace, algo=tpe.suggest, max_evals=100, trials=trials)
best_params
print('\n  Best parameters:',best_params)


  Best parameters: {'clf__bootstrap': False, 'clf__max_depth': 74, 'clf__max_features': 'sqrt', 'clf__min_samples_leaf': 2, 'clf__min_samples_split': 6, 'clf__n_estimators': 2022, 'pre__tfidf__max_df': 0.9621722973702668, 'pre__tfidf__min_df': 4, 'pre__tfidf__ngram_range': (1, 1), 'pre__tfidf__stop_words': 'english', 'pre__tfidf__sublinear_tf': False, 'pre__tfidf__use_idf': True}


## Hyperparameter optimization using Baysian methods



In [13]:
from scipy.sparse import hstack
from sklearn.model_selection import train_test_split
from sklearn.metrics import precision_score, recall_score, balanced_accuracy_score
from sklearn.metrics import f1_score, roc_auc_score
from time import time
from hyperopt import space_eval

# source data folder 
srcdir = './data_for_models/'

# subs to use for this analysis
sub2use = ['gaming', 'politics']

# apply a threshold to determine toxic vs not toxic
thresh = -1

# results logfile path
logpath = srcdir + 'model_hyperopt_results_log.csv'

# name of model
modelname = 'RandomForestClassifier'
    
# specify parameters for text prep
processkwargs = {
    'stemmer':'snowball', # snowball stemmer
    'regexstr':None, # remove all but alphanumeric chars
    'lowercase':False, # make lowercase
    'removestop':False, # don't remove stop words 
    'verbose':False
                }

# define model defaults for TF-IDF vectorizer.
tfidfargs = {
    "analyzer":'word', 
    "max_features" : 10000,
    "max_df" : 0.5, # Filters out terms that occur in more than half of the docs (max_df=0.5)
    "min_df" : 5, # Filters out terms that occur in only one document (min_df=2).
    "ngram_range":(1, 3), # unigrams
    "stop_words" : 'english',   #None, #"english", # Strips out “stop words”
    "use_idf" : True
    }

# hyperopt objective function
def objective(params):
    global X_train, y_train
    clf = build_pipeline(RandomForest_bal())
    clf.set_params(**params)
    score = cross_val_score(clf, X_train, y_train, 
                            scoring='balanced_accuracy',n_jobs=3).mean()   
    return 1-score
   
# hyperopt parameter space
paramspace = {
    'pre__tfidf__stop_words': hp.choice('tfidf__stop_words', ['english', None]),
    'pre__tfidf__use_idf': hp.choice('tfidf__use_idf', [True, False]),
    'pre__tfidf__sublinear_tf': hp.choice('tfidf__sublinear_tf', [True, False]),
    'pre__tfidf__min_df': 1+hp.randint('tfidf__min_df', 5),
    'pre__tfidf__max_df': hp.uniform('tfidf__max_df', 0.5, 1.0),
    'pre__tfidf__ngram_range': hp.choice('tfidf__ngram_range', [(1, 1),(1, 2), (1, 3)]),
    
    'clf__n_estimators': 200+hp.randint('clf__n_estimators', 2000),
    'clf__max_depth': 10+hp.randint('clf__max_depth', 100),
    'clf__min_samples_leaf': 1+hp.randint('clf__min_samples_leaf', 5),
    'clf__max_features': hp.choice('clf__max_features', ['auto', 'sqrt']),
    'clf__bootstrap': hp.choice('clf__bootstrap', [True, False]),    
    'clf__min_samples_split': 2+hp.randint('clf__min_samples_split', 10),
    }

# loop through to tune model with comments from each specified subreddit 
for subname in sub2use:
    t0 = tstart = time()

    print('------------------------------------------------------')
    print('\nTuning model %s using sub %s'%(modelname,subname))
    
    print('  loading feature data')
    t0 = time()
    X_text, X_numeric, y = capstone1_helper.load_feature_data([subname], srcdir, 
                                             toxic_thresh=thresh, 
                                             text_prep_args=processkwargs)
    print('    done in %0.3fs,'%(time() - t0),'X_text, X_numeric, y:',X_text.shape, X_numeric.shape, y.shape)
    
    # combine X data into df so I can do train_test_split now, 
    #  before the text is further processed
    dvcols = [s for s in X_numeric.columns if 'dv_' in s ]
    cols2use = dvcols + ['u_comment_karma']
    X_df = pd.concat([pd.DataFrame({'text':X_text}),pd.DataFrame(X_numeric[cols2use])],axis=1,ignore_index=True)   
    X_df.columns = ['text'] + cols2use
    
    # Split into test and training data
    t0 = time()
    print('  train/test split')
    X_train, X_test, y_train, y_test = train_test_split(X_df, y,  test_size=0.1, random_state=42)
    print('    done in %0.3fs,'%(time() - t0),'X_train, X_test', X_train.shape, X_test.shape)
   
    # hyperparameter tuning:
    # The Trials object will store details of each iteration
    trials = Trials()

    # Run the hyperparameter search using the tpe algorithm
    t0 = time()
    print('  tune model')
    best = fmin(fn=objective, space=paramspace, algo=tpe.suggest, max_evals=100, trials=trials)
    print('    done in %0.3fs,'%(time() - t0))

    # Get the values of the optimal parameters
    best_params = space_eval(paramspace, best)
    print('\n  Best parameters:',best_params)   

    # test model
    clf = build_pipeline(RandomForest_bal())

    # set model with the optimal hyperparamters
    clf.set_params(**best_params)

    # fit the classifier with training data
    clf.fit(X_train, y_train)

    # y_out = clf.predict_proba(X_test)[:,1]
    y_out = clf.predict(X_test)
    y_pred = (np.where(y_out>.5,1,0))
    
    # log the results
    capstone1_helper.log_model_results(logpath, modelname, subname, y_test, y_pred)

    print('\n  model performance:')
    print('    Test set: #nontoxic =', (y_test==0).sum(), ' #toxic =', (y_test==1).sum())
    print('    overall accuracy: %1.3f'%(np.sum((y_pred==y_test))/y_test.shape[0]))
    print('    Precision: %1.3f'%(precision_score(y_test, y_pred)))
    print('    Recall: %1.3f'%(recall_score(y_test, y_pred)))
    print('    Balanced Accuracy: %1.3f'%(balanced_accuracy_score(y_test, y_pred)))
    print('    F1 Score: %1.3f'%(f1_score(y_test, y_pred)))
    print('    ROC AUC: %1.3f'%(roc_auc_score(y_test, y_pred)))    
    
    print('\n  Total time to optimize model for sub %s = %3.1f min'%(subname, (time() - tstart)/60))
    

------------------------------------------------------

Tuning model RandomForestClassifier using sub gaming
  loading feature data
    done in 12.693s, X_text, X_numeric, y: (389947,) (389947, 101) (389947,)
  train/test split
    done in 0.432s, X_train, X_test (350952, 102) (38995, 102)
  tune model
{'bootstrap': False, 'max_depth': 41, 'max_features': 'sqrt', 'min_samples_leaf': 1, 'min_samples_split': 6, 'n_estimators': 237}
{'bootstrap': False, 'max_depth': 52, 'max_features': 'auto', 'min_samples_leaf': 5, 'min_samples_split': 10, 'n_estimators': 1154}
{'bootstrap': True, 'max_depth': 84, 'max_features': 'sqrt', 'min_samples_leaf': 4, 'min_samples_split': 10, 'n_estimators': 1342}
{'bootstrap': False, 'max_depth': 106, 'max_features': 'sqrt', 'min_samples_leaf': 5, 'min_samples_split': 6, 'n_estimators': 1178}
{'bootstrap': True, 'max_depth': 91, 'max_features': 'sqrt', 'min_samples_leaf': 2, 'min_samples_split': 4, 'n_estimators': 514}
{'bootstrap': False, 'max_depth': 109, 'ma

{'bootstrap': True, 'max_depth': 62, 'max_features': 'sqrt', 'min_samples_leaf': 5, 'min_samples_split': 10, 'n_estimators': 415}
{'bootstrap': True, 'max_depth': 64, 'max_features': 'sqrt', 'min_samples_leaf': 5, 'min_samples_split': 3, 'n_estimators': 1257}
{'bootstrap': True, 'max_depth': 60, 'max_features': 'auto', 'min_samples_leaf': 5, 'min_samples_split': 3, 'n_estimators': 1504}
{'bootstrap': False, 'max_depth': 33, 'max_features': 'auto', 'min_samples_leaf': 3, 'min_samples_split': 11, 'n_estimators': 1359}
{'bootstrap': False, 'max_depth': 40, 'max_features': 'sqrt', 'min_samples_leaf': 5, 'min_samples_split': 11, 'n_estimators': 2152}
{'bootstrap': True, 'max_depth': 40, 'max_features': 'auto', 'min_samples_leaf': 5, 'min_samples_split': 10, 'n_estimators': 1444}
{'bootstrap': True, 'max_depth': 44, 'max_features': 'auto', 'min_samples_leaf': 4, 'min_samples_split': 7, 'n_estimators': 511}
{'bootstrap': False, 'max_depth': 23, 'max_features': 'sqrt', 'min_samples_leaf': 3, '

### for r/gaming:

Best parameters: {'clf__bootstrap': False, 'clf__max_depth': 25, 'clf__max_features': 'sqrt', 'clf__min_samples_leaf': 1, 'clf__min_samples_split': 7, 'clf__n_estimators': 1305, 'pre__tfidf__max_df': 0.6610836362706193, 'pre__tfidf__min_df': 4, 'pre__tfidf__ngram_range': (1, 1), 'pre__tfidf__stop_words': 'english', 'pre__tfidf__sublinear_tf': False, 'pre__tfidf__use_idf': True}

  model performance:
    Test set: #nontoxic = 37871  #toxic = 1124
    overall accuracy: 0.939
    Precision: 0.134
    Recall: 0.205
    Balanced Accuracy: 0.583
    F1 Score: 0.162
    ROC AUC: 0.583

### for r/politics:

 Best parameters: {'clf__bootstrap': False, 'clf__max_depth': 74, 'clf__max_features': 'sqrt', 'clf__min_samples_leaf': 2, 'clf__min_samples_split': 6, 'clf__n_estimators': 2022, 'pre__tfidf__max_df': 0.9621722973702668, 'pre__tfidf__min_df': 4, 'pre__tfidf__ngram_range': (1, 1), 'pre__tfidf__stop_words': 'english', 'pre__tfidf__sublinear_tf': False, 'pre__tfidf__use_idf': True}

model performance:
    Test set: #nontoxic = 34124  #toxic = 2015
    overall accuracy: 0.929
    Precision: 0.271
    Recall: 0.165
    Balanced Accuracy: 0.570
    F1 Score: 0.205
    ROC AUC: 0.570