# Reddit toxic comment classifier: <br />XGBoost
## Hyperparameter tuning with hyperopt

### John Burt


### Introduction:

The goal of my first Capstone project is to develop a toxic comment classifier. This notebook will implement an Extreme Gradient Boosting classifier (XGBoost) and tune hyperparameters using the hyperopt Baysian hyperparameter optimization package.

**Note:** for several other model hypertuning scripts, I'm using a pipeline as the estimator for tuning, which allows me to tune TfidfVectorizer params in addition to the classifier. However, with XGBoostClassifier that adds too much time to an already long tuning run, so here I only tune XGBoostClassifier params (the process still takes most of a day!).

### The data.

The comment data used in this analysis was prepared in three stages:

- [acquired using Reddit Python API PRAW](https://github.com/johnmburt/springboard/blob/master/capstone_1/reddit_collect_comments_v1.ipynb) from 12 subs. 8 of the subs are non-political, and 4 are political in nature. Models are trained on data for only one subreddit at a time, so that they are specialized to that subreddit.


- The raw comment metadata was [processed using PCA to produce a single toxicity score](https://github.com/johnmburt/springboard/blob/master/capstone_1/reddit_generate_PCA_score_v2.ipynb) based on the votes and number of replies. Toxicity score was calculated and normalized within each subreddit and then ranged between -5 and +5 to create a toxicity score comparable between subs. The toxicity score was then thresholded to generate binary "toxic" vs. "not toxic" labels for supervised model training. The threshold applied was: score <= -1 = "toxic", otherwise "not toxic". 


- [Features for training the models were generated](https://github.com/johnmburt/springboard/blob/master/capstone_1/reddit_comment_create_model_features_v1.ipynb) and saved to two sample aligned feature files for each subreddit. These files are used by the models for input.

In [5]:
import pandas as pd
pd.options.display.max_columns = 100

import numpy as np
import datetime
import time
import csv
import glob
from scipy.sparse import hstack
from scipy.sparse.csr import csr_matrix
from sklearn.model_selection import train_test_split
from sklearn.metrics import precision_score, recall_score, balanced_accuracy_score
from sklearn.metrics import f1_score, roc_auc_score
from sklearn.model_selection import cross_val_score
from time import time
from hyperopt import space_eval
from hyperopt import tpe, hp, fmin, Trials

from xgboost import XGBClassifier
from sklearn.feature_extraction.text import TfidfVectorizer
import matplotlib


ModuleNotFoundError: No module named 'matplotlib'

In [4]:
# remove warnings
import warnings
warnings.filterwarnings('ignore')
# ---

%matplotlib inline
from matplotlib import pyplot as plt
import matplotlib
matplotlib.style.use('ggplot')

import pandas as pd
pd.options.display.max_columns = 100

import numpy as np
import datetime
import time
import csv
import glob

# import helper functions
import sys
sys.path.append('./')
import capstone1_helper
import importlib
importlib.reload(capstone1_helper)

ModuleNotFoundError: No module named 'matplotlib'

## Hyperparameter optimization using Baysian methods



In [None]:
from scipy.sparse import hstack
from scipy.sparse.csr import csr_matrix
from sklearn.model_selection import train_test_split
from sklearn.metrics import precision_score, recall_score, balanced_accuracy_score
from sklearn.metrics import f1_score, roc_auc_score
from sklearn.model_selection import cross_val_score
from time import time
from hyperopt import space_eval
from hyperopt import tpe, hp, fmin, Trials

from xgboost import XGBClassifier
from sklearn.feature_extraction.text import TfidfVectorizer

# source data folder 
srcdir = './data_for_models/'

# subs to use for this analysis
# sub2use = ['gaming', 'politics']
sub2use = ['politics','gaming']

# apply a threshold to determine toxic vs not toxic
thresh = -1

# results logfile path
logpath = srcdir + 'model_hyperopt_results_log.csv'

# name of model
modelname = 'XGBoost_3'
    
# specify parameters for text prep
processkwargs = {
    'stemmer':'snowball', # snowball stemmer
    'regexstr':None, # remove all but alphanumeric chars
    'lowercase':False, # make lowercase
    'removestop':False, # don't remove stop words 
    'verbose':False
                }

# define model defaults for TF-IDF vectorizer.
tfidfargs = {
    "analyzer":'word', 
    "max_features" : 10000,
    "max_df" : 0.5, # Filters out terms that occur in more than half of the docs (max_df=0.5)
    "min_df" : 5, # Filters out terms that occur in only one document (min_df=2).
    "ngram_range":(1, 3), # unigrams
    "stop_words" : 'english',   #None, #"english", # Strips out “stop words”
    "use_idf" : True
    }

# hyperopt objective function
def objective(params):
    global X_train_bal, y_train_bal
    clf = XGBClassifier(tree_method='gpu_hist')
#     print('objective func params:',params)
#     clf.set_params(**params)
#     print('\nclf.get_params()', clf.get_params()['steps'][1])
    score = cross_val_score(clf, X_train_bal, y_train_bal, 
                            scoring='balanced_accuracy',n_jobs=2).mean()   
    return 1-score
   
# hyperopt parameter space
paramspace = {
#     'pre__tfidf__stop_words': hp.choice('tfidf__stop_words', ['english', None]),
#     'pre__tfidf__use_idf': hp.choice('tfidf__use_idf', [True, False]),
#     'pre__tfidf__sublinear_tf': hp.choice('tfidf__sublinear_tf', [True, False]),
#     'pre__tfidf__min_df': 1+hp.randint('tfidf__min_df', 5),
#     'pre__tfidf__max_df': hp.uniform('tfidf__max_df', 0.5, 1.0),
#     'pre__tfidf__ngram_range': hp.choice('tfidf__ngram_range', [(1, 1),(1, 2), (1, 3)]),
    
    'clf__learning_rate': hp.uniform('clf__learning_rate', 0.01, 1.0),
    'clf__max_depth': 3+hp.randint('clf__max_depth', 7),
    'clf__n_estimators': 100+hp.randint('clf__n_estimators', 1000),
    'clf__min_child_weight': 1+hp.randint('clf__min_child_weight', 10),
    'clf__scale_pos_weight': 1+hp.randint('clf__scale_pos_weight', 5),
    }

# loop through to tune model with comments from each specified subreddit 
for subname in sub2use:
    t0 = tstart = time()

    print('------------------------------------------------------')
    print('\nTuning model %s using sub %s'%(modelname,subname))
    
    print('  loading feature data')
    t0 = time()
    X_text, X_numeric, y = capstone1_helper.load_feature_data([subname], srcdir, 
                                             toxic_thresh=thresh, 
                                             text_prep_args=processkwargs)
    print('    done in %0.3fs,'%(time() - t0),'X_text, X_numeric, y:',X_text.shape, X_numeric.shape, y.shape)
    
    # vectorize the text
    print('  vectorizing text')
    vectorizer = TfidfVectorizer(**tfidfargs)
    text_vec = vectorizer.fit_transform(X_text)

    # combine textvec + numeric
    dvcols = [s for s in X_numeric.columns if 'dv_' in s ]
    cols2use = dvcols + ['u_comment_karma']
    # numeric features must be >= 0 
    X_numeric[cols2use] = X_numeric[cols2use] - X_numeric[cols2use].min().min()
    # concat vactor matrices as sparse array
    X = hstack([text_vec.tocsr(), csr_matrix(X_numeric[cols2use])] )
    X = X.tocsr()
    print('text_vec, X_numeric[cols2use], X',
          text_vec.shape, X_numeric[cols2use].shape, X.shape)
    
    # Split into test and training data
    t0 = time()
    print('  train/test split')
    X_train, X_test, y_train, y_test = train_test_split(X, y,  test_size=0.1, random_state=42)
    print('    done in %0.3fs,'%(time() - t0),'X_train, X_test', X_train.shape, X_test.shape)
    
    # upsample training data to balance classes
    print('  balancing training data')
    X_train_bal, y_train_bal = capstone1_helper.balance_classes_sparse(
        X_train, y_train, verbose=False)
   
    # hyperparameter tuning:
    # The Trials object will store details of each iteration
    trials = Trials()

    # Run the hyperparameter search using the tpe algorithm
    t0 = time()
    print('  tune model')
    best = fmin(fn=objective, space=paramspace, algo=tpe.suggest, max_evals=100, trials=trials)
    print('    done in %0.3fs,'%(time() - t0))

    # Get the values of the optimal parameters
    best_params = space_eval(paramspace, best)
    print('\n  Best parameters:',best_params)   

    # test model
    clf = XGBClassifier()

    # set model with the optimal hyperparamters
    clf.set_params(**best_params)

    # fit the classifier with training data
    clf.fit(X_train, y_train)

    # y_out = clf.predict_proba(X_test)[:,1]
    y_out = clf.predict(X_test)
    y_pred = (np.where(y_out>.5,1,0))
    
    # log the results
    capstone1_helper.log_model_results(logpath, modelname, subname, y_test, y_pred)

    print('\n  model performance:')
    print('    Test set: #nontoxic =', (y_test==0).sum(), ' #toxic =', (y_test==1).sum())
    print('    overall accuracy: %1.3f'%(np.sum((y_pred==y_test))/y_test.shape[0]))
    print('    Precision: %1.3f'%(precision_score(y_test, y_pred)))
    print('    Recall: %1.3f'%(recall_score(y_test, y_pred)))
    print('    Balanced Accuracy: %1.3f'%(balanced_accuracy_score(y_test, y_pred)))
    print('    F1 Score: %1.3f'%(f1_score(y_test, y_pred)))
    print('    ROC AUC: %1.3f'%(roc_auc_score(y_test, y_pred)))    
    
    print('\n  Total time to optimize model for sub %s = %3.1f min'%(subname, (time() - tstart)/60))
    