# Reddit toxic comment classifier: <br/>Recurrent Neural Network
## Hyperparameter tuning with hyperopt

### John Burt


### Introduction:

The goal of my first Capstone project is to develop a toxic comment classifier. This notebook will implement a Recurrent Neural Network (RNN) classifier and tune hyperparameters using the hyperopt Baysian hyperparameter optimization package.

The RNN model consists of 4 layers:
- Embedding layer formatted for LSTM compatible sequentially encoded text data
- Dropout layer
- Bidirectional LSTM
- Output (single node)

This is a common configuration for text classification.

### Load the data.

The comment data used in this analysis was prepared in three stages:

- [acquired using Reddit Python API PRAW](https://github.com/johnmburt/springboard/blob/master/capstone_1/reddit_collect_comments_v1.ipynb) from 12 subs. 8 of the subs are non-political, and 4 are political in nature. Models are trained on data for only one subreddit at a time, so that they are specialized to that subreddit.


- The raw comment metadata was [processed using PCA to produce a single toxicity score](https://github.com/johnmburt/springboard/blob/master/capstone_1/reddit_generate_PCA_score_v2.ipynb) based on the votes and number of replies. Toxicity score was calculated and normalized within each subreddit and then ranged between -5 and +5 to create a toxicity score comparable between subs. The toxicity score was then thresholded to generate binary "toxic" vs. "not toxic" labels for supervised model training. The threshold applied was: score <= -1 = "toxic", otherwise "not toxic". 


- [Features for training the models were generated](https://github.com/johnmburt/springboard/blob/master/capstone_1/reddit_comment_create_model_features_v1.ipynb) and saved to two sample aligned feature files for each subreddit. These files are used by the models for input.



In [73]:
# remove warnings
import warnings
warnings.filterwarnings('ignore')
# ---

%matplotlib inline
from matplotlib import pyplot as plt
import matplotlib
matplotlib.style.use('ggplot')

import pandas as pd
pd.options.display.max_columns = 100

import numpy as np
import datetime
import time
import csv
import glob

# import helper functions
import sys
sys.path.append('./')
import capstone1_helper
import importlib
importlib.reload(capstone1_helper)

<module 'capstone1_helper' from 'C:\\Users\\john\\notebooks\\reddit\\capstone1_helper.py'>

## Define RNN model pipeline

The pipeline will transform the raw text input into tokenized sequences and then pass that to the model for training.


In [81]:
from sklearn.compose import ColumnTransformer, make_column_transformer
from sklearn.pipeline import Pipeline
from sklearn.base import BaseEstimator, ClassifierMixin
from sklearn.base import BaseEstimator, TransformerMixin
from sklearn.metrics import balanced_accuracy_score

from keras.preprocessing.text import Tokenizer
from keras.preprocessing.sequence import pad_sequences
from keras.layers import Dense, Embedding, Dropout, LSTM, SpatialDropout1D, Bidirectional
from keras.models import Sequential 
from keras.callbacks import EarlyStopping
from keras.optimizers import SGD

class SequenceText( BaseEstimator, TransformerMixin ):
    """Custom Transformer that converts text to 
    tokenized padded sequences"""
    
    def __init__( self ):
        self._n_most_common_words = 25000 
        self._max_seq_len = 100 
        self._lowercase = True 
        self._text_filter = '"#$%&()*+,-./:;<=>?@[\]^_`{|}~' 
        
    def set_params(self,**params):
        self.__dict__.update(params)
    
    #Return self nothing else to do here    
    def fit( self, X, y = None ):
        return self 
    
    #Return self nothing else to do here    
    def fit_transform( self, X, y = None ):
        return self.transform(X) 
    
    #Method that describes what we need this transformer to do
    def transform( self, X, y = None ):
        """Convert text to tokenized padded sequences"""
        tokenizer = Tokenizer(num_words=self._n_most_common_words, 
                              filters=self._text_filter, 
                              lower=self._lowercase)
        # input text can be passed in various forms
        if type(X) == pd.DataFrame:
            text = list(X['text'].values)
        elif type(X) == pd.Series:
            text = list(X.values)
        else:
            text = list(X)
        tokenizer.fit_on_texts(text)
        sequences = tokenizer.texts_to_sequences(text)
        # return padded sequences for RNN model training/testing
        return pd.DataFrame(pad_sequences(sequences, 
                                          maxlen=self._max_seq_len))    
    
# define RNN model
class RNNClassifier(BaseEstimator, ClassifierMixin):
    """RNN classifier model for text classification"""
    
    def __init__(self):
        """initialize the classifier with defaults """
        self._n_most_common_words = 25000
        self._max_seq_len = 100
        self._n_embedding = 16
        self._n_lstm = 16
        self._dropoutrate = .5
        self._clf = None
        
    def create_model(self):
        numoutputs = 1 # this might be a set param someday
        # create the classifier model
        self._clf = Sequential()
        self._clf.add(Embedding(self._n_most_common_words, 
                                self._n_embedding, input_length=self._max_seq_len))
        self._clf.add(SpatialDropout1D(self._dropoutrate))
        self._clf.add(Bidirectional(LSTM(self._n_lstm, dropout=self._dropoutrate, 
                                         recurrent_dropout=self._dropoutrate)))
        self._clf.add(Dense(numoutputs, activation='relu'))
        self._clf.compile(optimizer='adam', loss='mean_squared_error', 
                          metrics=[ 'accuracy'])        
#         print(self._clf.summary())

    def set_params(self, **params):
        self.__dict__.update(params)
        self.create_model()

    def fit(self, X, y=None):
        # these are generated outside the hyperopt loop
        global X_test_seq, y_test
        # local params for this fit run
        epochs = 1
        batch_size = 2000
        # create model if it isn't already made
        if self._clf is None:
            self.create_model()
        # train the classifier with this set of data
        return self._clf.fit(X, y, 
                          epochs=epochs, 
                          batch_size=batch_size,
                          validation_data = (X_test_seq, y_test),
                          callbacks=[EarlyStopping(
                              monitor='val_loss',patience=1, 
                              min_delta=0.0001)],verbose=0)
#                               min_delta=0.0001), plot_losses],verbose=0)

    def predict(self, X, y=None):
        return self._clf.predict(X)  

    def score(self, X, y=None):
        return balanced_accuracy_score(y, self._clf.predict(X))
    
# pipeline for RNN
def build_pipeline(classifier):
    """Create a pipeline to tokenize and sequence the text,
        then pass that to classifier
    """
    preprocessor = ColumnTransformer(
        transformers=[('seq', SequenceText(),'text')],
        remainder="passthrough"
        )        
    return Pipeline([('pre', preprocessor),
                     ('clf', classifier)])


## Hyperparameter optimization using Baysian methods



In [82]:
from scipy.sparse import hstack
from sklearn.model_selection import train_test_split
from sklearn.model_selection import cross_val_score
from sklearn.metrics import precision_score, recall_score, balanced_accuracy_score
from sklearn.metrics import f1_score, roc_auc_score
from time import time
from hyperopt import space_eval
from hyperopt import tpe, hp, fmin, Trials

# source data folder 
srcdir = './data_for_models/'

# subs to use for this analysis
sub2use = ['gaming', 'politics']

# apply a score threshold to determine toxic vs not toxic
thresh = -1

# results logfile path
logpath = srcdir + 'model_hyperopt_results_log.csv'

# name of model
modelname = 'RNN'

# hyperopt objective function: 
#  Sets model hyperparams, fits and predicts to generate score.
#  Note: the score must be "smaller is better"
def objective(params):
    # hyperopt depends on global scope datasets
    global X_train, y_train, X_test, X_test_seq
    # build the text transformer/RNN model pipeline
    clf = build_pipeline(RNNClassifier())
    # these model params must be set to hyperopt's chosen
    #   transformer params
    params['clf__n_most_common_words'] = params['pre__seq__n_most_common_words']
    params['clf__max_seq_len'] = params['pre__seq__max_seq_len']
    # set model params to hyperopt's chosen values
    clf.set_params(**params) 
    # transform the test text for the model training validation
    X_test_seq = clf.steps[0][1].transformers[0][1].transform(X_test)
    # train the model
    clf.fit(X_train, y_train)
    # make test data prediction
    y_est = clf.predict(X_test)
    y_pred = (np.where(y_est>.5,1,0))
    # create score
    score = balanced_accuracy_score(y_test, y_pred)
    return 1-score
            
# hyperopt parameter tuning space
paramspace = {
    'pre__seq__n_most_common_words': 5000+hp.randint('pre__seq__n_most_common_words', 30000),
    'pre__seq__max_seq_len': 50+hp.randint('pre__seq__max_seq_len', 200),
    'pre__seq__lowercase': hp.choice('pre__seq__lowercase', [True, False]),
    'pre__seq__text_filter': hp.choice('pre__seq__text_filter', ['"#$%&()*+,-./:;<=>?@[\]^_`{|}~','']),
    'clf__n_embedding': 5+hp.randint('clf__n_embedding', 25),
    'clf__n_lstm': 5+hp.randint('clf__n_lstm', 25),
    'clf__dropoutrate': hp.uniform('clf__dropoutrate', 0.1, 0.8),
    }

# loop through to tune model with comments from each specified subreddit 
for subname in sub2use:
    t0 = tstart = time()

    print('------------------------------------------------------')
    print('\nTuning model %s using sub %s'%(modelname,subname))
    
    # load feature data
    X_text, X_numeric, y = capstone1_helper.load_feature_data([subname], srcdir, 
                                             toxic_thresh=thresh, 
                                             text_prep_args=processkwargs)
    
    # Split into test and training data
    X_train, X_test, y_train, y_test = train_test_split(X_text, y,  
                                test_size=0.1, random_state=42)
    
    # Balance training data by upsampling
    Xy_train_df = capstone1_helper.balance_classes_df(
        pd.DataFrame({'X':X_train,'y':y_train}),'y')
    
    # split train X and y 
    X_train = pd.DataFrame({'text':Xy_train_df['X']})
    y_train = Xy_train_df['y']
    
    # prep X_test
    X_test = pd.DataFrame({'text':X_test.values})
    
    # hyperparameter tuning:
    
    # The Trials object will store details of each iteration
    trials = Trials()

    # Run the hyperparameter search using the tpe algorithm
    t0 = time()
    print('  tune model')
    best = fmin(fn=objective, space=paramspace, algo=tpe.suggest, 
                max_evals=100, trials=trials)
    print('    done in %0.3fs,'%(time() - t0))

    # Get the values of the optimal parameters
    best_params = space_eval(paramspace, best)
    print('\n  Best parameters:',best_params)   

    # test model
    clf = build_pipeline(RNNClassifier())

    # set model with the optimal hyperparamters
    clf.set_params(**best_params)

    # fit the classifier with balanced training data
    clf.fit(Xy_train_df['X'], Xy_train_df['y'])

    # predict test data
    y_out = clf.predict(X_test)
    # binary threshold 
    y_pred = (np.where(y_out>.5,1,0))
    
    # log the results
    capstone1_helper.log_model_results(logpath, modelname, subname, y_test, y_pred)

    print('\n  model performance:')
    print('    Test set: #nontoxic =', (y_test==0).sum(), ' #toxic =', (y_test==1).sum())
    print('    overall accuracy: %1.3f'%(np.sum((y_pred==y_test))/y_test.shape[0]))
    print('    Precision: %1.3f'%(precision_score(y_test, y_pred)))
    print('    Recall: %1.3f'%(recall_score(y_test, y_pred)))
    print('    Balanced Accuracy: %1.3f'%(balanced_accuracy_score(y_test, y_pred)))
    print('    F1 Score: %1.3f'%(f1_score(y_test, y_pred)))
    print('    ROC AUC: %1.3f'%(roc_auc_score(y_test, y_pred)))    
    
    print('\n  Total time to optimize model for sub %s = %3.1f min'%(subname, (time() - tstart)/60))
    

------------------------------------------------------

Tuning model RNN using sub gaming
  tune model
  0%|                                                                            | 0/100 [01:42<?, ?it/s, best loss: ?]


KeyboardInterrupt: 