# Reddit toxic comment classifier: <br />Recurrent Neural Network
## K folds cross-validation over all subs

### John Burt


### Introduction:

The goal of my first Capstone project is to develop a toxic comment classifier. This notebook will train a Recurrent Neural Network classifier to detect toxic Reddit comments, using tuned hyperparameters, and test it with K folds cross-validation. The script will train and test all subreddit datasets in turn and will report performance statistics.

### Load the data.

The comment data used in this analysis was prepared in three stages:

- [acquired using Reddit Python API PRAW](https://github.com/johnmburt/springboard/blob/master/capstone_1/reddit_collect_comments_v1.ipynb) from 12 subs. 8 of the subs are non-political, and 4 are political in nature. Models are trained on data for only one subreddit at a time, so that they are specialized to that subreddit.


- The raw comment metadata was [processed using PCA to produce a single toxicity score](https://github.com/johnmburt/springboard/blob/master/capstone_1/reddit_generate_PCA_score_v2.ipynb) based on the votes and number of replies. Toxicity score was calculated and normalized within each subreddit and then ranged between -5 and +5 to create a toxicity score comparable between subs. The toxicity score was then thresholded to generate binary "toxic" vs. "not toxic" labels for supervised model training. The threshold applied was: score <= -1 = "toxic", otherwise "not toxic". 


- [Features for training the models were generated](https://github.com/johnmburt/springboard/blob/master/capstone_1/reddit_comment_create_model_features_v1.ipynb) and saved to two sample aligned feature files for each subreddit. These files are used by the models for input.



In [88]:
# remove warnings
import warnings
warnings.filterwarnings('ignore')
# ---

%matplotlib inline
from matplotlib import pyplot as plt
import matplotlib
matplotlib.style.use('ggplot')

import pandas as pd
pd.options.display.max_columns = 100

import numpy as np
import datetime
import time
import csv
import glob

# import helper functions
import sys
sys.path.append('./')
import capstone1_helper
import importlib
importlib.reload(capstone1_helper)

<module 'capstone1_helper' from 'C:\\Users\\john\\notebooks\\reddit\\capstone1_helper.py'>

## Text data transformation for the model

In [89]:
from sklearn.base import BaseEstimator, TransformerMixin

from keras.preprocessing.text import Tokenizer
from keras.preprocessing.sequence import pad_sequences

class SequenceText( BaseEstimator, TransformerMixin ):
    """Custom Transformer that converts text to 
    tokenized padded sequences"""
    
    def __init__( self ):
        self._n_most_common_words = 25000 
        self._max_seq_len = 100 
        self._lowercase = True 
        self._text_filter = '"#$%&()*+,-./:;<=>?@[\]^_`{|}~' 
        
    def set_params(self,**params):
        self.__dict__.update(params)
    
    #Return self nothing else to do here    
    def fit( self, X, y = None ):
        return self 
    
    #Return self nothing else to do here    
    def fit_transform( self, X, y = None ):
        return self.transform(X) 
    
    #Method that describes what we need this transformer to do
    def transform( self, X, y = None ):
        """Convert text to tokenized padded sequences"""
        tokenizer = Tokenizer(num_words=self._n_most_common_words, 
                              filters=self._text_filter, 
                              lower=self._lowercase)
        # input text can be passed in various forms
        if type(X) == pd.DataFrame:
            text = list(X['text'].values)
        elif type(X) == pd.Series:
            text = list(X.values)
        else:
            text = list(X)
        tokenizer.fit_on_texts(text)
        sequences = tokenizer.texts_to_sequences(text)
        # return padded sequences for RNN model training/testing
        return pd.DataFrame(pad_sequences(sequences, 
                                          maxlen=self._max_seq_len))    
  

## Define the model

In [90]:
from sklearn.base import BaseEstimator, ClassifierMixin
from sklearn.metrics import balanced_accuracy_score

from keras.layers import Dense, Embedding, Dropout, LSTM, SpatialDropout1D, Bidirectional
from keras.models import Sequential 
from keras.callbacks import EarlyStopping
from keras.optimizers import SGD

# define RNN model
class RNNClassifier(BaseEstimator, ClassifierMixin):
    """RNN classifier model for text classification"""
    
    def __init__(self):
        """initialize the classifier """
        self._n_most_common_words = 25000
        self._max_seq_len = 100
        self._n_embedding = 16
        self._n_lstm = 16
        self._dropoutrate = .5
        self._clf = None
        
    def create_model(self):
        numoutputs = 1 # this might be a set param someday
        # create the classifier model
        self._clf = Sequential()
        self._clf.add(Embedding(self._n_most_common_words, 
                                self._n_embedding, input_length=self._max_seq_len))
        self._clf.add(SpatialDropout1D(self._dropoutrate))
        self._clf.add(Bidirectional(LSTM(self._n_lstm, dropout=self._dropoutrate, 
                                         recurrent_dropout=self._dropoutrate)))
        self._clf.add(Dense(numoutputs, activation='relu'))
        self._clf.compile(optimizer='adam', loss='mean_squared_error', 
                          metrics=[ 'accuracy'])        
#         print(self._clf.summary())

    def set_params(self, **params):
        self.__dict__.update(params)
        self.create_model()

    def fit(self, X, y=None):
        # these are generated outside the hyperopt loop
        global X_test, y_test
        # local params for this fit run
        epochs = 2
        batch_size = 2000
        # create model if it isn't already made
        if self._clf is None:
            self.create_model()
        # train the classifier with this set of data
        return self._clf.fit(X, y, 
                          epochs=epochs, 
                          batch_size=batch_size,
                          validation_data = (X_test, y_test),
                          callbacks=[EarlyStopping(
                              monitor='val_loss',patience=1, 
                              min_delta=0.0001)],verbose=1)
#                               min_delta=0.0001), plot_losses],verbose=0)

    def predict(self, X, y=None):
        return self._clf.predict(X)  

    def score(self, X, y=None):
        return balanced_accuracy_score(y, self._clf.predict(X))


In [91]:
from time import time

from sklearn.model_selection import cross_val_score
from sklearn.model_selection import KFold, StratifiedKFold
from sklearn.metrics import precision_score, recall_score, balanced_accuracy_score
from sklearn.metrics import f1_score, roc_auc_score

# cross-validation of classifier model with text string data X, category labels in y
# ** NOTE: X and y must be passed as pandas objects
def cross_validate_classifier(clf, X, y, logpath, modelname, subname, balance=True):
    """Set up kfold to generate several train-test sets, 
        then train and test""" 
    
    global X_test, y_test
    
    if type(X) == pd.DataFrame:
        X = X.values
              
    kf = StratifiedKFold(n_splits=3, shuffle=True)
    i = 1
    accuracy = []
    for train_index, test_index in kf.split(X, y):

        # balance label category frequencies by upsampling
        if balance:
            X_train, y_train = capstone1_helper.balance_classes_np(
                X[train_index,:], y[train_index], verbose=False)
            
        # do not balance category frequencies
        else:
            X_train = X[train_index,:]
            y_train = y[train_index]
            
        # extract test set for this fold
        X_test = X[test_index,:]
        y_test = y[test_index]

        t0 = time()

        # train the model
        clf.fit(X_train, y_train)

        # generate predictions for test data
        y_est = clf.predict(X_test)
        y_pred = (np.where(y_est>.5,1,0))

        # log the results
        capstone1_helper.log_model_results(logpath, modelname, 
                                           subname, y_test, y_pred,
                                          time()-t0)
        
        # store the balanced accuracy stat
        accuracy.append(balanced_accuracy_score(y_test, y_pred))
        i += 1

    print("\n    Mean balanced accuracy over %d folds = %2.1f%%"%(
        len(accuracy), np.mean(accuracy)*100))
    

## Test all subs with optimized parameters

This script will validation test a model for all subreddit datasets, using hyperparameters optimized with hyperopt in a previous notebook. The model will be K folds cross-validated with data for each subreddit, and the results will be saved to a common logfile so that cross-model comparisons can be made.

In [92]:
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.ensemble import RandomForestClassifier
from scipy.sparse import vstack, hstack
from scipy.sparse.csr import csr_matrix

# source data folder 
srcdir = './data_for_models/'

# subs to use for this analysis
sub2use = ['aww', 'funny', 'todayilearned','askreddit',
           'photography', 'gaming', 'videos', 'science',
           'politics', 'politicaldiscussion',             
           'conservative', 'the_Donald']

# apply a threshold to determine toxic vs not toxic
thresh = -1

# results logfile path
logpath = srcdir + 'model_results_log.csv'

# name of model
modelname = 'RNN'

# specify parameters for text prep
processkwargs = {
    'stemmer':'snowball', # snowball stemmer
    'regexstr':None, # remove all but alphanumeric chars
    'lowercase':False, # make lowercase
    'removestop':False, # don't remove stop words 
    'verbose':False
                }
# text to sequence transformer for model input
text2seqargs = {
    "n_most_common_words" : 20000, 
    "max_seq_len" : 4, 
    "lowercase" : True, 
    "text_filter" : '"#$%&()*+,-./:;<=>?@[\]^_`{|}~', 
    }

# model optimized parameters
clfargs = {
    # these two must be same as text2seqargs!
    "n_most_common_words": text2seqargs['n_most_common_words'], 
    "max_seq_len" : text2seqargs['max_seq_len'],  
    "n_embedding" : 7, 
    "n_lstm" : 10, 
    "dropoutrate" : 0.44, 
    }

# validate using all subs 
for subname in sub2use:
    t0 = tstart = time()

    print('\n------------------------------------------------------')
    print('Testing model %s using sub %s'%(modelname,subname))
    
    # load feature data and pre-process comment text
    t0 = time()
    X_text, X_numeric, y = capstone1_helper.load_feature_data([subname], srcdir, 
                                             toxic_thresh=thresh, 
                                             text_prep_args=processkwargs)
    # sequence convert text
    text2seq = SequenceText()
    text2seq.set_params(**text2seqargs)
    X = text2seq.fit_transform(X_text)
                        
    # create clf 
    clf = RNNClassifier()
                
    # set model with the optimal hyperparamters
    clf.set_params(**clfargs, n_jobs=4)
                
    # do cross validaion
    t0 = time()
    print('  cross-validating')
    cross_validate_classifier(clf, X, y, logpath, 
                              modelname, subname, balance=True)
    print('    done in %0.1f min,'%((time() - t0)/60))
        



------------------------------------------------------
Testing model RNN using sub aww
  cross-validating
Train on 276402 samples, validate on 72454 samples
Epoch 1/2
Epoch 2/2
Train on 276402 samples, validate on 72453 samples
Epoch 1/2
Epoch 2/2
Train on 276404 samples, validate on 72452 samples
Epoch 1/2
Epoch 2/2

    Mean balanced accuracy over 3 folds = 64.0%
    done in 33.8 min,

------------------------------------------------------
Testing model RNN using sub funny
  cross-validating
Train on 329052 samples, validate on 85142 samples
Epoch 1/2

KeyboardInterrupt: 