# Reddit toxic comment classifier: <br />Multinomial Naive Bayes
## K folds cross-validation over all subs

### John Burt

[To hide code cells, view this in nbviewer](https://nbviewer.jupyter.org/github/johnmburt/springboard/blob/master/capstone_1/reddit_toxicity_detection_model_MNB_v1.ipynb) 


### Introduction:

The goal of my first Capstone project is to develop a toxic comment classifier. This notebook will train a Multinomial Naive Bayes classifier to detect toxic Reddit comments, using tuned hyperparameters, and test it with K folds cross-validation. The script will train and test all subreddit datasets in turn and will report performance statistics.

### Load the data.

The comment data used in this analysis was prepared in three stages:

- [acquired using Reddit Python API PRAW](https://github.com/johnmburt/springboard/blob/master/capstone_1/reddit_collect_comments_v1.ipynb) from 12 subs. 8 of the subs are non-political, and 4 are political in nature. Models are trained on data for only one subreddit at a time, so that they are specialized to that subreddit.


- The raw comment metadata was [processed using PCA to produce a single toxicity score](https://github.com/johnmburt/springboard/blob/master/capstone_1/reddit_generate_PCA_score_v2.ipynb) based on the votes and number of replies. Toxicity score was calculated and normalized within each subreddit and then ranged between -5 and +5 to create a toxicity score comparable between subs. The toxicity score was then thresholded to generate binary "toxic" vs. "not toxic" labels for supervised model training. The threshold applied was: score <= -1 = "toxic", otherwise "not toxic". 


- [Features for training the models were generated](https://github.com/johnmburt/springboard/blob/master/capstone_1/reddit_comment_create_model_features_v1.ipynb) and saved to two sample aligned feature files for each subreddit. These files are used by the models for input.



In [159]:
# remove warnings
import warnings
warnings.filterwarnings('ignore')
# ---

%matplotlib inline
from matplotlib import pyplot as plt
import matplotlib
matplotlib.style.use('ggplot')

import pandas as pd
pd.options.display.max_columns = 100

import numpy as np
import datetime
import time
import csv
import glob

# import helper functions
import sys
sys.path.append('./')
import capstone1_helper
import importlib
importlib.reload(capstone1_helper)

<module 'capstone1_helper' from 'C:\\Users\\john\\notebooks\\reddit\\capstone1_helper.py'>

In [160]:
from time import time

from sklearn.model_selection import cross_val_score
from sklearn.model_selection import KFold, StratifiedKFold

# cross-validation of classifier model with text string data X, category labels in y
# ** NOTE: X and y must be passed as pandas objects
def cross_validate_classifier(clf, X, y, logpath, modelname, subname, balance=True):
    """Set up kfold to generate several train-test sets, 
        then train and test""" 
        
    kf = StratifiedKFold(n_splits=3, shuffle=True)
    i = 1
    accuracy = []
    print('    ',end='')
    for train_index, test_index in kf.split(X, y):

        print('*',end='')

        # balance label categories by upsampling
        if balance:
            X_train, y_train = capstone1_helper.balance_classes_sparse(
                X[train_index,:], y[train_index], verbose=False)
        else:
            X_train = X[train_index,:]
            y_train = y[train_index]
            
        # extract test set for this fold
        X_test = X[test_index,:]
        y_test = y[test_index]

        # train the model
        clf.fit(X_train, y_train)

        # generate predictions for test data
        y_est = clf.predict(X[test_index,:])
        y_pred = (np.where(y_est>.5,1,0))

        # log the results
        capstone1_helper.log_model_results(logpath, modelname, subname, y_test, y_pred)
        
        # store the balanced accuracy stat
        accuracy.append(balanced_accuracy_score(y_test, y_pred))
        i += 1

    print("\n    Mean balanced accuracy over %d folds = %2.1f%%"%(
        len(accuracy), np.mean(accuracy)*100))
    

## Test all subs with optimized parameters

This script will validation test a model for all subreddit datasets, using hyperparameters optimized with hyperopt in a previous notebook. The model will be K folds cross-validated with data for each subreddit, and the results will be saved to a common logfile so that cross-model comparisons can be made.

In [161]:
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.naive_bayes import MultinomialNB

# source data folder 
srcdir = './data_for_models/'

# subs to use for this analysis
sub2use = ['aww', 'funny', 'todayilearned','askreddit',
           'photography', 'gaming', 'videos', 'science',
           'politics', 'politicaldiscussion',             
           'conservative', 'the_Donald']

# apply a threshold to determine toxic vs not toxic
thresh = -1

# results logfile path
logpath = srcdir + 'model_results_log.csv'

# name of model
modelname = 'MultinomialNB'

# specify parameters for text prep
processkwargs = {
    'stemmer':'snowball', # snowball stemmer
    'regexstr':None, # remove all but alphanumeric chars
    'lowercase':False, # make lowercase
    'removestop':False, # don't remove stop words 
    'verbose':False
                }

# Tfidf vectorizer optimized parameters for model
tfidfargs = {
    "analyzer":'word', 
    "max_features" : 10000,
    "max_df" : 0.53, # Filters out terms that occur in more than half of the docs (max_df=0.5)
    "min_df" : 2, # Filters out terms that occur in only one document (min_df=2).
    "ngram_range":(1, 2), # unigrams
    "stop_words" : "english", # None, # "english", # Strips out “stop words”
    "use_idf" : False,
    "sublinear_tf" : False,
    }

# validate using all subs 
for subname in sub2use:
    t0 = tstart = time()

    print('\n------------------------------------------------------')
    print('Testing model %s using sub %s'%(modelname,subname))
    
    # load feature data and pre-process comment text
    t0 = time()
    X_text, X_numeric, y = capstone1_helper.load_feature_data([subname], srcdir, 
                                             toxic_thresh=thresh, 
                                             text_prep_args=processkwargs)
    # vectorize text
    vectorizer = TfidfVectorizer(**tfidfargs)
    text_vec = vectorizer.fit_transform(X_text)
    
    # combine textvec + numeric
    dvcols = [s for s in X_numeric.columns if 'dv_' in s ]
    cols2use = dvcols + ['u_comment_karma']
    # numeric features must be >= 0 
    X_numeric[cols2use] = X_numeric[cols2use] - X_numeric[cols2use].min().min()
    # concat vactor matrices as sparse array
    X = hstack([text_vec.tocsr(), csr_matrix(X_numeric[cols2use])] )
    X = X.tocsr()
                    
    # create clf 
    clf = MultinomialNB()
                
    # set model with the optimal hyperparamters
    # (I just use defaults for MultinomialNB)
#     clf.set_params(**clfparams)
                
    # do cross validaion
    t0 = time()
    print('  cross-validating')
    cross_validate_classifier(clf, X, y, logpath, modelname, subname, balance=True)
    print('    done in %0.1f min,'%((time() - t0)/60))
        



------------------------------------------------------
Testing model MultinomialNB using sub aww
  cross-validating
    ***
    Mean balanced accuracy over 3 folds = 71.6%
    done in 0.2 min,

------------------------------------------------------
Testing model MultinomialNB using sub funny
  cross-validating
    ***
    Mean balanced accuracy over 3 folds = 65.3%
    done in 0.2 min,

------------------------------------------------------
Testing model MultinomialNB using sub todayilearned
  cross-validating
    ***
    Mean balanced accuracy over 3 folds = 73.3%
    done in 0.2 min,

------------------------------------------------------
Testing model MultinomialNB using sub askreddit
  cross-validating
    ***
    Mean balanced accuracy over 3 folds = 58.6%
    done in 0.2 min,

------------------------------------------------------
Testing model MultinomialNB using sub photography
  cross-validating
    ***
    Mean balanced accuracy over 3 folds = 57.1%
    done in 0.1 min,

---