# Reddit toxic comment classifier: <br />Multinomial Naive Bayes
## K folds cross-validation over all subs

### John Burt

[To hide code cells, view this in nbviewer](https://nbviewer.jupyter.org/github/johnmburt/springboard/blob/master/capstone_1/reddit_toxicity_detection_model_MNB_v1.ipynb) 


### Introduction:

The goal of my first Capstone project is to develop a toxic comment classifier. This notebook will train a Multinomial Naive Bayes classifier to detect toxic Reddit comments, using tuned hyperparameters, and test it with K folds cross-validation. The script will train and test all subreddit datasets in turn and will report performance statistics.

In [None]:
# remove warnings
import warnings
warnings.filterwarnings('ignore')
# ---

%matplotlib inline
from matplotlib import pyplot as plt
import matplotlib
matplotlib.style.use('ggplot')

import pandas as pd
pd.options.display.max_columns = 100

import numpy as np
import datetime
import time
import csv
import glob

### Load the data.

The comment data used in this analysis was prepared in three stages:

- [acquired using Reddit Python API PRAW](https://github.com/johnmburt/springboard/blob/master/capstone_1/reddit_collect_comments_v1.ipynb) from 12 subs. 8 of the subs are non-political, and 4 are political in nature. Models are trained on data for only one subreddit at a time, so that they are specialized to that subreddit.


- The raw comment metadata was [processed using PCA to produce a single toxicity score](https://github.com/johnmburt/springboard/blob/master/capstone_1/reddit_generate_PCA_score_v2.ipynb) based on the votes and number of replies. Toxicity score was calculated and normalized within each subreddit and then ranged between -5 and +5 to create a toxicity score comparable between subs. The toxicity score was then thresholded to generate binary "toxic" vs. "not toxic" labels for supervised model training. The threshold applied was: score <= -1 = "toxic", otherwise "not toxic". 


- [Features for training the models were generated](https://github.com/johnmburt/springboard/blob/master/capstone_1/reddit_comment_create_model_features_v1.ipynb) and saved to two sample aligned feature files for each subreddit. These files are used by the models for input.

**Note** that this is a highly unbalanced dataset, with usually less than 10% of comments labelled "toxic". For proper training, most models will need to be trained with balanced data, which I achieve by up-sampling the less representative "toxic" category.


In [None]:
def load_feature_data(subnames, srcdir, toxic_thresh=-1):
    """Load and prep the feature data from two matched data files"""
    
    # load all data csvs for listed subs into dataframes 
    base_dfs = []
    numeric_dfs = []
    for sub in subnames:
        base_dfs.append(pd.read_csv(srcdir+'features_text_'+sub+'.csv'))
        numeric_dfs.append(pd.read_csv(srcdir+'features_doc2vec_'+sub+'.csv'))
        
    # concat all sub dfs into one for each data type
    base_df = pd.concat(base_dfs, ignore_index=True)
    numeric_df = pd.concat(numeric_dfs, ignore_index=True)
    
    # add numeric metadata features from base df to numeric df
    numeric_df['u_comment_karma'] = base_df['u_comment_karma']

    # return base df (text and all comment metadata), numeric features, training label
    return base_df['text'], numeric_df, np.where(base_df['pca_score']>thresh,0,1)


### Text preprocessing function

This function prepares text data for training. For most models, the text will be processed further at training time, but pre-processing can save time when training is iterated multiple times.

In [None]:
import re
from nltk.stem import PorterStemmer
from nltk.stem.snowball import SnowballStemmer
from nltk.tokenize import sent_tokenize, word_tokenize
from nltk.corpus import stopwords as sw

# function to prepare text for NLP analysis
def process_comment_text(comments, 
                         stemmer=None, 
                         regexstr=None, lowercase=True,
                         removestop=False,
                         verbose=True):
    """Helper function to pre-process text.
        Combines several preprocessing steps: lowercase, 
            remove stop, regex text cleaning, stemming"""
    
    if type(stemmer) == str:
        if stemmer.lower() == 'porter':
            stemmer = PorterStemmer()
        elif stemmer.lower() == 'snowball':
            stemmer = SnowballStemmer(language='english')
        else:
            stemmer = None
            
    processed = comments
    
    # make text lowercase
    if lowercase == True:
        if verbose: print('make text lowercase')
        processed = processed.str.lower()
        
    # remove stop words
    # NOTE: stop words w/ capitals not removed!
    if removestop == True:
        if verbose: print('remove stop words')
        stopwords = sw.words("english")
        processed = processed.map(lambda text: ' '.join([word for word in text.split() if word not in stopwords]))
        
    # apply regex expression
    if regexstr is not None:
        if verbose: print('apply regex expression')
        regex = re.compile(regexstr) 
        processed = processed.str.replace(regex,' ')
        
    # stemming
    # NOTE: stemming makes all lowercase
    if stemmer is not None:
        if verbose: print('stemming')
        processed = processed.map(lambda x: ' '.join([stemmer.stem(y) for y in x.split(' ')]))
        
    if verbose: print('done')
         
    return processed


## Balance sample frequencies in training samples

The classifier may require balancing of sample frequencies between classes for best results. This function will up-sample to the specified number of samples per class.

The balance_classes_sparse function does sample balancing with sparse matrices, such as vectorized BOW data.

In [None]:
# ******************************************************************************************
from scipy.sparse import vstack, hstack
from scipy.sparse.csr import csr_matrix

def balance_classes_sparse(X, y, samples_per_class=None, verbose=False):
    """Equalize number of samples so that all classes have equal numbers of samples.
    If samples_per_class==None, then upsample (randomly repeat) all classes to the largest class,
      Otherwise, set samples for all classes to samples_per_class."""
    
    def get_samples(arr, numsamples):
        if arr.shape[0] >= numsamples:
            index = np.arange(arr.shape[0])
            np.random.shuffle(index)
            return arr[index[:numsamples],:]
        else:
            samples = arr.copy()
            numrepeats = int(numsamples / arr.shape[0])
            lastsize = numsamples % arr.shape[0]
            for i in range(numrepeats-1):
                samples = vstack([samples,arr])
            if lastsize > 0:
                index = np.arange(arr.shape[0])
                np.random.shuffle(index)
                samples = vstack([samples, arr[index[:lastsize],:]])
            return samples   
    
    if verbose: 
        print('Balancing class sample frequencies:')
        
    # all class IDs
    classes =  pd.unique(y)
    classes = classes[~np.isnan(classes)]
    
    # get class with max samples
    if verbose: 
        print('\tOriginal sample frequencies:')
    if samples_per_class is None:
        samples_per_class = 0
        for c in classes:
            if verbose: 
                print('\t\tclass:',c,'#samples:',(np.sum(y==c)))
            samples_per_class = np.max([samples_per_class, np.sum(y==c)])
    if verbose: 
        print('\tNew samples_per_class:',samples_per_class)
                              
    # combine X and y
    Xy = csr_matrix(hstack([X, csr_matrix(np.reshape(y, (-1, 1)))]))
       
    # create a list of samples for each class with equal sample numbers 
    newdata = None
    for c in classes:
        if newdata is None:
            newdata = get_samples(Xy[y==c,:], samples_per_class)
        else:
            newdata = vstack([newdata, get_samples(Xy[y==c,:], samples_per_class)])
            
    print('\ttotal new samples:',newdata.shape[0])
            
    return newdata[:,:-1], newdata[:,-1].toarray()

### Define special classifier class and create model pipeline

**Custom class with sample balancing:** Most classifiers benefit from using balanced data, and so I made a custom classifier model that does the data balancing via upsampling at the time of fit. I do this because the toxic comment data is very unequal and so a large number of duplicated samples are created during upsampling. I don't want my code to have to process all of that, so the duplication is done after processing and transforming of the original samples in the pipeline.

**Creating training data in the pipeline:** The dataset features used for training are mixed, and consists of three sources: comment text, comment metadata, Doc2Vec vectors. These different feature sets are combined in the pipeline using a ColumnTransformer. This allows hyperopt to optimize parameters for the text vectorizer TfidfVectorizer.

In [None]:
from sklearn.pipeline import Pipeline
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.compose import ColumnTransformer, make_column_transformer

from sklearn.naive_bayes import MultinomialNB
    
# Custom classifier that balances the training data
class MultinomialNB_bal(MultinomialNB):
    """Wrapper class that balances data by upsampling prior to training"""
    def __init__(self, **kwargs):
        super().__init__(**kwargs)
    def fit(self, X, y, **fit_params):
        bal_X, bal_y = balance_classes_sparse(X, y, verbose=False)
        super().fit(bal_X, bal_y, **fit_params)
        return self
    
def build_pipeline(classifier):
    """Create a pipeline to vectorize text,
        combine all features, and pass that to classifier
    """
    preprocessor = ColumnTransformer(
        transformers=[('tfidf', TfidfVectorizer(),'text')],
        remainder="passthrough"
        )        
    return Pipeline([('pre', preprocessor),
                     ('clf', classifier)])
    

### Log results of each model test

This function logs the results of a model test to a CSV logfile. Every model notebook logs to the same file so that results can be compared.

In [None]:
import csv
import os.path
from os import path
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import train_test_split
from sklearn.metrics import precision_score, recall_score, balanced_accuracy_score
from sklearn.metrics import f1_score, roc_auc_score
from time import time

def log_model_results(logpath, modelname, subname, y_test, y_pred):
    """Write to CSV log file containing results of model train/test runs"""
    # write the header labels
    if not os.path.exists(logpath):
        labels = (['date','model','sub','num_nontoxic','num_toxic',
                   'acc_nontoxic','acc_toxic','accuracy','precision',
                   'recall','balanced_acc','F1','roc_auc'])
        with open(logpath, 'a', newline='') as csvfile:
            csvwriter = csv.writer(csvfile, delimiter=',')
            csvwriter.writerow(labels)
            
    # create data row
    row = [datetime.datetime.now().strftime('%y%m%d_%H%M%S'),
          modelname, subname, 
          (y_test==0).sum(), (y_test==1).sum(),
           '%1.3f'%(((y_test==0) & (y_test==y_pred)).sum()/(y_test==0).sum()),
           '%1.3f'%(((y_test==1) & (y_test==y_pred)).sum()/(y_test==1).sum()),
           '%1.3f'%(np.sum((y_pred==y_test))/y_test.shape[0]),
           '%1.3f'%(precision_score(y_test, y_pred)),
           '%1.3f'%(recall_score(y_test, y_pred)),
           '%1.3f'%(balanced_accuracy_score(y_test, y_pred)),
           '%1.3f'%(f1_score(y_test, y_pred)),
           '%1.3f'%(roc_auc_score(y_test, y_pred))
          ]
    # write the data row
    with open(logpath, 'a', newline='') as csvfile:
        csvwriter = csv.writer(csvfile, delimiter=',')
        csvwriter.writerow(row)
    

## Testing all subs with optimized parameters

This script will validation test model for all subreddit datasets, using hypermarameters optimized using hyperopt in a previous notebook. The model will be K folds cross-validated with data for each subreddit, and the results will be saved to a common logfile so that cross-model comparisons can be made.

In [None]:
from time import time

from sklearn.model_selection import cross_val_score
from sklearn.model_selection import KFold, StratifiedKFold

def print_prediction_results(y_est, y_target):
    """helper function to report accuracy results of a prediction run"""
    
    classes = np.unique(y_target)    
    print("Classifier results:")    
    for classid in classes:
        print("\taccuracy class %d = \t%d/%d = %2.1f%%"%(
            classid,
            (y_est[y_target==classid] == classid).sum(),
            len(y_est[y_target==classid]),
            100*(y_est[y_target==classid] == classid).sum() / len(y_est[y_target==classid])))        
    print("\taccuracy all =    \t%d/%d = %2.1f%%"%(
        (y_est == y_target).sum(), 
        len(y_est),
        100*(y_est == y_target).sum() / len(y_est)))
   
# cross-validation of classifier model with text string data X, category labels in y
def cross_validate_classifier(clf, X, y, logpath, modelname, subname):
    """Set up kfold to generate several train-test sets, 
        then train and test""" 
    
    kf = StratifiedKFold(n_splits=3, shuffle=True)
    i = 1
    accuracy = []
    for train_index, test_index in kf.split(X, y):
        # fit the classifier with training data
        clf.fit(X[train_index], y[train_index])
        # generate predictions for test data
        y_est = clf.predict(X[test_index])
        y_pred = (np.where(y_est>.5,1,0))
        # print/log results of the prediction test
        log_model_results(logpath, modelname, subname, y_test, y_pred)
        accuracy.append((y_pred == test_y).sum() / len(y_pred))
        i += 1

    print("\nOverall accuracy = %2.1f%%"%(np.mean(accuracy)*100))
        print_prediction_results(y_est, test_y)
    

In [None]:
# source data folder 
srcdir = './data_for_models/'

# subs to use for this analysis
sub2use = ['aww', 'funny', 'todayilearned','askreddit',
           'photography', 'gaming', 'videos', 'science',
           'politics', 'politicaldiscussion',             
           'conservative', 'the_Donald']

# apply a threshold to determine toxic vs not toxic
thresh = -1

# results logfile path
logpath = srcdir + 'model_results_log.csv'

# name of model
modelname = 'MultinomialNB'

# specify parameters for text prep
processkwargs = {
    'stemmer':'snowball', # snowball stemmer
    'regexstr':None, # remove all but alphanumeric chars
    'lowercase':False, # make lowercase
    'removestop':False, # don't remove stop words 
    'verbose':False
                }

# optimized parameters for model
best_params = {
    "pre__tfidf__analyzer":'word', 
    "pre__tfidf__max_features" : 10000,
    "pre__tfidf__max_df" : 0.53, # Filters out terms that occur in more than half of the docs (max_df=0.5)
    "pre__tfidf__min_df" : 2, # Filters out terms that occur in only one document (min_df=2).
    "pre__tfidf__ngram_range":(1, 2), # unigrams
    "pre__tfidf__stop_words" : None, #"english", # Strips out “stop words”
    "pre__tfidf__use_idf" : False,
    "pre__tfidf__sublinear_tf" : False
    }

# validate using all subs 
for subname in sub2use:
    t0 = tstart = time()

    print('------------------------------------------------------')
    print('\nTesting model %s using sub %s'%(modelname,subname))
    
    print('  loading feature data')
    t0 = time()
    X_text, X_numeric, y = load_feature_data(sub2use, srcdir, toxic_thresh=thresh)
    print('    done in %0.3fs,'%(time() - t0),'X_text.shape, X_numeric.shape, y.shape:',X_text.shape, X_numeric.shape, y.shape)
    
    print('  preparing text for vectorization')
    X_text = process_comment_text(X_text, **processkwargs)
    print('    done in %0.3fs,'%(time() - t0))
    
    # combine X data into df so I can do train_test_split now, before the text is further processed
    dvcols = [s for s in X_numeric.columns if 'dv_' in s ]
    cols2use = dvcols + ['u_comment_karma']
    X_df = pd.concat([pd.DataFrame({'text':X_text}),pd.DataFrame(X_numeric[cols2use])],axis=1,ignore_index=True)   
    X_df.columns = ['text'] + cols2use
    
    # numeric features must be >= 0 
    X_df[cols2use] = X_df[cols2use] - X_df[cols2use].min().min()
    
    # Split into test and training data
    t0 = time()
    print('  train/test split')
    X_train, X_test, y_train, y_test = train_test_split(X_df, y,  test_size=0.1, random_state=42)
    print('    done in %0.3fs,'%(time() - t0),'X_train.shape, X_test.shape:', X_train.shape, X_test.shape)

    # build model pipeline
    clf = build_pipeline(MultinomialNB_bal)

    # set model with the optimal hyperparamters
    clf.set_params(**best_params)

    # do cross validaion
    cross_validate_classifier(clf, X_train.values, y_train.values, logpath, modelname, subname)
    
 

In [None]:
best_params = space_eval(paramspace, best)
print('\n  Best parameters:',best_params)   

# Fit the model with the optimal hyperparamters
clf = build_pipeline()

clf.set_params(**best_params)

# fit the classifier with training data
clf.fit(X_train, y_train)

# y_out = clf.predict_proba(X_test)[:,1]
y_out = clf.predict(X_test)
y_pred = (np.where(y_out>.5,1,0))

print('\n  model performance:')
print('    Test set: #nontoxic =', (y_test==0).sum(), ' #toxic =', (y_test==1).sum())
print('    overall accuracy: %1.3f'%(np.sum((y_pred==y_test))/y_test.shape[0]))
print('    Precision: %1.3f'%(precision_score(y_test, y_pred)))
print('    Recall: %1.3f'%(recall_score(y_test, y_pred)))
print('    Balanced Accuracy: %1.3f'%(balanced_accuracy_score(y_test, y_pred)))
print('    F1 Score: %1.3f'%(f1_score(y_test, y_pred)))
print('    ROC AUC: %1.3f'%(roc_auc_score(y_test, y_pred)))    
