# Reddit comment toxicity classifier: Logistic Regression

### John Burt

[To hide code cells, view this in nbviewer](https://nbviewer.jupyter.org/github/johnmburt/springboard/blob/master/capstone_1/reddit_toxicity_detection_model_logregress_v1.ipynb) 


### Introduction:

The goal of my first Capstone project is to develop a toxic comment classifier. Logistic Regression is one of the simpler models and will serve as a baseline to compare with more complicated models.

## Load the data.

The comment data used in this analysis was [acquired using Reddit Python API PRAW](https://github.com/johnmburt/springboard/blob/master/capstone_1/reddit_collect_comments_v1.ipynb) from 12 subs. 8 of the subs are non-political, and 4 are political in nature. 

The raw comment data was [processed using PCA to produce a single toxicity score](https://github.com/johnmburt/springboard/blob/master/capstone_1/reddit_generate_PCA_score_v1.ipynb) based on the votes and number of replies. 

Then I [converted this score into an integer 0 to 4 range training label variable](https://github.com/johnmburt/springboard/blob/master/capstone_1/reddit_create_train-test_set.ipynb), with 0 being no/low toxicity and higher values indicating higher toxicity. 

Note that this is a highly unbalanced dataset, with less than 10% of comments having toxicity label values above 0. I'll have to adjust this proportion for models that require reasonably balanced categories.


In [1]:
# remove warnings
import warnings
warnings.filterwarnings('ignore')
# ---

%matplotlib inline
from matplotlib import pyplot as plt
import matplotlib
matplotlib.style.use('ggplot')

import pandas as pd
pd.options.display.max_columns = 100

import numpy as np
import datetime
import time
import csv
import glob


# source data folder 
srcdir = './data_labeled/'

df = pd.read_csv(srcdir+'comment_sample_train-test_data.csv').drop_duplicates()

print('\nTotal comment samples read:',df.shape[0])


Total comment samples read: 3251323


In [2]:
df.head()

Unnamed: 0.1,Unnamed: 0,comment_ID,sub_name,post_ID,parent_ID,time,age_re_post,age_re_now,u_id,u_name,u_created,u_comment_karma,u_link_karma,num_replies,controversy,score,text,score_sign,u_days,pca_score,label_neg-pos,label_neg-inv,label_bin
0,0.0,e2pe37x,aww,90bu6w,90bu6w,1532057000.0,5636.0,20164600.0,ktsxr,hppmoep,1421736000.0,64801.0,444.0,31.0,0.0,3864.0,He judged the hell out of you and decided you ...,positive,1276.86588,4.467621,2.0,0.0,0
1,1.0,e2p8yc3,aww,90bu6w,90bu6w,1532051000.0,81.0,20170150.0,1f5xz4a2,wcollins260,1527828000.0,139463.0,157372.0,209.0,0.0,10039.0,You may have saved his little life.,positive,48.880683,5.0,2.0,0.0,0
2,2.0,e2pbfft,aww,90bu6w,90bu6w,1532054000.0,2716.0,20167520.0,d7o70,firmkillernate,1379584000.0,64836.0,12482.0,197.0,0.0,21666.0,*Moisturize me*,positive,1764.699803,5.0,2.0,0.0,0
3,3.0,e2p9dox,aww,90bu6w,90bu6w,1532052000.0,527.0,20169710.0,bx40q,wyslan,1370334000.0,11596.0,36.0,662.0,0.0,43126.0,"Frogs drink through their skin, so you cooled ...",positive,1871.736076,5.0,2.0,0.0,0
4,4.0,e2pb7zl,aww,90bu6w,90bu6w,1532054000.0,2498.0,20167730.0,167k31,VioletVenable,1489587000.0,73050.0,135.0,54.0,0.0,15163.0,Its the happy wriggle that does me in.,positive,491.508611,5.0,2.0,0.0,0


In [5]:
import re
from nltk.stem import PorterStemmer
from nltk.stem.snowball import SnowballStemmer
from nltk.tokenize import sent_tokenize, word_tokenize
from nltk.corpus import stopwords as sw

# function to prepare text for NLP analysis
def process_comment_text(comments, 
                         stemmer=None, 
                         regexstr=None, lowercase=True,
                         removestop=False,
                         verbose=True):
    """Helper function to pre-process text.
        Combines several preprocessing steps: lowercase, 
            remove stop, regex text cleaning, stemming"""
    
    if type(stemmer) == str:
        if stemmer.lower() == 'porter':
            stemmer = PorterStemmer()
        elif stemmer.lower() == 'snowball':
            stemmer = SnowballStemmer(language='english')
        else:
            stemmer = None
            
    processed = comments
    
    # make text lowercase
    if lowercase == True:
        if verbose: print('make text lowercase')
        processed = processed.str.lower()
        
    # remove stop words
    # NOTE: stop words w/ capitals not removed!
    if removestop == True:
        if verbose: print('remove stop words')
        stopwords = sw.words("english")
        processed = processed.map(lambda text: ' '.join([word for word in text.split() if word not in stopwords]))
        
    # apply regex expression
    if regexstr is not None:
        if verbose: print('apply regex expression')
        regex = re.compile(regexstr) 
        processed = processed.str.replace(regex,' ')
        
    # stemming
    # NOTE: stemming makes all lowercase
    if stemmer is not None:
        if verbose: print('stemming')
        processed = processed.map(lambda x: ' '.join([stemmer.stem(y) for y in x.split(' ')]))
        
    if verbose: print('done')
        
    return processed


## Create the X (pre-processed text) and y (label) variables for training and testing.

In [6]:
processkwargs = {
    'stemmer':'snowball', # snowball stemmer
    'regexstr':'[^a-zA-Z0-9\s]', # remove all but alphanumeric chars
    'lowercase':True, # make lowercase
    'removestop':False # don't remove stop words 
                }

# the label used = 0-4 scale, w/ 4 = most toxic
y = df['label_neg-inv']

# process text, make that the text version of the training data
verbose = True
X_text = process_comment_text(df['text'], **processkwargs, verbose=verbose)


make text lowercase
apply regex expression
stemming
done


## Set up the gridsearch 

The gridsearch using GridSearchCV is pretty straightforward: it takes a single estimator object, a set of parameters to test and some train/test data, and then exhaustively trains and tests the estimator with every parameter value combination to determine the one that gives the best score. [See here for more about hyperparameter tuning and gridsearch.](http://scikit-learn.org/stable/modules/grid_search.html#grid-search)

The [Pipeline object](http://scikit-learn.org/stable/modules/pipeline.html#pipeline) is an estimator object that lets you chain multiple estimators so that you can transform data and train a classifier in one step. You can chain as many estimators as you want and even use your own custom estimator objects. In this case, I'll pipeline TfidfVectorizer and the classifier I'll use to classify NLP data. Once I create the pipeline object containing these two estimators, I can then pass that to GridSearchCV and tune both at the same time.


In [8]:
from time import time
import logging

from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.model_selection import GridSearchCV
from sklearn.pipeline import Pipeline
from sklearn.linear_model import LogisticRegression

# Tfidf vectorizer:
# define defaults: doing it this way allows us to define our own default params
tfidfargs = {
    "analyzer":'word', 
    "max_features" : None,
    "max_df" : 0.25, # Filters out terms that occur in more than half of the docs (max_df=0.5)
    "min_df" : 2, # Filters out terms that occur in only one document (min_df=2).
    "ngram_range":(1, 3), # unigrams
    "stop_words" : "english", # None, # "english", # Strips out “stop words”
    "use_idf" : True
    }

# Logistic regression defaults:
clfargs = {
    "penalty":'l2', 
    "class_weight" : 'balanced',
    "solver" : 'sag', # For multiclass problems, only ‘newton-cg’, ‘sag’, ‘saga’ and ‘lbfgs’ handle multinomial loss
    "multi_class" : 'multinomial', # alt: 'ovr'
    "n_jobs": -1, # -1 = use all available CPU cores
    }

# Define a pipeline combining a text vectorizer with a Naive Bayes classifier
pipeline = Pipeline([    
    ('tfidf', TfidfVectorizer(**tfidfargs)),
    ('clf', LogisticRegression(**clfargs)),
])

# Define the parameters and values we want to test.
# Uncommenting more parameters will give better exploring power but will
#   increase processing time in a combinatorial way. I suggest tuning <= 3
#   parameters at a time.
# Note the naming format: pipelineobjectname__paramname
parameters = {
    'tfidf__stop_words': ('english', None),
    #'tfidf__analyzer': ('word', 'char', 'char_wb'),
    #'tfidf__max_df': (0.1, 0.25, 0.5, 0.75),
    #'tfidf__min_df': (1,2,5),
    #'tfidf__max_features': (None, 5000, 10000, 50000),
    'tfidf__ngram_range': ((1, 1), (1, 3), (3, 3)),  # unigrams or bigrams
    #'tfidf__use_idf': (True, False),
}

# create grid search object to find the best parameters for both the 
#   feature extraction and the classifier.
# Note: n_jobs=-1 causes GridSearchCV to use multithreading to employ all processor cores.
grid_search = GridSearchCV(pipeline, parameters, n_jobs=-1, verbose=1)

## Perform the grid search

The gridsearch takes the pipeline object (containing the text vectorizer and the classifier) and the data and tries all combos of the parameters I have defined. The output parameter "best\_estimator\_" contains a pipeline object with the parameters that give best performance.

In [None]:
print("Performing grid search...")
print("pipeline:", [name for name, _ in pipeline.steps])
print("parameters:")
print(parameters)
t0 = time()
grid_search.fit(X_text, y)
print("done in %0.3fs" % (time() - t0))
print()

print("Best score: %0.3f" % grid_search.best_score_)
print("Best parameters set:")
best_parameters = grid_search.best_estimator_.get_params()
for param_name in sorted(parameters.keys()):
    print("\t%s: %r" % (param_name, best_parameters[param_name]))

Performing grid search...
pipeline: ['tfidf', 'clf']
parameters:
{'tfidf__stop_words': ('english', None), 'tfidf__ngram_range': ((1, 1), (1, 3), (3, 3))}
Fitting 3 folds for each of 6 candidates, totalling 18 fits


[Parallel(n_jobs=-1)]: Using backend LokyBackend with 8 concurrent workers.


## Test the best classifier using k-folds cross validation


Cross_validate_classifier uses k-folds cross validation to partition the data into multiple non-overlapping train/test sets, and run the classifier on each. If the classifier is solid, the results should be the same for all sets.


In [None]:
from sklearn.model_selection import KFold, StratifiedKFold

# cross-validation of classifier model with text string data X_text, category labels in y
def cross_validate_classifier(clf, X_text, y):

    # set up kfold to generate several train-test sets, 
    #  with shuffled indices for selecting from data
    kf = StratifiedKFold(n_splits=5, shuffle=True)

    i = 1
    accuracy = []
    for train_index, test_index in kf.split(X_text, y):
        print("\nk-fold train/test set #%d: "%(i))

        # fit the classifier with training data
        clf.fit(X_text[train_index], y[train_index])

        # generate predictions for test data
        y_est = clf.predict(X_text[test_index])

        # print results of the prediction test
#         print_prediction_results(y_est, y[test_index])
        accuracyscore = (y_est == y[test_index]).sum() / y_est.size
        print(accuracyscore)

        accuracy.append(accuracyscore)
        i += 1

    print("\nOverall accuracy = %2.1f%%"%(np.mean(accuracy)*100))

In [None]:
# cross-validate classifier using nonoverlapping subsets of data
print("\n***************************")
print("Cross-validate classifier:")
cross_validate_classifier(grid_search.best_estimator_, X_text, y)