# Initial Models Exploration

Attempts were first made to see how Naive Bayes, SVM, and Logistic Regression models would perform on the data set.  Data was stored in an AWS S3 bucket.  From the bucket, it was loaded, cleaned and features were generated off of the cleaned data.  Both text and numerical features were generated based off of data exploration.  

In [1]:
import pandas as pd

import string
import nltk
from nltk.corpus import stopwords
from nltk.stem import PorterStemmer 
from nltk.stem import LancasterStemmer 

nltk.download("stopwords")

import string
import re

import warnings
warnings.filterwarnings('ignore')

import sklearn
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfTransformer

from sklearn.naive_bayes import MultinomialNB
from sklearn.naive_bayes import GaussianNB

from sklearn import svm

from sklearn.linear_model import LogisticRegression

import numpy as np

from scipy import sparse
import datetime

import s3fs

from sklearn.feature_extraction.text import TfidfVectorizer

import io

[nltk_data] Downloading package stopwords to
[nltk_data]     /home/ec2-user/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


### Load and split data

Read in test.csv and train.csv

In [6]:
test = pd.read_csv("s3://advancedml-koch-mathur-hinkson/test.csv")

In [2]:
train = pd.read_csv("s3://advancedml-koch-mathur-hinkson/train.csv")

Create a new column called "toxicity_category" in the train data frame categorizing comments as toxic ("1") or non-toxic ("0").

In [3]:
train['toxicity_category'] = train.target.apply(lambda x: 1 if x > 0.5 else 0)

Split train.csv into training (80%) and validation sets (20%).

In [4]:
msk = np.random.rand(len(train)) < 0.8
train_set = train[msk]
validation_set = train[~msk]

In [5]:
print(train_set.toxicity_category.value_counts())

0    1359251
1      84921
Name: toxicity_category, dtype: int64


In [6]:
print(validation_set.toxicity_category.value_counts())

0    339185
1     21517
Name: toxicity_category, dtype: int64


Create small sample ("train_sample1") from the train_set on which to run models.  Ensure that samples are iid by replacing after each draw.

In [7]:
train_sample1 = train_set.sample(frac=0.05, replace=True)

In [8]:
print(train_sample1.toxicity_category.value_counts())

0    67936
1     4273
Name: toxicity_category, dtype: int64


### Generate features

In [9]:
ls = LancasterStemmer()
ps = PorterStemmer() 

sw = set(stopwords.words('english'))
sw.add('')

def clean_text(text, stemming=None, remove_sw = True):
    '''
    This auxiliary function cleans text.
    
    Methods used for cleaning are: 
        (1) transform string of text to list of words,
        (2) cleaned (lowercase, remove punctuation) and remove stop words,
        (3) Porter stemming of cleaned (lowercase, remove punctuation) text, 
        (4) Lancaster stemming of cleaned (lowercase, remove punctuation), 
        (5) cleaned (lowercase, remove punctuation) without removing stop words.
    
    Inputs:
        text (string) - A string of text.
        stemming (parameter) - either Porter or Lancaster stemming method
        remove_sw (boolean) - True/False remove stop words
    
    Outputs:
        Cleaned text per the input parameters.
    '''

    t = text.replace("-", " ").split(" ")
    
    t = [w.lower() for w in t]
    
    if remove_sw == True:
        t = [w for w in t if w not in sw]
    
    if stemming == None:
        pass;
    elif stemming == "Porter":
        t = [ps.stem(w) for w in t]
    elif stemming == "Lancaster":
        t = [ls.stem(w) for w in t]
    else:
        print("Please enter a valid stemming type")
        
    t = [w.strip(string.punctuation) for w in t]

    return ' '.join(t)

In [10]:
def add_text_cleaning_cols(df):
    '''
    This function generates features and adds them to the data frame.
    
    Input:
        Data frame with raw text strings.
        
    Output:
        Data frame with added columns:
            (1) 'split' - (list) Transforms the string of text into a list of words
            (2) 'cleaned_w_stopwords' - (string) A string of text where words have been lowercased, 
                                        punctuation is removed, and stop words are removed
            (3) 'cleaned_no_stem' - (string) A string of text where words have been lowercased, and 
                                        punctuation is removed (stop words remain in text).
                                        
            
            (4) 'cleaned_porter' - (string) A string of text where words have been stemmed using the 
                                        Porter method on cleaned (lowercase, remove punctuation) text. 
            (5) 'cleaned_lancaster' - (string) A string of text where words have been stemmed using the
                                        Lancaster method on cleaned (lowercase, remove punctuation) text.
            (6) 'perc_upper' - (float) Percent of uppercase letters in the string of text.
            (7) 'num_exclam' - (integer) Number of times an exclamation point appears in text.
            (8) 'num_words' - (integer) Number of words in text.
            
    '''
    print(datetime.datetime.now())
    
    df['split'] = df["comment_text"].apply(lambda x: x.split(" "))
    df['cleaned_w_stopwords'] = df["comment_text"].apply(clean_text,args=(None,False),)

    print(datetime.datetime.now())
    df['cleaned_no_stem'] = df["comment_text"].apply(clean_text,)
    df['cleaned_porter'] = df["comment_text"].apply(clean_text,args=("Porter",),)
    df['cleaned_lancaster'] = df["comment_text"].apply(clean_text,args=("Lancaster",),)

    print(datetime.datetime.now())

    df['perc_upper'] = df["comment_text"].apply(lambda x: round((len(re.findall(r'[A-Z]',x)) / len(x)), 3))

    df['num_exclam'] = df["comment_text"].apply(lambda x:(len(re.findall(r'!',x))))
    
    df['num_words'] = df["split"].apply(lambda x: len(x))
    print("DONE")
        

    
    

In [11]:
add_text_cleaning_cols(train_sample1)

2019-05-27 20:40:36.194393
2019-05-27 20:40:38.012761
2019-05-27 20:41:36.698465
DONE


In [12]:
train_sample1.columns

Index(['id', 'target', 'comment_text', 'severe_toxicity', 'obscene',
       'identity_attack', 'insult', 'threat', 'asian', 'atheist', 'bisexual',
       'black', 'buddhist', 'christian', 'female', 'heterosexual', 'hindu',
       'homosexual_gay_or_lesbian', 'intellectual_or_learning_disability',
       'jewish', 'latino', 'male', 'muslim', 'other_disability',
       'other_gender', 'other_race_or_ethnicity', 'other_religion',
       'other_sexual_orientation', 'physical_disability',
       'psychiatric_or_mental_illness', 'transgender', 'white', 'created_date',
       'publication_id', 'parent_id', 'article_id', 'rating', 'funny', 'wow',
       'sad', 'likes', 'disagree', 'sexual_explicit',
       'identity_annotator_count', 'toxicity_annotator_count',
       'toxicity_category', 'split', 'cleaned_w_stopwords', 'cleaned_no_stem',
       'cleaned_porter', 'cleaned_lancaster', 'perc_upper', 'num_exclam',
       'num_words'],
      dtype='object')

In [34]:
train_sample1.shape

(72214, 57)

In [35]:
train_sample1.head(5)

Unnamed: 0,id,target,comment_text,severe_toxicity,obscene,identity_attack,insult,threat,asian,atheist,...,split,cleaned_w_stopwords,cleaned_no_stem,cleaned_porter,cleaned_lancaster,perc_upper,num_exclam,num_words,perc_stopwords,num_upper_words
375600,702857,0.0,You completely misstate how the Bulletin of t...,0.0,0.0,0.0,0.0,0.0,,,...,"[You, completely, misstate, how, the, , Bullet...",you completely misstate how the bulletin of t...,completely misstate bulletin atomic scientists...,complet misstat bulletin atom scientist move d...,complet misst bulletin atom sci mov doomsday c...,0.03,0,29,-2.69,0
1566597,6038615,0.3,The good thing about boondoggles is that they ...,0.0,0.0,0.0,0.3,0.0,,,...,"[The, good, thing, about, boondoggles, is, tha...",the good thing about boondoggles is that they ...,good thing boondoggles usually collapse weight...,good thing boondoggl usual collaps weight corr...,good thing boondoggl us collaps weight corruption,0.01,0,17,-2.353,0
275683,579993,0.0,Throwin your votes away eh,0.0,0.0,0.0,0.0,0.0,,,...,"[Throwin, your, votes, away, eh]",throwin your votes away eh,throwin votes away eh,throwin vote away eh,throwin vot away eh,0.038,0,5,-3.2,0
89607,352281,0.0,Mr. Sayre trunk line fibers typically carry 40...,0.0,0.0,0.0,0.0,0.0,,,...,"[Mr., Sayre, trunk, line, fibers, typically, c...",mr sayre trunk line fibers typically carry 40 ...,mr sayre trunk line fibers typically carry 40 ...,mr sayr trunk line fiber typic carri 40 separ ...,mr sayr trunk lin fib typ carry 40 sep 10 gb/s...,0.025,0,183,-2.902,3
1347516,5762238,0.0,MSNBC and CNN often mention facts ..............,0.0,0.0,0.0,0.0,0.0,,,...,"[MSNBC, and, CNN, often, mention, facts, , ......",msnbc and cnn often mention facts \n althoug...,msnbc cnn often mention facts \n although sto...,msnbc cnn often mention fact \n although stop...,msnbc cnn oft ment fact \n although stop watc...,0.085,0,24,-3.833,3


Due to memory issues, we needed to use a smaller training set.  We used a random iid sample of half of the train_sample1 frame to train NB model:

In [13]:
toxic = train_sample1[train_sample1.toxicity_category == 1]
nontoxic = train_sample1[train_sample1.toxicity_category == 0]

In [14]:
train_sample1.shape, toxic.shape, nontoxic.shape

((72209, 54), (4273, 54), (67936, 54))

Reshaping the dataset to be include an equal number of toxic and nontoxic samples

In [15]:
prepared_df = toxic.append(toxic).append(nontoxic.sample(len(toxic)*2))
prepared_df = prepared_df.sample(frac=1).reset_index(drop=True)

print(prepared_df.toxicity_category.value_counts())


1    8546
0    8546
Name: toxicity_category, dtype: int64


Because we are unable to train an NB model on categorical (text) and continuous (numerical) data at the same time, our action plan changed to running two independent models for each type of data and then running a thrid NB model on the resulting predict_proba from the other two trained models.

In [16]:
def run_model(model_df, train_perc=.80, addtl_feats =[''], model_type = "Multi", 
               should_print=False, see_inside=False, comments="comment_text",
             target='toxicity_category'):
    '''
    This function runs a single machine learning model as per the specified parameters.
    
    Input(s):
        model_df: source data frame
        train_perc: percentage that should be used for training set
        addtl_feats: (list) list of non text columns to include
        model_type: which machine learning model to use
        see_inside: returns the intermediate tokenized and vectorized arrays
        comments: source column for text data
        target: source column for y values
        
    Output(s):
    
    '''
    
    train_start = 0
    train_end = round(model_df.shape[0]*train_perc) 

    test_start = train_end
    test_end = model_df.shape[0]
    
    X_all = model_df[comments].values
    y_all = model_df[target].values
    
    # tokenizing text
#     count_vect = CountVectorizer()
#     X_all_counts = count_vect.fit_transform(X_all.astype('U'))
    #print(X_all_counts.shape)

    # calculating frequencies
    tfidf_vectorizer = TfidfVectorizer(use_idf=True)
    fitted_vectorizer=tfidf_vectorizer.fit(model_df[comments])
    X_all_tfidf =  fitted_vectorizer.transform(model_df[comments])


    print(X_all_tfidf.shape)
    
    if addtl_feats != ['']: # combine non-text and text features if necessary
        print("here")
#         others_all = model_df[addtl_feats].values.reshape(-1,1)

        others_all = model_df[addtl_feats].values.reshape(-1,len(addtl_feats))
        #print(others_all)
        newfeatures_all = sparse.hstack((X_all_tfidf, others_all.astype(float))).tocsr()
    else:
        newfeatures_all = X_all_tfidf
    
    
    X_train = newfeatures_all[train_start:train_end]
    y_train = model_df[train_start:train_end][target].values
    y_train=y_train.astype('int')
    

    X_test = newfeatures_all[test_start:test_end]
    y_test = model_df[test_start:test_end][target].values
    
    
    
    
    if model_type == 'Multi':
        clf = MultinomialNB().fit(X_train, y_train)
    if model_type == "Gauss":
        clf = GaussianNB().fit(X_train, y_train) 
    if model_type == "SVM":
        clf = svm.SVC(kernel='linear', probability=True, random_state=1008).fit(X_train, y_train) 
    if model_type == "LR":
        clf = LogisticRegression(penalty="l1",C=1e5).fit(X_train, y_train)
        
    preds_for_train = clf.predict(X_train)
    
    
   
    predicted = clf.predict(X_test)
    accuracy = np.mean(predicted == y_test)
    
    output = model_df[test_start:test_end]
    output['predicted'] = predicted
    output['y_test'] = y_test
    output['accuracy'] = output.predicted == output.y_test
    

#     y_scores_sorted, y_true_sorted = joint_sort_descending(np.array(y_scores), np.array(y_true))
#     precision = precision_score(y_true_sorted, preds)


    if should_print == True:

        print("The accuracy on the test set is {}%.".format(round(accuracy*100,2)))    
    
    if see_inside == True:
        return clf, accuracy, X_all_counts, X_all_tfidf
    else:
        return clf, accuracy, preds_for_train, predicted, output


In [17]:
clf1, accuracy, preds_for_train, predicted , output = run_model(prepared_df, comments = "cleaned_lancaster", should_print=False)

print("The unique values predicted in the training set include :" + str(np.unique(preds_for_train)))
print("The unique values predicted in the test set include :" + str(np.unique(predicted)))

(17092, 25851)
The unique values predicted in the training set include :[0 1]
The unique values predicted in the test set include :[0 1]


In [18]:
output.head()

Unnamed: 0,id,target,comment_text,severe_toxicity,obscene,identity_attack,insult,threat,asian,atheist,...,cleaned_w_stopwords,cleaned_no_stem,cleaned_porter,cleaned_lancaster,perc_upper,num_exclam,num_words,predicted,y_test,accuracy
13674,5293834,0.0,Preeeecisely.,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,preeeecisely,preeeecisely,preeeecisely,preeeecisely,0.077,0,1,1,0,False
13675,6043087,0.166667,tol·er·ance\nˈtäl(ə)rəns/Submit\nnoun\n1.\nthe...,0.0,0.0,0.0,0.0,0.0,,,...,tol·er·ance\nˈtäl(ə)rəns/submit\nnoun\n1.\nthe...,tol·er·ance\nˈtäl(ə)rəns/submit\nnoun\n1.\nthe...,tol·er·ance\nˈtäl(ə)rəns/submit\nnoun\n1.\nth ...,tol·er·ance\nˈtäl(ə)rəns/submit\nnoun\n1.\nthe...,0.012,0,67,0,0,True
13676,5415503,0.550725,"Oh sure, put a black guy in the role of ""Caesa...",0.014493,0.014493,0.565217,0.246377,0.0,0.0,0.0,...,oh sure put a black guy in the role of caesar ...,oh sure put black guy role caesar youd crying ...,oh sure put black guy role caesar youd cri any...,oh sure put black guy rol caesar youd cry anyo...,0.023,0,19,1,1,True
13677,5936191,0.685714,Great work getting more scum off the streets.,0.071429,0.114286,0.014286,0.628571,0.014286,,,...,great work getting more scum off the streets,great work getting scum streets,great work get scum streets,gre work get scum streets,0.022,0,8,1,1,True
13678,744227,0.8,There is an error there.. Something U can not...,0.0,0.0,0.0,0.8,0.0,,,...,there is an error there something u can not s...,error there something u see smart,error there someth u see smart,er there someth u see smart,0.038,0,16,0,1,False


In [19]:
output.y_test.value_counts()

0    1714
1    1704
Name: y_test, dtype: int64

In [20]:
output.predicted.value_counts()

1    1975
0    1443
Name: predicted, dtype: int64

In [21]:
accuracy

0.8089526038619076

In [22]:
targets = output[output.y_test == 1]
targets[targets.accuracy == True].shape[0] / targets.shape[0]

0.8879107981220657

In [25]:
nontargets = output[output.y_test == 0]
nontargets[nontargets.accuracy == True].shape[0] / nontargets.shape[0]

0.7304550758459744

In [23]:
output[output.y_test == 0].accuracy.value_counts()

True     1252
False     462
Name: accuracy, dtype: int64

### Naive Bayes

In [27]:
best_accuracy = 0
model_factors = []

for text in ['cleaned_w_stopwords', 'cleaned_no_stem', 'cleaned_porter',
    'cleaned_lancaster']:
    for tp in [0.6, 0.7, 0.8]:
  
        factors = [text, tp]
        print(factors)

        clf, accuracy, preds_for_train, predicted, output = run_model(prepared_df, train_perc = tp, comments = text, should_print=False)
        
        print("The unique values predicted for the training set include :" + str(np.unique(preds_for_train)))
        print("The unique values predicted for the test set include :" + str(np.unique(predicted)))
        
        targets = output[output.y_test == 1]
        target_accuracy = targets[targets.accuracy == True].shape[0] / targets.shape[0]
        
        nontargets = output[output.y_test == 0]
        nontarget_accuracy = nontargets[nontargets.accuracy == True].shape[0] / nontargets.shape[0]
        
        print("Accuracy: {} , Target Accuracy: {}, Nontarget Accuracy: {}".format(accuracy, target_accuracy, nontarget_accuracy))

        if target_accuracy > best_accuracy:
            model_factors = factors
            best_accuracy = target_accuracy

        print()



['cleaned_w_stopwords', 0.6]
(17092, 30051)
The unique values predicted for the training set include :[0 1]
The unique values predicted for the test set include :[0 1]
Accuracy: 0.8221442152991078 , Target Accuracy: 0.8695023148148148, Nontarget Accuracy: 0.7737355811889973

['cleaned_w_stopwords', 0.7]
(17092, 30051)
The unique values predicted for the training set include :[0 1]
The unique values predicted for the test set include :[0 1]
Accuracy: 0.8260530421216848 , Target Accuracy: 0.8893617021276595, Nontarget Accuracy: 0.7616987809673614

['cleaned_w_stopwords', 0.8]
(17092, 30051)
The unique values predicted for the training set include :[0 1]
The unique values predicted for the test set include :[0 1]
Accuracy: 0.8224107665301346 , Target Accuracy: 0.9002347417840375, Nontarget Accuracy: 0.7450408401400234

['cleaned_no_stem', 0.6]
(17092, 30048)
The unique values predicted for the training set include :[0 1]
The unique values predicted for the test set include :[0 1]
Accuracy

In [28]:
best_accuracy,model_factors 

(0.9019953051643192, ['cleaned_no_stem', 0.8])

In [29]:
clf, accuracy, preds_for_train, predicted, output = run_model(prepared_df, train_perc = 0.8, comments = 'cleaned_no_stem', should_print=False)


(17092, 30048)


In [30]:
targets = output[output.y_test == 1]
target_accuracy = targets[targets.accuracy == True].shape[0] / targets.shape[0]

nontargets = output[output.y_test == 0]
nontarget_accuracy = nontargets[nontargets.accuracy == True].shape[0] / nontargets.shape[0]

In [32]:
target_accuracy, nontarget_accuracy

(0.9019953051643192, 0.7450408401400234)

### SVM

In [58]:
best_accuracy_svm = 0
model_factors_svm = []

for text in ['cleaned_w_stopwords', 'cleaned_no_stem', 'cleaned_porter',
    'cleaned_lancaster']:
    for tp in [0.6, 0.7, 0.8]:
  
        factors = [text, tp]
        print(factors)

        clf, accuracy, preds_for_train, predicted, output = run_model(prepared_df, model_type="SVM", train_perc = tp, comments = text, should_print=False)
        
        print("The unique values predicted for the training set include :" + str(np.unique(preds_for_train)))
        print("The unique values predicted for the test set include :" + str(np.unique(predicted)))
        
        targets = output[output.y_test == 1]
        target_accuracy = targets[targets.accuracy == True].shape[0] / targets.shape[0]
        
        print("Accuracy: {} , Target Accuracy: {}".format(accuracy, target_accuracy))

        if target_accuracy > best_accuracy:
            model_factors_svm = factors
            best_accuracy_svm = target_accuracy

        print()

['cleaned_w_stopwords', 0.6]
(17328, 30033)
The unique values predicted for the training set include :[0 1]
The unique values predicted for the test set include :[0 1]
Accuracy: 0.874909825422017 , Target Accuracy: 0.8627733026467204

['cleaned_w_stopwords', 0.7]
(17328, 30033)
The unique values predicted for the training set include :[0 1]
The unique values predicted for the test set include :[0 1]
Accuracy: 0.8855328972681801 , Target Accuracy: 0.8772269558481797

['cleaned_w_stopwords', 0.8]
(17328, 30033)
The unique values predicted for the training set include :[0 1]
The unique values predicted for the test set include :[0 1]
Accuracy: 0.8993075591459896 , Target Accuracy: 0.906628242074928

['cleaned_no_stem', 0.6]
(17328, 30032)
The unique values predicted for the training set include :[0 1]
The unique values predicted for the test set include :[0 1]
Accuracy: 0.875486942721108 , Target Accuracy: 0.8593210586881473

['cleaned_no_stem', 0.7]
(17328, 30032)
The unique values predi

In [60]:
best_accuracy_svm, model_factors_svm

(0, [])

### Logistic Regression

In [68]:
best_accuracy_log = 0
model_factors_log = []

for text in ['cleaned_w_stopwords', 'cleaned_no_stem', 'cleaned_porter',
    'cleaned_lancaster']:
    for tp in [0.6, 0.7, 0.8]:
  
        factors = [text, tp]
        print(factors)

        clf, accuracy, preds_for_train, predicted, output = run_model(prepared_df, model_type="LR", train_perc = tp, comments = text, should_print=False)
        
        print("The unique values predicted for the training set include :" + str(np.unique(preds_for_train)))
        print("The unique values predicted for the test set include :" + str(np.unique(predicted)))
        
        targets = output[output.y_test == 1]
        target_accuracy = targets[targets.accuracy == True].shape[0] / targets.shape[0]
        
        print("Accuracy: {} , Target Accuracy: {}".format(accuracy, target_accuracy))

        if target_accuracy > best_accuracy:
            model_factors_log = factors
            best_accuracy_log = target_accuracy

        print()

['cleaned_w_stopwords', 0.6]
(17328, 30033)
The unique values predicted for the training set include :[0 1]
The unique values predicted for the test set include :[0 1]
Accuracy: 0.8482181503390565 , Target Accuracy: 0.8918296892980437

['cleaned_w_stopwords', 0.7]
(17328, 30033)
The unique values predicted for the training set include :[0 1]
The unique values predicted for the test set include :[0 1]
Accuracy: 0.856483262793382 , Target Accuracy: 0.9047250193648335

['cleaned_w_stopwords', 0.8]
(17328, 30033)
The unique values predicted for the training set include :[0 1]
The unique values predicted for the test set include :[0 1]
Accuracy: 0.8742065781881131 , Target Accuracy: 0.9342939481268011

['cleaned_no_stem', 0.6]
(17328, 30032)
The unique values predicted for the training set include :[0 1]
The unique values predicted for the test set include :[0 1]
Accuracy: 0.8492281056124658 , Target Accuracy: 0.8941311852704258

['cleaned_no_stem', 0.7]
(17328, 30032)
The unique values pre