# Naive Bayes Model

In this notebook, a Naive Bayes model is run on a iid sampled data set of approximately 250K rows of data.  This notebook was run on an AWS SageMaker ml.c5.4xlarge instance.

#### Import modules

In [2]:
import pandas as pd
import string
import re
import string
import numpy as np
import datetime

import nltk
from nltk.corpus import stopwords
from nltk.stem import PorterStemmer 
from nltk.stem import LancasterStemmer 

import warnings
warnings.filterwarnings('ignore')

import sklearn
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfTransformer
from sklearn.feature_extraction.text import TfidfVectorizer

from sklearn.naive_bayes import MultinomialNB
from sklearn import svm
from sklearn.linear_model import LogisticRegression

from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score, roc_auc_score

In [3]:
import feature_generation_functions as fgf

[nltk_data] Downloading package stopwords to
[nltk_data]     /home/ec2-user/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


In [4]:
import model_functions as mf

In [5]:
import pickle_functions as pf

### Load and shuffle data

In [5]:
# train = pd.read_csv("s3://advancedml-koch-mathur-hinkson/train.csv")

In [6]:
train.shape

(1804874, 45)

Label comments as toxic ("1") or nontoxic ("0") using 0.5 threshold

In [7]:
train['toxicity_category'] = train.target.apply(lambda x: 1 if x > 0.5 else 0)

In [8]:
train.shape

(1804874, 46)

Split into train_set and validation_set

In [9]:
#Citation: https://stackoverflow.com/questions/24147278/how-do-i-create-test-and-train-samples-from-one-dataframe-with-pandas
msk = np.random.rand(len(train)) < 0.8
train_set = train[msk]
hold_out_set = train[~msk]


In [10]:
print(train_set.toxicity_category.value_counts())

0    1358996
1      85262
Name: toxicity_category, dtype: int64


In [11]:
print(hold_out_set.toxicity_category.value_counts())

0    339440
1     21176
Name: toxicity_category, dtype: int64


Randomly sample train_set to create a smaller data frame (train_sample) to run NB on

In [12]:
train_sample = train_set.sample(frac=0.75, replace=True)

In [13]:
print(train_sample.toxicity_category.value_counts())

0    1018765
1      64429
Name: toxicity_category, dtype: int64


### Generate features

In [14]:
train_df = fgf.generate_NB_SVM_features(train_sample)

Cleaned with stopwords...Elapsed Time:  0.265 minutes
Cleaned without stopwords...Elapsed Time:  0.368 minutes
Stemmed (Porter)...Elapsed Time:  8.513 minutes
Stemmed (Lancaster)...Elapsed Time:  7.019 minutes

DONE GENERATING FEATURES


In [15]:
pf.write_pickle_to_s3bucket(filename='NB_final_1M', 
                            bucket_name='advancedml-koch-mathur-hinkson', 
                            df=train_df)

Pickled and sent to bucket!


In [16]:
hold_out_df = fgf.generate_NB_SVM_features(hold_out_set)

Cleaned with stopwords...Elapsed Time:  0.088 minutes
Cleaned without stopwords...Elapsed Time:  0.123 minutes
Stemmed (Porter)...Elapsed Time:  2.788 minutes
Stemmed (Lancaster)...Elapsed Time:  2.291 minutes

DONE GENERATING FEATURES


In [17]:
pf.write_pickle_to_s3bucket(filename='NB_final_holdout_350K', 
                            bucket_name='advancedml-koch-mathur-hinkson', 
                            df=hold_out_df)

Pickled and sent to bucket!


### Reshaping & Balancing

In [6]:
train_df = pf.read_pickle(filename='NB_final_1M', bucket_name='advancedml-koch-mathur-hinkson')


In [7]:
hold_out_df = pf.read_pickle(filename='NB_final_holdout_350K', bucket_name='advancedml-koch-mathur-hinkson')


In [8]:
toxic = train_df[train_df.toxicity_category == 1]
nontoxic = train_df[train_df.toxicity_category == 0]

In [9]:
train_df.shape, toxic.shape, nontoxic.shape

((1083194, 50), (64429, 50), (1018765, 50))

Reshape the dataset to include an equal number of toxic and nontoxic samples

In [31]:
quarter = len(toxic)
ten_percent = round((len(toxic) / 5) * 2)

In [32]:
ten_percent * 4

25772

In [11]:
random_df = train_df.sample(quarter*4)

In [12]:
prepared_50 = toxic.append(toxic).append(nontoxic.sample(len(toxic)*2))
prepared_50 = prepared_50.sample(frac=1).reset_index(drop=True)
print(prepared_50.toxicity_category.value_counts())

1    128858
0    128858
Name: toxicity_category, dtype: int64


In [45]:
prepared_40 = toxic.append(toxic).append(toxic).append(nontoxic.sample(len(toxic) * 6))
prepared_40 = prepared_40.sample(frac=.5).reset_index(drop=True)
print(prepared_40.toxicity_category.value_counts())

0    193524
1     96406
Name: toxicity_category, dtype: int64


In [46]:
96406 / (193524 + 96406)

0.3325147449384334

### Naive Bayes

From NB_iter4 notebook we learned that cleaned_no_stem_str was the feature that yeilded the highest performing model results.

In [56]:
classifier, output, fitted_vectorizer = mf.run_model(model_df=prepared_50, 
                                                     model_type="MultiNB", 
                                                     comments = "cleaned_no_stem_str", 
                                                     train_perc=0.8, 
                                                     target="toxicity_category", 
                                                     see_inside=False)

fitting model now


In [57]:
mf.get_metrics(output=output, detailed=True, should_print=True, round_to=3)

Overall Accuracy: 0.8231573637545351
Overall Precision: 0.7878411910669976
Overall Recall: 0.88539447693559
Overall F1 Score: 0.8337740494209903
ROC_AUC: 0.823

Target Accuracy: 0.88539447693559
Target Precision: 1.0
Target Recall: 0.88539447693559
Target F1 Score: 0.9392140347994002

Non-Target Accuracy: 0.7606904058466801
Non-Target Precision: 1.0
Non-Target Recall: 0.7606904058466801
Non-Target F1 Score: 0.8640819570785128

Strong Identity Accuracy: 0.9066666666666666
Strong Identity Precision: 0.9980818414322251
Strong Identity Recall: 0.908086096567772
Strong Identity F1 Score: 1.0

Obscenity Accuracy: 0.8784
Obscenity Precision: 0.9993932038834952
Obscenity Recall: 0.8788687299893276
Obscenity F1 Score: 1.0

Insults Accuracy: 0.8975253807106599
Insults Precision: 0.9990581032554307
Insults Recall: 0.8982692002328905
Insults F1 Score: 1.0

Threats Accuracy: 0.8932178932178932
Threats Precision: 0.9983870967741936
Threats Recall: 0.8945086705202312
Threats F1 Score: 1.0



{'Overall': {'Accuracy': 0.8231573637545351,
  'Precision': 0.7878411910669976,
  'Recall': 0.88539447693559,
  'F1': 0.8337740494209903,
  'ROC_AUC': 0.823},
 'Target': {'Accuracy': 0.88539447693559,
  'Precision': 1.0,
  'Recall': 0.88539447693559,
  'F1': 0.9392140347994002},
 'Non-Target': {'Accuracy': 0.8932178932178932,
  'Precision': 0.9983870967741936,
  'Recall': 0.8945086705202312,
  'F1': 1.0}}

In [58]:
hold_out_results = mf.run_model_test(model_df=hold_out_df, 
                                     clf=classifier, 
                                     vectorizer=fitted_vectorizer, 
                                     comments="cleaned_no_stem_str", target="toxicity_category")

Index(['id', 'target', 'comment_text', 'severe_toxicity', 'obscene',
       'identity_attack', 'insult', 'threat', 'asian', 'atheist', 'bisexual',
       'black', 'buddhist', 'christian', 'female', 'heterosexual', 'hindu',
       'homosexual_gay_or_lesbian', 'intellectual_or_learning_disability',
       'jewish', 'latino', 'male', 'muslim', 'other_disability',
       'other_gender', 'other_race_or_ethnicity', 'other_religion',
       'other_sexual_orientation', 'physical_disability',
       'psychiatric_or_mental_illness', 'transgender', 'white', 'created_date',
       'publication_id', 'parent_id', 'article_id', 'rating', 'funny', 'wow',
       'sad', 'likes', 'disagree', 'sexual_explicit',
       'identity_annotator_count', 'toxicity_annotator_count',
       'toxicity_category', 'cleaned_w_stopwords_str', 'cleaned_no_stem_str',
       'cleaned_porter_str', 'cleaned_lancaster_str', 'predicted', 'y_test'],
      dtype='object')


In [64]:
hold_out_results.to_csv("holdout_results", sep='|')

In [59]:
hold_out_results.toxicity_category.value_counts()

0    339440
1     21176
Name: toxicity_category, dtype: int64

In [60]:
hold_out_results[hold_out_results.predicted == 1].shape

(100085, 52)

In [61]:
hold_out_results[(hold_out_results.predicted == 1) & (hold_out_results.toxicity_category == 1)].shape

(17732, 52)

In [62]:
precision_score(hold_out_results.y_test, hold_out_results.predicted, pos_label=1)

0.17716940600489584

In [63]:
hold_out_metrics = mf.get_metrics(output=hold_out_results, detailed=True, should_print=True, round_to=3)

Overall Accuracy: 0.762082103955454
Overall Precision: 0.17716940600489584
Overall Recall: 0.837363052512278
Overall F1 Score: 0.29246006547859577
ROC_AUC: 0.797

Target Accuracy: 0.837363052512278
Target Precision: 1.0
Target Recall: 0.837363052512278
Target F1 Score: 0.9114834995373703

Non-Target Accuracy: 0.7573856940843743
Non-Target Precision: 1.0
Non-Target Recall: 0.7573856940843743
Non-Target F1 Score: 0.8619458968328341

Strong Identity Accuracy: 0.8355525965379494
Strong Identity Precision: 0.9436728395061729
Strong Identity Recall: 0.8754473872584109
Strong Identity F1 Score: 1.0

Obscenity Accuracy: 0.8239482200647249
Obscenity Precision: 0.9936958234830575
Obscenity Recall: 0.8268852459016394
Obscenity F1 Score: 1.0

Insults Accuracy: 0.842698752677334
Insults Precision: 0.9814458900059136
Insults Recall: 0.8553114732976873
Insults F1 Score: 1.0

Threats Accuracy: 0.8333333333333334
Threats Precision: 0.9782135076252724
Threats Recall: 0.8471698113207548
Threats F1 Score: