# Naive Bayes Model

In this notebook, a Naive Bayes model is run on a iid sampled data set of approximately 670K rows of data.

#### Import modules

In [1]:
import pandas as pd
import string
import re
import string
import numpy as np
import datetime

import nltk
from nltk.corpus import stopwords
from nltk.stem import PorterStemmer 
from nltk.stem import LancasterStemmer 

import warnings
warnings.filterwarnings('ignore')

import sklearn
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfTransformer
from sklearn.feature_extraction.text import TfidfVectorizer

from sklearn.naive_bayes import MultinomialNB
from sklearn import svm
from sklearn.linear_model import LogisticRegression

from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score, roc_auc_score

In [2]:
import feature_generation_functions as fgf

[nltk_data] Downloading package stopwords to
[nltk_data]     /home/ec2-user/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


In [3]:
import model_functions as mf

In [4]:
import pickle_functions as pf

### Load and shuffle data

In [5]:
train = pd.read_csv("s3://advancedml-koch-mathur-hinkson/train.csv")

In [6]:
train.shape

(1804874, 45)

Label comments as toxic ("1") or nontoxic ("0") using 0.5 threshold

In [7]:
train['toxicity_category'] = train.target.apply(lambda x: 1 if x > 0.5 else 0)

In [8]:
train.shape

(1804874, 46)

Split into train_set and validation_set

In [9]:
#Citation: https://stackoverflow.com/questions/24147278/how-do-i-create-test-and-train-samples-from-one-dataframe-with-pandas
msk = np.random.rand(len(train)) < 0.8
train_set = train[msk]
validation_set = train[~msk]

In [14]:
print(train_set.toxicity_category.value_counts())

0    1358850
1      85018
Name: toxicity_category, dtype: int64


In [15]:
print(validation_set.toxicity_category.value_counts())

0    339586
1     21420
Name: toxicity_category, dtype: int64


Randomly sample train_set to create a smaller data frame (train_sample) to run SVM on

In [16]:
train_sample = train_set.sample(frac=0.5, replace=True)

In [17]:
print(train_sample.toxicity_category.value_counts())

0    679613
1     42321
Name: toxicity_category, dtype: int64


### Generate features

In [18]:
train_df = fgf.generate_NB_SVM_features(train_sample)

Cleaned with stopwords...Elapsed Time:  0.177 minutes
Cleaned without stopwords...Elapsed Time:  0.245 minutes
Stemmed (Porter)...Elapsed Time:  5.661 minutes
Stemmed (Lancaster)...Elapsed Time:  4.641 minutes

DONE GENERATING FEATURES


In [20]:
pf.write_pickle_to_s3bucket(filename='NB_final_720K', 
                            bucket_name='advancedml-koch-mathur-hinkson', 
                            df=train_df)

Pickled and sent to bucket!


### Reshaping

In [None]:
# train_df = pf.read_pickle(filename='NB_final_720K', bucket_name='advancedml-koch-mathur-hinkson')

In [30]:
toxic = train_df[train_df.toxicity_category == 1]
nontoxic = train_df[train_df.toxicity_category == 0]

In [31]:
train_df.shape, toxic.shape, nontoxic.shape

((721934, 50), (42321, 50), (679613, 50))

Reshape the dataset to include an equal number of toxic and nontoxic samples

In [32]:
quarter = len(toxic)

In [33]:
random_df = train_df.sample(quarter*4)

In [34]:
prepared_25 = toxic.append(nontoxic.sample(len(toxic)*3))
prepared_25 = prepared_25.sample(frac=1).reset_index(drop=True)
print(prepared_25.toxicity_category.value_counts())

prepared_50 = toxic.append(toxic).append(nontoxic.sample(len(toxic)*2))
prepared_50 = prepared_50.sample(frac=1).reset_index(drop=True)
print(prepared_50.toxicity_category.value_counts())

prepared_75 = toxic.append(toxic).append(toxic).append(nontoxic.sample(len(toxic)))
prepared_75 = prepared_75.sample(frac=1).reset_index(drop=True)
print(prepared_75.toxicity_category.value_counts())

0    126963
1     42321
Name: toxicity_category, dtype: int64
1    84642
0    84642
Name: toxicity_category, dtype: int64
1    126963
0     42321
Name: toxicity_category, dtype: int64


### Highlighted Model