# SVM -- Iteration 3

In this iteration of SVM, a preprocessed data set of 100K rows was split into a Train (80%) and Hold out (20%) set.  Due to a very low number of toxic comments (somewhere around 6% of the data is labeled as toxic), it was learned in earlier iterations that the model could not pick up on such a low number of toxic comments.  Therefore the data set was reshaped and weigted to 50% toxic comments and 50% nontoxic comments.  From the results shown on the tested hold out set, recall is most important because this measures if we are identifying the comments of interest.  While recall has improved, precision has decreased because the training data no longer reflects what is observed in the raw data. 

In [1]:
import pandas as pd
import string
import re
import string
import numpy as np
import datetime

import warnings
warnings.filterwarnings('ignore')

import nltk
from nltk.corpus import stopwords
from nltk.stem import PorterStemmer 
from nltk.stem import LancasterStemmer 

import sklearn
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfTransformer
from sklearn.feature_extraction.text import TfidfVectorizer

from sklearn.naive_bayes import MultinomialNB,GaussianNB
from sklearn import svm
from sklearn.linear_model import LogisticRegression

from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score, roc_auc_score

In [2]:
import pickle_functions as pf

In [3]:
import model_functions as mf

### Load and shuffle data

Read in test.csv and train.csv

In [4]:
train = pf.read_pickle(bucket_name='advancedml-koch-mathur-hinkson', filename='sub_train_df7_preprocessed')

In [5]:
train.columns

Index(['id', 'target', 'comment_text', 'severe_toxicity', 'obscene',
       'identity_attack', 'insult', 'threat', 'asian', 'atheist', 'bisexual',
       'black', 'buddhist', 'christian', 'female', 'heterosexual', 'hindu',
       'homosexual_gay_or_lesbian', 'intellectual_or_learning_disability',
       'jewish', 'latino', 'male', 'muslim', 'other_disability',
       'other_gender', 'other_race_or_ethnicity', 'other_religion',
       'other_sexual_orientation', 'physical_disability',
       'psychiatric_or_mental_illness', 'transgender', 'white', 'created_date',
       'publication_id', 'parent_id', 'article_id', 'rating', 'funny', 'wow',
       'sad', 'likes', 'disagree', 'sexual_explicit',
       'identity_annotator_count', 'toxicity_annotator_count', 'split',
       'cleaned_w_stopwords_str', 'cleaned_w_stopwords', 'cleaned_no_stem_str',
       'cleaned_no_stem', 'cleaned_porter_str', 'cleaned_porter',
       'cleaned_lancaster_str', 'cleaned_lancaster', 'bigrams_unstemmed',
       

In [6]:
drop_cols = ['split', 'cleaned_w_stopwords', 'cleaned_no_stem', 'cleaned_porter', 'cleaned_lancaster', 'bigrams_unstemmed',
       'perc_upper', 'num_exclam', 'num_words', 'perc_stopwords',
       'num_upper_words']

In [7]:
train = train.drop(drop_cols, axis = 1)

In [8]:
train.shape

(100000, 49)

In [9]:
train.columns

Index(['id', 'target', 'comment_text', 'severe_toxicity', 'obscene',
       'identity_attack', 'insult', 'threat', 'asian', 'atheist', 'bisexual',
       'black', 'buddhist', 'christian', 'female', 'heterosexual', 'hindu',
       'homosexual_gay_or_lesbian', 'intellectual_or_learning_disability',
       'jewish', 'latino', 'male', 'muslim', 'other_disability',
       'other_gender', 'other_race_or_ethnicity', 'other_religion',
       'other_sexual_orientation', 'physical_disability',
       'psychiatric_or_mental_illness', 'transgender', 'white', 'created_date',
       'publication_id', 'parent_id', 'article_id', 'rating', 'funny', 'wow',
       'sad', 'likes', 'disagree', 'sexual_explicit',
       'identity_annotator_count', 'toxicity_annotator_count',
       'cleaned_w_stopwords_str', 'cleaned_no_stem_str', 'cleaned_porter_str',
       'cleaned_lancaster_str'],
      dtype='object')

Create a new column called "toxicity_category" in the train data frame categorizing comments as toxic ("1") or non-toxic ("0").

In [10]:
train['toxicity_category'] = train.target.apply(lambda x: 1 if x > 0.5 else 0)

Split train.csv into training (80%) and hold out sets (20%).

In [23]:
# https://stackoverflow.com/questions/24147278/how-do-i-create-test-and-train-samples-from-one-dataframe-with-pandas
msk = np.random.rand(len(train)) < 0.8
train_set = train[msk]
hold_out_set = train[~msk]

In [12]:
print(train_set.toxicity_category.value_counts())

0    75710
1     4115
Name: toxicity_category, dtype: int64


In [24]:
print(hold_out_set.toxicity_category.value_counts())

0    18803
1     1013
Name: toxicity_category, dtype: int64


In [14]:
print(train_set.toxicity_category.value_counts())

0    75710
1     4115
Name: toxicity_category, dtype: int64


In [15]:
toxic = train_set[train_set.toxicity_category == 1]
nontoxic = train_set[train_set.toxicity_category == 0]

In [16]:
train_set.shape, toxic.shape, nontoxic.shape

((79825, 50), (4115, 50), (75710, 50))

Reshaping the dataset to be include an equal number of toxic and nontoxic samples

In [17]:
quarter = len(toxic)

In [18]:
random_df = train_set.sample(quarter*4)

Double the numbe of toxic comments and then sample and equal number of non-toxic comments, to make a 50%-50% data set of toxic and nontoxic comments

In [19]:
prepared_50 = toxic.append(toxic).append(nontoxic.sample(len(toxic)*2))
prepared_50 = prepared_50.sample(frac=1).reset_index(drop=True)
print(prepared_50.toxicity_category.value_counts())

1    8230
0    8230
Name: toxicity_category, dtype: int64


### SVM

In [20]:
classifier, output, fitted_vectorizer = mf.run_model(model_df=prepared_50, 
                                                     model_type="SVM", 
                                                     comments = "cleaned_no_stem_str", 
                                                     train_perc=0.95, 
                                                     target="toxicity_category", 
                                                     see_inside=False)

fitting model now


In [21]:
mf.get_metrics(output=output, detailed=True, should_print=True, round_to=3)

Overall Accuracy: 0.8991494532199271
Overall Precision: 0.8983050847457628
Overall Recall: 0.9004854368932039
Overall F1 Score: 0.8993939393939395
ROC_AUC: 0.899

Target Accuracy: 0.9004854368932039
Target Precision: 1.0
Target Recall: 0.9004854368932039
Target F1 Score: 0.9476372924648786

Non-Target Accuracy: 0.8978102189781022
Non-Target Precision: 1.0
Non-Target Recall: 0.8978102189781022
Non-Target F1 Score: 0.9461538461538462

Strong Identity Accuracy: 0.8947368421052632
Strong Identity Precision: 1.0
Strong Identity Recall: 0.8947368421052632
Strong Identity F1 Score: 1.0

Obscenity Accuracy: 0.9354838709677419
Obscenity Precision: 1.0
Obscenity Recall: 0.9354838709677419
Obscenity F1 Score: 1.0

Insults Accuracy: 0.9090909090909091
Insults Precision: 1.0
Insults Recall: 0.9090909090909091
Insults F1 Score: 1.0

Threats Accuracy: 0.8181818181818182
Threats Precision: 1.0
Threats Recall: 0.8181818181818182
Threats F1 Score: 1.0



{'Overall': {'Accuracy': 0.8991494532199271,
  'Precision': 0.8983050847457628,
  'Recall': 0.9004854368932039,
  'F1': 0.8993939393939395,
  'ROC_AUC': 0.899},
 'Target': {'Accuracy': 0.9004854368932039,
  'Precision': 1.0,
  'Recall': 0.9004854368932039,
  'F1': 0.9476372924648786},
 'Non-Target': {'Accuracy': 0.8181818181818182,
  'Precision': 1.0,
  'Recall': 0.8181818181818182,
  'F1': 1.0}}

In [25]:
hold_out_results = mf.run_model_test(model_df=hold_out_set, 
                                     clf=classifier, 
                                     vectorizer=fitted_vectorizer, 
                                     comments="cleaned_no_stem_str", target="toxicity_category")

Index(['id', 'target', 'comment_text', 'severe_toxicity', 'obscene',
       'identity_attack', 'insult', 'threat', 'asian', 'atheist', 'bisexual',
       'black', 'buddhist', 'christian', 'female', 'heterosexual', 'hindu',
       'homosexual_gay_or_lesbian', 'intellectual_or_learning_disability',
       'jewish', 'latino', 'male', 'muslim', 'other_disability',
       'other_gender', 'other_race_or_ethnicity', 'other_religion',
       'other_sexual_orientation', 'physical_disability',
       'psychiatric_or_mental_illness', 'transgender', 'white', 'created_date',
       'publication_id', 'parent_id', 'article_id', 'rating', 'funny', 'wow',
       'sad', 'likes', 'disagree', 'sexual_explicit',
       'identity_annotator_count', 'toxicity_annotator_count',
       'cleaned_w_stopwords_str', 'cleaned_no_stem_str', 'cleaned_porter_str',
       'cleaned_lancaster_str', 'toxicity_category', 'predicted', 'y_test'],
      dtype='object')


In [26]:
hold_out_results.to_csv("holdout_results", sep='|')

In [27]:
hold_out_results.toxicity_category.value_counts()

0    18803
1     1013
Name: toxicity_category, dtype: int64

In [28]:
precision_score(hold_out_results.y_test, hold_out_results.predicted, pos_label=1)

0.3230452674897119

In [29]:
hold_out_metrics = mf.get_metrics(output=hold_out_results, detailed=True, should_print=True, round_to=3)

Overall Accuracy: 0.8968005651998385
Overall Precision: 0.3230452674897119
Overall Recall: 0.9299111549851925
Overall F1 Score: 0.47951132603715957
ROC_AUC: 0.912

Target Accuracy: 0.9299111549851925
Target Precision: 1.0
Target Recall: 0.9299111549851925
Target F1 Score: 0.963682864450128

Non-Target Accuracy: 0.8950167526458543
Non-Target Precision: 1.0
Non-Target Recall: 0.8950167526458543
Non-Target F1 Score: 0.9446003592276605

Strong Identity Accuracy: 0.8769230769230769
Strong Identity Precision: 0.9473684210526315
Strong Identity Recall: 0.9152542372881356
Strong Identity F1 Score: 1.0

Obscenity Accuracy: 0.9032258064516129
Obscenity Precision: 0.9649122807017544
Obscenity Recall: 0.9322033898305084
Obscenity F1 Score: 1.0

Insults Accuracy: 0.9241645244215938
Insults Precision: 0.9846153846153847
Insults Recall: 0.9361702127659575
Insults F1 Score: 1.0

Threats Accuracy: 0.8823529411764706
Threats Precision: 0.8823529411764706
Threats Recall: 1.0
Threats F1 Score: 1.0

