# Toxic Comment Classification:
- Baseline Naive Bayes method
- [Kaggle Link](https://www.kaggle.com/c/jigsaw-toxic-comment-classification-challenge)
- Final project for CS 7650, Spring 2021 at Georgia Tech taught by Alan Ritter
  - Due 05/05/2021
- By Justin Chen

## Libraries

Mount my google drive for the data

In [1]:
import os
from google.colab import drive
drive.mount('/content/drive', force_remount=True)
os.chdir("drive/MyDrive/Colab Notebooks/CS7650/final")
os.listdir()

Mounted at /content/drive


['data', 'Preprocessing', 'Models', 'resources.gdoc']

Necessary Libraries

In [2]:
import pandas as pd
import numpy as np
import tqdm
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
from sklearn.model_selection import train_test_split
from sklearn.pipeline import Pipeline
from sklearn.naive_bayes import MultinomialNB
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score
from sklearn.multiclass import OneVsRestClassifier

In [3]:
np.random.seed(0)

## Read in Data

In [4]:
df_train = pd.read_csv('data/clean/train_clean_stop_stem.csv')
df_test = pd.read_csv('data/clean/test_clean_stop_stem.csv')
df_train.head()

Unnamed: 0.1,Unnamed: 0,id,comment_text,toxic,severe_toxic,obscene,threat,insult,identity_hate
0,0,0000997932d777bf,explan edit made usernam hardcor metallica fan...,0,0,0,0,0,0
1,1,000103f0d9cfb60f,daww match background colour im seem stuck wit...,0,0,0,0,0,0
2,2,000113f07ec002fd,hey man im realli tri edit war guy constant re...,0,0,0,0,0,0
3,3,0001b41b1c6bb37e,cant make real suggest improv wonder section s...,0,0,0,0,0,0
4,4,0001d958c54c6e35,you sir hero chanc rememb page that on,0,0,0,0,0,0


In [5]:
print('{0} rows in train'.format(len(df_train)))
print('{0} rows in test'.format(len(df_test)))

159556 rows in train
63543 rows in test


In [6]:
# #create mask to generate train/val set
# train_mask = np.full(len(df), False)
# num_test = int(len(df)*0.25)
# train_mask[:num_test] = True
# train_mask = np.random.shuffle(train_mask)
classes = ['toxic', 'severe_toxic', 'obscene', 'threat', 'insult', 'identity_hate']
X_train = df_train['comment_text']
y_train = df_train[classes]
X_test = df_test['comment_text']
y_test = df_test[classes]

## Naive Bayes
- Try either tfidf or count vectorizer

In [7]:
class NaiveBayes():
  def __init__(self):
    self.pipeline = Pipeline([('vect', CountVectorizer()),
                              ('clf', OneVsRestClassifier(MultinomialNB()))
                              ])
  
  def fit(self, x, y):
    self.pipeline.fit(x, y)

  def predict(self, x):
    return self.pipeline.predict(x)
  
  def eval(self, train, train_label, test, test_label):
    all_preds = np.zeros(shape=(test.shape[0], len(classes)))
    for i, c in enumerate(classes):
      self.fit(train, train_label[c])
      pred = self.predict(test)
      acc = accuracy_score(test_label[c], pred)
      rec = recall_score(test_label[c], pred)
      prec = precision_score(test_label[c], pred)
      f1 = f1_score(test_label[c], pred)
      print(f'{c} label')
      print(f'Accuracy: {acc} Recall {rec} Precision {prec} F1 {f1}')
      print('-----------------------')
      all_preds[:,i] = pred
    total_acc = accuracy_score(test_label, all_preds)
    total_rec = recall_score(test_label, all_preds, average='micro')
    total_prec = precision_score(test_label, all_preds, average='micro')
    total_f1 = f1_score(test_label, all_preds, average='micro')
    print('Total')
    print(f'Accuracy: {total_acc} Recall {total_rec} Precision {total_prec} F1 {total_f1}')

In [8]:
NB = NaiveBayes()
NB.eval(X_train, y_train, X_test, y_test)

toxic label
Accuracy: 0.9241773287380199 Recall 0.6366174055829228 Precision 0.5981178648565257 F1 0.6167674196627426
-----------------------
severe_toxic label
Accuracy: 0.9862927466440048 Recall 0.43869209809264303 Precision 0.19491525423728814 F1 0.269907795473596
-----------------------
obscene label
Accuracy: 0.9530711486709788 Recall 0.5740991601192089 Precision 0.6004533862283933 F1 0.5869806094182826
-----------------------
threat label
Accuracy: 0.9936893127488472 Recall 0.018957345971563982 Precision 0.020202020202020204 F1 0.019559902200488997
-----------------------
insult label
Accuracy: 0.9497033504870718 Recall 0.4823460752845054 Precision 0.5375609756097561 F1 0.5084589357120887
-----------------------
identity_hate label
Accuracy: 0.9843885243063752 Recall 0.1601123595505618 Precision 0.22440944881889763 F1 0.18688524590163935
-----------------------
Total
Accuracy: 0.8812457705805518 Recall 0.5468340460753207 Precision 0.5423450540429607 F1 0.5445802994916885
