In [3]:
import sqlite3 as sql
import pandas as pd
import numpy as np
import re

seed = 101

Load the dataset from the previous notebook.

In [2]:
with sql.connect('../data/toxic.db') as conn:
    df = pd.read_sql_query('select * from toxic', conn)
df.head()

Unnamed: 0,rev_id,comment,year,logged_in,ns,sample,split,min,max,avg,y
0,2232.0,This:\n:One can make an analogy in mathematica...,2002,1,article,random,train,-1.0,1.0,0.4,0
1,4216.0,"""\n\n:Clarification for you (and Zundark's ri...",2002,1,user,random,train,0.0,2.0,0.5,0
2,8953.0,Elected or Electoral? JHK,2002,0,article,random,test,0.0,1.0,0.1,0
3,26547.0,"""This is such a fun entry. Devotchka\n\nI on...",2002,1,article,random,train,0.0,2.0,0.6,0
4,28959.0,Please relate the ozone hole to increases in c...,2002,1,article,random,test,-1.0,1.0,0.2,0


Remember to isolate the train, dev, and test sets.

In [6]:
idx_train = df['split'] == 'train'
idx_dev = df['split'] == 'dev'
idx_test = df['split'] == 'test'

Let's start things off with a pretty basic model to serve as a baseline: a CountVectorizer and a MultinomialNB. We'll start with the tokenizer we developed in the exploration step.

In [5]:
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.naive_bayes import MultinomialNB

def tokenizer(text):
    return re.findall(r'[a-z0-9]+', text)

vect = CountVectorizer(tokenizer=tokenizer)
clf = MultinomialNB()

In [10]:
X_train = df.loc[idx_train, 'comment'].values
X_dev = df.loc[idx_dev, 'comment'].values

y_train = df.loc[idx_train, 'y'].values
y_dev = df.loc[idx_dev, 'y'].values

X_train_vect = vect.fit_transform(X_train)
X_dev_vect = vect.transform(X_dev)

In [11]:
clf.fit(X_train_vect, y_train)
y_pred = clf.predict(X_dev_vect)
np.mean(y_dev==y_pred)

0.8969434760956175

Nice, we nearly got 90% on the dev set. Is accuracy a good metric for this problem though?

In [12]:
from sklearn.metrics import confusion_matrix

confusion_matrix(y_dev, y_pred)

array([[25211,  1039],
       [ 2272,  3606]], dtype=int64)

In [13]:
from sklearn.metrics import classification_report

print(classification_report(y_dev, y_pred))

              precision    recall  f1-score   support

           0       0.92      0.96      0.94     26250
           1       0.78      0.61      0.69      5878

   micro avg       0.90      0.90      0.90     32128
   macro avg       0.85      0.79      0.81     32128
weighted avg       0.89      0.90      0.89     32128



In [23]:
idx_error = (y_dev != y_pred) & (y_dev == 1)
print(X_dev[idx_error][3])



==Werewolf: The Whatever==

:No reason for these minor elements of W:tF backstory to have their own articles.

Was that a deliberate typo? Either way, it's quite funny ).  


Since we have a pretty large imbalance and the positive class is low, let's use F1 as the evaluation metric. We want a balance between precision _and_ recall.