## Introduction

This kernel shows how to use NBSVM (Naive Bayes - Support Vector Machine) to create a strong baseline for the [Toxic Comment Classification Challenge](https://www.kaggle.com/c/jigsaw-toxic-comment-classification-challenge) competition. NBSVM was introduced by Sida Wang and Chris Manning in the paper [Baselines and Bigrams: Simple, Good Sentiment and Topic Classiﬁcation](https://nlp.stanford.edu/pubs/sidaw12_simple_sentiment.pdf). In this kernel, we use sklearn's logistic regression, rather than SVM, although in practice the two are nearly identical (sklearn uses the liblinear library behind the scenes).

If you're not familiar with naive bayes and bag of words matrices, I've made a preview available of one of fast.ai's upcoming *Practical Machine Learning* course videos, which introduces this topic. Here is a link to the section of the video which discusses this: [Naive Bayes video](https://youtu.be/37sFIak42Sc?t=3745).

In [10]:
import pandas as pd, numpy as np
import re
import string
from sklearn.metrics import accuracy_score
from sklearn.metrics import classification_report
from sklearn.linear_model import LogisticRegression
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer

In [11]:
df = pd.read_csv('data/toxicity_annotated_comments.tsv', sep='\t')

In [12]:
scores = pd.read_csv('data/toxicity_annotations.tsv',  sep='\t')
scores.drop_duplicates(subset='rev_id', inplace=True)

In [13]:
df = df.merge(scores, on='rev_id', how='inner')

In [14]:
len(df)

159686

In [15]:
df.drop(columns=['year', 'logged_in', 'split', 'ns', 'sample', 'worker_id'], inplace=True)

In [16]:
df.head()

Unnamed: 0,rev_id,comment,toxicity,toxicity_score
0,2232.0,This:NEWLINE_TOKEN:One can make an analogy in ...,0,0.0
1,4216.0,`NEWLINE_TOKENNEWLINE_TOKEN:Clarification for ...,0,0.0
2,8953.0,Elected or Electoral? JHK,0,1.0
3,26547.0,`This is such a fun entry. DevotchkaNEWLINE_...,0,1.0
4,28959.0,Please relate the ozone hole to increases in c...,0,1.0


In [17]:
re_tok = re.compile(f'([{string.punctuation}“”¨«»®´·º½¾¿¡§£₤‘’])')
def mr_clean(comment):
    comment = re.sub('NEWLINE_TOKEN', '', comment)
    comment = re_tok.sub('', comment)             # remove punctuation
    comment = re.sub('_', ' ', comment)
    comment = re.sub( '\s+', ' ', comment)
    comment = comment.strip()
    return comment

In [18]:
df['comment'] = df['comment'].apply(mr_clean)

In [19]:
df['toxic'] = df['toxicity_score'].apply(lambda x: int(x < 0))

In [20]:
df.drop(columns=['toxicity', 'toxicity_score'], inplace=True)

In [21]:
df.head()

Unnamed: 0,rev_id,comment,toxic
0,2232.0,ThisOne can make an analogy in mathematical te...,0
1,4216.0,Clarification for you and Zundarks right i sho...,0
2,8953.0,Elected or Electoral JHK,0
3,26547.0,This is such a fun entry DevotchkaI once had a...,0
4,28959.0,Please relate the ozone hole to increases in c...,0


# Train/Test Split

In [22]:
from sklearn.model_selection import train_test_split
y = df.toxic
X_train, X_test, y_train, y_test = train_test_split(df, y, test_size=0.1)

The length of the comments varies a lot.

In [23]:
lens = X_train.comment.str.len()
lens.mean(), lens.std(), lens.max()

(374.31681707800749, 563.54040399226028, 5000)

We'll create a list of all the labels to predict, and we'll also create a 'none' label so we can see how many comments have no labels. We can then summarize the dataset.

In [25]:
len(X_train)

143717

In [26]:
len(X_test)

15969

In [27]:
COMMENT = 'comment'

## Building the model

We'll start by creating a *bag of words* representation, as a *term document matrix*. We'll use ngrams, as suggested in the NBSVM paper.

In [28]:
re_tok = re.compile(f'([{string.punctuation}“”¨«»®´·º½¾¿¡§£₤‘’])')
def tokenize(s): return re_tok.sub(r' \1 ', s).split()

It turns out that using TF-IDF gives even better priors than the binarized features used in the paper. I don't think this has been mentioned in any paper before, but it improves leaderboard score from 0.59 to 0.55.

In [32]:
n = X_train.shape[0]
vec = TfidfVectorizer(ngram_range=(1,2), tokenizer=tokenize,
               min_df=3, max_df=0.9, strip_accents='unicode', use_idf=1,
               smooth_idf=1, sublinear_tf=1 )
trn_term_doc = vec.fit_transform(X_train[COMMENT])
test_term_doc = vec.transform(X_test[COMMENT])

This creates a *sparse matrix* with only a small number of non-zero elements (*stored elements* in the representation  below).

In [33]:
trn_term_doc, test_term_doc

(<143717x355752 sparse matrix of type '<class 'numpy.float64'>'
 	with 12662305 stored elements in Compressed Sparse Row format>,
 <15969x355752 sparse matrix of type '<class 'numpy.float64'>'
 	with 1366860 stored elements in Compressed Sparse Row format>)

Here's the basic naive bayes feature equation:

In [34]:
def pr(y_i, y):
    p = x[y==y_i].sum(0)
    return (p+1) / ((y==y_i).sum()+1)

In [35]:
x = trn_term_doc
test_x = test_term_doc

Fit a model for one dependent at a time:

In [36]:
def get_mdl(y):
    y = y.values
    r = np.log(pr(1,y) / pr(0,y))
    m = LogisticRegression(C=4, dual=True)
    x_nb = x.multiply(r)
    return m.fit(x_nb, y), r

In [46]:
label_cols = ['toxic']
preds = np.zeros((len(X_test), len(label_cols)))

for i, j in enumerate(label_cols):
    print('fit', j)
    m,r = get_mdl(X_train[j])
    preds[:,i] = m.predict_proba(test_x.multiply(r))[:,1]

fit toxic


In [47]:
from sklearn.metrics import log_loss
y_true = X_test.as_matrix(columns=['toxic'])
log_loss(y_true, preds[:,0]) 

0.28089634788067197

In [51]:
from sklearn.metrics import accuracy_score
accuracy_score(y_true, preds[:, 0].round())

0.89805247667355503