## Introduction

This snippet shows how to use NBSVM (Naive Bayes - Support Vector Machine) to create a toxic content detector. NBSVM was introduced by Sida Wang and Chris Manning in the paper [Baselines and Bigrams: Simple, Good Sentiment and Topic Classiﬁcation](https://nlp.stanford.edu/pubs/sidaw12_simple_sentiment.pdf). 
Model is capable to detect different types of toxicity like threats, obscenity, insults, and identity-based hate. A dataset of comments from Wikipedia’s talk page edits was used as a dictionary. 

In [1]:
import pandas as pd, numpy as np
from sklearn.linear_model import LogisticRegression
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
import joblib

## Looking at the data

The dataset here is from wiki corpus dataset which was rated by human raters for toxicity.
The corpus contains 63M comments from discussions relating to user pages and articles dating from 2004-2015. 

Different platforms/sites can have different standards for their toxic screening process. Hence the comments are tagged in the following five categories
* toxic
* severe_toxic
* obscene
* threat
* insult
* identity_hate

The tagging was done via **crowdsourcing** which means that the dataset was rated by different people and the tagging might not be 100% accurate too. 

The [source paper](https://arxiv.org/pdf/1610.08914.pdf) contains more interesting details about the dataset creation.

In [2]:
train = pd.read_csv('train.csv')
test = pd.read_csv('test.csv')
subm = pd.read_csv('submission.csv')

In [3]:
train.head(10)

Unnamed: 0,id,comment_text,toxic,severe_toxic,obscene,threat,insult,identity_hate
0,0000997932d777bf,Explanation\nWhy the edits made under my usern...,0,0,0,0,0,0
1,000103f0d9cfb60f,D'aww! He matches this background colour I'm s...,0,0,0,0,0,0
2,000113f07ec002fd,"Hey man, I'm really not trying to edit war. It...",0,0,0,0,0,0
3,0001b41b1c6bb37e,"""\nMore\nI can't make any real suggestions on ...",0,0,0,0,0,0
4,0001d958c54c6e35,"You, sir, are my hero. Any chance you remember...",0,0,0,0,0,0
5,00025465d4725e87,"""\n\nCongratulations from me as well, use the ...",0,0,0,0,0,0
6,0002bcb3da6cb337,COCKSUCKER BEFORE YOU PISS AROUND ON MY WORK,1,1,1,0,1,0
7,00031b1e95af7921,Your vandalism to the Matt Shirvington article...,0,0,0,0,0,0
8,00037261f536c51d,Sorry if the word 'nonsense' was offensive to ...,0,0,0,0,0,0
9,00040093b2687caa,alignment on this subject and which are contra...,0,0,0,0,0,0


The length of the comments varies a lot.

In [4]:
lens = train.comment_text.str.len()
lens.mean(), lens.std(), lens.max()

(394.0732213246768, 590.7202819048923, 5000)

There are a few empty comments to be removed, otherwise sklearn will complain.

In [5]:
COMMENT = 'comment_text'
train[COMMENT].fillna("unknown", inplace=True)
test[COMMENT].fillna("unknown", inplace=True)

## Building the model

I'll start by *TF-IDF* transformation using ngrams.

In [6]:
import re, string
re_tok = re.compile(f'([{string.punctuation}“”¨«»®´·º½¾¿¡§£₤‘’])')
def tokenize(s): return re_tok.sub(r' \1 ', s).split()

In [7]:
n = train.shape[0]
vec = TfidfVectorizer(ngram_range=(1,2), tokenizer=tokenize,
               min_df=3, max_df=0.9, strip_accents='unicode', use_idf=1,
               smooth_idf=1, sublinear_tf=1, stop_words='english')

In [8]:
# Save the vectorizer
vectorizer = 'vectorizer.joblib'
joblib.dump(vec, open(vectorizer, 'wb'))

In [9]:
train_tdidf = vec.fit_transform(train[COMMENT])
test_tdidf = vec.transform(test[COMMENT])

This creates a *sparse matrix* with only a small number of non-zero elements.

In [10]:
train_tdidf, test_tdidf

(<159571x326105 sparse matrix of type '<class 'numpy.float64'>'
 	with 9291686 stored elements in Compressed Sparse Row format>,
 <153164x326105 sparse matrix of type '<class 'numpy.float64'>'
 	with 7741173 stored elements in Compressed Sparse Row format>)

Here's the basic naive bayes feature equation:

In [11]:
def pr(y_i, y):
    p = x[y==y_i].sum(0)
    return (p+1) / ((y==y_i).sum()+1)

In [12]:
x = train_tdidf
test_x = test_tdidf

Fit a model for one dependent at a time:

In [13]:
def get_mdl(y):
    y = y.values
    r = np.log(pr(1,y) / pr(0,y))
    m = LogisticRegression(C=0.1, solver='sag')
    x_nb = x.multiply(r)
    return m.fit(x_nb, y), r

In [14]:
class_names = ['toxic', 'severe_toxic', 'obscene', 'threat', 'insult', 'identity_hate']
preds = np.zeros((len(test), len(class_names)))

for i, j in enumerate(class_names):
    print('fit', j)
    m,r = get_mdl(train[j])
    # Save the model
    filename = 'finalized_model_'+j+'.sav'
    joblib.dump(m, filename)
    preds[:,i] = m.predict_proba(test_x.multiply(r))[:,1]

fit toxic
fit severe_toxic
fit obscene
fit threat
fit insult
fit identity_hate


And finally, create the submission file.

In [15]:
submid = pd.DataFrame({'id': subm["id"]})
submission = pd.concat([submid, pd.DataFrame(preds, columns = class_names)], axis=1)
submission.to_csv('submission.csv', index=False)