* Forked from **Jeremy Howard**'s NB-LR kernel: https://www.kaggle.com/jhoward/nb-svm-baseline-0-06-lb

## Introduction

This kernel shows how to use NBSVM (Naive Bayes - Support Vector Machine) to create a strong baseline for the [Toxic Comment Classification Challenge](https://www.kaggle.com/c/jigsaw-toxic-comment-classification-challenge) competition. NBSVM was introduced by Sida Wang and Chris Manning in the paper [Baselines and Bigrams: Simple, Good Sentiment and Topic Classiﬁcation](https://nlp.stanford.edu/pubs/sidaw12_simple_sentiment.pdf). In this kernel, we use sklearn's logistic regression, rather than SVM, although in practice the two are nearly identical (sklearn uses the liblinear library behind the scenes).


In [None]:
import pandas as pd, numpy as np
from sklearn.linear_model import LogisticRegression, LogisticRegressionCV
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.model_selection import cross_val_predict
import csv

In [None]:
train = pd.read_csv('../input/train.csv')
test = pd.read_csv('../input/test.csv')
subm = pd.read_csv('../input/sample_submission.csv')

## Looking at the data

The training data contains a row per comment, with an id, the text of the comment, and 6 different labels that we'll try to predict.

In [None]:
train.head()

ddHere's a couple of examples of comments, one toxic, and one with no labels.

The length of the comments varies a lot.

In [None]:
lens = train.comment_text.str.len()
lens.mean(), lens.std(), lens.max()

In [None]:
lens.hist();

We'll create a list of all the labels to predict, and we'll also create a 'none' label so we can see how many comments have no labels. We can then summarize the dataset.

In [None]:
label_cols = ['toxic', 'severe_toxic', 'obscene', 'threat', 'insult', 'identity_hate']
train['none'] = 1-train[label_cols].max(axis=1)
train.describe()

In [None]:
# add label col 
# https://stackoverflow.com/questions/44464280/mapping-one-hot-encoded-target-values-to-proper-label-names
new_label_cols = ['toxic', 'severe_toxic', 'obscene', 'threat', 'insult', 'identity_hate','none']
f, u = pd.factorize(new_label_cols)
y_test  = np.array(
    train[new_label_cols]
)
train["target"]=[', '.join(u[y.astype(bool)]) for y in y_test]

# train["target"]=
# labels = [', '.join(u[y.astype(int)]) for y in y_test]

In [None]:
train.shape

In [None]:
train.head()

In [None]:
train[label_cols].max(axis=1).describe()

In [None]:
len(train),len(test)

There are a few empty comments that we need to get rid of, otherwise sklearn will complain.

In [None]:
# COMMENT = 'comment_text'
# train[COMMENT].fillna("unknown", inplace=True)
# test[COMMENT].fillna("unknown", inplace=True)

In [None]:
df = pd.concat([train['comment_text'], test['comment_text']], axis=0)

In [None]:
pd.concat([train, test], axis=0).drop_duplicates(subset='comment_text').drop("id",axis=1).to_csv('toxic_raw_text.csv.gz', index=False,compression="gzip",quoting=csv.QUOTE_ALL)

In [None]:
train[["id",'comment_text',"target"]].to_csv('train_toxic_raw_v0.csv.gz', index=False,compression="gzip",quoting=csv.QUOTE_ALL)

## Building the model

We'll start by creating a *bag of words* representation, as a *term document matrix*. We'll use ngrams, as suggested in the NBSVM paper.

In [None]:
n = train.shape[0]
vec = CountVectorizer(ngram_range=(1,2),min_df=3, max_df=0.97,max_features = 60000) # could also try adding stop word removals, stemming, not lowercasing!

vec.fit(df.values)
trn_term_doc = vec.transform(train[COMMENT])
test_term_doc = vec.transform(test[COMMENT])

# trn_term_doc = vec.fit_transform(train[COMMENT])
# test_term_doc = vec.transform(test[COMMENT])

Here's the basic naive bayes feature equation:

In [None]:
def pr(y_i, y):
    p = x[y==y_i].sum(0)
    return (p+1) / ((y==y_i).sum()+1)

We *binarize* the features as discussed in the NBSVM paper.

In [None]:
x=trn_term_doc.sign()
test_x = test_term_doc.sign()

Fit a model for one dependent at a time:

In [None]:
def get_mdl(y):
    y = y.values
    r = np.log(pr(1,y) / pr(0,y))
    m = LogisticRegression(C=0.1, dual=True) # ORIG
#     m = LogisticRegressionCV(Cs=12)
    x_nb = x.multiply(r)
    return m.fit(x_nb, y), r

In [None]:
preds = np.zeros((len(test), len(label_cols)))

In [None]:
# for i, j in enumerate(label_cols):
#     print('fit', j)
#     m,r = get_mdl(train[j])
#     preds[:,i] = m.predict_proba(test_x.multiply(r))[:,1]

And finally, create the submission file.

In [None]:
# submid = pd.DataFrame({'id': subm["id"]})
# submission = pd.concat([submid, pd.DataFrame(preds, columns = label_cols)], axis=1)
# submission.to_csv('submission.csv', index=False)