# Baseline Document Classifiers

To start out the modelling process, we will create two linear classifiers with a bag of words approach. Note here that linear models are generally used with bag of words because the feature vectors are so sparse. We start with a linear regression paired with TFIDF Vectorization.

## TFIDF Bag-of-Words + Logistic Regression

#### Imports

In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.linear_model import LogisticRegressionCV

#### Importing Training Data and Class Titles


In [2]:
docs = pd.read_csv('C:\\Users\\harri\\.kaggle\\competitions\\jigsaw-toxic-comment-classification-challenge\\train.csv\\train.csv')
X = docs.iloc[:,1]
y = docs.iloc[:,2:]
class_names = list(docs.columns)[2:]
print(class_names)

['toxic', 'severe_toxic', 'obscene', 'threat', 'insult', 'identity_hate']


#### Initializing TFIDF 

Here we convert our documents to arrays that encode the number of times each word in our corpus is used in each particular document. We then down weight more frequent words as they are less useful for discerning particular classes of documents.

In [3]:
tfidf = TfidfVectorizer(strip_accents = 'unicode',
                        analyzer = 'word',
                        stop_words = 'english',
                        max_features = 10000 )
X_tfidf = tfidf.fit_transform(X)

#### Fitting a Logistic Regression for Each Document Class Type

Here we take a One-vs-All approach to classification by fitting a logistic regression to each document class. Our model will essentially be given a class, for example "Toxic", and then be fed documents and asked to decide whether each is "Toxic" or "Not-Toxic". Below we see a scoring metric for each class of document.

In [5]:
CV_scores = []
for name in class_names:
    y = docs[name]
    clf = LogisticRegressionCV()
    score = np.mean(cross_val_score(clf, X_tfidf, y, cv = 5, scoring = 'roc_auc', n_jobs = -1))
    print(name, score)
    CV_scores.append(score)

toxic 0.9665584030848896
severe_toxic 0.9820968963740692
obscene 0.9769778379928169
threat 0.9759224558498725
insult 0.9724199291295372
identity_hate 0.9714379317504337


##### Average Score of all Categories: 0.974

In [6]:
np.mean(CV_scores)

0.9742355756969365

This is OK for a baseline, we scored on  the 30th percentile based on the above cross validation. However if we want to score in the top 25% we need a ROC-AUC score of  0.9858. Next lets try a Naive Bayes-Support Vector Machine Approach.

## NB-SVM
