## Baseline Submission: Toxic Language Classification 
**w207 Spring 2018 - Final Project Baseline**

**Team: Paul, Walt, Yisang, Joe**



### Project Description 

Our challenge is to build a multi-headed model that’s capable of detecting different types of of toxicity like threats, obscenity, insults, and identity-based hate.  The toxic language data set is sourced from Wikipedia and available as a public kaggle data set. 

Our goal is to use various machine learning techniques used in class to develop high quality ML models and pipelines.  

1. Exercise and build upon concepts covered in class and test out at least 3 kinds of supervised models:
    a. Regression (LASSO, Logistic)
    b. Trees (RF, XGBoost)
    c. DeepLearning (Tensorflow)
2. Using stacking/ensembling methods for improving prediction metrics (K-Means, anomaly detection)
3. Using unsupervised methods for feature engineering/selection

For the baseline proposal, this file contains a first pass run through from data preprocessing to model evaluation using a regression model pipeline. 

https://www.kaggle.com/c/jigsaw-toxic-comment-classification-challenge




### Data Ingestion

In [38]:
import numpy as np
import pandas as pd

#sklearn imports
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn.pipeline import Pipeline
from sklearn.neighbors import KNeighborsClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.naive_bayes import BernoulliNB
from sklearn.naive_bayes import MultinomialNB
from sklearn.grid_search import GridSearchCV
from sklearn.feature_extraction.text import CountVectorizer

#scipy imports
from scipy.sparse import hstack

#Visualization imports
import matplotlib.pyplot as plt
import matplotlib.gridspec as gridspec 
import bokeh
#! pip install bokeh

#General imports
import pprint

# target classes
target_names = ['toxic', 'severe_toxic', 'obscene', 'threat', 'insult', 'identity_hate']

In [3]:
# read frames localy through csv
train_df = pd.read_csv("../data/train.csv")
test_df = pd.read_csv("../data/test.csv")

# Random index generator for splitting training data
# Note: Each rerun of cell will create new splits.
randIndexCut = np.random.rand(len(train_df)) < 0.7

#S plit up data
test_data = test_df["comment_text"]
dev_data, dev_labels = train_df[~randIndexCut]["comment_text"], train_df[~randIndexCut][target_names]
train_data, train_labels = train_df[randIndexCut]["comment_text"], train_df[randIndexCut][target_names]

print 'total training observations:', train_df.shape[0]
print 'training data shape:', train_data.shape
print 'training label shape:', train_labels.shape
print 'dev label shape:', dev_labels.shape
print 'labels names:', target_names

total training observations: 159571
training data shape: (111848,)
training label shape: (111848, 6)
dev label shape: (47723, 6)
labels names: ['toxic', 'severe_toxic', 'obscene', 'threat', 'insult', 'identity_hate']


### Exploratory Data Analysis

#### Class Imbalance

Let's see how imblanced the label set is in order to have a better understanding with the label quality of the given data set. 

In [4]:
from bokeh.io import push_notebook
from bokeh.plotting import figure, show, output_file, output_notebook

target_counts = train_labels.apply(np.sum,0)
target_counts

output_notebook()


p = figure(x_range=target_names)
p.vbar(x=target_names, top = target_counts, width=0.9)

show(p)

train_labels.head()

Unnamed: 0,toxic,severe_toxic,obscene,threat,insult,identity_hate
0,0,0,0,0,0,0
1,0,0,0,0,0,0
2,0,0,0,0,0,0
3,0,0,0,0,0,0
5,0,0,0,0,0,0


The data is fairly imbalanced when counting label occurrences. 

Ideas to consider
- Sampling methods
- Custom Cross Validation

### Feature Engineering/Selection (WIP)
....

### Modeling

#### Text Processing

In [54]:
pp = pprint.PrettyPrinter(indent=4)
# Basic Count Vectorizer
countVector = CountVectorizer(ngram_range=(1,1))
train_counts = countVector.fit_transform(train_data)

print("\nVocabulary size is: {}").format(len(countVector.vocabulary_))
ten_vocab_entries = {k: countVector.vocabulary_[k] for k in countVector.vocabulary_.keys()[:11]}
first_10 = pd.Series(ten_vocab_entries).to_frame()
first_10.columns = ['count']
print(pp.pprint(first_10.sort_values(by='count')))
print("Number of nonzero entries in matrix: {}").format(train_counts.nnz)

tfidfVector = TfidfVectorizer(ngram_range=(1,1), stop_words='english')
train_tfidf_counts = tfidfVector.fit_transform(train_data)

print("\nVocabulary (tfidf) size is: {}").format(len(tfidfVector.vocabulary_))
ten_vocab_entries = {k: tfidfVector.vocabulary_[k] for k in tfidfVector.vocabulary_.keys()[:11]}
first_10 = pd.Series(ten_vocab_entries).to_frame()
first_10.columns = ['count']
print(pp.pprint(first_10.sort_values(by='count')))
print("Number of nonzero entries in matrix (tfidf): {}").format(train_tfidf_counts.nnz)

#sample column wise sum, we can see that an observation can have multiple classes. 
count_df = pd.DataFrame(train_labels.apply(np.sum,1), columns = ["counts"])
count_df = count_df[((count_df["counts"] >= 1))]
count_df.head(10)


Vocabulary size is: 153132
             count
027597675      204
comically    32312
gaf          56312
gavan        56926
regularize  112653
sowell      125742
spiders     126265
trawling    136988
tsukino     137821
woods       148229
woody       148238
None
Number of nonzero entries in matrix: 4870048

Vocabulary (tfidf) size is: 152818
             count
027597675      204
comically    32252
gaf          56205
gavan        56819
regularize  112448
sowell      125513
spiders     126036
trawling    136722
tsukino     137555
woods       147922
woody       147931
None
Number of nonzero entries in matrix (tfidf): 2850571


Unnamed: 0,counts
6,4
12,1
16,1
42,4
44,1
51,2
55,4
65,3
79,2
86,2



### First Pass Logistic Regression with saga

In [57]:
from sklearn.metrics import auc
# SK-learn libraries for cross validation
from sklearn.cross_validation import StratifiedKFold, cross_val_score, train_test_split 
# Basic Logistic Regression Model/MultiLabel Edition

prediction_output = []
scores_output = []
for name in target_names:
    classifier = LogisticRegression(solver='saga')
    cv_score = np.mean(cross_val_score(
        classifier, train_counts, train_labels[name], cv=3, scoring='roc_auc'))
    scores_output.append(cv_score)
    print('CV score for class {} is {}'.format(name, cv_score))
    classifier.fit(train_counts, train_labels[name])
    
    
print("Mean ROC_AUC: {}").format(np.mean(scores_output))

CV score for class toxic is 0.774045289922
CV score for class severe_toxic is 0.762506292844
CV score for class obscene is 0.749179306563
CV score for class threat is 0.626406548539
CV score for class insult is 0.747288640982
CV score for class identity_hate is 0.667176912541
Mean ROC_AUC: 0.721100498565



### Here's the same using tfidf and saga

In [58]:
from sklearn.metrics import auc
# SK-learn libraries for cross validation
from sklearn.cross_validation import StratifiedKFold, cross_val_score, train_test_split 
# Basic Logistic Regression Model/MultiLabel Edition

prediction_output = []
scores_output = []
for name in target_names:
    classifier = LogisticRegression(solver='saga') 
    #classifier = LogisticRegression(solver='saga') 
    cv_score = np.mean(cross_val_score(
        classifier, train_tfidf_counts, train_labels[name], cv=3, scoring='roc_auc'))
    scores_output.append(cv_score)
    print('CV score for class {} is {}'.format(name, cv_score))
    classifier.fit(train_tfidf_counts, train_labels[name])
    
    
print("Mean ROC_AUC (tfidf): {}").format(np.mean(scores_output))

CV score for class toxic is 0.965523844593
CV score for class severe_toxic is 0.98533487365
CV score for class obscene is 0.983638985401
CV score for class threat is 0.978819522659
CV score for class insult is 0.973643130263
CV score for class identity_hate is 0.971053589786
Mean ROC_AUC (tfidf): 0.976335657725



### Original counts with saga and L1

In [None]:
from sklearn.metrics import auc
# SK-learn libraries for cross validation
from sklearn.cross_validation import StratifiedKFold, cross_val_score, train_test_split 
# Basic Logistic Regression Model/MultiLabel Edition

prediction_output = []
scores_output = []
for name in target_names:
    #classifier = LogisticRegression(solver='sag', ) 
    classifier = LogisticRegression(solver='saga', penalty='l1') 
    cv_score = np.mean(cross_val_score(
        classifier, train_counts, train_labels[name], cv=3, scoring='roc_auc'))
    scores_output.append(cv_score)
    print('CV score for class {} is {}'.format(name, cv_score))
    classifier.fit(train_counts, train_labels[name])
    
    
print("Mean ROC_AUC: {}").format(np.mean(scores_output))

CV score for class toxic is 0.777955867638


### Tfidf with saga and L1

In [26]:
from sklearn.metrics import auc
# SK-learn libraries for cross validation
from sklearn.cross_validation import StratifiedKFold, cross_val_score, train_test_split 
# Basic Logistic Regression Model/MultiLabel Edition

prediction_output = []
scores_output = []
for name in target_names:
    classifier = LogisticRegression(solver='saga',penalty='l1') 
    #classifier = LogisticRegression(solver='saga') 
    cv_score = np.mean(cross_val_score(
        classifier, train_tfidf_counts, train_labels[name], cv=3, scoring='roc_auc'))
    scores_output.append(cv_score)
    print('CV score for class {} is {}'.format(name, cv_score))
    classifier.fit(train_tfidf_counts, train_labels[name])
    
    
print("Mean ROC_AUC (tfidf): {}").format(np.mean(scores_output))

CV score for class toxic is 0.774038186683
CV score for class severe_toxic is 0.762375099959
CV score for class obscene is 0.74914462031
CV score for class threat is 0.626464769357
CV score for class insult is 0.747331565877
CV score for class identity_hate is 0.667196255196
Mean ROC_AUC: 0.721091749564


#### Testing on Dev Data

In [7]:
from sklearn.metrics import auc, roc_curve
from sklearn import metrics

dev_Vector = CountVectorizer(ngram_range=(1,1))
dev_counts = countVector.fit_transform(dev_data)

pred_dt = pd.DataFrame()
scores_dev = []
for name in target_names:
    classifier = LogisticRegression(solver='sag') 
    classifier.fit(dev_counts, dev_labels[name])
    scores_dev.append(cv_score)
    output = classifier.predict(dev_counts)
    fpr, tpr, thresholds = metrics.roc_curve(dev_labels[name], output)
    print('Dev score for class {} is {}'.format(name, metrics.auc(fpr,tpr)))
    pred_dt[name] = classifier.predict_proba(dev_counts)[:, 1]
    
    
print("Mean(dev) ROC_AUC: {}").format(np.mean(scores_dev))

Dev score for class toxic is 0.655974378594
Dev score for class severe_toxic is 0.576845390847
Dev score for class obscene is 0.612902884437
Dev score for class threat is 0.50325746303
Dev score for class insult is 0.599985585204
Dev score for class identity_hate is 0.515875993381
Mean(dev) ROC_AUC: 0.673912502762


Score on dev set is worse than training set, thus evidence of overfitting and a need for performance improvement.

The target is multi-label since each observation can be classified as multiple fields.  This is an important distinction from multi-class where each prediction can only be one label.  

## Evaluation

In [8]:
count_df
train_labels["toxic"]

0         0
1         0
2         0
3         0
5         0
6         1
7         0
8         0
9         0
11        0
12        1
13        0
14        0
15        0
16        1
17        0
18        0
20        0
23        0
24        0
26        0
27        0
28        0
29        0
30        0
31        0
32        0
33        0
34        0
36        0
         ..
159530    0
159532    0
159533    0
159534    0
159535    0
159536    0
159537    0
159538    0
159541    1
159542    0
159544    0
159545    0
159546    1
159547    0
159548    0
159549    0
159551    0
159552    0
159553    0
159554    1
159556    0
159557    0
159560    0
159561    0
159562    0
159564    0
159566    0
159568    0
159569    0
159570    0
Name: toxic, Length: 111848, dtype: int64

### Submission

In [10]:
from sklearn.metrics import auc
# SK-learn libraries for cross validation
from sklearn.cross_validation import StratifiedKFold, cross_val_score, train_test_split 
# Basic Logistic Regression Model/MultiLabel Edition

prediction_submission = pd.DataFrame()
prediction_submission["id"] = test_df["id"]

# new vector object for all train data for submission
finalTrainVector = CountVectorizer()
finalTrainCount = finalTrainVector.fit_transform(train_df["comment_text"])

# TODO: Using pipelines can clean up repetitive processes
# test set up
#testVector = CountVectorizer()
testCount = finalTrainVector.transform(test_df["comment_text"])

for name in target_names:
    classifier = LogisticRegression(solver='sag') #sag is one kind of solver optimize for multi-label
    clf = classifier.fit(finalTrainCount, train_df[name])
    prediction_submission[name] = clf.predict_proba(testCount)[:, 1]
    #print(prediction_submission)

    
print(prediction_submission.head(10)) # print frame output 
#prediction_submission.to_csv("submission.csv")

                 id     toxic  severe_toxic   obscene        threat    insult  \
0  00001cee341fdb12  0.888785      0.002663  0.496331  2.394389e-05  0.432335   
1  0000247867823ef7  0.250779      0.069228  0.195475  3.602274e-02  0.194031   
2  00013b17ad220c46  0.432636      0.397718  0.430546  4.034219e-01  0.428046   
3  00017563c3f7919a  0.068287      0.001644  0.038307  1.863119e-04  0.025525   
4  00017695ad8997eb  0.424610      0.363699  0.421359  3.635901e-01  0.413221   
5  0001ea8717f6de06  0.340937      0.092327  0.259935  3.274066e-02  0.277824   
6  00024115d4cbde0f  0.124901      0.007612  0.072641  2.152088e-03  0.067184   
7  000247e83dcc1211  0.452093      0.319777  0.400319  2.503194e-01  0.417690   
8  00025358d4737918  0.006213      0.000003  0.001182  5.682402e-07  0.000462   
9  00026d1092fe71cc  0.044135      0.000409  0.017797  9.582382e-05  0.010237   

   identity_hate  
0       0.000398  
1       0.079262  
2       0.409133  
3       0.000478  
4       0.368

The frame contains the output for each class and is saved in a pandas data frame.  