## Baseline Submission: Toxic Language Classification 
**w207 Spring 2018 - Final Project Baseline**

**Team: Paul, Walt, Yisang, Joe**



### Project Description 

Our challenge is to build a multi-headed model that’s capable of detecting different types of of toxicity like threats, obscenity, insults, and identity-based hate.  The toxic language data set is sourced from Wikipedia and available as a public kaggle data set. 

Our goal is to use various machine learning techniques used in class to develop high quality ML models and pipelines.  

1. Exercise and build upon concepts covered in class and test out at least 3 kinds of supervised models:
    a. Regression (LASSO, Logistic)
    b. Trees (RF, XGBoost)
    c. DeepLearning (Tensorflow)
2. Using stacking/ensembling methods for improving prediction metrics (K-Means, anomaly detection)
3. Using unsupervised methods for feature engineering/selection

For the baseline proposal, this file contains a first pass run through from data preprocessing to model evaluation using a regression model pipeline. 

https://www.kaggle.com/c/jigsaw-toxic-comment-classification-challenge




### Data Ingestion

In [1]:
import numpy as np
import pandas as pd

#sklearn imports
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn.pipeline import Pipeline
from sklearn.neighbors import KNeighborsClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.naive_bayes import BernoulliNB
from sklearn.naive_bayes import MultinomialNB
from sklearn.grid_search import GridSearchCV
from sklearn.feature_extraction.text import CountVectorizer

#scipy imports
from scipy.sparse import hstack

#Visualization imports
import matplotlib.pyplot as plt
import matplotlib.gridspec as gridspec 
import bokeh
#! pip install bokeh

# target classes
target_names = ['toxic', 'severe_toxic', 'obscene', 'threat', 'insult', 'identity_hate']



In [2]:
# read frames localy through csv
train_df = pd.read_csv("../data/train.csv")
test_df = pd.read_csv("../data/test.csv")

# Random index generator for splitting training data
# Note: Each rerun of cell will create new splits.
randIndexCut = np.random.rand(len(train_df)) < 0.7

#S plit up data
test_data = test_df["comment_text"]
dev_data, dev_labels = train_df[~randIndexCut]["comment_text"], train_df[~randIndexCut][target_names]
train_data, train_labels = train_df[randIndexCut]["comment_text"], train_df[randIndexCut][target_names]

print ('total training observations:', train_df.shape[0])
print ('training data shape:', train_data.shape)
print ('training label shape:', train_labels.shape)
print ('dev label shape:', dev_labels.shape)
print ('labels names:', target_names)

total training observations: 159571
training data shape: (111334,)
training label shape: (111334, 6)
dev label shape: (48237, 6)
labels names: ['toxic', 'severe_toxic', 'obscene', 'threat', 'insult', 'identity_hate']


### Exploratory Data Analysis

#### Class Imbalance

Let's see how imblanced the label set is in order to have a better understanding with the label quality of the given data set. 

In [3]:
from bokeh.io import push_notebook
from bokeh.plotting import figure, show, output_file, output_notebook

target_counts = train_labels.apply(np.sum,0)
target_counts

output_notebook()


p = figure(x_range=target_names)
p.vbar(x=target_names, top = target_counts, width=0.9)

show(p)

train_labels.head()

Unnamed: 0,toxic,severe_toxic,obscene,threat,insult,identity_hate
1,0,0,0,0,0,0
2,0,0,0,0,0,0
3,0,0,0,0,0,0
4,0,0,0,0,0,0
6,1,1,1,0,1,0


The data is fairly imbalanced when counting label occurrences. 

Ideas to consider
- Sampling methods
- Custom Cross Validation

### Feature Engineering/Selection (WIP)
....

### Modeling

#### Text Processing

In [4]:
# Basic Count Vectorizer
countVector = CountVectorizer(ngram_range=(1,1))
train_counts = countVector.fit_transform(train_data)

print("Vocabulary size is: {}".format(len(countVector.vocabulary_)))

print("Number of nonzero entries in matrix: {}".format(train_counts.nnz))

#sample column wise sum, we can see that an observation can have multiple classes. 
count_df = pd.DataFrame(train_labels.apply(np.sum,1), columns = ["counts"])
count_df = count_df[((count_df["counts"] >= 1))]
count_df.head(10)

Vocabulary size is: 153173
Number of nonzero entries in matrix: 4858960


Unnamed: 0,counts
6,4
12,1
43,3
44,1
51,2
55,4
56,3
58,2
59,1
65,3


#### First Pass Logistic Regression

In [5]:
from sklearn.metrics import auc
# SK-learn libraries for cross validation
from sklearn.cross_validation import StratifiedKFold, cross_val_score, train_test_split 
# Basic Logistic Regression Model/MultiLabel Edition

prediction_output = []
scores_output = []
for name in target_names:
    classifier = LogisticRegression(solver='sag') 
    cv_score = np.mean(cross_val_score(
        classifier, train_counts, train_labels[name], cv=3, scoring='roc_auc'))
    scores_output.append(cv_score)
    print('CV score for class {} is {}'.format(name, cv_score))
    classifier.fit(train_counts, train_labels[name])
    
    
print("Mean ROC_AUC: {}").format(np.mean(scores_output))



CV score for class toxic is 0.7931806764841953




CV score for class severe_toxic is 0.7716463967007033




CV score for class obscene is 0.7695848574305241




CV score for class threat is 0.6142712064318583




CV score for class insult is 0.764512541708876




CV score for class identity_hate is 0.6697962142801029
Mean ROC_AUC: {}


AttributeError: 'NoneType' object has no attribute 'format'

#### Testing on Dev Data

In [6]:
from sklearn.metrics import auc, roc_curve
from sklearn import metrics

dev_Vector = CountVectorizer(ngram_range=(1,1))
dev_counts = countVector.fit_transform(dev_data)

pred_dt = pd.DataFrame()
scores_dev = []
for name in target_names:
    classifier = LogisticRegression(solver='sag') 
    classifier.fit(dev_counts, dev_labels[name])
    scores_dev.append(cv_score)
    output = classifier.predict(dev_counts)
    fpr, tpr, thresholds = metrics.roc_curve(dev_labels[name], output)
    print('Dev score for class {} is {}'.format(name, metrics.auc(fpr,tpr)))
    pred_dt[name] = classifier.predict_proba(dev_counts)[:, 1]
    
    
print("Mean(dev) ROC_AUC: {}".format(np.mean(scores_dev)))



Dev score for class toxic is 0.6703810167570833
Dev score for class severe_toxic is 0.5867791509907171
Dev score for class obscene is 0.621809659059055
Dev score for class threat is 0.5066906141871892
Dev score for class insult is 0.6137888126414333
Dev score for class identity_hate is 0.5118937628785551
Mean(dev) ROC_AUC: 0.6697962142801029


Score on dev set is worse than training set, thus evidence of overfitting and a need for performance improvement.

The target is multi-label since each observation can be classified as multiple fields.  This is an important distinction from multi-class where each prediction can only be one label.  

## Evaluation

In [7]:
count_df
train_labels["toxic"]

1         0
2         0
3         0
4         0
6         1
7         0
8         0
10        0
12        1
14        0
18        0
19        0
23        0
24        0
26        0
27        0
29        0
30        0
31        0
32        0
33        0
39        0
40        0
41        0
43        1
44        1
45        0
46        0
49        0
51        1
         ..
159529    0
159530    0
159531    0
159532    0
159533    0
159534    0
159535    0
159536    0
159537    0
159538    0
159539    0
159541    1
159542    0
159543    0
159544    0
159546    1
159548    0
159549    0
159550    0
159551    0
159552    0
159553    0
159554    1
159558    0
159559    0
159560    0
159564    0
159565    0
159566    0
159568    0
Name: toxic, Length: 111334, dtype: int64

### Submission

In [8]:
from sklearn.metrics import auc
# SK-learn libraries for cross validation
from sklearn.cross_validation import StratifiedKFold, cross_val_score, train_test_split 
# Basic Logistic Regression Model/MultiLabel Edition

prediction_submission = pd.DataFrame()
prediction_submission["id"] = test_df["id"]

# new vector object for all train data for submission
finalTrainVector = CountVectorizer()
finalTrainCount = finalTrainVector.fit_transform(train_df["comment_text"])

# TODO: Using pipelines can clean up repeitive processes
# test set up
#testVector = CountVectorizer()
testCount = finalTrainVector.transform(test_df["comment_text"])

for name in target_names:
    classifier = LogisticRegression(solver='sag') #sag is one kind of solver optimize for multi-label
    clf = classifier.fit(finalTrainCount, train_df["toxic"])
    prediction_submission[name] = clf.predict_proba(testCount)[:, 1]
    #print(prediction_submission)

    
print(prediction_submission.head(10)) # print frame output 
#prediction_submission.to_csv("submission.csv")



                 id     toxic  severe_toxic   obscene    threat    insult  \
0  00001cee341fdb12  0.889874      0.889881  0.889580  0.889148  0.889253   
1  0000247867823ef7  0.250707      0.250631  0.250783  0.250673  0.250832   
2  00013b17ad220c46  0.432800      0.432654  0.432645  0.432660  0.432584   
3  00017563c3f7919a  0.068222      0.068144  0.068119  0.068206  0.068508   
4  00017695ad8997eb  0.424500      0.424078  0.424393  0.424709  0.424134   
5  0001ea8717f6de06  0.341316      0.340847  0.341085  0.340838  0.340882   
6  00024115d4cbde0f  0.124698      0.124975  0.124506  0.124726  0.124905   
7  000247e83dcc1211  0.452275      0.452389  0.451903  0.452156  0.452112   
8  00025358d4737918  0.006188      0.006213  0.006212  0.006213  0.006215   
9  00026d1092fe71cc  0.044157      0.044126  0.044130  0.044161  0.044084   

   identity_hate  
0       0.888660  
1       0.250915  
2       0.432634  
3       0.068281  
4       0.424510  
5       0.340689  
6       0.124845  


The frame contains the output for each class and is saved in a pandas data frame.  

In [9]:
import string
import spacy
from pprint import pprint
import pickle
from sklearn.feature_extraction.stop_words import ENGLISH_STOP_WORDS
from sklearn.model_selection import GridSearchCV
from sklearn.tree import DecisionTreeClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.pipeline import Pipeline

# target classes
#target_names = ['toxic', 'severe_toxic', 'obscene', 'threat', 'insult', 'identity_hate']

In [10]:
# Forcing pandas to display all data (instead of cutting off columns & rows in the view)
pd.set_option('display.max_columns', None)
pd.set_option('display.max_rows', None)

In [11]:
# Setting up X and y
X_train = train_df['comment_text']
y_train = train_df[['toxic', 'severe_toxic', 'obscene', 'threat', 'insult', 'identity_hate']]
X_test = test_df['comment_text']

In [12]:
# Setting up stop words
stop_words = set(list(ENGLISH_STOP_WORDS) + ['wikipedia'])

In [13]:
# Creating text processing & modeling pipeline for gridsearching hyper-parameters
grid_pipeline = Pipeline([
    ('tfidf', TfidfVectorizer(stop_words=stop_words)),
    ('model', DecisionTreeClassifier()),
])

In [14]:
# Setting up parameter list for gridsearching
param_list = [{'model': [KNeighborsClassifier()],'model__n_neighbors': [5, 15]}]

In [None]:
# Grid searching using the pipeline's parameters
g = GridSearchCV(grid_pipeline, param_list, cv=5, n_jobs=3, verbose=10, scoring='f1_weighted')
g.fit(X_train, y_train)

Fitting 5 folds for each of 2 candidates, totalling 10 fits
[CV] model=KNeighborsClassifier(algorithm='auto', leaf_size=30, metric='minkowski',
           metric_params=None, n_jobs=1, n_neighbors=5, p=2,
           weights='uniform'), model__n_neighbors=5 
[CV] model=KNeighborsClassifier(algorithm='auto', leaf_size=30, metric='minkowski',
           metric_params=None, n_jobs=1, n_neighbors=5, p=2,
           weights='uniform'), model__n_neighbors=5 
[CV] model=KNeighborsClassifier(algorithm='auto', leaf_size=30, metric='minkowski',
           metric_params=None, n_jobs=1, n_neighbors=5, p=2,
           weights='uniform'), model__n_neighbors=5 
