## Baseline Submission: Toxic Language Classification 
**w207 Spring 2018 - Final Project Baseline**

**Team: Paul, Walt, Yisang, Joe**



### Project Description 

Our challenge is to build a multi-headed model that’s capable of detecting different types of of toxicity like threats, obscenity, insults, and identity-based hate.  The toxic language data set is sourced from Wikipedia and available as a public kaggle data set. 

Our goal is to use various machine learning techniques used in class to develop high quality ML models and pipelines.  

1. Exercise and build upon concepts covered in class and test out at least 3 kinds of supervised models:
    a. Regression (LASSO, Logistic)
    b. Trees (RF, XGBoost)
    c. DeepLearning (Tensorflow)
2. Using stacking/ensembling methods for improving prediction metrics (K-Means, anomaly detection)
3. Using unsupervised methods for feature engineering/selection

For the baseline proposal, this file contains a first pass run through from data preprocessing to model evaluation using a regression model pipeline. 

https://www.kaggle.com/c/jigsaw-toxic-comment-classification-challenge




### Data Ingestion

In [7]:
import numpy as np
import pandas as pd

#sklearn imports
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn.pipeline import Pipeline
from sklearn.neighbors import KNeighborsClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.naive_bayes import BernoulliNB
from sklearn.naive_bayes import MultinomialNB
from sklearn.grid_search import GridSearchCV
from sklearn.feature_extraction.text import CountVectorizer

#scipy imports
from scipy.sparse import hstack

#Visualization imports
import matplotlib.pyplot as plt
import matplotlib.gridspec as gridspec 
import bokeh
#! pip install bokeh

# target classes
target_names = ['toxic', 'severe_toxic', 'obscene', 'threat', 'insult', 'identity_hate']

In [56]:
# read frames localy through csv
train_df = pd.read_csv("../data/train.csv")
test_df = pd.read_csv("../data/test.csv")

# Random index generator for splitting training data
# Note: Each rerun of cell will create new splits.
randIndexCut = np.random.rand(len(train_df)) < 0.7

#S plit up data
test_data = test_df["comment_text"]
dev_data, dev_labels = train_df[~randIndexCut]["comment_text"], train_df[~randIndexCut][target_names]
train_data, train_labels = train_df[randIndexCut]["comment_text"], train_df[randIndexCut][target_names]


print('total training observations:', train_df.shape[0])
print('training data shape:', train_data.shape)
print('training label shape:', train_labels.shape)
print('dev label shape:', dev_labels.shape)
print ('labels names:', target_names)

total training observations: 159571
training data shape: (111613,)
training label shape: (111613, 6)
dev label shape: (47958, 6)
labels names: ['toxic', 'severe_toxic', 'obscene', 'threat', 'insult', 'identity_hate']


### Exploratory Data Analysis

#### Class Imbalance

Let's see how imblanced the label set is in order to have a better understanding with the label quality of the given data set. 

In [49]:
from bokeh.io import push_notebook
from bokeh.plotting import figure, show, output_file, output_notebook

target_counts = train_labels.apply(np.sum,0)
target_counts

output_notebook()


p = figure(x_range=target_names)
p.vbar(x=target_names, top = target_counts, width=0.9)

show(p)

train_labels.head()

Unnamed: 0,toxic,severe_toxic,obscene,threat,insult,identity_hate
0,0,0,0,0,0,0
1,0,0,0,0,0,0
2,0,0,0,0,0,0
3,0,0,0,0,0,0
6,1,1,1,0,1,0


In [53]:
total_size = len(train_labels.toxic)

print(sum(train_labels.toxic))
total_normal=len(np.where((train_labels.toxic==0) & (train_labels.severe_toxic==0) & (train_labels.obscene==0) & 
             (train_labels.threat ==0 ) & (train_labels.insult==0) & (train_labels.identity_hate==0))[0])
print("Total normal = %d, total with issues = %d" % (total_normal, total_size-total_normal))
print(target_counts)
      

10656
Total normal = 100402, total with issues = 11299
toxic            10656
severe_toxic      1079
obscene           5820
threat             332
insult            5475
identity_hate      981
dtype: int64


The data is fairly imbalanced when counting label occurrences. 

Ideas to consider
- Sampling methods
- Custom Cross Validation

### Feature Engineering/Selection (WIP)
....

### Modeling

#### Text Processing

In [51]:
# Basic Count Vectorizer
countVector = CountVectorizer(ngram_range=(1,1))
train_counts = countVector.fit_transform(train_data)

print("Vocabulary size is: {}".format(len(countVector.vocabulary_)))

print("Number of nonzero entries in matrix: {}".format(train_counts.nnz))

#sample column wise sum, we can see that an observation can have multiple classes. 
count_df = pd.DataFrame(train_labels.apply(np.sum,1), columns = ["counts"])
count_df = count_df[((count_df["counts"] >= 1))]
count_df.head(10)

Vocabulary size is: 153726
Number of nonzero entries in matrix: 4866901


Unnamed: 0,counts
6,4
12,1
16,1
42,4
43,3
51,2
56,3
58,2
59,1
65,3


#### First Pass Logistic Regression

In [57]:
from sklearn.metrics import auc
# SK-learn libraries for cross validation
from sklearn.cross_validation import StratifiedKFold, cross_val_score, train_test_split 
# Basic Logistic Regression Model/MultiLabel Edition

prediction_output = []
scores_output = []
for name in target_names:
    classifier = LogisticRegression(solver='sag') 
    cv_score = np.mean(cross_val_score(
        classifier, train_counts, train_labels[name], cv=3, scoring='roc_auc'))
    scores_output.append(cv_score)
    print('CV score for class {} is {}'.format(name, cv_score))
    classifier.fit(train_counts, train_labels[name])
    
    
print("Mean ROC_AUC: {}").format(np.mean(scores_output))

ValueError: Found input variables with inconsistent numbers of samples: [111701, 111613]

#### Testing on Dev Data

In [6]:
from sklearn.metrics import auc, roc_curve
from sklearn import metrics

dev_Vector = CountVectorizer(ngram_range=(1,1))
dev_counts = countVector.fit_transform(dev_data)

pred_dt = pd.DataFrame()
scores_dev = []
for name in target_names:
    classifier = LogisticRegression(solver='sag') 
    classifier.fit(dev_counts, dev_labels[name])
    scores_dev.append(cv_score)
    output = classifier.predict(dev_counts)
    fpr, tpr, thresholds = metrics.roc_curve(dev_labels[name], output)
    print('Dev score for class {} is {}'.format(name, metrics.auc(fpr,tpr)))
    pred_dt[name] = classifier.predict_proba(dev_counts)[:, 1]
    
    
print("Mean(dev) ROC_AUC: {}").format(np.mean(scores_dev))

Dev score for class toxic is 0.675413012295
Dev score for class severe_toxic is 0.581701390358
Dev score for class obscene is 0.631987934387
Dev score for class threat is 0.503865476473
Dev score for class insult is 0.60758283931
Dev score for class identity_hate is 0.521895703782
Mean(dev) ROC_AUC: 0.676147422434


Score on dev set is worse than training set, thus evidence of overfitting and a need for performance improvement.

The target is multi-label since each observation can be classified as multiple fields.  This is an important distinction from multi-class where each prediction can only be one label.  

## Evaluation

In [7]:
count_df
train_labels["toxic"]

0         0
1         0
2         0
5         0
6         1
9         0
10        0
11        0
13        0
14        0
17        0
18        0
21        0
22        0
24        0
25        0
26        0
27        0
29        0
30        0
31        0
32        0
35        0
36        0
38        0
39        0
41        0
43        1
44        1
46        0
         ..
159535    0
159536    0
159537    0
159538    0
159539    0
159540    0
159541    1
159542    0
159544    0
159545    0
159546    1
159547    0
159549    0
159550    0
159551    0
159554    1
159555    0
159557    0
159558    0
159559    0
159560    0
159561    0
159562    0
159563    0
159564    0
159565    0
159566    0
159567    0
159569    0
159570    0
Name: toxic, dtype: int64

### Submission

In [8]:
from sklearn.metrics import auc
# SK-learn libraries for cross validation
from sklearn.cross_validation import StratifiedKFold, cross_val_score, train_test_split 
# Basic Logistic Regression Model/MultiLabel Edition

prediction_submission = pd.DataFrame()
prediction_submission["id"] = test_df["id"]

# new vector object for all train data for submission
finalTrainVector = CountVectorizer()
finalTrainCount = finalTrainVector.fit_transform(train_df["comment_text"])

# TODO: Using pipelines can clean up repeitive processes
# test set up
#testVector = CountVectorizer()
testCount = finalTrainVector.transform(test_df["comment_text"])

for name in target_names:
    classifier = LogisticRegression(solver='sag') #sag is one kind of solver optimize for multi-label
    clf = classifier.fit(finalTrainCount, train_df["toxic"])
    prediction_submission[name] = clf.predict_proba(testCount)[:, 1]
    #print(prediction_submission)

    
print(prediction_submission.head(10)) # print frame output 
#prediction_submission.to_csv("submission.csv")

                 id     toxic  severe_toxic   obscene    threat    insult  \
0  00001cee341fdb12  0.889472      0.889208  0.889490  0.889300  0.889859   
1  0000247867823ef7  0.250917      0.250663  0.250762  0.250821  0.250742   
2  00013b17ad220c46  0.432613      0.432735  0.432661  0.432668  0.432595   
3  00017563c3f7919a  0.068362      0.068237  0.068175  0.068064  0.068163   
4  00017695ad8997eb  0.424004      0.424707  0.424460  0.424676  0.424557   
5  0001ea8717f6de06  0.340763      0.340854  0.340989  0.340558  0.340962   
6  00024115d4cbde0f  0.124536      0.124778  0.124653  0.124630  0.124505   
7  000247e83dcc1211  0.452367      0.452056  0.452069  0.452164  0.452073   
8  00025358d4737918  0.006227      0.006221  0.006210  0.006212  0.006193   
9  00026d1092fe71cc  0.044093      0.044136  0.044161  0.044094  0.044203   

   identity_hate  
0       0.888301  
1       0.250729  
2       0.432661  
3       0.068241  
4       0.424652  
5       0.340869  
6       0.124839  


The frame contains the output for each class and is saved in a pandas data frame.  