## Baseline Submission: Toxic Language Classification 
**w207 Spring 2018 - Final Project Baseline**

**Team: Paul, Walt, Yisang, Joe**



### Project Description 

Our challenge is to build a multi-headed model that’s capable of detecting different types of of toxicity like threats, obscenity, insults, and identity-based hate.  The toxic language data set is sourced from Wikipedia and available as a public kaggle data set. 

Our goal is to use various machine learning techniques used in class to develop high quality ML models and pipelines.  

1. Exercise and build upon concepts covered in class and test out at least 3 kinds of supervised models:
    a. Regression (LASSO, Logistic)
    b. Trees (RF, XGBoost)
    c. DeepLearning (Tensorflow)
2. Using stacking/ensembling methods for improving prediction metrics (K-Means, anomaly detection)
3. Using unsupervised methods for feature engineering/selection

For the baseline proposal, this file contains a first pass run through from data preprocessing to model evaluation using a regression model pipeline. 

https://www.kaggle.com/c/jigsaw-toxic-comment-classification-challenge




### Data Ingestion

In [1]:
%matplotlib inline
import numpy as np
import pandas as pd

#sklearn imports
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn.pipeline import Pipeline
from sklearn.neighbors import KNeighborsClassifier
from sklearn.neural_network import MLPClassifier

from sklearn.linear_model import LogisticRegression
from sklearn.naive_bayes import BernoulliNB
from sklearn.naive_bayes import MultinomialNB
from sklearn.grid_search import GridSearchCV
from sklearn.feature_extraction.text import CountVectorizer

from sklearn.cross_validation import StratifiedKFold, cross_val_score, train_test_split 


#scipy imports
from scipy.sparse import hstack

#Visualization imports
import matplotlib.pyplot as plt
import matplotlib.gridspec as gridspec 
import bokeh
#! pip install bokeh

#General imports
import pprint

# target classes
target_names = ['toxic', 'severe_toxic', 'obscene', 'threat', 'insult', 'identity_hate']



In [2]:
# read frames localy through csv
train_df = pd.read_csv("../data/train.csv")
test_df = pd.read_csv("../data/test.csv")

# Random index generator for splitting training data
# Note: Each rerun of cell will create new splits.
randIndexCut = np.random.rand(len(train_df)) < 0.7

#S plit up data
test_data = test_df["comment_text"]
dev_data, dev_labels = train_df[~randIndexCut]["comment_text"], train_df[~randIndexCut][target_names]
train_data, train_labels = train_df[randIndexCut]["comment_text"], train_df[randIndexCut][target_names]

print 'total training observations:', train_df.shape[0]
print 'training data shape:', train_data.shape
print 'training label shape:', train_labels.shape
print 'dev label shape:', dev_labels.shape
print 'labels names:', target_names

total training observations: 159571
training data shape: (112084,)
training label shape: (112084, 6)
dev label shape: (47487, 6)
labels names: ['toxic', 'severe_toxic', 'obscene', 'threat', 'insult', 'identity_hate']


### Exploratory Data Analysis

#### Class Imbalance

Let's see how imblanced the label set is in order to have a better understanding with the label quality of the given data set. 

In [3]:
from bokeh.io import push_notebook
from bokeh.plotting import figure, show, output_file, output_notebook

target_counts = train_labels.apply(np.sum,0)
target_counts

output_notebook()


p = figure(x_range=target_names)
p.vbar(x=target_names, top = target_counts, width=0.9)

show(p)

train_labels.head()

Unnamed: 0,toxic,severe_toxic,obscene,threat,insult,identity_hate
2,0,0,0,0,0,0
3,0,0,0,0,0,0
4,0,0,0,0,0,0
5,0,0,0,0,0,0
10,0,0,0,0,0,0


The data is fairly imbalanced when counting label occurrences. 

Ideas to consider
- Sampling methods
- Custom Cross Validation

### Feature Engineering/Selection (WIP)
....

### Modeling

### Text Processing

In [4]:
pp = pprint.PrettyPrinter(indent=4)

basic=False

if (basic):
    # Basic Count Vectorizer
    countVector = CountVectorizer(ngram_range=(1,1))
    train_counts = countVector.fit_transform(train_data)
    dev_counts = countVector.fit_transform(dev_data)

    print("\nVocabulary size is: {}").format(len(countVector.vocabulary_))
    vocab_entries = {k: countVector.vocabulary_[k] for k in countVector.vocabulary_.keys()}
    vocab_entries = pd.Series(vocab_entries).to_frame()
    vocab_entries.columns = ['count']
    vocab_entries = vocab_entries.sort_values(by='count')

    print("Sample vocabulary from CountVectorizer:")
    print(pp.pprint(vocab_entries.head(10)))
    print("...")
    print(pp.pprint(vocab_entries.tail(10)))
    print("Number of nonzero entries in matrix: {}").format(train_counts.nnz)


    
tfidfVector = TfidfVectorizer(ngram_range=(1,1), stop_words='english')
train_tfidf_counts = tfidfVector.fit_transform(train_data)
dev_tfidf_counts = tfidfVector.transform(dev_data)

print("\nVocabulary (tfidf) size is: {}").format(len(tfidfVector.vocabulary_))
vocab_entries = {k: tfidfVector.vocabulary_[k] for k in tfidfVector.vocabulary_.keys()}
vocab_entries = pd.Series(vocab_entries).to_frame()
vocab_entries.columns = ['count']
vocab_entries = vocab_entries.sort_values(by='count')

print("Sample vocabulary from TfidfVectorizer:")
print(pp.pprint(vocab_entries.head(10)))
print("...")
print(pp.pprint(vocab_entries.tail(10)))
print("Number of nonzero entries in matrix: {}").format(train_tfidf_counts.nnz)

# Zip the feature names with the coefs and sort
coefs = sorted(
    zip(tfidfVector.idf_, tfidfVector.get_feature_names()),
    key=lambda item: item[0], reverse=True
)
max_coeffs, max_features = zip(*coefs)

vocab_split=1

# Get the top 100% of vocabulary, by idf weight
n = int(len(tfidfVector.vocabulary_)/vocab_split)
subset_vocab = {word for word in max_features[:n]}

# Re-train and re-fit train and dev data, using limited vocabulary
tfidfVector = TfidfVectorizer(ngram_range=(1,1), stop_words='english', vocabulary=subset_vocab)
train_tfidf_counts = tfidfVector.fit_transform(train_data)
dev_tfidf_counts = tfidfVector.transform(dev_data)

print("\nLimited vocabulary (tfidf) size is: {}").format(len(tfidfVector.vocabulary_))
vocab_entries = {k: tfidfVector.vocabulary_[k] for k in tfidfVector.vocabulary_.keys()}
vocab_entries = pd.Series(vocab_entries).to_frame()
vocab_entries.columns = ['count']
vocab_entries = vocab_entries.sort_values(by='count')

print("Sample limited vocabulary from TfidfVectorizer:")
print(pp.pprint(vocab_entries.head(10)))
print("...")
print(pp.pprint(vocab_entries.tail(10)))
print("Number of nonzero entries in matrix: {}").format(train_tfidf_counts.nnz)

#sample column wise sum, we can see that an observation can have multiple classes. 
count_df = pd.DataFrame(train_labels.apply(np.sum,1), columns = ["counts"])
count_df = count_df[((count_df["counts"] >= 1))]
count_df.head(10)


Vocabulary (tfidf) size is: 152876
Sample vocabulary from TfidfVectorizer:
            count
00              0
000             1
0000            2
00000           3
000000          4
0000000         5
00000000        6
0000000027      7
00000001        8
0000030422      9
None
...
            count
번역         152866
보호         152867
요청         152868
유헌         152869
잡아야        152870
척뉴넘        152871
천리마군       152872
편집         152873
ﬂute       152874
ｳｨｷﾍﾟﾃﾞｨｱ  152875
None
Number of nonzero entries in matrix: 2853957

Limited vocabulary (tfidf) size is: 152876
Sample limited vocabulary from TfidfVectorizer:
            count
00              0
000             1
0000            2
00000           3
000000          4
0000000         5
00000000        6
0000000027      7
00000001        8
0000030422      9
None
...
            count
번역         152866
보호         152867
요청         152868
유헌         152869
잡아야        152870
척뉴넘        152871
천리마군       152872
편집         152873
ﬂute    

Unnamed: 0,counts
12,1
16,1
42,4
43,3
44,1
51,2
55,4
56,3
58,2
59,1


### MLPClassifier (Neural Net)

In [5]:
import time
from sklearn.metrics import auc
# SK-learn libraries for cross validation
#from sklearn.cross_validation import StratifiedKFold, cross_val_score, train_test_split 
from sklearn.model_selection import StratifiedKFold, cross_val_score, train_test_split 



print("Modelling with MLPClassifier")

# multiply the initial layer size by the factor of the vocab_split
initial_layer_n = 6*vocab_split
print "Using an initial layer of size ", initial_layer_n


# This is the same loop as with the below examples of Logistic Regression
prediction_output = []
scores_output = []
full_CV_start = time.time()
for name in target_names:
    label_CV_start = time.time()

    # This Multi-Layer Perceptron classifier will be setup with hidden layers of 6 and 6 each, with tanh activation
    # Running a 3-way cross-validation for a single label takes between 10 and 20 minutes, dependenging on the machine.
    # The mean AUC for train and dev was 93%.
    # Adding an additional hidden layer (10,8,6) didn't give better results, just took longer. For the train data, each
    # label took beween 20 and 30 minutes, and the dev labels each took around 15 minutes. The resulting score for this
    # was 93%, again for both datasets.
    
    classifier = MLPClassifier(hidden_layer_sizes=(10,8,6), activation='tanh', learning_rate='adaptive')
    classifier.fit(train_tfidf_counts, train_labels[name])
    cv_score = np.mean(cross_val_score(
        classifier, train_tfidf_counts, train_labels[name], cv=3, scoring='roc_auc'))
    scores_output.append(cv_score)
    label_CV_finish = time.time()
    print('Train data CV score for class {} is {:.2f}, after {:.2f} minutes.'.format(name, cv_score, 
                                                                                (label_CV_finish-label_CV_start)/60))
full_CV_finish = time.time()
print("Full cross-val across all labels took {:.2f} minutes.".format((full_CV_finish-full_CV_start)/60))

print("Mean Train ROC_AUC for MLPClassifier: {:.2f}".format(np.mean(scores_output)))

prediction_output = []
scores_output = []
full_CV_start = time.time()
for name in target_names:
    label_CV_start = time.time()
    classifier = MLPClassifier(hidden_layer_sizes=(10,8,6), activation='tanh', learning_rate='adaptive')
    classifier.fit(dev_tfidf_counts, dev_labels[name])
    cv_score = np.mean(cross_val_score(
        classifier, dev_tfidf_counts, dev_labels[name], cv=3, scoring='roc_auc'))
    scores_output.append(cv_score)
    label_CV_finish = time.time()
    print('DEV data CV score for class {} is {:.2f}, after {:.2f} minutes.'.format(name, cv_score, 
                                                                                (label_CV_finish-label_CV_start)/60))
full_CV_finish = time.time()
print("Full cross-val across all labels took {:.2f} minutes.".format(full_CV_finish-full_CV_start))
print("Mean DEV ROC_AUC for MLPClassifier: {:.2f}".format(np.mean(scores_output)))
      


Modelling with MLPClassifier
Using an initial layer of size  6
Train data CV score for class toxic is 0.93, after 28.17 minutes.
Train data CV score for class severe_toxic is 0.92, after 30.46 minutes.
Train data CV score for class obscene is 0.95, after 33.32 minutes.
Train data CV score for class threat is 0.93, after 22.63 minutes.
Train data CV score for class insult is 0.91, after 28.26 minutes.
Train data CV score for class identity_hate is 0.93, after 25.49 minutes.
Full cross-val across all labels took 168.32 minutes.
Mean Train ROC_AUC for MLPClassifier: 0.93
DEV data CV score for class toxic is 0.93, after 13.19 minutes.
DEV data CV score for class severe_toxic is 0.93, after 14.51 minutes.
DEV data CV score for class obscene is 0.95, after 14.89 minutes.
DEV data CV score for class threat is 0.93, after 11.71 minutes.
DEV data CV score for class insult is 0.91, after 17.88 minutes.
DEV data CV score for class identity_hate is 0.90, after 13.82 minutes.
Full cross-val across 


### First Pass Logistic Regression with sag

In [None]:
from sklearn.metrics import auc
# SK-learn libraries for cross validation
from sklearn.cross_validation import StratifiedKFold, cross_val_score, train_test_split 
# Basic Logistic Regression Model/MultiLabel Edition

solver = 'sag'

print("Modelling with {} solver".format(solver))
prediction_output = []
scores_output = []
for name in target_names:
    classifier = LogisticRegression(solver=solver)
    classifier.fit(train_counts, train_labels[name])
    cv_score = np.mean(cross_val_score(
        classifier, train_counts, train_labels[name], cv=3, scoring='roc_auc'))
    scores_output.append(cv_score)
    print('Training data CV score for class {} is {}'.format(name, cv_score))
    
print("Mean Training ROC_AUC for {} solver: {}").format(solver, np.mean(scores_output))

prediction_output = []
scores_output = []
for name in target_names:
    classifier = LogisticRegression(solver=solver) 
    classifier.fit(dev_counts, dev_labels[name])
    cv_score = np.mean(cross_val_score(
        classifier, dev_counts, dev_labels[name], cv=3, scoring='roc_auc'))
    scores_output.append(cv_score)
    print('Dev data CV score for class {} is {}'.format(name, cv_score))
        
print("Mean Dev ROC_AUC for {} solver: {}").format(solver, np.mean(scores_output))


### First Pass Logistic Regression with saga

In [None]:
from sklearn.metrics import auc
# SK-learn libraries for cross validation
from sklearn.cross_validation import StratifiedKFold, cross_val_score, train_test_split 
# Basic Logistic Regression Model/MultiLabel Edition

solver = 'saga'

print("Modelling with {} solver".format(solver))
prediction_output = []
scores_output = []
for name in target_names:
    classifier = LogisticRegression(solver=solver)
    classifier.fit(train_counts, train_labels[name])
    cv_score = np.mean(cross_val_score(
        classifier, train_counts, train_labels[name], cv=3, scoring='roc_auc'))
    scores_output.append(cv_score)
    print('Training data CV score for class {} is {}'.format(name, cv_score))
    
print("Mean Training ROC_AUC for {} solver: {}").format(solver, np.mean(scores_output))

prediction_output = []
scores_output = []
for name in target_names:
    classifier = LogisticRegression(solver=solver) 
    classifier.fit(dev_counts, dev_labels[name])
    cv_score = np.mean(cross_val_score(
        classifier, dev_counts, dev_labels[name], cv=3, scoring='roc_auc'))
    scores_output.append(cv_score)
    print('Dev data CV score for class {} is {}'.format(name, cv_score))
        
print("Mean Dev ROC_AUC for {} solver: {}").format(solver, np.mean(scores_output))


### Here's the same using tfidf and saga

In [None]:
from sklearn.metrics import auc
# SK-learn libraries for cross validation
from sklearn.cross_validation import StratifiedKFold, cross_val_score, train_test_split 
# Basic Logistic Regression Model/MultiLabel Edition

solver = 'saga'

print("Modelling with {} solver".format(solver))
prediction_output = []
scores_output = []
for name in target_names:
    classifier = LogisticRegression(solver=solver)
    classifier.fit(train_tfidf_counts, train_labels[name])
    cv_score = np.mean(cross_val_score(
        classifier, train_tfidf_counts, train_labels[name], cv=3, scoring='roc_auc'))
    scores_output.append(cv_score)
    print('Training data CV score for class {} is {}'.format(name, cv_score))

    
print("Mean Training ROC_AUC for {} solver: {}").format(solver, np.mean(scores_output))

prediction_output = []
scores_output = []
for name in target_names:
    classifier = LogisticRegression(solver=solver) 
    classifier.fit(dev_tfidf_counts, dev_labels[name])
    cv_score = np.mean(cross_val_score(
        classifier, dev_tfidf_counts, dev_labels[name], cv=3, scoring='roc_auc'))
    scores_output.append(cv_score)
    print('Dev data CV score for class {} is {}'.format(name, cv_score))

        
print("Mean Dev ROC_AUC for {} solver: {}").format(solver, np.mean(scores_output))


### Original counts with saga and L1

In [None]:
from sklearn.metrics import auc
# SK-learn libraries for cross validation
from sklearn.cross_validation import StratifiedKFold, cross_val_score, train_test_split 
# Basic Logistic Regression Model/MultiLabel Edition

solver = 'saga'

print("Modelling with {} solver".format(solver))
prediction_output = []
scores_output = []
for name in target_names:
    classifier = LogisticRegression(solver=solver,penalty='l1')
    classifier.fit(train_counts, train_labels[name])
    cv_score = np.mean(cross_val_score(
        classifier, train_counts, train_labels[name], cv=3, scoring='roc_auc'))
    scores_output.append(cv_score)
    print('Training data CV score for class {} is {}'.format(name, cv_score))

    
print("Mean Training ROC_AUC for {} solver: {}").format(solver, np.mean(scores_output))

prediction_output = []
scores_output = []
for name in target_names:
    classifier = LogisticRegression(solver=solver,penalty='l1') 
    classifier.fit(dev_counts, dev_labels[name])
    cv_score = np.mean(cross_val_score(
        classifier, dev_counts, dev_labels[name], cv=3, scoring='roc_auc'))
    scores_output.append(cv_score)
    print('Dev data CV score for class {} is {}'.format(name, cv_score))

        
print("Mean Dev ROC_AUC for {} solver: {}").format(solver, np.mean(scores_output))

### Tfidf with saga and L1

In [None]:
from sklearn.metrics import auc
# SK-learn libraries for cross validation
from sklearn.cross_validation import StratifiedKFold, cross_val_score, train_test_split 
# Basic Logistic Regression Model/MultiLabel Edition

solver = 'saga'

print("Modelling with {} solver".format(solver))
prediction_output = []
scores_output = []
for name in target_names:
    classifier = LogisticRegression(solver=solver,penalty='l1')
    classifier.fit(train_tfidf_counts, train_labels[name])
    cv_score = np.mean(cross_val_score(
        classifier, train_tfidf_counts, train_labels[name], cv=3, scoring='roc_auc'))
    scores_output.append(cv_score)
    print('Training data CV score for class {} is {}'.format(name, cv_score))

    
print("Mean Training ROC_AUC for {} solver: {}").format(solver, np.mean(scores_output))

prediction_output = []
scores_output = []
for name in target_names:
    classifier = LogisticRegression(solver=solver,penalty='l1') 
    classifier.fit(dev_tfidf_counts, dev_labels[name])
    cv_score = np.mean(cross_val_score(
        classifier, dev_tfidf_counts, dev_labels[name], cv=3, scoring='roc_auc'))
    scores_output.append(cv_score)
    print('Dev data CV score for class {} is {}'.format(name, cv_score))

        
print("Mean Dev ROC_AUC for {} solver: {}").format(solver, np.mean(scores_output))

#### Testing on Dev Data

In [None]:
from sklearn.metrics import auc, roc_curve
from sklearn import metrics

dev_Vector = CountVectorizer(ngram_range=(1,1))
dev_counts = countVector.fit_transform(dev_data)

pred_dt = pd.DataFrame()
scores_dev = []
for name in target_names:
    classifier = LogisticRegression(solver='sag') 
    classifier.fit(dev_counts, dev_labels[name])
    scores_dev.append(cv_score)
    output = classifier.predict(dev_counts)
    fpr, tpr, thresholds = metrics.roc_curve(dev_labels[name], output)
    print('Dev score for class {} is {}'.format(name, metrics.auc(fpr,tpr)))
    pred_dt[name] = classifier.predict_proba(dev_counts)[:, 1]
    
    
print("Mean(dev) ROC_AUC: {}").format(np.mean(scores_dev))

Score on dev set is worse than training set, thus evidence of overfitting and a need for performance improvement.

The target is multi-label since each observation can be classified as multiple fields.  This is an important distinction from multi-class where each prediction can only be one label.  

## Evaluation

In [None]:
count_df
train_labels["toxic"]

### Submission

In [None]:
from sklearn.metrics import auc
# SK-learn libraries for cross validation
from sklearn.cross_validation import StratifiedKFold, cross_val_score, train_test_split 
# Basic Logistic Regression Model/MultiLabel Edition

prediction_submission = pd.DataFrame()
prediction_submission["id"] = test_df["id"]

# new vector object for all train data for submission
finalTrainVector = CountVectorizer()
finalTrainCount = finalTrainVector.fit_transform(train_df["comment_text"])

# TODO: Using pipelines can clean up repetitive processes
# test set up
#testVector = CountVectorizer()
testCount = finalTrainVector.transform(test_df["comment_text"])

for name in target_names:
    classifier = LogisticRegression(solver='sag') #sag is one kind of solver optimize for multi-label
    clf = classifier.fit(finalTrainCount, train_df[name])
    prediction_submission[name] = clf.predict_proba(testCount)[:, 1]
    #print(prediction_submission)

    
print(prediction_submission.head(10)) # print frame output 
#prediction_submission.to_csv("submission.csv")

The frame contains the output for each class and is saved in a pandas data frame.  