## Baseline Submission: Toxic Language Classification 
**w207 Spring 2018 - Final Project Baseline**

**Team: Paul, Walt, Yisang, Joe**



### Project Description 

Our challenge is to build a multi-headed model that’s capable of detecting different types of of toxicity like threats, obscenity, insults, and identity-based hate.  The toxic language data set is sourced from Wikipedia and available as a public kaggle data set. 

Our goal is to use various machine learning techniques used in class to develop high quality ML models and pipelines.  

1. Exercise and build upon concepts covered in class and test out at least 3 kinds of supervised models:
    a. Regression (LASSO, Logistic)
    b. Trees (RF, XGBoost)
    c. DeepLearning (Tensorflow)
2. Using stacking/ensembling methods for improving prediction metrics (K-Means, anomaly detection)
3. Using unsupervised methods for feature engineering/selection

For the baseline proposal, this file contains a first pass run through from data preprocessing to model evaluation using a regression model pipeline. 

https://www.kaggle.com/c/jigsaw-toxic-comment-classification-challenge




### Data Ingestion

In [4]:
import numpy as np
import pandas as pd

#sklearn imports
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn.pipeline import Pipeline
from sklearn.neighbors import KNeighborsClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.naive_bayes import BernoulliNB
from sklearn.naive_bayes import MultinomialNB
from sklearn.grid_search import GridSearchCV
from sklearn.feature_extraction.text import CountVectorizer

#scipy imports
from scipy.sparse import hstack

#Visualization imports
import matplotlib.pyplot as plt
import matplotlib.gridspec as gridspec 
import bokeh
#! pip install bokeh

# target classes
target_names = ['toxic', 'severe_toxic', 'obscene', 'threat', 'insult', 'identity_hate']



In [5]:
# read frames localy through csv
train_df = pd.read_csv("../data/train.csv")
test_df = pd.read_csv("../data/test.csv")

# Random index generator for splitting training data
# Note: Each rerun of cell will create new splits.
randIndexCut = np.random.rand(len(train_df)) < 0.7

#S plit up data
test_data = test_df["comment_text"]
dev_data, dev_labels = train_df[~randIndexCut]["comment_text"], train_df[~randIndexCut][target_names]
train_data, train_labels = train_df[randIndexCut]["comment_text"], train_df[randIndexCut][target_names]

print ('total training observations:', train_df.shape[0])
print ('training data shape:', train_data.shape)
print ('training label shape:', train_labels.shape)
print ('dev label shape:', dev_labels.shape)
print ('labels names:', target_names)

total training observations: 159571
training data shape: (111606,)
training label shape: (111606, 6)
dev label shape: (47965, 6)
labels names: ['toxic', 'severe_toxic', 'obscene', 'threat', 'insult', 'identity_hate']


### Exploratory Data Analysis

#### Class Imbalance

Let's see how imblanced the label set is in order to have a better understanding with the label quality of the given data set. 

In [6]:
from bokeh.io import push_notebook
from bokeh.plotting import figure, show, output_file, output_notebook

target_counts = train_labels.apply(np.sum,0)
target_counts

output_notebook()


p = figure(x_range=target_names)
p.vbar(x=target_names, top = target_counts, width=0.9)

show(p)

train_labels.head()

Unnamed: 0,toxic,severe_toxic,obscene,threat,insult,identity_hate
0,0,0,0,0,0,0
2,0,0,0,0,0,0
3,0,0,0,0,0,0
4,0,0,0,0,0,0
7,0,0,0,0,0,0


The data is fairly imbalanced when counting label occurrences. 

Ideas to consider
- Sampling methods
- Custom Cross Validation

### Feature Engineering/Selection (WIP)
....

### Modeling

#### Text Processing

In [7]:
# Basic Count Vectorizer
countVector = CountVectorizer(ngram_range=(1,1))
train_counts = countVector.fit_transform(train_data)

print("Vocabulary size is: {}".format(len(countVector.vocabulary_)))

print("Number of nonzero entries in matrix: {}".format(train_counts.nnz))

#sample column wise sum, we can see that an observation can have multiple classes. 
count_df = pd.DataFrame(train_labels.apply(np.sum,1), columns = ["counts"])
count_df = count_df[((count_df["counts"] >= 1))]
count_df.head(10)

Vocabulary size is: 153026
Number of nonzero entries in matrix: 4861415


Unnamed: 0,counts
12,1
16,1
42,4
43,3
51,2
55,4
56,3
58,2
59,1
105,4


#### First Pass Logistic Regression

In [8]:
from sklearn.metrics import auc
# SK-learn libraries for cross validation
from sklearn.cross_validation import StratifiedKFold, cross_val_score, train_test_split 
# Basic Logistic Regression Model/MultiLabel Edition

prediction_output = []
scores_output = []
for name in target_names:
    classifier = LogisticRegression(solver='sag') 
    cv_score = np.mean(cross_val_score(
        classifier, train_counts, train_labels[name], cv=3, scoring='roc_auc'))
    scores_output.append(cv_score)
    print('CV score for class {} is {}'.format(name, cv_score))
    classifier.fit(train_counts, train_labels[name])
    
    
print("Mean ROC_AUC: {}").format(np.mean(scores_output))



CV score for class toxic is 0.7910587208585081




CV score for class severe_toxic is 0.7708712496283959




CV score for class obscene is 0.7680118471136107




CV score for class threat is 0.6235014231416413




CV score for class insult is 0.7624033567392329




CV score for class identity_hate is 0.6657662849179168
Mean ROC_AUC: {}


AttributeError: 'NoneType' object has no attribute 'format'

#### Testing on Dev Data

In [9]:
from sklearn.metrics import auc, roc_curve
from sklearn import metrics

dev_Vector = CountVectorizer(ngram_range=(1,1))
dev_counts = countVector.fit_transform(dev_data)

pred_dt = pd.DataFrame()
scores_dev = []
for name in target_names:
    classifier = LogisticRegression(solver='sag') 
    classifier.fit(dev_counts, dev_labels[name])
    scores_dev.append(cv_score)
    output = classifier.predict(dev_counts)
    fpr, tpr, thresholds = metrics.roc_curve(dev_labels[name], output)
    print('Dev score for class {} is {}'.format(name, metrics.auc(fpr,tpr)))
    pred_dt[name] = classifier.predict_proba(dev_counts)[:, 1]
    
    
print("Mean(dev) ROC_AUC: {}".format(np.mean(scores_dev)))



Dev score for class toxic is 0.6638772853285859
Dev score for class severe_toxic is 0.5659558046251271
Dev score for class obscene is 0.6368878607060424
Dev score for class threat is 0.5069825515540877
Dev score for class insult is 0.6244210119360175
Dev score for class identity_hate is 0.5191347517029775
Mean(dev) ROC_AUC: 0.6657662849179168


Score on dev set is worse than training set, thus evidence of overfitting and a need for performance improvement.

The target is multi-label since each observation can be classified as multiple fields.  This is an important distinction from multi-class where each prediction can only be one label.  

## Evaluation

In [10]:
count_df
train_labels["toxic"]

0         0
2         0
3         0
4         0
7         0
10        0
11        0
12        1
13        0
14        0
15        0
16        1
17        0
19        0
20        0
21        0
23        0
24        0
28        0
29        0
30        0
32        0
33        0
34        0
35        0
36        0
40        0
41        0
42        1
43        1
         ..
159525    0
159526    0
159529    0
159532    0
159533    0
159535    0
159536    0
159537    0
159538    0
159541    1
159543    0
159544    0
159546    1
159548    0
159549    0
159551    0
159553    0
159554    1
159557    0
159558    0
159560    0
159561    0
159562    0
159563    0
159564    0
159566    0
159567    0
159568    0
159569    0
159570    0
Name: toxic, Length: 111606, dtype: int64

### Submission

In [11]:
from sklearn.metrics import auc
# SK-learn libraries for cross validation
from sklearn.cross_validation import StratifiedKFold, cross_val_score, train_test_split 
# Basic Logistic Regression Model/MultiLabel Edition

prediction_submission = pd.DataFrame()
prediction_submission["id"] = test_df["id"]

# new vector object for all train data for submission
finalTrainVector = CountVectorizer()
finalTrainCount = finalTrainVector.fit_transform(train_df["comment_text"])

# TODO: Using pipelines can clean up repeitive processes
# test set up
#testVector = CountVectorizer()
testCount = finalTrainVector.transform(test_df["comment_text"])

for name in target_names:
    classifier = LogisticRegression(solver='sag') #sag is one kind of solver optimize for multi-label
    clf = classifier.fit(finalTrainCount, train_df["toxic"])
    prediction_submission[name] = clf.predict_proba(testCount)[:, 1]
    #print(prediction_submission)

    
print(prediction_submission.head(10)) # print frame output 
#prediction_submission.to_csv("submission.csv")



                 id     toxic  severe_toxic   obscene    threat    insult  \
0  00001cee341fdb12  0.889560      0.890203  0.889134  0.889738  0.889625   
1  0000247867823ef7  0.250773      0.250720  0.250751  0.250866  0.250840   
2  00013b17ad220c46  0.432696      0.432653  0.432728  0.432663  0.432624   
3  00017563c3f7919a  0.068228      0.068120  0.068257  0.068294  0.068290   
4  00017695ad8997eb  0.424850      0.424369  0.424467  0.423950  0.423764   
5  0001ea8717f6de06  0.341078      0.340398  0.341074  0.340683  0.341017   
6  00024115d4cbde0f  0.124654      0.124575  0.124866  0.124799  0.124791   
7  000247e83dcc1211  0.452092      0.452361  0.452309  0.452347  0.452237   
8  00025358d4737918  0.006198      0.006223  0.006197  0.006222  0.006188   
9  00026d1092fe71cc  0.044070      0.044112  0.044138  0.044079  0.044196   

   identity_hate  
0       0.889800  
1       0.250779  
2       0.432622  
3       0.068168  
4       0.424935  
5       0.341079  
6       0.124660  


The frame contains the output for each class and is saved in a pandas data frame.  

In [12]:
import string
import spacy
from pprint import pprint
import pickle
import re
import string

from sklearn.feature_extraction.stop_words import ENGLISH_STOP_WORDS
from sklearn.model_selection import GridSearchCV
from sklearn.tree import DecisionTreeClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.pipeline import Pipeline

# target classes
#target_names = ['toxic', 'severe_toxic', 'obscene', 'threat', 'insult', 'identity_hate']

In [13]:
train_df.head()

Unnamed: 0,id,comment_text,toxic,severe_toxic,obscene,threat,insult,identity_hate
0,0000997932d777bf,Explanation\nWhy the edits made under my usern...,0,0,0,0,0,0
1,000103f0d9cfb60f,D'aww! He matches this background colour I'm s...,0,0,0,0,0,0
2,000113f07ec002fd,"Hey man, I'm really not trying to edit war. It...",0,0,0,0,0,0
3,0001b41b1c6bb37e,"""\nMore\nI can't make any real suggestions on ...",0,0,0,0,0,0
4,0001d958c54c6e35,"You, sir, are my hero. Any chance you remember...",0,0,0,0,0,0


In [14]:
def tokenize(s):
    pattern = re.compile(f'([{string.punctuation}“”¨«»®´·º½¾¿¡§£₤‘’])')
    return pattern.sub(r' \1 ', s).split()

In [15]:
X = train_df.comment_text.values
idx = np.arange(len(X))

In [16]:
train_df['comment_text'].fillna("unknown", inplace=True)

In [18]:
knn_pipe = Pipeline(steps=[
    ('tfidf',
     TfidfVectorizer(ngram_range=(1,2),
                     tokenizer=tokenize,
                     min_df=3,
                     max_df=0.9,
                     use_idf=1,
                     smooth_idf=1,
                     sublinear_tf=1,
                     stop_words='english')),
    ('knn',
     KNeighborsClassifier(algorithm='kd_tree'))
])

parameters = {'knn__n_neighbors': [1]}

In [None]:
best_models = {}

for i, col in enumerate(train_df.columns[2:]):
    y = train_df.loc[:, col].values
    clf = GridSearchCV(knn_pipe, parameters)
    clf.fit(X, y)
    print(clf.best_model_)
    print(clf.cv_results_)
    best_models[col] = clf.best_model_
    break

In [19]:
# Forcing pandas to display all data (instead of cutting off columns & rows in the view)
pd.set_option('display.max_columns', None)
pd.set_option('display.max_rows', None)

In [25]:
# Setting up X and y
X_train = train_df['comment_text']
y_train = train_df[['toxic',
                    'severe_toxic',
                    'obscene',
                    'threat',
                    'insult',
                    'identity_hate']]
X_test = test_df['comment_text']

In [26]:
# Setting up stop words
stop_words = set(list(ENGLISH_STOP_WORDS) + ['wikipedia'])

In [27]:
# Creating text processing & modeling pipeline for gridsearching hyper-parameters
grid_pipeline = Pipeline([
    ('tfidf', TfidfVectorizer(stop_words=stop_words)),
    ('model', DecisionTreeClassifier()),
])

In [28]:
# Setting up parameter list for gridsearching
param_list = [
    {'tfidf__max_features': [150, 200]},
    {'model': [KNeighborsClassifier()],
     'model__n_neighbors': [5, 15]},
    {'model': [DecisionTreeClassifier()],
     'model__min_samples_split': [.5, 1.0],
     'model__max_depth': [5, 15]},
    {'model': [RandomForestClassifier()],
     'model__min_samples_split': [.5, 1.0],
     'model__max_depth': [5, 15]}]

In [None]:
# Grid searching using the pipeline's parameters
g = GridSearchCV(grid_pipeline,
                 param_list,
                 cv=5,
                 n_jobs=3,
                 verbose=10,
                 scoring='f1_weighted')
g.fit(X_train, y_train)

Fitting 5 folds for each of 12 candidates, totalling 60 fits
[CV] tfidf__max_features=150 .........................................
[CV] tfidf__max_features=150 .........................................
[CV] tfidf__max_features=150 .........................................
[CV] . tfidf__max_features=150, score=0.338365826683341, total=  52.7s
[CV]  tfidf__max_features=150, score=0.2949467480747935, total=  52.9s
[CV] tfidf__max_features=150 .........................................


[Parallel(n_jobs=3)]: Done   2 tasks      | elapsed:  1.1min


[CV] tfidf__max_features=150 .........................................
[CV]  tfidf__max_features=150, score=0.2837927283010161, total=  52.2s
[CV] tfidf__max_features=200 .........................................
[CV] . tfidf__max_features=150, score=0.334215620603847, total=  52.1s
[CV] tfidf__max_features=200 .........................................
[CV]  tfidf__max_features=150, score=0.32657877178459954, total=  52.7s
[CV] tfidf__max_features=200 .........................................
[CV]  tfidf__max_features=200, score=0.36733524513617927, total= 1.1min
[CV] tfidf__max_features=200 .........................................
[CV]  tfidf__max_features=200, score=0.3440175780304799, total= 1.1min


[Parallel(n_jobs=3)]: Done   7 tasks      | elapsed:  3.4min


[CV] tfidf__max_features=200 .........................................
[CV]  tfidf__max_features=200, score=0.2785564710998293, total= 1.1min
[CV] model=KNeighborsClassifier(algorithm='auto', leaf_size=30, metric='minkowski',
           metric_params=None, n_jobs=1, n_neighbors=5, p=2,
           weights='uniform'), model__n_neighbors=5 
[CV]  tfidf__max_features=200, score=0.41644419911961117, total= 1.1min
[CV] model=KNeighborsClassifier(algorithm='auto', leaf_size=30, metric='minkowski',
           metric_params=None, n_jobs=1, n_neighbors=5, p=2,
           weights='uniform'), model__n_neighbors=5 
[CV]  tfidf__max_features=200, score=0.3385022634291978, total= 1.7min
[CV] model=KNeighborsClassifier(algorithm='auto', leaf_size=30, metric='minkowski',
           metric_params=None, n_jobs=1, n_neighbors=5, p=2,
           weights='uniform'), model__n_neighbors=5 
