## Baseline Submission: Toxic Language Classification 
**w207 Spring 2018 - Final Project Baseline**

**Team: Paul, Walt, Yisang, Joe**



### Project Description 

Our challenge is to build a multi-headed model that’s capable of detecting different types of of toxicity like threats, obscenity, insults, and identity-based hate.  The toxic language data set is sourced from Wikipedia and available as a public kaggle data set. 

Our goal is to use various machine learning techniques used in class to develop high quality ML models and pipelines.  

1. Exercise and build upon concepts covered in class and test out at least 3 kinds of supervised models:
    a. Regression (LASSO, Logistic)
    b. Trees (RF, XGBoost)
    c. DeepLearning (Tensorflow)
2. Using stacking/ensembling methods for improving prediction metrics (K-Means, anomaly detection)
3. Using unsupervised methods for feature engineering/selection

For the baseline proposal, this file contains a first pass run through from data preprocessing to model evaluation using a regression model pipeline. 

https://www.kaggle.com/c/jigsaw-toxic-comment-classification-challenge




### Data Ingestion

In [55]:
%matplotlib inline
import numpy as np
import pandas as pd

#sklearn imports
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn.pipeline import Pipeline
from sklearn.neighbors import KNeighborsClassifier
from sklearn.neural_network import MLPClassifier

from sklearn.linear_model import LogisticRegression
from sklearn.naive_bayes import BernoulliNB
from sklearn.naive_bayes import MultinomialNB
from sklearn.grid_search import GridSearchCV
from sklearn.feature_extraction.text import CountVectorizer

from sklearn.cross_validation import StratifiedKFold, cross_val_score, train_test_split 


#scipy imports
from scipy.sparse import hstack

#Visualization imports
import matplotlib.pyplot as plt
import matplotlib.gridspec as gridspec 
import bokeh
#! pip install bokeh

#General imports
import pprint

# target classes
target_names = ['toxic', 'severe_toxic', 'obscene', 'threat', 'insult', 'identity_hate']

In [56]:
# read frames localy through csv
train_df = pd.read_csv("../data/train.csv")
test_df = pd.read_csv("../data/test.csv")

# Random index generator for splitting training data
# Note: Each rerun of cell will create new splits.
randIndexCut = np.random.rand(len(train_df)) < 0.7

#S plit up data
test_data = test_df["comment_text"]
dev_data, dev_labels = train_df[~randIndexCut]["comment_text"], train_df[~randIndexCut][target_names]
train_data, train_labels = train_df[randIndexCut]["comment_text"], train_df[randIndexCut][target_names]

print 'total training observations:', train_df.shape[0]
print 'training data shape:', train_data.shape
print 'training label shape:', train_labels.shape
print 'dev label shape:', dev_labels.shape
print 'labels names:', target_names

total training observations: 159571
training data shape: (111661,)
training label shape: (111661, 6)
dev label shape: (47910, 6)
labels names: ['toxic', 'severe_toxic', 'obscene', 'threat', 'insult', 'identity_hate']


### Exploratory Data Analysis

#### Class Imbalance

Let's see how imblanced the label set is in order to have a better understanding with the label quality of the given data set. 

In [57]:
from bokeh.io import push_notebook
from bokeh.plotting import figure, show, output_file, output_notebook

target_counts = train_labels.apply(np.sum,0)
target_counts

output_notebook()


p = figure(x_range=target_names)
p.vbar(x=target_names, top = target_counts, width=0.9)

show(p)

train_labels.head()

Unnamed: 0,toxic,severe_toxic,obscene,threat,insult,identity_hate
0,0,0,0,0,0,0
1,0,0,0,0,0,0
2,0,0,0,0,0,0
4,0,0,0,0,0,0
5,0,0,0,0,0,0


The data is fairly imbalanced when counting label occurrences. 

Ideas to consider
- Sampling methods
- Custom Cross Validation

### Feature Engineering/Selection (WIP)
....

### Modeling

### Text Processing

In [58]:
pp = pprint.PrettyPrinter(indent=4)

basic=False

if (basic):
    # Basic Count Vectorizer
    countVector0 = CountVectorizer(ngram_range=(1,1))
    train_counts = countVector.fit_transform(train_data)
    dev_counts = countVector.fit_transform(dev_data)

    print("\nVocabulary size is: {}").format(len(countVector.vocabulary_))
    vocab_entries = {k: countVector.vocabulary_[k] for k in countVector.vocabulary_.keys()}
    vocab_entries = pd.Series(vocab_entries).to_frame()
    vocab_entries.columns = ['count']
    vocab_entries = vocab_entries.sort_values(by='count')

    print("Sample vocabulary from CountVectorizer:")
    print(pp.pprint(vocab_entries.head(10)))
    print("...")
    print(pp.pprint(vocab_entries.tail(10)))
    print("Number of nonzero entries in matrix: {}").format(train_counts.nnz)


    
tfidfVector = TfidfVectorizer(ngram_range=(1,1), stop_words='english')
train_tfidf_counts = tfidfVector.fit_transform(train_data)
dev_tfidf_counts = tfidfVector.transform(dev_data)

print("\nVocabulary (tfidf) size is: {}").format(len(tfidfVector.vocabulary_))
vocab_entries = {k: tfidfVector.vocabulary_[k] for k in tfidfVector.vocabulary_.keys()}
vocab_entries = pd.Series(vocab_entries).to_frame()
vocab_entries.columns = ['count']
vocab_entries = vocab_entries.sort_values(by='count')

print("Sample vocabulary from TfidfVectorizer:")
print(pp.pprint(vocab_entries.head(10)))
print("...")
print(pp.pprint(vocab_entries.tail(10)))
print("Number of nonzero entries in matrix: {}").format(train_tfidf_counts.nnz)

#sample column wise sum, we can see that an observation can have multiple classes. 
count_df = pd.DataFrame(train_labels.apply(np.sum,1), columns = ["counts"])
count_df = count_df[((count_df["counts"] >= 1))]
count_df.head(10)


Vocabulary (tfidf) size is: 153283
Sample vocabulary from TfidfVectorizer:
            count
00              0
000             1
0000            2
00000           3
000000          4
0000000         5
00000000        6
0000000027      7
00000001        8
00000003        9
None
...
            count
잡아야        153273
조선인민군      153274
척뉴넘        153275
칠지도        153276
편집         153277
ﬂute       153278
ａｎｏｎｔａｌｋ   153279
ｃｏｍ        153280
ｗｗｗ        153281
ｳｨｷﾍﾟﾃﾞｨｱ  153282
None
Number of nonzero entries in matrix: 2848397


Unnamed: 0,counts
6,4
12,1
16,1
44,1
55,4
58,2
59,1
65,3
86,2
105,4


### Starting to look at Keras -- this is not functional -- please ignore

In [38]:
from keras.wrappers.scikit_learn import KerasClassifier
from scipy import sparse
from types import FunctionType

class KerasClassifier(KerasClassifier):
    """ adds sparse matrix handling using batch generator"""
    
    def fit(self, x, y, **kwargs):
        """ adds sparse matrix handling """
        if not sparse.issparse(x):
            return super().fit(x, y, **kwargs)

        ############ adapted from KerasClassifier.fit   ######################   
        if self.build_fn is None:
            self.model = self.__call__(**self.filter_sk_params(self.__call__))
        elif not isinstance(self.build_fn, FunctionType):
            self.model = self.build_fn(**self.filter_sk_params(self.build_fn.__call__))
        else:
            self.model = self.build_fn(**self.filter_sk_params(self.build_fn))

        loss_name = self.model.loss
        if hasattr(loss_name, '__name__'):
            loss_name = loss_name.__name__
        if loss_name == 'categorical_crossentropy' and len(y.shape) != 2:
            y = to_categorical(y)
        ### fit => fit_generator
        fit_args = copy.deepcopy(self.filter_sk_params(Sequential.fit_generator))
        fit_args.update(kwargs)
        ############################################################
        self.model.fit_generator(
            self.get_batch(x, y, 500),
            samples_per_epoch=x.shape[0],**fit_args)
        return self                               

    def get_batch(self, x, y=None, batch_size=1000):
        """ batch generator to enable sparse input """
        index = np.arange(x.shape[0])
        start = 0
        while True:
            if start == 0 and y is not None:
                np.random.shuffle(index)
            batch = index[start:start+batch_size]
            if y is not None:
                yield x[batch].toarray(), y[batch]
            else:
                yield x[batch].toarray()
            start += batch_size
            if start >= x.shape[0]:
                start = 0

    def predict_proba(self, x):
        """ adds sparse matrix handling """
        if not sparse.issparse(x):
            return super().predict_proba(x)
      
        preds = self.model.predict_generator(
            self.get_batch(x, None, 500), val_samples=x.shape[0])
        return preds


### Starting to look at Keras -- this is not functional -- please ignore

In [54]:
from scipy import sparse
from types import *
import copy
from sklearn.metrics import auc
from sklearn.model_selection import KFold, cross_val_score, train_test_split 
from keras.models import Sequential
from keras.layers import Activation, Dense
import numpy as np
import time

print("Modelling with Keras")
# fix random seed for reproducibility
seed = 7
np.random.seed(seed)


def build_model(n_hidden=32):
    model = Sequential([
        Dense(n_hidden, input_dim=4),
        Activation("relu"),
        Dense(n_hidden),
        Activation("relu"),
        Dense(3),
        Activation("sigmoid")
    ])
    model.compile(loss="categorical_crossentropy", optimizer="adam", metrics=["accuracy"])
    return model

params = dict(batch_size=500)

prediction_output = []
scores_output = []
full_CV_start = time.time()
for name in target_names:
    # define 3-fold cross validation test harness
    kfold = KFold(n_splits=3, shuffle=True, random_state=seed)
    cvscores = []
    for train, test in kfold.split(dev_tfidf_counts, dev_labels):
        model=KerasClassifier(build_fn=build_model, verbose=0)
        model.fit(train_tfidf_counts, train_labels, epochs=150, verbose=0)
        # evaluate the model
        scores = model.evaluate(dev_tfidf_counts, dev_labels, verbose=0)
        print("%s: %.2f%%" % (model.metrics_names[1], scores[1]*100))
        cvscores.append(scores[1] * 100)
    print("Summary scores for label '{}': Mean {:.2f}  SD {:.2f})".format(name, numpy.mean(cvscores), numpy.std(cvscores)))

Modelling with Keras




KeyError: '[ 70210  33136  30869 102644 110653  87776  36686  19530  46891 102084\n  47362  71340  81013  38411   8450 105745   9102  99512  39504   2125\n 109594 101262  99342  10408  46840  10767  43585  12963  94497   8178\n  14571  32281  13212 103756 104097  71923  51686  19477    936  73356\n  27578  71936  84755  87360  30875  65836  86581  93746  53268  63037\n  50999 106091  75385 108438  81706  76184  26119  62583   5161  31516\n  35705  52694 103545  75954  24798  94683  53472    588  99985  91294\n  81625  66138  92253  37979   7061  84304  71558  32116  36172   2821\n  29981  64404  29031  49208  47785   7398  28023  89834 103604  94618\n  12057  61402  80837 109382    147  84992 103407  36610  52667  50547\n  81769 109333  62275   6328  47018  91881  75841  59972  46730  24650\n  97226  94984  95046  62626   9345  13486  53330  45043  42512  29833\n  64644  47124  57897  62675 110894   9503 111024  87858  45033  99042\n  12702  43368  53885  10788 102939 110035 108224   1617   6871  23132\n 101059  99278  82379  21687 105042  82167  28733  20015  92662  61638\n  82922  41070 108600   7627  20662  92612  62716  47975  20443  85793\n  29753  24619  77087 104248  70661  39349  56406  92674 106480  59720\n  38706  93042   7345  94611  58646  82551  77998  16341 106890    190\n  39439  38853  12758  13436  60526  86428  94762  25519  21482 107215\n   4971  30516  13568  89675  89613  52873  22989  73164  12756  46315\n  84060  82658  88121  95414  95176   6476  42361  24276  34172  12101\n  35332  41898  60493  86206  19633  21722  46530  46022  95234  15439\n 106820  87304  46071  15796  81794  20762  81035  79687  29565  77042\n  78199  23888  91923  11439  27907 109434  52795  36998  24204  37767\n   7766  17180  11260  92424  34396  50568  39520  38996 108645 106841\n  62303  57552  50915  23992  68497  59755  94009  42878  34181  17394\n  52494  20774  22230  47284  95428 109085  85853   4703  28114  63304\n  58966  80251  90884  26845  42517  91185  43413  11992  96239  60229\n 106053  66800  71287 110417 103016  10556  40975  34189  42049  72714\n  58846  23676  90383  71260  31694  21501  55290  42132  42762  32007\n  96839  54180  99666 105030  86158   8587  88641  46205   3525  41770\n  20264  75112   8802  85019  46652  62441 104796  40786  59592  88865\n  51301  27754  60720  46362  35707  15817  46989  79150  60863  64155\n   2559  34357  28339 102364  14922  43178  14552  43052  70886  50770\n  80552  44395  97301  14570  42777  43817  95756  80174 105534  37038\n  22365  81315  95860  15610  74811  88302  49701  39693  66658   9720\n  37008  25334  13547  86363  84378   5299   8360  25303  90851  77613\n  65806  81412  43142  50155  91151  34700  71196  15074  47787  91644\n  22603  72736  59188 107985   6494  35071  17610  39751 107446  91109\n  46161  95028  75332 101118  20402  61570  84907 108849  88638  41388\n  82022  64303  61879  18106 106162  79064  38160 103041  82314  49066\n  44234  16175  84926  21738  76934  66183  77852  65636  37544  47736\n  27863  31806   3075  54221  88553  82583  21245    328  28245  92988\n  86382  17093  61650  20702 108140  47547  38141 101241  15774  85741\n  34739  19090  57708  28628 104662  35545  21146   3107  21360  93668\n  24906  71362  50501  51306  34601  38350  36802  22203  53755  48204\n  17669  67852  37423  29010  58587  16526  99485  62233  21740  34476\n  54881  13855    127  17304   3715  66207 108127  94269  23188  25324\n  13848 103119 109951  34767  29918  50154  63719  24253 109071  57585\n   3384  39312 109177  79303 109546  42779 101672 105343  15832  14017] not in index'

### Starting to look at Keras -- this is not functional -- please ignore

In [53]:
from sklearn.model_selection import KFold, cross_val_score, train_test_split 

def build_model(n_hidden=32):
    model = Sequential([
        Dense(n_hidden, input_dim=4),
        Activation("relu"),
        Dense(n_hidden),
        Activation("relu"),
        Dense(3),
        Activation("sigmoid")
    ])
    model.compile(loss="categorical_crossentropy", optimizer="adam", metrics=["accuracy"])
    return model

param_grid = {
    "n_hidden": np.array([4, 8, 16]),
    "nb_epoch": np.array(range(50, 61, 5))
}

model = KerasClassifier(build_fn=build_model, verbose=0)
skf = KFold(n_splits=5).split(train_tfidf_counts, train_labels) # this yields (train_indices, test_indices)

In [15]:
print(train_matrix.shape)

(111708, 285189)


### MLPClassifier (Neural Net)

In [59]:
import time
from sklearn.metrics import auc
# SK-learn libraries for cross validation
#from sklearn.cross_validation import StratifiedKFold, cross_val_score, train_test_split 
from sklearn.model_selection import StratifiedKFold, cross_val_score, train_test_split 



print("Modelling with MLPClassifier")

# This is the same loop as with the below examples of Logistic Regression
prediction_output = []
scores_output = []
full_CV_start = time.time()
for name in target_names:
    label_CV_start = time.time()

    # This Multi-Layer Perceptron classifier will be setup with hidden layers of 6 and 6 each, with tanh activation
    # Running a 3-way cross-validation for a single label takes between 10 and 2 minutes, dependenging on the machine.
    # Note that while there was a typo in the below code (it's gone now), in the last run that failed the individual
    # label scores were around 93%.
    # The really long runs where when I had unknowingly allowed my machine to go to sleep, which accounts for the time.
    
    classifier = MLPClassifier(hidden_layer_sizes=(6,6), activation='tanh', learning_rate='adaptive')
    classifier.fit(train_tfidf_counts, train_labels[name])
    cv_score = np.mean(cross_val_score(
        classifier, train_tfidf_counts, train_labels[name], cv=3, scoring='roc_auc'))
    scores_output.append(cv_score)
    label_CV_finish = time.time()
    print('Train data CV score for class {} is {}, after {} minutes.'.format(name, cv_score, 
                                                                                (label_CV_finish-label_CV_start)/60))
full_CV_finish = time.time()
print("Full cross-val across all labels took {} minutes.".format((full_CV_finish-full_CV_start)/60))

print("Mean Train ROC_AUC for MLPClassifier: {}".format(np.mean(scores_output)))

prediction_output = []
scores_output = []
full_CV_start = time.time()
for name in target_names:
    label_CV_start = time.time()
    classifier = MLPClassifier(hidden_layer_sizes=(6,6), activation='tanh', learning_rate='adaptive')
    classifier.fit(dev_tfidf_counts, dev_labels[name])
    cv_score = np.mean(cross_val_score(
        classifier, dev_tfidf_counts, dev_labels[name], cv=3, scoring='roc_auc'))
    scores_output.append(cv_score)
    label_CV_finish = time.time()
    print('DEV data CV score for class {} is {}, after {} minutes.'.format(name, cv_score, 
                                                                                (label_CV_finish-label_CV_start)/60))
full_CV_finish = time.time()
print("Full cross-val across all labels took {} minutes.".format(full_CV_finish-full_CV_start))
print("Mean DEV ROC_AUC for MLPClassifier: {}".format(np.mean(scores_output)))
      


Modelling with MLPClassifier
Train data CV score for class toxic is 0.932057379203, after 20.1736497362 minutes.
Train data CV score for class severe_toxic is 0.925718157775, after 20.1235693018 minutes.
Train data CV score for class obscene is 0.953477908436, after 20.3226263007 minutes.
Train data CV score for class threat is 0.939294023398, after 35.3592856526 minutes.
Train data CV score for class insult is 0.913523731044, after 110.750495831 minutes.
Train data CV score for class identity_hate is 0.938434700338, after 15.2087560654 minutes.


NameError: name 'full_CV_Start' is not defined

### Keras (Neural Net with GPU support) -- this is not yet functioning, and quite broken.

In [51]:
from sklearn.metrics import auc
from sklearn.model_selection import KFold, cross_val_score, train_test_split 
from keras.models import Sequential
from keras.layers import Dense
import numpy as np
import time

print("Modelling with Keras")
# fix random seed for reproducibility
seed = 7
np.random.seed(seed)

prediction_output = []
scores_output = []
full_CV_start = time.time()
for name in target_names:
    # define 3-fold cross validation test harness
    kfold = KFold(n_splits=3, shuffle=True, random_state=seed)
    cvscores = []
    for train, test in kfold.split(dev_tfidf_counts, dev_labels):
        # create model
        model = Sequential()
        model.add(Dense(1000, batch_input_shape=train_tfidf_counts.shape, activation='relu'))
        model.add(Dense(6, input_dim=36, activation='relu'))
        model.add(Dense(6, input_dim=6, activation='sigmoid'))
        # Compile model
        model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy'])
        # Fit the model
        model.fit(train_tfidf_counts, train_labels, epochs=150, verbose=0)
        # evaluate the model
        scores = model.evaluate(dev_tfidf_counts, dev_labels, verbose=0)
        print("%s: %.2f%%" % (model.metrics_names[1], scores[1]*100))
        cvscores.append(scores[1] * 100)
    print("Summary scores for label '{}': Mean {:.2f}  SD {:.2f})".format(name, numpy.mean(cvscores), numpy.std(cvscores)))

Modelling with Keras


ValueError: Cannot feed value of shape (32, 152766) for Tensor u'dense_69_input:0', which has shape '(111395, 152766)'


### First Pass Logistic Regression with sag

In [None]:
from sklearn.metrics import auc
# SK-learn libraries for cross validation
from sklearn.cross_validation import StratifiedKFold, cross_val_score, train_test_split 
# Basic Logistic Regression Model/MultiLabel Edition

solver = 'sag'

print("Modelling with {} solver".format(solver))
prediction_output = []
scores_output = []
for name in target_names:
    classifier = LogisticRegression(solver=solver)
    classifier.fit(train_counts, train_labels[name])
    cv_score = np.mean(cross_val_score(
        classifier, train_counts, train_labels[name], cv=3, scoring='roc_auc'))
    scores_output.append(cv_score)
    print('Training data CV score for class {} is {}'.format(name, cv_score))
    
print("Mean Training ROC_AUC for {} solver: {}").format(solver, np.mean(scores_output))

prediction_output = []
scores_output = []
for name in target_names:
    classifier = LogisticRegression(solver=solver) 
    classifier.fit(dev_counts, dev_labels[name])
    cv_score = np.mean(cross_val_score(
        classifier, dev_counts, dev_labels[name], cv=3, scoring='roc_auc'))
    scores_output.append(cv_score)
    print('Dev data CV score for class {} is {}'.format(name, cv_score))
        
print("Mean Dev ROC_AUC for {} solver: {}").format(solver, np.mean(scores_output))


### First Pass Logistic Regression with saga

In [None]:
from sklearn.metrics import auc
# SK-learn libraries for cross validation
from sklearn.cross_validation import StratifiedKFold, cross_val_score, train_test_split 
# Basic Logistic Regression Model/MultiLabel Edition

solver = 'saga'

print("Modelling with {} solver".format(solver))
prediction_output = []
scores_output = []
for name in target_names:
    classifier = LogisticRegression(solver=solver)
    classifier.fit(train_counts, train_labels[name])
    cv_score = np.mean(cross_val_score(
        classifier, train_counts, train_labels[name], cv=3, scoring='roc_auc'))
    scores_output.append(cv_score)
    print('Training data CV score for class {} is {}'.format(name, cv_score))
    
print("Mean Training ROC_AUC for {} solver: {}").format(solver, np.mean(scores_output))

prediction_output = []
scores_output = []
for name in target_names:
    classifier = LogisticRegression(solver=solver) 
    classifier.fit(dev_counts, dev_labels[name])
    cv_score = np.mean(cross_val_score(
        classifier, dev_counts, dev_labels[name], cv=3, scoring='roc_auc'))
    scores_output.append(cv_score)
    print('Dev data CV score for class {} is {}'.format(name, cv_score))
        
print("Mean Dev ROC_AUC for {} solver: {}").format(solver, np.mean(scores_output))


### Here's the same using tfidf and saga

In [None]:
from sklearn.metrics import auc
# SK-learn libraries for cross validation
from sklearn.cross_validation import StratifiedKFold, cross_val_score, train_test_split 
# Basic Logistic Regression Model/MultiLabel Edition

solver = 'saga'

print("Modelling with {} solver".format(solver))
prediction_output = []
scores_output = []
for name in target_names:
    classifier = LogisticRegression(solver=solver)
    classifier.fit(train_tfidf_counts, train_labels[name])
    cv_score = np.mean(cross_val_score(
        classifier, train_tfidf_counts, train_labels[name], cv=3, scoring='roc_auc'))
    scores_output.append(cv_score)
    print('Training data CV score for class {} is {}'.format(name, cv_score))

    
print("Mean Training ROC_AUC for {} solver: {}").format(solver, np.mean(scores_output))

prediction_output = []
scores_output = []
for name in target_names:
    classifier = LogisticRegression(solver=solver) 
    classifier.fit(dev_tfidf_counts, dev_labels[name])
    cv_score = np.mean(cross_val_score(
        classifier, dev_tfidf_counts, dev_labels[name], cv=3, scoring='roc_auc'))
    scores_output.append(cv_score)
    print('Dev data CV score for class {} is {}'.format(name, cv_score))

        
print("Mean Dev ROC_AUC for {} solver: {}").format(solver, np.mean(scores_output))


### Original counts with saga and L1

In [None]:
from sklearn.metrics import auc
# SK-learn libraries for cross validation
from sklearn.cross_validation import StratifiedKFold, cross_val_score, train_test_split 
# Basic Logistic Regression Model/MultiLabel Edition

solver = 'saga'

print("Modelling with {} solver".format(solver))
prediction_output = []
scores_output = []
for name in target_names:
    classifier = LogisticRegression(solver=solver,penalty='l1')
    classifier.fit(train_counts, train_labels[name])
    cv_score = np.mean(cross_val_score(
        classifier, train_counts, train_labels[name], cv=3, scoring='roc_auc'))
    scores_output.append(cv_score)
    print('Training data CV score for class {} is {}'.format(name, cv_score))

    
print("Mean Training ROC_AUC for {} solver: {}").format(solver, np.mean(scores_output))

prediction_output = []
scores_output = []
for name in target_names:
    classifier = LogisticRegression(solver=solver,penalty='l1') 
    classifier.fit(dev_counts, dev_labels[name])
    cv_score = np.mean(cross_val_score(
        classifier, dev_counts, dev_labels[name], cv=3, scoring='roc_auc'))
    scores_output.append(cv_score)
    print('Dev data CV score for class {} is {}'.format(name, cv_score))

        
print("Mean Dev ROC_AUC for {} solver: {}").format(solver, np.mean(scores_output))

### Tfidf with saga and L1

In [None]:
from sklearn.metrics import auc
# SK-learn libraries for cross validation
from sklearn.cross_validation import StratifiedKFold, cross_val_score, train_test_split 
# Basic Logistic Regression Model/MultiLabel Edition

solver = 'saga'

print("Modelling with {} solver".format(solver))
prediction_output = []
scores_output = []
for name in target_names:
    classifier = LogisticRegression(solver=solver,penalty='l1')
    classifier.fit(train_tfidf_counts, train_labels[name])
    cv_score = np.mean(cross_val_score(
        classifier, train_tfidf_counts, train_labels[name], cv=3, scoring='roc_auc'))
    scores_output.append(cv_score)
    print('Training data CV score for class {} is {}'.format(name, cv_score))

    
print("Mean Training ROC_AUC for {} solver: {}").format(solver, np.mean(scores_output))

prediction_output = []
scores_output = []
for name in target_names:
    classifier = LogisticRegression(solver=solver,penalty='l1') 
    classifier.fit(dev_tfidf_counts, dev_labels[name])
    cv_score = np.mean(cross_val_score(
        classifier, dev_tfidf_counts, dev_labels[name], cv=3, scoring='roc_auc'))
    scores_output.append(cv_score)
    print('Dev data CV score for class {} is {}'.format(name, cv_score))

        
print("Mean Dev ROC_AUC for {} solver: {}").format(solver, np.mean(scores_output))

#### Testing on Dev Data

In [None]:
from sklearn.metrics import auc, roc_curve
from sklearn import metrics

dev_Vector = CountVectorizer(ngram_range=(1,1))
dev_counts = countVector.fit_transform(dev_data)

pred_dt = pd.DataFrame()
scores_dev = []
for name in target_names:
    classifier = LogisticRegression(solver='sag') 
    classifier.fit(dev_counts, dev_labels[name])
    scores_dev.append(cv_score)
    output = classifier.predict(dev_counts)
    fpr, tpr, thresholds = metrics.roc_curve(dev_labels[name], output)
    print('Dev score for class {} is {}'.format(name, metrics.auc(fpr,tpr)))
    pred_dt[name] = classifier.predict_proba(dev_counts)[:, 1]
    
    
print("Mean(dev) ROC_AUC: {}").format(np.mean(scores_dev))

Score on dev set is worse than training set, thus evidence of overfitting and a need for performance improvement.

The target is multi-label since each observation can be classified as multiple fields.  This is an important distinction from multi-class where each prediction can only be one label.  

## Evaluation

In [None]:
count_df
train_labels["toxic"]

### Submission

In [None]:
from sklearn.metrics import auc
# SK-learn libraries for cross validation
from sklearn.cross_validation import StratifiedKFold, cross_val_score, train_test_split 
# Basic Logistic Regression Model/MultiLabel Edition

prediction_submission = pd.DataFrame()
prediction_submission["id"] = test_df["id"]

# new vector object for all train data for submission
finalTrainVector = CountVectorizer()
finalTrainCount = finalTrainVector.fit_transform(train_df["comment_text"])

# TODO: Using pipelines can clean up repetitive processes
# test set up
#testVector = CountVectorizer()
testCount = finalTrainVector.transform(test_df["comment_text"])

for name in target_names:
    classifier = LogisticRegression(solver='sag') #sag is one kind of solver optimize for multi-label
    clf = classifier.fit(finalTrainCount, train_df[name])
    prediction_submission[name] = clf.predict_proba(testCount)[:, 1]
    #print(prediction_submission)

    
print(prediction_submission.head(10)) # print frame output 
#prediction_submission.to_csv("submission.csv")

The frame contains the output for each class and is saved in a pandas data frame.  