## Baseline Submission: Toxic Language Classification 
**w207 Spring 2018 - Final Project Baseline**

**Team: Paul, Walt, Yisang, Joe**



### Project Description 

Our challenge is to build a multi-headed model that’s capable of detecting different types of of toxicity like threats, obscenity, insults, and identity-based hate.  The toxic language data set is sourced from Wikipedia and available as a public kaggle data set. 

Our goal is to use various machine learning techniques used in class to develop high quality ML models and pipelines.  

1. Exercise and build upon concepts covered in class and test out at least 3 kinds of supervised models:
    a. Regression (LASSO, Logistic)
    b. Trees (RF, XGBoost)
    c. DeepLearning (Tensorflow)
2. Using stacking/ensembling methods for improving prediction metrics (K-Means, anomaly detection)
3. Using unsupervised methods for feature engineering/selection

For the baseline proposal, this file contains a first pass run through from data preprocessing to model evaluation using a regression model pipeline. 

https://www.kaggle.com/c/jigsaw-toxic-comment-classification-challenge




### Data Ingestion

In [46]:
%matplotlib inline
import numpy as np
import pandas as pd
import string
import time
import os.path
import pickle

#sklearn imports
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.decomposition import TruncatedSVD
from sklearn.pipeline import Pipeline
from sklearn.neighbors import KNeighborsClassifier
from sklearn.neural_network import MLPClassifier

from sklearn.linear_model import LogisticRegression
from sklearn.naive_bayes import BernoulliNB
from sklearn.naive_bayes import MultinomialNB
from sklearn.grid_search import GridSearchCV
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.preprocessing import label_binarize


from sklearn import metrics

from sklearn.cross_validation import StratifiedKFold, cross_val_score, train_test_split 


#NLTK imports

import nltk
from nltk.corpus import stopwords as sw
from nltk.corpus import wordnet as wn
from nltk.tokenize import punkt as punkt
from nltk import wordpunct_tokenize
from nltk import WordNetLemmatizer
from nltk import sent_tokenize
from nltk import pos_tag

# These imports enable the use of NLTKPreprocessor in an sklearn Pipeline
from sklearn.base import BaseEstimator, TransformerMixin


#scipy imports
from scipy.sparse import hstack

#Visualization imports
import matplotlib.pyplot as plt
import matplotlib.gridspec as gridspec 
import bokeh
#! pip install bokeh

#General imports
import pprint

# target classes
target_names = ['toxic', 'severe_toxic', 'obscene', 'threat', 'insult', 'identity_hate']

In [47]:
# read frames localy through csv
train_df = pd.read_csv("../data/train.csv")
test_df = pd.read_csv("../data/test.csv")

np.random.seed(455)

# Random index generator for splitting training data
# Note: Each rerun of cell will create new splits.
randIndexCut = np.random.rand(len(train_df)) < 0.7

#S plit up data
test_data = test_df["comment_text"]
dev_data, dev_labels = train_df[~randIndexCut]["comment_text"], train_df[~randIndexCut][target_names]
train_data, train_labels = train_df[randIndexCut]["comment_text"], train_df[randIndexCut][target_names]

print 'total training observations:', train_df.shape[0]
print 'training data shape:', train_data.shape
print 'training label shape:', train_labels.shape

print 'dev label shape:', dev_labels.shape
print 'labels names:', target_names

total training observations: 159571
training data shape: (111906,)
training label shape: (111906, 6)
dev label shape: (47665, 6)
labels names: ['toxic', 'severe_toxic', 'obscene', 'threat', 'insult', 'identity_hate']


In [48]:
print train_labels

        toxic  severe_toxic  obscene  threat  insult  identity_hate
0           0             0        0       0       0              0
1           0             0        0       0       0              0
2           0             0        0       0       0              0
3           0             0        0       0       0              0
6           1             1        1       0       1              0
7           0             0        0       0       0              0
8           0             0        0       0       0              0
9           0             0        0       0       0              0
10          0             0        0       0       0              0
11          0             0        0       0       0              0
12          1             0        0       0       0              0
13          0             0        0       0       0              0
15          0             0        0       0       0              0
16          1             0        0       0    

### Text Processing

In [49]:
nltk.download('stopwords')

class NLTKPreprocessor(BaseEstimator, TransformerMixin):
    """Text preprocessor using NLTK tokenization and Lemmatization

    This class is to be used in an sklean Pipeline, prior to other processers like PCA/LSA/classification
    Attributes:
        lower: A boolean indicating whether text should be lowercased by preprocessor
                default: True
        strip: A boolean indicating whether text should be stripped of surrounding whitespace, underscores and '*'
                default: True
        stopwords: A set of words to be used as stop words and thus ignored during tokenization
                default: built-in English stop words
        punct: A set of punctuation characters that should be ignored
                default: None
        lemmatizer: An object that should be used to lemmatize tokens
    """

    def __init__(self, stopwords=None, punct=None,
                 lower=True, strip=True):
        """Initialize method for NLTKPreprocessor instance

        Simple initialization of specified instance variables:

        Args:
            self 
            stopwords: set of words to ignore as stop words, or a default set for English will be used
            punct: set of punctuation characters to strip, or a default set will be used
            lower: indicator of whether to convert all characters to lowercase, defaults to True
            strip: indicator of whether to strip whitespace, defaults to True

        Returns:
            N/A: instance initializer

        """
        self.lower      = lower
        self.strip      = strip
        self.stopwords  = stopwords or set(sw.words('english'))
        self.punct      = punct or set(string.punctuation)
        self.lemmatizer = WordNetLemmatizer()
        

    def fit(self, X, y=None):
        """Fit model with X and optional y

        This function does nothing but return self, since as a processor in the sklearn Pipeline this preprocessor
        has nothing analogous to "fit" logic. The tokenization logic is independent of specific dataset training, 
        and is fully realized in the transform() function. 
        This function exists as implementation of sklearn.BaseEstimator, for use in Pipeline.

        Args:
            self 
            X (array-like): independent variable
            y (array-like): dependent variable
            
        Returns:
            NLTKPreprocessor: self
        """
        return self

    def inverse_transform(self, X):
        """Function exists as implementation of sklearn.BaseEstimator, for use in Pipeline.
        This is simply for complying with interface.

        Args:
            self 
            X (array-like): input documents
            
        Returns:
            string: joined documents
        """
        return [" ".join(doc) for doc in X]

    def transform(self, X):
        """Transform input X to produce output to be processed by next element in sklearn Pipeline

        This triggers the tokenization/lemmatization of the source documents.
        This is invoked by the sklearn Pipeline.

        Args:
            self 
            X: input documents to be tokenized
            
        Returns:
            list: tokenized documents reduced to simplest lemma form
        """
        return [
            list(self.tokenize(doc)) for doc in X
        ]

    
    def tokenize(self, document):
        """Tokenize an input document, converting from a block of text into sentences, into tagged tokens,
        generating a set of lemmas.

        This method does the preprocessing work of sentence-based tokenization and then reduces words to lemmas

        Args:
            self 
            X (array-like): independent variable
            y (array-like): dependent variable
            
        Returns:
            Iterator[str]: an iterator over the tokens produced from the input documents
        """
        # Break the document into sentences. This is necessary for part-of-speech tagging.
        for sent in sent_tokenize(unicode(document,'utf-8')):

            # Break the sentence into part of speech tagged tokens
            for token, tag in pos_tag(wordpunct_tokenize(sent)):
                # Apply preprocessing to the token
                token = token.lower() if self.lower else token
                token = token.strip() if self.strip else token
                token = token.strip('_') if self.strip else token
                token = token.strip('*') if self.strip else token

                # If stopword, ignore token and continue
                if token in self.stopwords:
                    continue

                # If punctuation, ignore token and continue
                if all(char in self.punct for char in token):
                    continue

                # Lemmatize the token and yield
                lemma = self.lemmatize(token, tag)
                yield lemma

                
    def lemmatize(self, token, tag):
        """Convert a token into the appropriate lemma

        Method uses the NLTK WordNetLemmatizer for part-of-speech tag-based lemmatization of words.

        Args:
            self 
            token: input word
            tag: part-of-speech tag
            
        Returns:
            string: lemma
        """
        tag = {
            'N': wn.NOUN,
            'V': wn.VERB,
            'R': wn.ADV,
            'J': wn.ADJ
        }.get(tag[0], wn.NOUN)

        return self.lemmatizer.lemmatize(token, tag)

def identity(arg):
    """ Simple identity function works as a passthrough.

        This function will be used with the Vectorizer classes, when tokenization will have been performed already.
        In this scenario, the Vectorizer class will call this function in the place of its normal tokenization feature
        and this function will simply return the input token.
        
        Args:
            token (string): text token being evaluated by CountVectorizer or TfidfVectorizer
            
        Returns:
            string: input token unchanged (processed earlier by NLTK) will tbe returned
    """
    return arg

[nltk_data] Downloading package stopwords to
[nltk_data]     /Users/burgew/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


### Final Text Preprocessing - training data

### Text Preprocessing
This block uses the NLTKPreprocessor to tokenize the input data and then the TfidfVectorizer to vectorize it. The NLTKPreprocessor will ignore English stop words and will lemmatize where possible. The vectorizer ignores words occuring in fewer than 5 documents, which sufficed to reduce the size of the words vector significantly. Also, the vectorizer will limit the total features (words) to 15000, prioritizing the most valuable ones with highest TF-IDF score.

Note that in this case the tokenization available by default in TfidfVectorizer is disabled, since that is handled by the NLTKPreprocessor. This made it clear that tokenization is by far more expensive (time) than vectorization.

In [50]:
pp = pprint.PrettyPrinter(indent=4)

np.random.seed(455)

# This preprocessor will be used to process data prior to vectorization
nltkPreprocessor = NLTKPreprocessor()
    
# Note that this vectorizer is created with a passthru tokenizer(identity), no preprocessor and no lowercasing
# This is to account for the NLTKPreprocessor already taking care of tokenization.
tfidfVector = TfidfVectorizer(ngram_range=(1,2), min_df=5, max_df=.7, max_features=6000,
                              tokenizer=identity, preprocessor=None, lowercase=False, stop_words={'english'})

# Check if there is a serialized copy of the preprocessed training data, and if not the perform text preprocessing and
# save the serialized result for reuse.
pickle_file_name = 'train_preproc_data.pickle'
if (not os.path.exists(pickle_file_name)):
    print "Starting preprocessing of training data..."
    start_train_preproc = time.time()
    nltkPreprocessor.fit(train_data)
    train_preproc_data = nltkPreprocessor.transform(train_data)
    finish_train_preproc = time.time()
    print "Completed tokenization/preprocessing of training data in {:.2f} seconds".format(finish_train_preproc-start_train_preproc)
    
    with open(pickle_file_name,'w') as pickle_file:
        pickle.dump(train_preproc_data,pickle_file)
else:
    # If the serialized file already exists, simply load it for the next step of the process.
    with open(pickle_file_name,'r') as pickle_file:
        train_preproc_data = pickle.load(pickle_file)

# Check if there is a serialized copy of the vectorized counts, and if not regenerate the matrix and save the
# serialized result for reuse.        
pickle_file_name = 'train_tfidf_counts.6000.pickle'
if (not os.path.exists(pickle_file_name)):
    
    # Generating new TF-IDF train counts means we need to then re-apply LSA to the results, so remove the LSA results
    #remove_file('lsa_train_counts.pickle')
    print "Starting vectorization of training data..."
    start_train_vectors = time.time()
    train_tfidf_counts = tfidfVector.fit_transform(train_preproc_data)
    finish_train_vectors = time.time()
    print "Completed vectorization of training data in {:.2f} seconds".format(finish_train_vectors-start_train_vectors)
    
    with open(pickle_file_name,'w') as pickle_file:
        pickle.dump(train_tfidf_counts,pickle_file)
else:
    # If the serialized file already exists, simply load it for the next step of the process.
    with open(pickle_file_name,'r') as pickle_file:
        train_tfidf_counts = pickle.load(pickle_file)
    
# Check if there is a serialized copy of the preprocessed dev data, and if not the perform text preprocessing and
# save the serialized result for reuse.
pickle_file_name = 'dev_preproc_data.pickle'
if (not os.path.exists(pickle_file_name)):
    print "\nStarting preprocessing of dev data..."
    start_dev_preproc = time.time()
    nltkPreprocessor.fit(dev_data)
    dev_preproc_data = nltkPreprocessor.transform(dev_data)
    finish_dev_preproc = time.time()
    print "Completed tokenization/preprocessing of dev data in {:.2f} seconds".format(finish_dev_preproc-start_dev_preproc)

    with open(pickle_file_name,'w') as pickle_file:
        pickle.dump(dev_preproc_data,pickle_file)
else:
    # If the serialized file already exists, simply load it for the next step of the process.
    with open(pickle_file_name,'r') as pickle_file:
        dev_preproc_data = pickle.load(pickle_file)
    
pickle_file_name = 'dev_tfidf_counts.6000.pickle'
if (not os.path.exists(pickle_file_name)):
    
    
    # Generating new TF-IDF dev counts means we need to then re-apply LSA to the results, so remove the LSA results
    #remove_file('lsa_dev_counts.pickle')
    
    print "Starting vectorization of dev data..."
    start_dev_vectors = time.time()
    dev_tfidf_counts = tfidfVector.transform(dev_preproc_data)
    finish_dev_vectors = time.time()
    print "Completed vectorization of dev data in {:.2f} seconds".format(finish_dev_vectors-start_dev_vectors)


    print("\nVocabulary (tfidf) size is: {}").format(len(tfidfVector.vocabulary_))
    vocab_entries = {k: tfidfVector.vocabulary_[k] for k in tfidfVector.vocabulary_.keys()}
    vocab_entries = pd.Series(vocab_entries).to_frame()
    vocab_entries.columns = ['count']
    vocab_entries = vocab_entries.sort_values(by='count')

    print("Sample vocabulary from TfidfVectorizer:")
    print(pp.pprint(vocab_entries.head(10)))
    print("...")
    print(pp.pprint(vocab_entries.tail(10)))
    print("Number of nonzero entries in matrix: {}").format(train_tfidf_counts.nnz)

    with open(pickle_file_name,'w') as pickle_file:
        pickle.dump(dev_tfidf_counts,pickle_file)
else:
    # If the serialized file already exists, simply load it for the next step of the process.
    with open(pickle_file_name,'r') as pickle_file:
        dev_tfidf_counts = pickle.load(pickle_file)


# Print sample column wise sum, we can see that an observation can have multiple classes.
count_df = pd.DataFrame(train_labels.apply(np.sum,1), columns = ["counts"])
count_df = count_df[((count_df["counts"] >= 1))]
count_df.head(10)



Unnamed: 0,counts
6,4
12,1
16,1
42,4
43,3
44,1
51,2
55,4
56,3
58,2


### Final LSA Feature Selection - training data

### PCA/LSA
Principal Component Analysis (PCA) and Latent Semantic Analysis (LSA) are both operations that use Singular Value Decomposition to reduce the dimensionality of a dataset. PCA is applied to a term-covariance matrix, whereas LSA is applied to a term-document matrix. As such, LSA is appropriate for machine learning algorithms using scikit-learn TfidfVectorizer. Additionally PCA, as implemented in scikit-learn, cannot handle the sparse matrices that are produced by such vectorization tools.

In [51]:
# Set the number of principal components to identify for use in classification processes
target_components = 3000

# Check if there is a serialized copy of the Principal Components data for the training dataset, and if not then
# perform LSA processing and save the serialized result for reuse.
pickle_file_name = 'lsa_train_counts.3000.pickle'
if (not os.path.exists(pickle_file_name)):
    svd = TruncatedSVD(n_components=target_components, algorithm='arpack')
    print "Starting LSA on train counts with {} components...".format(target_components)
    train_start=time.time()
    lsa_train_counts = svd.fit_transform(train_tfidf_counts)
    train_stop=time.time()
    print "Train counts transform took {:.2f} minutes.".format((train_stop-train_start)/60)
    
    with open(pickle_file_name,'w') as pickle_file:
        pickle.dump(lsa_train_counts,pickle_file)
else:
    # If the serialized file already exists, simply load it for the next step of the process.
    with open(pickle_file_name,'r') as pickle_file:
        lsa_train_counts = pickle.load(pickle_file)
 
# Check if there is a serialized copy of the Principal Components data for the dev dataset, and if not then
# perform LSA processing and save the serialized result for reuse.
pickle_file_name = 'lsa_dev_counts.3000.pickle'
if (not os.path.exists(pickle_file_name)):
    print "Starting LSA on dev counts with {} components...".format(target_components)
    dev_start=time.time()
    lsa_dev_counts = svd.fit_transform(dev_tfidf_counts)
    dev_stop=time.time()
    print "Dev counts transform took {:.2f} minutes.".format((dev_stop-dev_start)/60)
    
    with open(pickle_file_name,'w') as pickle_file:
        pickle.dump(lsa_dev_counts,pickle_file)
else:
    # If the serialized file already exists, simply load it for the next step of the process.
    with open(pickle_file_name,'r') as pickle_file:
        lsa_dev_counts = pickle.load(pickle_file)       

### Final MLPClassifier Training and Submission

### Text Classification with Neural Net (sklearn.MLPClassifier)
In choosing a neural net model for text classification, the output layer should have the same number of nodes as the number of classification labels. In this case, there are 6 labels and as such not only will the output layer have 6 nodes, but the final hidden layer as well. The input layer will have the same number of nodes as features, normally, and ideally the initial hidden layer will be between that and the number of classes.

In this case, we're limiting our feature set to 5,000 principal components, and it was not possible to use a number of initial hidden layer nodes at all close to that, running this process on a Macbook. So, setting the initial hidden layer to 12 gave at least some benefit of being less than the number of features and greater than the number of output classes. This (12,6) model is the one that ended up producing best (most accurate) results.

Note that, nod toward deeper learning, a (10,8,6) model was also tested, but this ended up demonstrating overfitting, with a signficantly higher accuracy score on test data than on dev data.

In [55]:

    
# This MLPClassifier will be fit using training data and subsequently used to predict labels using dev
# data, for scoring.
classifier = MLPClassifier(hidden_layer_sizes=(12,6), solver='adam', early_stopping=False, activation='relu',
                           tol=1e-13, alpha=1, learning_rate='adaptive', learning_rate_init=0.01, )

# Fit using the training data and time the process for reference
full_train_start = time.time()
classifier.fit(lsa_train_counts, train_labels)
full_train_stop = time.time()

duration = (full_train_stop-full_train_start)/60
print('Fitting train data completed, after {:.2f} minutes.'.format(duration))

# Generate predictions using the dev LSA data and collect a series of scores
dev_pred = classifier.predict(lsa_dev_counts)
acc_score = metrics.accuracy_score(dev_labels, dev_pred)

# Note that, since this is multilabel data, an F1 score must be evaluated with either results weighted across labels or
# as samples taken from each.
precision_recall_fscore = metrics.precision_recall_fscore_support(dev_labels, dev_pred, average=None)
precision = metrics.precision_score(dev_labels, dev_pred, average=None)
recall = metrics.recall_score(dev_labels, dev_pred, average=None)

# Prediction probabilities will be saved for comparison with other models and processing by ensembles
predict_probs = classifier.predict_proba(lsa_dev_counts)

print("Accuracy score from dev predict: {}".format(acc_score))
print("Precision score from dev predict: {}".format(precision))
print("Recall score from dev predict: {}".format(recall))

# Fitting again with binarized labels and predicting again to support per-label roc_auc scores
binarized_train_labels = label_binarize(train_labels, classes=[0, 1, 2, 3, 4, 5])
binarized_dev_labels = label_binarize(dev_labels, classes=[0, 1, 2, 3, 4, 5])

# While this is multilabel data, the sklearn ROC AUC scoring feature doesn't support multilabel data directly.
# So, instead the model will be re-trained with binarized training labels and the predicted probabilities used
# To derive ROC AUC for each label. This is mainly for comparison with other models, since these numbers won't
# be directly related to the multilabel classification.
print("Re-fitting and scoring for per-label roc_auc scores...")
y_score = classifier.fit(lsa_train_counts, binarized_train_labels).predict_proba(lsa_dev_counts)
fpr = dict()
tpr = dict()
roc_auc = []
for ind, label in enumerate(target_names):
    fpr[ind], tpr[ind], _ = metrics.roc_curve(binarized_dev_labels[:, ind], y_score[:, ind])
    roc_auc.append(metrics.auc(fpr[ind], tpr[ind]))

print "ROC AUC score from dev predict: ", roc_auc

# In order to Save the complete collection of scores, a pandas.DataFrame will be created and used to create
# "scoring.csv".
scoring_arr = np.asarray(precision_recall_fscore)
scoring_arr = np.vstack([scoring_arr,roc_auc])
scoring_submission = pd.DataFrame(data=scoring_arr, columns=target_names, index=['precision', 'recall', 
                                                                                 'fbeta_score', 'support', 'roc_auc'])
print("Precision, recall, fbeta_score, support and ROC AUC:")
print(scoring_submission)
scoring_submission.to_csv("scoring.csv")

# The predicted probabilities from the initial version of the model will be saved in CSV file "submission.csv"
prediction_submission = pd.DataFrame(data=predict_probs,columns=target_names)
print(prediction_submission[0:10]) # print frame output 
prediction_submission.to_csv("submission.csv")



Fitting train data completed, after 1.72 minutes.
Accuracy score from dev predict: 0.85456834155
Precision score from dev predict: [0.07733333 0.         0.01965602 0.         0.01552795 0.        ]
Recall score from dev predict: [0.03777681 0.         0.00623296 0.         0.00418936 0.        ]
Re-fitting and scoring for per-label roc_auc scores...
ROC AUC score from dev predict:  [0.6003087085964923, 0.6995730899650312, 0.6386008461199412, 0.6475760620854013, 0.6260208752175941, 0.6392196850895646]
Precision, recall, fbeta_score, support and ROC AUC:
                   toxic  severe_toxic      obscene      threat       insult  \
precision       0.077333      0.000000     0.019656    0.000000     0.015528   
recall          0.037777      0.000000     0.006233    0.000000     0.004189   
fbeta_score     0.050758      0.000000     0.009465    0.000000     0.006598   
support      4606.000000    480.000000  2567.000000  155.000000  2387.000000   
roc_auc         0.600309      0.699573  

### Submission - based on test preprocessing, LSA feature selection and MLPClassifier training

prediction_submission = pd.DataFrame()
prediction_submission["id"] = test_df["id"]

# new vector object for all train data for submission
finalTrainVector = CountVectorizer()
finalTrainCount = finalTrainVector.fit_transform(train_df["comment_text"])

# TODO: Using pipelines can clean up repetitive processes
# test set up
#testVector = CountVectorizer()
testCount = finalTrainVector.transform(test_df["comment_text"])

for name in target_names:
    classifier = LogisticRegression(solver='sag') #sag is one kind of solver optimize for multi-label
    clf = classifier.fit(finalTrainCount, train_df[name])
    prediction_submission[name] = clf.predict_proba(testCount)[:, 1]
    #print(prediction_submission)

    
print(prediction_submission.head(10)) # print frame output 
prediction_submission.to_csv("submission.csv")

### Submission

In [54]:
prediction_submission = pd.DataFrame()
prediction_submission["id"] = test_df["id"]

# new vector object for all train data for submission
finalTrainVector = CountVectorizer()
finalTrainCount = finalTrainVector.fit_transform(train_df["comment_text"])

# TODO: Using pipelines can clean up repetitive processes
# test set up
#testVector = CountVectorizer()
testCount = finalTrainVector.transform(test_df["comment_text"])

for name in target_names:
    classifier = LogisticRegression(solver='sag') #sag is one kind of solver optimize for multi-label
    clf = classifier.fit(finalTrainCount, train_df[name])
    prediction_submission[name] = clf.predict_proba(testCount)[:, 1]
    #print(prediction_submission)

    
print(prediction_submission.head(10)) # print frame output 
prediction_submission.to_csv("submission.csv")

                 id     toxic  severe_toxic   obscene        threat    insult  \
0  00001cee341fdb12  0.889067      0.002657  0.495938  2.411054e-05  0.431924   
1  0000247867823ef7  0.250678      0.068930  0.195161  3.597304e-02  0.194258   
2  00013b17ad220c46  0.432563      0.397684  0.430576  4.030110e-01  0.428336   
3  00017563c3f7919a  0.068290      0.001655  0.038274  1.869286e-04  0.025554   
4  00017695ad8997eb  0.424790      0.363867  0.421929  3.636448e-01  0.413012   
5  0001ea8717f6de06  0.340943      0.092428  0.259784  3.285916e-02  0.277697   
6  00024115d4cbde0f  0.124710      0.007629  0.072675  2.157967e-03  0.067466   
7  000247e83dcc1211  0.452275      0.318918  0.400078  2.501187e-01  0.417772   
8  00025358d4737918  0.006207      0.000003  0.001180  5.665929e-07  0.000461   
9  00026d1092fe71cc  0.044106      0.000410  0.017836  9.634092e-05  0.010188   

   identity_hate  
0       0.000399  
1       0.079082  
2       0.409280  
3       0.000478  
4       0.368

The frame contains the output for each class and is saved in a pandas data frame.  