# Toxic Comment Challenge - A Multilabel Classification Problem
#### Authored by Megan Yow
Jun 18, 2020  
V1 - initial run of notebook  
V2 - Made Submission Files for Binary and Chain Classification
V3 - Advanced Setting for saving output (no code changes made)

This kernel is inspired by:
- kernel by Jeremy Howard : _NB-SVM strong linear baseline + EDA (0.052 lb)_
- kernel by Issac : _logistic regression (0.055 lb)_
- _Solving Multi-Label Classification problems_, https://www.analyticsvidhya.com/blog/2017/08/introduction-to-multi-label-classification/
- **Heavily Inspired** notebook by Rhodium Beng: Classifying multi-label comments (0.9741 lb) 
- submitting from a kernel  https://www.kaggle.com/dansbecker/submitting-from-a-kernel + Advanced Settings (save output for this version)

In [None]:
import numpy as np
import pandas as pd
from matplotlib import pyplot as plt
%matplotlib inline
import seaborn as sns
import re

## Load training and test data

In [None]:
# import os
# os.chdir('../input/jigsaw-toxic-comment-classification-challenge/')

In [None]:
train_df = pd.read_csv('../input/jigsaw-toxic-comment-classification-challenge/train.csv.zip')
test_df = pd.read_csv('../input/jigsaw-toxic-comment-classification-challenge/test.csv.zip')

In [None]:
"{:03.2f} MB".format(train_df.memory_usage(deep=True).sum() / 1024) # usage in bytes, MB

In [None]:
train_df.head()

## Examine the data (EDA)

In [None]:
test_df.head()

In the training data, the comments are labelled as one or more of the six categories; toxic, severe toxic, obscene, threat, insult and identity hate. This is essentially a multi-label classification problem.

In [None]:
cols_target = ['obscene','insult','toxic','severe_toxic','identity_hate','threat']

In [None]:
# check missing values in numeric columns
train_df.isna().sum() # no missing data

In [None]:
train_df.describe() #tag rates of each column

In [None]:
unlabelled_in_all = train_df[(train_df['toxic']!=1) & (train_df['severe_toxic']!=1) & (train_df['obscene']!=1) & 
                            (train_df['threat']!=1) & (train_df['insult']!=1) & (train_df['identity_hate']!=1)]
print('Percentage of unlabelled comments is ', len(unlabelled_in_all)/len(train_df)*100)

In [None]:
test_df.isna().sum() # no missing data in test set as well

All rows in the training and test data contain comments, so there's no need to clean up null fields.

In [None]:
# let's see the total rows in train, test data and the numbers for the various categories
print('Total rows in test is {}'.format(len(test_df)))
print('Total rows in train is {}'.format(len(train_df)))
print(train_df[cols_target].sum())

As mentioned earlier, majority of the comments in the training data are not labelled in one or more of these categories.

In [None]:
# Let's look at the character length for the rows in the training data and record these
train_df['char_length'] = train_df['comment_text'].apply(lambda x: len(str(x)))

In [None]:
# look at the histogram plot for text length
sns.set()
train_df['char_length'].hist()
plt.show()

Most of the text length are within 500 characters, with some up to 5,000 characters long.

Next, let's examine the correlations among the target variables.

In [None]:
data = train_df[cols_target]

In [None]:
colormap = plt.cm.plasma
plt.figure(figsize=(7,7))
plt.title('Correlation of features & targets',y=1.05,size=14)
sns.heatmap(data.astype(float).corr(),linewidths=0.1,vmax=1.0,square=True,cmap=colormap,
           linecolor='white',annot=True)
plt.show()

Indeed, it looks like some of the labels are higher correlated, e.g. insult-obscene has the highest at 0.74, followed by toxic-obscene and toxic-insult.

What about the character length & distribution of the comment text in the test data?

In [None]:
test_df['char_length'] = test_df['comment_text'].apply(lambda x: len(str(x)))

In [None]:
plt.figure()
plt.hist(test_df['char_length'])
plt.show()

Now, the shape of character length distribution looks similar between the training data and the train data. For the training data, I guess the train data were clipped to 5,000 characters to facilitate the folks who did the labelling of the comment categories.

## Clean up the comment text

In [None]:
def clean_text(text):
    text = text.lower()
    text = re.sub(r"what's", "what is ", text)
    text = re.sub(r"\'s", " ", text)
    text = re.sub(r"\'ve", " have ", text)
    text = re.sub(r"can't", "cannot ", text)
    text = re.sub(r"n't", " not ", text)
    text = re.sub(r"i'm", "i am ", text)
    text = re.sub(r"\'re", " are ", text)
    text = re.sub(r"\'d", " would ", text)
    text = re.sub(r"\'ll", " will ", text)
    text = re.sub(r"\'scuse", " excuse ", text)
    text = re.sub('\W', ' ', text)
    text = re.sub('\s+', ' ', text)
    text = text.strip(' ')
    return text

In [None]:
train_df[15:20]

In [None]:
cleaned_df = train_df.copy()
cleaned_df['comment_text'] = cleaned_df['comment_text'].map(lambda com : clean_text(com))

In [None]:
cleaned_df[15:20]

In [None]:
test_df['comment_text'] = test_df['comment_text'].map(lambda com : clean_text(com))


## Define X from entire train & test data for use in tokenization by Vectorizer

In [None]:
cleaned_df = cleaned_df.drop('char_length',axis=1)

In [None]:
X = cleaned_df.comment_text
y_all = cleaned_df[cols_target]
test_X = test_df.comment_text

In [None]:
print(X.shape, test_X.shape)

## Vectorize the data

In [None]:
# import and instantiate TfidfVectorizer
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfVectorizer # feature extractor
vect = TfidfVectorizer(max_features=5000,stop_words='english') # features selected by top frequency
vect

In [None]:
# learn the vocabulary in the training data, then use it to create a document-term matrix
X_dtm = vect.fit_transform(X)
# examine the document-term matrix created from X_train
X_dtm

In [None]:
vect.get_feature_names()[400:405] # first hundreds are numbers

In [None]:
X_dtm.shape

In [None]:
# transform the test data using the earlier fitted vocabulary, into a document-term matrix
test_X_dtm = vect.transform(test_X)
# examine the document-term matrix from X_test
test_X_dtm

## Solving a multi-label classification problem
One way to approach a multi-label classification problem is to transform the problem into separate single-class classifier problems. This is known as 'problem transformation'. There are three methods:
* _**Binary Relevance.**_ This is probably the simplest which treats each label as a separate single classification problems. The key assumption here though, is that there are no correlation among the various labels.
* _**Classifier Chains.**_ In this method, the first classifier is trained on the input X. Then the subsequent classifiers are trained on the input X and all previous classifiers' predictions in the chain. This method attempts to draw the signals from the correlation among preceding target variables.
* _**Label Powerset.**_ This method transforms the problem into a multi-class problem  where the multi-class labels are essentially all the unique label combinations. In our case here, where there are six labels, Label Powerset would in effect turn this into a 2^6 or 64-class problem. {Thanks Joshua for pointing out.}

## Binary Relevance - build a multi-label classifier using Logistic Regression

In [None]:
# Evaluation Metrics
from sklearn.metrics import roc_auc_score # version 0.19.1

def evaluate(y_true, y_probs):
    macro_auc = roc_auc_score(y_true, y_probs, average='macro')
    micro_auc = roc_auc_score(y_true, y_probs, average='micro')
    return {'macro_auc': macro_auc, 'micro_auc': micro_auc}

In [None]:
# import and instantiate the Logistic Regression model
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score
logreg = LogisticRegression() # C is inverse regularization strength based on SVM

# create submission file
submission_binary = pd.read_csv('../input/jigsaw-toxic-comment-classification-challenge/sample_submission.csv.zip')

for label in cols_target:
    print('... Processing Tag: {}'.format(label))
    y = cleaned_df[label]
    # train the model using X_dtm & y
    logreg.fit(X_dtm, y)
    # compute the training accuracy
    y_pred_X = logreg.predict(X_dtm)
    print('Training accuracy is {}'.format(accuracy_score(y, y_pred_X)))
    # compute auc
    y_prob_X = logreg.predict_proba(X_dtm)[:,1]
    cleaned_df[label+'_prob'] = y_prob_X
    print('Training AUC is {}'.format(evaluate(np.array(y),y_prob_X)))
    # compute the predicted probabilities for X_test_dtm
    test_y_prob = logreg.predict_proba(test_X_dtm)[:,1]
    submission_binary[label] = test_y_prob

In [None]:
cols_probs = ['obscene_prob','insult_prob','toxic_prob','severe_toxic_prob','identity_hate_prob','threat_prob']

In [None]:
evaluate(np.array(y_all), cleaned_df[cols_probs])

In [None]:
submission_binary.to_csv('submission_binary.csv',index=False)

#### Binary Relevance with Logistic Regression classifier scored 0.074 on the public leaderboard.

## Classifier Chains - build a multi-label classifier using Logistic Regression

In [None]:
# create submission file
submission_chains = pd.read_csv('../input/jigsaw-toxic-comment-classification-challenge/sample_submission.csv.zip')

# create a function to add features
def add_feature(X, feature_to_add):
    '''
    Returns sparse feature matrix with added feature.
    feature_to_add can also be a list of features.
    '''
    from scipy.sparse import csr_matrix, hstack
    return hstack([X, csr_matrix(feature_to_add).T], 'csr')

In [None]:
for label in cols_target:
    print('... Processing {}'.format(label))
    y = cleaned_df[label]
    # train the model using X_dtm & y
    logreg.fit(X_dtm,y)
    
    # compute the training accuracy
    y_pred_X = logreg.predict(X_dtm)
    print('Training Accuracy is {}'.format(accuracy_score(y,y_pred_X)))
    
    # compute the training AUC
    y_prob_X = logreg.predict_proba(X_dtm)[:,1]
    cleaned_df[label+'_prob2'] = y_prob_X
    print('Training AUC is {}'.format(evaluate(np.array(y),y_prob_X)))
    
    # make predictions from test_X
    test_y = logreg.predict(test_X_dtm)
    test_y_prob = logreg.predict_proba(test_X_dtm)[:,1]
    submission_chains[label] = test_y_prob
    
    # chain current label to X_dtm
    X_dtm = add_feature(X_dtm, y)
    print('Shape of X_dtm is now {}'.format(X_dtm.shape))
    # chain current label predictions to test_X_dtm
    test_X_dtm = add_feature(test_X_dtm, test_y)
    print('Shape of test_X_dtm is now {}'.format(test_X_dtm.shape))

In [None]:
cols_probs = ['obscene_prob2','insult_prob2','toxic_prob2','severe_toxic_prob2','identity_hate_prob2','threat_prob2']

In [None]:
submission_chains.to_csv('submission_chains.csv', index=False)

## Label Powerset - 63 multi class problem

In [None]:
from itertools import chain, combinations
cols_target = ['obscene','insult','toxic','severe_toxic','identity_hate','threat']
def powerset(iterable):
    "powerset([1,2,3]) --> () (1,) (2,) (3,) (1,2) (1,3) (2,3) (1,2,3)"
    s = list(iterable)
    return chain.from_iterable(combinations(s, r) for r in range(1, len(s)+1))

In [None]:
[c for c in powerset(cols_target)][10:15] # excluding all tags unlabelled