# About this Notebook

The goal of this notebook is to build a DL classifier to find toxic comments. The data has been taken from a series of Kaggle competitions to classify Wikipedia comments as toxic/nontoxic. The data has been sourced from Google and Jigsaw. 

Though the full dataset includes non-English comments, I will restrict myself to English-only comment for this iteration. 

I will explore deep learning approaches, using a combination of pretrained word embeddings and simple deep learning models like RNNs and 1D convolutions to do more benchmarking. 

Next, we will explore deep learning models that have 'memory' using LSTMs (Long Short Term Memory) and GRUs (Gated Recurrent Units). 

Finally, we will approach state of the art performance using pretrained models like BERT and xlnet.

For metrics, I will focus on both ROC and precision-recall curves. In addition, I will look at the confusion matrix and performance across different flavors of toxicity.

Credits:
- https://www.kaggle.com/tanulsingh077/deep-learning-for-nlp-zero-to-transformers-bert
- https://www.kaggle.com/jagangupta/stop-the-s-toxic-comments-eda
- https://www.kaggle.com/clinma/eda-toxic-comment-classification-challenge
- https://www.kaggle.com/abhi111/naive-bayes-baseline-and-logistic-regression

My approach to feature engineering and building the model is below:

Deep Learning:
1. Use standard tokenizers and compare with 'homegrown' version from above.
2. Use open source word embeddings for corpus as input to RNN models. Quantify how misspellings affect the standard tokenizers.
3. Find way to input additional features like punctuation/capitalization from approach above to Deep Learning RNN models.
4. Try progressively more complicated deep learning sequence models approaching SOTA.
5. Use metrics from above.

Potential Modules:
1. Correct misspellings
2. Analytics for preprocessing
3. Analytics for model performance (use multi-labels, make easy way to look at specific examples)
4. Automatically generate a lookup table for common variations of words (particularly toxic words, e.g., 'mothafucka' -> 'motherfucker')




## Install requirements as needed

In [None]:
from tqdm import tqdm
import numpy as np
import pandas as pd
%matplotlib inline
  
pd.options.display.max_rows = 999

#Uncomment below if running in colab
#!pip install tokenizers
#!pip install transformers


# Install toxicity package

In [None]:
#Run below if toxicity package is not installed
#!pip install --upgrade git+https://github.com/jkchandalia/toxic-comment-classifier.git@fe5dfe51f09322c166cce0a56818f66a2a2fc5c7


In [None]:
from toxicity import constants, data, features, metrics, visualize, model, text_preprocessing, model_BERT, model_embeddings


## Load data

In [None]:
#Mount drive if using google colab nb
from google.colab import drive
drive.mount('/content/drive')

In [None]:
#Use below for local
pre_path = './'
#Use below for paperspace
#pre_path = '/storage/'
#Use below for colab with drive mounted
#pre_path = '/content/drive/My Drive/toximeter_project/'
input_data_path = pre_path+constants.INPUT_PATH
df_train = data.load(input_data_path, filter=False)

train_full = df_train.copy()
#df_train = df_train.loc[:10000,:]
print("Sample Toxic Comments: ")
print(df_train.comment_text[df_train.toxic==1][1:2].values)
print("Breakdown of nontoxic/toxic comments: ")
df_train.toxic.value_counts()


In [None]:
xtrain, xvalid, ytrain, yvalid = model.make_train_test(df_train)

In [None]:
xtrain, ytrain = model_BERT.smart_sample(xtrain, ytrain)

In [None]:
len(xtrain)

## Use Deep Learning

## Preprocess data

### We will check the maximum number of words that can be present in a comment , this will help us in padding later

In [None]:
max_len = model_BERT.find_max_len(df_train['comment_text'])

### First do Tokenization of input corpus

In [None]:
xtrain_pad, xvalid_pad, word_index = model_embeddings.tokenize(xtrain, xvalid, max_len)

In [None]:
word_index

## Convert our one-hot word index into semantic rich GloVe vectors

In [None]:
# load the GloVe vectors in a dictionary:
glove_embedding_path = pre_path + 'data/jigsaw-multilingual-toxic-comment-classification/'
embeddings_index = model_embeddings.create_embedding_index(glove_embedding_path)

print('Found %s word vectors.' % len(embeddings_index))

In [None]:
embeddings_index

In [None]:
# create an embedding matrix for the words we have in the dataset
output_path = pre_path + 'data/glove_embedding_for_subsample'
embedding_matrix = model_embeddings.create_embedding_matrix(word_index, embeddings_index, output_path)

In [None]:
#Load embeddings
input_path = output_path
embedding_matrix = model_embeddings.load_embeddings(input_path+'.npy')

In [None]:
embedding_matrix.shape

## LSTM Model

In [None]:
#IMP DATA FOR CONFIG
#AUTO = tf.data.experimental.AUTOTUNE

# Configuration
EPOCHS = 120
BATCH_SIZE = 100


In [None]:
model = model_embeddings.build_model(word_index, embedding_matrix, max_len)
model.summary()


    

## Callbacks

In [None]:
project_name = 'check_output_glove'
callbacks = model_BERT.make_callbacks(pre_path, project_name)

In [None]:
train_history = model.fit(
    xtrain_pad, 
    ytrain, 
    epochs=EPOCHS, 
    batch_size=BATCH_SIZE,
    callbacks=callbacks,
    validation_split=0.2)

In [None]:
y_pred = model.predict(xvalid_pad)
preds = scores>.5
run_metrics(preds, scores, yvalid)

# Summary

So far, with very little preprocessing, we have achieved high accuracy. This is a little bit misleading however because the training set is highly imbalanced (roughly 10% positive/toxic class). 

Slightly older techniques, bag-of-words and tf-idf have done better than a simple deep learning models out-of-the-box. This can been seen by the higher AUCs and accuracy of these models in contrast to the simple RNN model. In addition, training these models was extremely fast, even on a local machine. In contrast, the deep learning models required more than 10 minutes to train even five epochs. In addition, trainingg the simple RNN required playing around with the learning rate to get network to learn. The first few attempts produced labels of all zeros. 

The simple LSTM model starts to improve dramatically over the simple RNN model even with only 5 epochs, showing that using the semantic rich word embeddings and including memory already improve simple deep learning results. Though the overall accuracy has decreased in the LSTM model vs the Naive Bayes models, the AUC and precision-recall and ROC curves are much better than the simple models. As we approach more state-of-the-art (SOTA) models and move beyond simple proof-of-concept model training, i.e., try different network parameters, experiment with data preprocessing, do hyperparameter optimization, train until the results start to degrade, add regularization, etc., the results will likely improve even more dramatically.
