# About this Notebook

The goal of this notebook is to build a DL classifier to find toxic comments. The data has been taken from a series of Kaggle competitions to classify Wikipedia comments as toxic/nontoxic. The data has been sourced from Google and Jigsaw. 

Though the full dataset includes non-English comments, I will restrict myself to English-only comment for this iteration. 

I will explore deep learning approaches, using a combination of pretrained word embeddings and simple deep learning models like RNNs and 1D convolutions to do more benchmarking. 

Next, we will explore deep learning models that have 'memory' using LSTMs (Long Short Term Memory) and GRUs (Gated Recurrent Units). 

Finally, we will approach state of the art performance using pretrained models like BERT and xlnet.

For metrics, I will focus on both ROC and precision-recall curves. In addition, I will look at the confusion matrix and performance across different flavors of toxicity.

Credits:
- https://www.kaggle.com/tanulsingh077/deep-learning-for-nlp-zero-to-transformers-bert
- https://www.kaggle.com/jagangupta/stop-the-s-toxic-comments-eda
- https://www.kaggle.com/clinma/eda-toxic-comment-classification-challenge
- https://www.kaggle.com/abhi111/naive-bayes-baseline-and-logistic-regression

My approach to feature engineering and building the model is below:

Deep Learning:
1. Use standard tokenizers and compare with 'homegrown' version from above.
2. Use open source word embeddings for corpus as input to RNN models. Quantify how misspellings affect the standard tokenizers.
3. Find way to input additional features like punctuation/capitalization from approach above to Deep Learning RNN models.
4. Try progressively more complicated deep learning sequence models approaching SOTA.
5. Use metrics from above.

Potential Modules:
1. Correct misspellings
2. Analytics for preprocessing
3. Analytics for model performance (use multi-labels, make easy way to look at specific examples)
4. Automatically generate a lookup table for common variations of words (particularly toxic words, e.g., 'mothafucka' -> 'motherfucker')




## Install requirements as needed

In [None]:
from tqdm import tqdm
import numpy as np
import pandas as pd
%matplotlib inline
  
pd.options.display.max_rows = 999

#Uncomment below if running in colab
#!pip install tokenizers
#!pip install transformers


# Install toxicity package

In [1]:
#Run below if toxicity package is not installed
#!pip install --upgrade git+https://github.com/jkchandalia/toxic-comment-classifier.git@fe5dfe51f09322c166cce0a56818f66a2a2fc5c7


In [8]:
from toxicity import constants, data, features, metrics, visualize, model, text_preprocessing, model_BERT

Some weights of the model checkpoint at distilbert-base-cased were not used when initializing TFDistilBertModel: ['vocab_projector', 'vocab_transform', 'vocab_layer_norm', 'activation_13']
- This IS expected if you are initializing TFDistilBertModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPretraining model).
- This IS NOT expected if you are initializing TFDistilBertModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
All the weights of TFDistilBertModel were initialized from the model checkpoint at distilbert-base-cased.
If your task is similar to the task the model of the ckeckpoint was trained on, you can already use TFDistilBertModel for predictions without further training.


## Load data

In [3]:
#Mount drive if using google colab nb
from google.colab import drive
drive.mount('/content/drive')

ModuleNotFoundError: No module named 'google.colab'

In [5]:
#Use below for local
pre_path = './'
#Use below for paperspace
#pre_path = '/storage/'
#Use below for colab with drive mounted
#pre_path = '/content/drive/My Drive/toximeter_project/'
input_data_path = pre_path+constants.INPUT_PATH
df_train = data.load(input_data_path, filter=False)

train_full = df_train.copy()
#df_train = df_train.loc[:10000,:]
print("Sample Toxic Comments: ")
print(df_train.comment_text[df_train.toxic==1][1:2].values)
print("Breakdown of nontoxic/toxic comments: ")
df_train.toxic.value_counts()


Sample Toxic Comments: 
Breakdown of nontoxic/toxic comments: 


0    202165
1     21384
Name: toxic, dtype: int64

In [6]:
xtrain, xvalid, ytrain, yvalid = model.make_train_test(df_train)

In [7]:
len(xvalid)

44710

## Use Deep Learning

## Preprocess data

### We will check the maximum number of words that can be present in a comment , this will help us in padding later

In [9]:
max_len = model_BERT.find_max_len(df_train['comment_text'])

Max length of comment text is: 2400


### First do Tokenization of input corpus

In [None]:
# using keras tokenizer here
token = text.Tokenizer(num_words=None)
token_toxic = text.Tokenizer(num_words=None)
token_nontoxic = text.Tokenizer(num_words=None)

token.fit_on_texts(list(xtrain) + list(xvalid))
token_toxic.fit_on_texts(df_train.comment_text.values[df_train.toxic==1])
token_nontoxic.fit_on_texts(df_train.comment_text.values[df_train.toxic==0])

xtrain_seq = token.texts_to_sequences(xtrain)
xvalid_seq = token.texts_to_sequences(xvalid)

#zero pad the sequences
xtrain_pad = sequence.pad_sequences(xtrain_seq, maxlen=max_len)
xvalid_pad = sequence.pad_sequences(xvalid_seq, maxlen=max_len)

word_index = token.word_index

In [None]:
word_toxic = token_toxic.word_index
word_nontoxic = token_nontoxic.word_index

In [None]:
print(len(word_toxic))
print(len(word_nontoxic))

42681
288956


Example for fitting tokenizer line-by-line if corpus is too big to fit into memory

with open('/Users/liling.tan/test.txt') as fin: for line in fin:
t.fit_on_texts(line.split()) # Fitting the tokenizer line-by-line.

M = []

with open('/Users/liling.tan/test.txt') as fin: for line in fin:

    # Converting the lines into matrix, line-by-line.
    m = t.texts_to_matrix([line], mode='count')[0]
    M.append(m)

## Use pretrained word embeddings

## Convert our one-hot word index into semantic rich GloVe vectors

In [None]:
# load the GloVe vectors in a dictionary:

embeddings_index = {}
f = open(pre_path + 'glove840b300dtxt/glove.840B.300d.txt','r',encoding='utf-8')
for line in tqdm(f):
    values = line.split(' ')
    word = values[0]
    coefs = np.asarray([float(val) for val in values[1:]])
    embeddings_index[word] = coefs
f.close()

print('Found %s word vectors.' % len(embeddings_index))

In [None]:


words_not_in_corpus = ddict(int)
words_in_corpus = ddict(int)
# create an embedding matrix for the words we have in the dataset
embedding_matrix = np.zeros((len(word_index) + 1, 300))
for word, i in tqdm(word_nontoxic.items()):
    embedding_vector = embeddings_index.get(word)
    if embedding_vector is not None:
        embedding_matrix[i] = embedding_vector
        words_in_corpus[word]+=1
    else:
        words_not_in_corpus[word]+=1

In [None]:
print(len(words_not_in_corpus))
print(len(words_in_corpus))
max(words_not_in_corpus.values())
max(words_in_corpus.values())

#For the full dataset, more than half the 'words' are not found in the glove embeddings
#For the 10K sample dataset, only ~25% of the words are not found in the glove embeddings


In [None]:
print(len(words_not_in_corpus))
print(len(words_in_corpus))
max(words_not_in_corpus.values())
max(words_in_corpus.values())

#For the full dataset, more than half the 'words' are not found in the glove embeddings
#For the 10K sample dataset, only ~25% of the words are not found in the glove embeddings


In [None]:
#Save embeddings so they can be easily loaded
np.save('/kaggle/working/glove_embedding_for_full_data', embedding_matrix)

In [None]:
import os
os.path.abspath('.')

'/Users/jkc/workspace/toxic-comment-classifier/exploration/DL_experiments'

In [None]:
#Load embeddings
embedding_matrix = np.load(pre_path+'data/embedding_for_lstm_all.npy')

In [None]:
embedding_matrix.shape

(300258, 300)

## Simple RNN Model

In [None]:
opt = Adam(learning_rate=0.0001)

In [None]:
model1 = Sequential()
model1.add(Embedding(len(word_index) + 1,
                 300,
                 input_length=max_len))
model1.add(SimpleRNN(100))
model1.add(Dense(1, activation='relu'))
model1.compile(loss='binary_crossentropy', optimizer=opt, metrics=['accuracy'])
    
model1.summary()

In [None]:
from keras.callbacks import ModelCheckpoint, TensorBoard, EarlyStopping
EPOCHS = 10
checkpoint_filepath = './checkpoint'
model_checkpoint_callback = ModelCheckpoint(
    filepath=checkpoint_filepath,
    save_weights_only=True,
    monitor='val_acc',
    mode='max',
    save_best_only=True)


my_callbacks = [
    model_checkpoint_callback,
    TensorBoard(log_dir='./logs'),
    EarlyStopping(monitor='val_loss', patience=3)
]
model_checkpoint_callback

<tensorflow.python.keras.callbacks.ModelCheckpoint at 0x21f435ad0>

In [None]:
model1.fit(xtrain_pad, 
           ytrain, 
           epochs=50, 
           batch_size=100, 
           callbacks=my_callbacks,
           validation_split=0.2,)

In [None]:
scores = model1.predict(xvalid_pad)[:, 0]
preds = scores>.5
run_metrics(preds, scores, yvalid)

## Simple LSTM Model

In [None]:
%%time
# A simple LSTM with glove embeddings and one dense layer
model = Sequential()
model.add(Embedding(len(word_index) + 1,
                 300,
                 weights=[embedding_matrix],
                 input_length=max_len,
                 trainable=False))

model.add(LSTM(100, activation="tanh",
    recurrent_activation="sigmoid", dropout=0.2, recurrent_dropout=0.1))
model.add(Dense(1, activation='sigmoid'))
model.compile(loss='binary_crossentropy', optimizer='adam',metrics=['accuracy', AUC(curve='PR')])
    
model.summary()


    

Model: "sequential"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
embedding (Embedding)        (None, 2400, 300)         90077400  
_________________________________________________________________
lstm (LSTM)                  (None, 100)               160400    
_________________________________________________________________
dense (Dense)                (None, 1)                 101       
Total params: 90,237,901
Trainable params: 160,501
Non-trainable params: 90,077,400
_________________________________________________________________
CPU times: user 1.99 s, sys: 550 ms, total: 2.54 s
Wall time: 5.92 s


In [None]:
import os
# Create a callback for tensorboard
tb_callback = TensorBoard(log_dir=pre_path+'glove_lstm_frozen_10Ksample/Graph', histogram_freq=0, write_graph=True, write_images=True)

# Create a callback that saves the model's weights every epoch
checkpoint_path = pre_path+"training/glove_lstm_frozen_10Ksample/cp-{epoch:04d}.ckpt"
checkpoint_dir = os.path.dirname(checkpoint_path)

cp_callback = ModelCheckpoint(
    filepath=checkpoint_path, 
    verbose=1, 
    save_weights_only=True,
    save_freq='epoch',
    period=5)

# Callback for early stopping if model isn't improving
es = EarlyStopping(
    monitor='val_loss', min_delta=0, patience=2, verbose=0, mode='auto',
    baseline=None, restore_best_weights=True
)




In [None]:
model.fit(xtrain_pad, 
          ytrain, 
          epochs=120, 
          batch_size=100,
          callbacks=[tb_callback, cp_callback],
          validation_split=0.2,)

Epoch 1/120
Instructions for updating:
use `tf.profiler.experimental.stop` instead.
Epoch 2/120
Epoch 3/120
Epoch 4/120
Epoch 5/120
Epoch 00005: saving model to /content/drive/My Drive/toximeter_project/training/glove_lstm_frozen_10Ksample/cp-0005.ckpt
Epoch 6/120
Epoch 7/120
Epoch 8/120
Epoch 9/120
Epoch 10/120
Epoch 00010: saving model to /content/drive/My Drive/toximeter_project/training/glove_lstm_frozen_10Ksample/cp-0010.ckpt
Epoch 11/120
Epoch 12/120
Epoch 13/120
Epoch 14/120
Epoch 15/120
Epoch 00015: saving model to /content/drive/My Drive/toximeter_project/training/glove_lstm_frozen_10Ksample/cp-0015.ckpt
Epoch 16/120
Epoch 17/120
Epoch 18/120
Epoch 19/120
Epoch 20/120
Epoch 00020: saving model to /content/drive/My Drive/toximeter_project/training/glove_lstm_frozen_10Ksample/cp-0020.ckpt
Epoch 21/120
Epoch 22/120
Epoch 23/120
Epoch 24/120
Epoch 25/120
Epoch 00025: saving model to /content/drive/My Drive/toximeter_project/training/glove_lstm_frozen_10Ksample/cp-0025.ckpt
Epoch 2

<tensorflow.python.keras.callbacks.History at 0x7f10accc16d8>

In [None]:
scores = model.predict(xvalid_pad)
preds = scores>.5
run_metrics(preds, scores, yvalid)

# Summary

So far, with very little preprocessing, we have achieved high accuracy. This is a little bit misleading however because the training set is highly imbalanced (roughly 10% positive/toxic class). 

Slightly older techniques, bag-of-words and tf-idf have done better than a simple deep learning models out-of-the-box. This can been seen by the higher AUCs and accuracy of these models in contrast to the simple RNN model. In addition, training these models was extremely fast, even on a local machine. In contrast, the deep learning models required more than 10 minutes to train even five epochs. In addition, trainingg the simple RNN required playing around with the learning rate to get network to learn. The first few attempts produced labels of all zeros. 

The simple LSTM model starts to improve dramatically over the simple RNN model even with only 5 epochs, showing that using the semantic rich word embeddings and including memory already improve simple deep learning results. Though the overall accuracy has decreased in the LSTM model vs the Naive Bayes models, the AUC and precision-recall and ROC curves are much better than the simple models. As we approach more state-of-the-art (SOTA) models and move beyond simple proof-of-concept model training, i.e., try different network parameters, experiment with data preprocessing, do hyperparameter optimization, train until the results start to degrade, add regularization, etc., the results will likely improve even more dramatically.


## Try a GRU Model

In [None]:
%%time
# GRU with glove embeddings and two dense layers
 model = Sequential()
 model.add(Embedding(len(word_index) + 1,
                 300,
                 weights=[embedding_matrix],
                 input_length=max_len,
                 trainable=False))
 model.add(SpatialDropout1D(0.3))
 model.add(GRU(300))
 model.add(Dense(1, activation='sigmoid'))

 model.compile(loss='binary_crossentropy', optimizer='adam',metrics=['accuracy'])   
    
model.summary()

In [None]:
%%time
# GRU with glove embeddings and two dense layers
 model = Sequential()
 model.add(Embedding(len(word_index) + 1,
                 300,
                 weights=[embedding_matrix],
                 input_length=max_len,
                 trainable=False))
 model.add(SpatialDropout1D(0.3))
 model.add(GRU(300))
 model.add(Dense(1, activation='sigmoid'))

 model.compile(loss='binary_crossentropy', optimizer='adam',metrics=['accuracy'])   
    
model.summary()

In [None]:
model.fit(xtrain_pad, ytrain, nb_epoch=5, batch_size=64)

In [None]:
scores = model.predict(xvalid_pad)


## Bidirectional RNN Model

In [None]:
%%time
# A simple bidirectional LSTM with glove embeddings and one dense layer
model = Sequential()
model.add(Embedding(len(word_index) + 1,
                 300,
                 weights=[embedding_matrix],
                 input_length=max_len,
                 trainable=False))
model.add(Bidirectional(LSTM(300, dropout=0.3, recurrent_dropout=0.3)))

model.add(Dense(1,activation='sigmoid'))
model.compile(loss='binary_crossentropy', optimizer='adam',metrics=['accuracy'])
    
    
model.summary()

In [None]:
model.fit(xtrain_pad, ytrain, nb_epoch=5, batch_size=64)

In [None]:
scores = model.predict(xvalid_pad)


## Seq2seq Architecture

In [None]:
#TBD


In [None]:
import tensorflow as tf
from keras.models import Sequential
from keras.layers.recurrent import LSTM, GRU,SimpleRNN
from keras.layers.core import Dense, Activation, Dropout
from keras.layers.embeddings import Embedding
from keras.layers.normalization import BatchNormalization
from keras.utils import np_utils
from sklearn import preprocessing, decomposition, model_selection, metrics, pipeline
from keras.layers import GlobalMaxPooling1D, Conv1D, MaxPooling1D, Flatten, Bidirectional, SpatialDropout1D
from keras.preprocessing import sequence, text
from keras.callbacks import EarlyStopping, History, ModelCheckpoint, TensorBoard
from tensorflow.keras.metrics import Accuracy, AUC
from tensorflow.keras.optimizers import Adam



In [None]:
# using keras tokenizer here
token = text.Tokenizer(num_words=None)
token_toxic = text.Tokenizer(num_words=None)
token_nontoxic = text.Tokenizer(num_words=None)

token.fit_on_texts(list(xtrain) + list(xvalid))
token_toxic.fit_on_texts(df_train.comment_text.values[df_train.toxic==1])
token_nontoxic.fit_on_texts(df_train.comment_text.values[df_train.toxic==0])

xtrain_seq = token.texts_to_sequences(xtrain)
xvalid_seq = token.texts_to_sequences(xvalid)

#zero pad the sequences
xtrain_pad = sequence.pad_sequences(xtrain_seq, maxlen=max_len)
xvalid_pad = sequence.pad_sequences(xvalid_seq, maxlen=max_len)

word_index = token.word_index

In [None]:
#Load embeddings
embedding_matrix = np.load(pre_path+'data/embedding_for_lstm_all.npy')

## Simple LSTM Model

In [None]:
%%time
# A simple LSTM with glove embeddings and one dense layer
model = Sequential()
model.add(Embedding(len(word_index) + 1,
                 300,
                 weights=[embedding_matrix],
                 input_length=max_len,
                 trainable=False))

model.add(LSTM(100, activation="tanh",
    recurrent_activation="sigmoid"))
model.add(Dense(1, activation='sigmoid'))
model.compile(loss='binary_crossentropy', optimizer='adam',metrics=['accuracy', AUC(curve='PR')])
    
model.summary()


    

Model: "sequential_2"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
embedding_1 (Embedding)      (None, 2400, 300)         90077400  
_________________________________________________________________
lstm_1 (LSTM)                (None, 100)               160400    
_________________________________________________________________
dense_1 (Dense)              (None, 1)                 101       
Total params: 90,237,901
Trainable params: 160,501
Non-trainable params: 90,077,400
_________________________________________________________________
CPU times: user 1.75 s, sys: 227 ms, total: 1.98 s
Wall time: 1.32 s


In [None]:
import os
callbacks = make_callbacks('glove_lstm_all')



In [None]:
model.fit(xtrain_pad, 
          ytrain, 
          epochs=120, 
          batch_size=100,
          callbacks=callbacks,
          validation_split=0.2,)

Epoch 1/120
Instructions for updating:
use `tf.profiler.experimental.stop` instead.
Epoch 2/120
Epoch 3/120
Epoch 4/120
Epoch 5/120
Epoch 00005: saving model to /content/drive/My Drive/toximeter_project/training/glove_lstm_allcp-0005.ckpt
Epoch 6/120
Epoch 7/120
Epoch 8/120
Epoch 9/120
Epoch 10/120
Epoch 00010: saving model to /content/drive/My Drive/toximeter_project/training/glove_lstm_allcp-0010.ckpt
Epoch 11/120
Epoch 12/120
Epoch 13/120
Epoch 14/120
Epoch 15/120
Epoch 00015: saving model to /content/drive/My Drive/toximeter_project/training/glove_lstm_allcp-0015.ckpt
Epoch 16/120
Epoch 17/120
Epoch 18/120
Epoch 19/120
Epoch 20/120
Epoch 00020: saving model to /content/drive/My Drive/toximeter_project/training/glove_lstm_allcp-0020.ckpt
Epoch 21/120
Epoch 22/120
Epoch 23/120
Epoch 24/120

In [None]:
y_pred=model.predict(
    x_valid
)


In [None]:
from toxicity.metrics import run_metrics
run_metrics(y_pred>.5, y_pred, y_valid_s, visualize=True)