# Deep Learning Model for Toxic Comments Classification


In this notebook, our task is to classify the comments for toxicity. The comments have been aggregated from wikipedia talkpage. The dataset for this NLP tutprial comes from a recent Kaggle competition sponsered by Google. The training data consists of over 150k comments labeled by human annotators. The test data is little over 150k where we need to predict the label.  

One of the challenges is the imbalance of positivie (toxic labeled comments) and negative class. 

# 1. Import Packages

As always, we first import libraries from Numpy, Keras, Pandas, and sklearn.

In [29]:
import numpy as np
np.random.seed(42)
import pandas as pd

from sklearn.model_selection import train_test_split
from sklearn.metrics import roc_auc_score, log_loss
from sklearn.model_selection import KFold

from keras.models import Model
from keras.layers import Input, Dense, Embedding, SpatialDropout1D, concatenate, CuDNNGRU
from keras.layers import GRU, Bidirectional, GlobalAveragePooling1D, GlobalMaxPooling1D
from keras.preprocessing import text, sequence
from keras.layers import Dropout, BatchNormalization
from keras.callbacks import Callback
from keras import backend as K

# 2. Load Files

Let's load files before we move any further. We read training and test data files provided in csv format using Pandas dataframes. 

We also need to load pre-trained word vector embedding. Later we will describe why and how we use the word embedding in our recurrent neural net model.

In [30]:
train = pd.read_csv('data/train.csv')
test = pd.read_csv('data/test.csv')
submission = pd.read_csv('data/sample_submission.csv')

EMBEDDING_FILE_glove = 'vectors/glove.840B.300d.txt'

Some data re-arranging here:

In [31]:
X_train = train["comment_text"].fillna("fillna").values
y_train = train[["toxic", "severe_toxic", "obscene", "threat", "insult", "identity_hate"]].values
X_test = test["comment_text"].fillna("fillna").values

# 3. Model Parameters

It's time to set some of the most important model parameters. We will need to tune these parameters.

In [32]:
max_features = 200000
maxlen = 300
embed_size = 300

# 4. Tokenizer
After setting-up the basics, now let's dive into the actual process of deep learning model development.
First of all, we will tokenize the texts. What it means is that we will convert texts into matrices of number. Why we do this?

In [33]:
tokenizer = text.Tokenizer(num_words=max_features)
tokenizer.fit_on_texts(list(X_train) + list(X_test))
X_train = tokenizer.texts_to_sequences(X_train)
X_test = tokenizer.texts_to_sequences(X_test)
x_train = sequence.pad_sequences(X_train, maxlen=maxlen)
x_test = sequence.pad_sequences(X_test, maxlen=maxlen)

# 5. Word Vector Embedding

5.1 Glove Embedding

In [34]:
embeddings_index = dict()
f = open(EMBEDDING_FILE_glove)

for line in f:
    # Note: use split(' ') instead of split() if you get an error.
    values = line.split(' ')
    word = values[0]
    coefs = np.asarray(values[1:], dtype='float32')
    embeddings_index[word] = coefs
f.close()

all_embs = np.stack(embeddings_index.values())
emb_mean, emb_std = all_embs.mean(), all_embs.std()
# Create the weight matrix

word_index = tokenizer.word_index
nb_words = min(max_features, (len(word_index)-1))

embedding_matrix = np.random.normal(emb_mean, emb_std, (nb_words, embed_size))

for word, i in word_index.items():
    if i >= max_features: continue
    embedding_vector = embeddings_index.get(word)
    if embedding_vector is not None: embedding_matrix[i] = embedding_vector


# 6. RNN Model

Now, we have reached a point where we can develop our recurrent neural net model using Keras.


Here are the steps:

1. We create embedding layer with the embeddign matrix that we created before.
2. Right after embedding, we define a dropout layer.
3. We use CuDNNGRU as our recurrent layer (If not using GPU, replace CuDNNGRU with GRU).
4. Next, we apply average and max pooling and concatenate them.
4. Finally, we pass the output from concatenation into a dense layer with 6 outputs, one each for a class label.

In [35]:
def get_model():
    inp = Input(shape=(maxlen, ))
    x = Embedding(max_features, embed_size, weights=[embedding_matrix])(inp)
    x = SpatialDropout1D(0.5)(x)
    x = Bidirectional(CuDNNGRU(64, return_sequences=True))(x)
    avg_pool = GlobalAveragePooling1D()(x)
    max_pool = GlobalMaxPooling1D()(x)
    conc = concatenate([avg_pool, max_pool])
    outp = Dense(6, activation="sigmoid")(conc)
    
    model = Model(inputs=inp, outputs=outp)
    model.compile(loss='binary_crossentropy',
                  optimizer='adam',
                  metrics=['accuracy'])

    return model

model = get_model()
model.summary()

__________________________________________________________________________________________________
Layer (type)                    Output Shape         Param #     Connected to                     
input_6 (InputLayer)            (None, 300)          0                                            
__________________________________________________________________________________________________
embedding_6 (Embedding)         (None, 300, 300)     60000000    input_6[0][0]                    
__________________________________________________________________________________________________
spatial_dropout1d_6 (SpatialDro (None, 300, 300)     0           embedding_6[0][0]                
__________________________________________________________________________________________________
bidirectional_6 (Bidirectional) (None, 300, 128)     140544      spatial_dropout1d_6[0][0]        
__________________________________________________________________________________________________
global_ave

# 7. ROC AUC Metric

Currently, Keras doest not have ROC AUC metric and so we will write some code to monitor ROC AUC for validation set loss.

In [36]:
from sklearn.metrics import roc_auc_score

def _train_model(model, batch_size, train_x, train_y, val_x, val_y):
    best_loss = -1
    best_weights = None
    best_epoch = 0

    current_epoch = 0

    while True:
        model.fit(train_x, train_y, batch_size=batch_size, epochs=1)
        y_pred = model.predict(val_x, batch_size=batch_size)

        total_loss = 0
        for j in range(6):
            loss = roc_auc_score(val_y[:, j], y_pred[:, j])
            total_loss += loss

        total_loss /= 6.

        print("Epoch {0} loss {1} best_loss {2}".format(current_epoch, total_loss, best_loss))

        current_epoch += 1
        if total_loss > best_loss or best_loss == -1:
            best_loss = total_loss
            best_weights = model.get_weights()
            best_epoch = current_epoch
        else:
            if current_epoch - best_epoch == 1:
                break

    model.set_weights(best_weights)
    return model

In [37]:
def train_folds(X, y, fold_count, batch_size, get_model_func):
    fold_size = len(X) // fold_count
    models = []
    for fold_id in range(0, fold_count):
        fold_start = fold_size * fold_id
        fold_end = fold_start + fold_size

        if fold_id == fold_size - 1:
            fold_end = len(X)

        train_x = np.concatenate([X[:fold_start], X[fold_end:]])
        train_y = np.concatenate([y[:fold_start], y[fold_end:]])

        val_x = X[fold_start:fold_end]
        val_y = y[fold_start:fold_end]

        model = _train_model(get_model_func(), batch_size, train_x, train_y, val_x, val_y)
        models.append(model)

    return models

# 8. Train Model

Now is the time to actually train the model. 

In [38]:
nr_folds = 5
batch_size = 32

In [39]:
models = train_folds(x_train, y_train, nr_folds, batch_size, get_model)

Epoch 1/1
Epoch 0 loss 0.9875802368268326 best_loss -1
Epoch 1/1
Epoch 1 loss 0.9894354568612592 best_loss 0.9875802368268326
Epoch 1/1
Epoch 2 loss 0.9899473566953375 best_loss 0.9894354568612592
Epoch 1/1
Epoch 3 loss 0.9885625357296844 best_loss 0.9899473566953375
Epoch 1/1
Epoch 0 loss 0.9860735579645251 best_loss -1
Epoch 1/1
Epoch 1 loss 0.9886888878369549 best_loss 0.9860735579645251
Epoch 1/1
Epoch 2 loss 0.9885022739827657 best_loss 0.9886888878369549
Epoch 1/1
Epoch 0 loss 0.9865644894489897 best_loss -1
Epoch 1/1
Epoch 1 loss 0.9877616814238852 best_loss 0.9865644894489897
Epoch 1/1
Epoch 2 loss 0.987168218123233 best_loss 0.9877616814238852
Epoch 1/1
Epoch 0 loss 0.9854899909067051 best_loss -1
Epoch 1/1
Epoch 1 loss 0.9883896936574522 best_loss 0.9854899909067051
Epoch 1/1
Epoch 2 loss 0.9882708127540151 best_loss 0.9883896936574522
Epoch 1/1
Epoch 0 loss 0.9862962253492219 best_loss -1
Epoch 1/1
Epoch 1 loss 0.9877997708214594 best_loss 0.9862962253492219
Epoch 1/1
Epoch 

# 9. Create Submission

Finally, we create predictions for test and create submission file. The predictions are ensembled from each fold using geometric mean and numpy array predictions files for each fold are also saved.

In [40]:
CLASSES = ["toxic", "severe_toxic", "obscene", "threat", "insult", "identity_hate"]
import os
if not os.path.exists("nlp_tut"):
        os.mkdir("nlp_tut")

In [41]:
print("Predicting results...")
test_predicts_list = []
for fold_id, model in enumerate(models):
    model_path = os.path.join("nlp_tut", "model{0}_weights.npy".format(fold_id))
    np.save(model_path, model.get_weights())

    test_predicts_path = os.path.join("nlp_tut", "test_predicts{0}.npy".format(fold_id))
    test_predicts = model.predict(x_test, batch_size=256)
    test_predicts_list.append(test_predicts)
    np.save(test_predicts_path, test_predicts)

test_predicts = np.ones(test_predicts_list[0].shape)
for fold_predict in test_predicts_list:
    test_predicts *= fold_predict

test_predicts **= (1. / len(test_predicts_list))

test_ids = test["id"].values
test_ids = test_ids.reshape((len(test_ids), 1))

test_predicts = pd.DataFrame(data=test_predicts, columns=CLASSES)
test_predicts["id"] = test_ids
test_predicts = test_predicts[["id"] + CLASSES]
submit_path = os.path.join("nlp_tut", "submit")
test_predicts.to_csv(submit_path, index=False)

print("Finished creating submission!")

Predicting results...
Finished creating submission!


# Acknowledgements

This notebook is based on my learning from a number of wonderful people sharing their insights, models, and codes.