# Neural Network Training Overview
As mentioned in the project Readme, I utilized two different model components to generate the final ensemble predictions. This notebook will show the Recurrent Neural Network (RNN) approach with trainable word embeddings. Specifically, I used three different word embeddings:

- Global Vectors for Word Representation (GloVe) trained on 840B common crawl tokens with 2.2M vocab and represented by 300-d vectors: https://nlp.stanford.edu/projects/glove/ 
- FastText Crawl trained on 600B tokens with 2M word vectors and 300-d vector representation: https://www.kaggle.com/yekenot/fasttext-crawl-300d-2m
- GloVe trained on 2B tweets with 27B tokens, 1.2M word vectors and a 200-d vector representation: https://nlp.stanford.edu/projects/glove/

The three RNN models all used the same architecture, consisting of a single bi-directional Gated Recurrent Unit (GRU) layer with spatial dropout prior to the GRU layer and max / average pooling following the GRU layer. Thus, the three models are identical other than the word embeddings used for training.

Overall, the individual models scored well with an AUC over 0.9805, which would be ~top 50th percentile on their own. However, when ensembled with each other and the linear models shown in the Linear Model notebook, the AUC score was >0.9850 and top 20–25th percentile.

The code in this notebook was trained on Google's Datalab Cloud computing engine in order to access the training speed increases of a GPU (Tesla K80). Thus, there are a few code blocks that are specific to Google Cloud such as importing "objects" (text or csv files) from Google Cloud storage and instaling Keras. I will note where the Google Cloud specific code occurs.

In [1]:
# Google cloud does not come preinstalled with Keras
!pip install keras

Collecting keras
  Using cached Keras-2.1.4-py2.py3-none-any.whl
Installing collected packages: keras
Successfully installed keras-2.1.4


In [2]:
import google.datalab.storage as store # required for accessing objects in Google Storage
from io import BytesIO, StringIO # required for opening files from Google Storage
import pandas as pd
import numpy as np
pd.options.mode.chained_assignment = None
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline
%config InlineBackend.figure_format = 'retina'

In [3]:
from keras.optimizers import Nadam
from keras.preprocessing.text import Tokenizer
from keras.preprocessing.sequence import pad_sequences
from keras.layers import Dense, Input, LSTM, GRU, Embedding, Dropout, Activation, concatenate
from keras.layers import Bidirectional, GlobalMaxPooling1D, SpatialDropout1D, GlobalAveragePooling1D
from keras.models import Model
from keras import initializers, regularizers, constraints, optimizers, layers
from keras.callbacks import Callback, History

  from ._conv import register_converters as _register_converters
Using TensorFlow backend.


In [4]:
# trained on an 8-core instance with Tesla K80 GPU
import os
os.environ['OMP_NUM_THREADS'] = '8'

# Loading Data
The training and testing data provided by Kaggle was stored in a Google Data Bucket, which is loaded into the train_object and test_object using Google's datalab storage API.

In [5]:
train_object = store.Object(bucket='marks_toxic_comments', key='train.csv').download()
test_object = store.Object(bucket='marks_toxic_comments', key='test.csv').download()

In [6]:
train = pd.read_csv(BytesIO(train_object))
test = pd.read_csv(BytesIO(test_object))

del train_object
del test_object

There are six binary classes that need to be classified, shown in the list_classes list below. The training and testing comments are simply a 1-d vector consisting of words / sentences / paragraphs of varying lenghts. Depending on the words and messaging of the text, multiple categories can occur, i.e. a comment can be toxic, severe_toxic, and obscene at the same time. 

In [7]:
list_classes = ['toxic', 'severe_toxic', 'obscene', 'threat', 'insult', 'identity_hate']
y_train = train[list_classes]

train_comments = train.comment_text
test_comments = test.comment_text

# Tokenization
The code below uses Keras' Tokenizer to create a token feature set with a maximum number of 30,000 features. The max_features are selected based on their frequency in the dataset (top-n frequencies are selected). The tokens are created based on both the training and testing datasets to improve accuracy since the training and testing datasets come from slightly different distributions based on changes in the competitions. Finally, all of the training examples are either cut or padded to 100 tokens, indicated by the maxlen.

In [8]:
max_features = 30000
maxlen = 100

tokenizer = Tokenizer(num_words=max_features)
tokenizer.fit_on_texts(list(train_comments) + list(test_comments))

train_tokenized = tokenizer.texts_to_sequences(train_comments)
test_tokenized = tokenizer.texts_to_sequences(test_comments)

X_train = pad_sequences(train_tokenized, maxlen=maxlen)
X_test = pad_sequences(test_tokenized, maxlen=maxlen)

# Word Embeddings
Next, the word embeddings are loaded from the various text files. When using Google Datalab, the .txt files were stored in a Storage bucket and downloaded using the data storage API.

In [9]:
embedding_object = store.Object(bucket='marks_toxic_comments', key='glove.twitter.27B.200d.txt').download()

The code cells below seek to create an embeddings dictionary with each word as the key and the values being the 200 or 300-d (depending on the embedding type) vector representation from the embedding training. Each text file requires slightly different massaging in order to properly load the words / vectors and create the dictionary, so three different code blocks are provided depending on the embedding used. 

In [10]:
#Uncomment to load 'glove.twitter.27B.200d.txt'
embed_size = 200
embeddings_index = {}

i = 0
for line in BytesIO(embedding_object):
    values = str(line).rstrip().rsplit(' ')    
    values[-1] = values[-1].replace("\\n'", "")
    values[-1] = values[-1].replace('\\n"', "")
    
    word = values[0].replace("b'", "")
    coefs = np.asarray(values[1:], dtype=np.float32)
    embeddings_index[word] = coefs
    
print('Loaded %s word vectors.' % len(embeddings_index))

Loaded 1193514 word vectors.


In [None]:
#Uncomment to load 'glove.840B.300d.txt'
#embed_size=300
#embeddings_index = {}
#
#for line in BytesIO(embedding_object):
#    values = str(line).rstrip().rsplit(' ')
#    values[-1] = values[-1].replace("\\n'", "")
#    values[-1] = values[-1].replace('\\n"', "")
#    
#    word = values[0].replace("b'", "")
#    coefs = np.asarray(values[1:], dtype=np.float32)
#    embeddings_index[word] = coefs
#    
#print('Loaded %s word vectors.' % len(embeddings_index))

In [None]:
#Uncomment to load 'crawl-300d-2M.vec'
#embed_size=300
#embeddings_index = {}
#
#count=0
#for line in BytesIO(embedding_object):
#  values = str(line).rstrip().rsplit(' ')[:-1]
#  if len(values) == embed_size + 1:
#    word = values[0].replace("b'", "")
#    coefs = np.asarray(values[1:], dtype=np.float32)
#    embeddings_index[word] = coefs
#    
#print('Loaded %s word vectors.' % len(embeddings_index))

In [11]:
import gc

del embedding_object
gc.collect();

Next, the 30,000 tokens selected during the previous tokenization step are matched with appropriate vector representation from the embeddings dictionary. After this step, a 30000 x 200 array is created with the token as the sample and its vector representation as the column. Not all tokens are present in the embeddings set, and are set to zero otherwise. In total, >=85% of the tokens match embedding vectors.

In [12]:
word_index = tokenizer.word_index
nb_words = min(max_features, len(word_index))
embedding_matrix = np.zeros((nb_words, embed_size))

count=0
for word, i in word_index.items():
    if i >= max_features: 
        continue
        
    embedding_vector = embeddings_index.get(word)
    if embedding_vector is not None: 
        count += 1
        embedding_matrix[i] = embedding_vector
        
del embedding_vector
gc.collect()
        
print('Added %s embeddings,' % count, '%s percent of total' % str(round(100*count / max_features, 1)))

Added 25332 embeddings, 84.4 percent of total


# Keras GRU Model with Embeddings
With the embedding matrix created, and the training and testing datasets transformed into n_samples x 100 tokenized features, a Recurrent Neural Network (RNN) can be trained on the data. The architecture for this specific case is shown below, consisting of the embeddings with trainable weights, spatial dropout, bidirectional GRU, and ending with a combination of average and max pooling. The dense layer at the end corresponds to the 6 outputs variables. The NAdam optimizer was used, as it showed very slight improvements over the standard Adam optimizer.

### Creating Model Architecture

In [13]:
def create_model(output_num):
    inp = Input(shape=(maxlen, ))
    X = Embedding(max_features, embed_size, weights=[embedding_matrix], trainable=True)(inp)
    X = SpatialDropout1D(0.2)(X)
    X = Bidirectional(GRU(80, return_sequences=True))(X)
    avg_pool = GlobalAveragePooling1D()(X)
    max_pool = GlobalMaxPooling1D()(X)
    conc = concatenate([avg_pool, max_pool])
    X = Dense(output_num, activation='sigmoid')(conc)
    
    model = Model(inputs=inp, outputs=X)
    optimizer = Nadam(lr=0.001, beta_1=0.9, beta_2=0.999, epsilon=None, schedule_decay=0.01)

    model.compile(loss='binary_crossentropy',
                  optimizer=optimizer,
                  metrics=None)
    
    return model


### ROC Evaluation 
The metric specified by Kaggle is the mean AUC score across the six categories. The function below will calculate and report the ROC score on the validation data after each epoch.

In [14]:
from sklearn.metrics import roc_auc_score

class RocAucEvaluation(Callback):
    def __init__(self, validation_data=(), interval=1):
        super(Callback, self).__init__()

        self.interval = interval
        self.X_val, self.y_val = validation_data

    def on_epoch_end(self, epoch, logs={}):
        if epoch % self.interval == 0:
            y_pred = self.model.predict(self.X_val, verbose=0)
            score = roc_auc_score(self.y_val, y_pred)
            print("ROC-AUC - epoch: %d - score: %.6f \n" % (epoch+1, score))

### Cross-Validation
The function below performs n-fold cross-validation testing for the specified model. It used stratified splitting of the dataset due to significant class imbalances for the various output categories.

In [15]:
def model_cross_val(X, y, n_folds=5, batch_size=256, epochs = 2, output_num=6, verbose=1):
    
    from sklearn.model_selection import StratifiedKFold
    skf = StratifiedKFold(n_splits=n_folds, shuffle=False, random_state=123)
    
    index=0
    for train_indices, val_indices in skf.split(X, y[:,0]):
        index+=1
        print("Training on fold " + str(index) + "/" + str(n_folds))
        X_train_split, X_val = X[train_indices], X[val_indices]
        y_train_split, y_val = y[train_indices], y[val_indices]

        RocAuc = RocAucEvaluation(validation_data=(X_val, y_val), interval=1)
    
        model = create_model(output_num=output_num)
        model.fit(X_train_split, y_train_split, 
                  batch_size=batch_size, 
                  epochs=epochs, 
                  validation_data=(X_val, y_val),
                  callbacks=[RocAuc],
                  verbose=verbose)
  
  return model

### Train on Full Data and Predict
After finding tuning hyperparameters through cross-validation, the function below trains the model on all of the training data and creates predictions that can be uploaded and scored through the Kaggle interface.

In [None]:
def model_predict(X_train, y_train, X_test, list_classes, ids, batch_size=256, epochs=2, output_num=6, verbose=1):
    model = create_model(output_num=output_num)
    model.fit(X_train, y_train, batch_size=batch_size, epochs=epochs, verbose=verbose)
    predictions = model.predict(X_test, batch_size=1024)
    if output_num == 6:
        results_df = pd.DataFrame(predictions, columns=list_classes)
        results_df = pd.concat([pd.Series(ids, name='id'), results_df], axis=1)
    else:
        results_df = predictions
    
    return model, results_df

# Cross-Validation Training
The output below shows an example experimental run to optimize the neural network hyperparameters. I used a fractional factorial experimental design methodology to efficiently explore hyperparemeters including GRU units, maxlen of tokens, number of features, optimizers, and dropout values. 

In the end, the architecture shown above was the best-performing model through cross-validation and Kaggle leaderboard results. The cross-validation was performed using a mini-batch size of 128 or 256 to speed up training time (~2 minutes per epoch on the GPU), but was decreased to 32 when training the final model for predictions. This decrease in batch size led to slower training times due to a loss in vectorization, but also provided higher overall accuracy.

In [27]:
model_fitted = model_cross_val(X_train, y_train.values, batch_size=256, n_folds=5, verbose=1)

Training on fold 1/5
Train on 76680 samples, validate on 19171 samples
Epoch 1/2
ROC-AUC - epoch: 1 - score: 0.972810 

Epoch 2/2
ROC-AUC - epoch: 2 - score: 0.978433 

Training on fold 2/5
Train on 76680 samples, validate on 19171 samples
Epoch 1/2
ROC-AUC - epoch: 1 - score: 0.975653 

Epoch 2/2
ROC-AUC - epoch: 2 - score: 0.980987 

Training on fold 3/5
Train on 76681 samples, validate on 19170 samples
Epoch 1/2
ROC-AUC - epoch: 1 - score: 0.975887 

Epoch 2/2
ROC-AUC - epoch: 2 - score: 0.979889 

Training on fold 4/5
Train on 76681 samples, validate on 19170 samples
Epoch 1/2
ROC-AUC - epoch: 1 - score: 0.973174 

Epoch 2/2
ROC-AUC - epoch: 2 - score: 0.978874 

Training on fold 5/5
Train on 76682 samples, validate on 19169 samples
Epoch 1/2
ROC-AUC - epoch: 1 - score: 0.973463 

Epoch 2/2
ROC-AUC - epoch: 2 - score: 0.979768 



# Create Predictions
The model b

In [30]:
full_model, results_df =  model_predict(X_train, y_train, X_test, list_classes, batch_size=256, ids=test['id'])

Epoch 1/2
Epoch 2/2


# Creating Submission

In [31]:
bucket = store.Bucket(name='marks_toxic_comments')
output_object = bucket.object('submission_38.csv')
output_object.write_stream(results_df.to_csv(index=False), 'text/csv')