## MVP Siamese LSTM Net

This is a baseline siamese LSTM net. The purpose is to build out the architecture, and see if the net can get as good as validation score as the classifiers.

Ideas Implemented:
* Add BatchNormalization - theoritically speeds up training
  * https://arxiv.org/pdf/1502.03167.pdf
* Add EarlyStopping
* Add validation AUC

In [34]:
# data manipulation
import utils
import pandas as pd
import numpy as np
import logging

# Keras
from keras.preprocessing.text import Tokenizer
from keras.preprocessing.sequence import pad_sequences
from keras.models import Sequential, Model
from keras.layers import Dense, Flatten, LSTM, Conv1D, MaxPooling1D, Dropout, Activation, Input, Add, concatenate, BatchNormalization
from keras.layers.embeddings import Embedding
from keras.utils.vis_utils import model_to_dot
from keras.callbacks import ModelCheckpoint, TensorBoard, Callback, EarlyStopping

from sklearn.metrics import roc_auc_score
from sklearn.model_selection import train_test_split

# plotting
from IPython.display import SVG

In [2]:
X_train = utils.load('X_train')
y_train = utils.load('y_train')

## Tokenize and Encode vocabulary

1. Limit the vocab to 20,000 words.
2. Clean questions only and do not lemmatize.
3. Limit the question length to 100 tokens.

In [19]:
vocabulary_size = 20000
max_q_len = 100

X_train_stack = utils.clean_questions(utils.stack_questions(X_train))

tokenizer = Tokenizer(num_words= vocabulary_size)
tokenizer.fit_on_texts(X_train_stack)

sequences = tokenizer.texts_to_sequences(X_train_stack)
data = pad_sequences(sequences, maxlen=max_q_len)

print(data.shape)
data[:,0].sum()

(606398, 100)


7212

## Embedding Matrix

1. Calculates the embedding matrix utilizing spaCy `en_core_web_lg` word vectors.
  * https://spacy.io/models/en#en_core_web_lg
  * GloVe vectors trained on Common Crawl

In [4]:
try:
    embedding_matrix = utils.load('embedding_matrix')
except:
    # create a weight matrix for words in training docs
    embedding_matrix = np.zeros((vocabulary_size, 300))
    for word, index in tokenizer.word_index.items():
    #     print(word, index, end='\r')
        if index > vocabulary_size - 1:
            break
        else:
            embedding_vector = utils.nlp(word).vector
            if embedding_vector is not None:
                embedding_matrix[index] = embedding_vector
    #     break

    utils.save(embedding_matrix, 'embedding_matrix')

## Define the batch to pass into the network

Create arrays to split the stacked data into question 1 set and question 2 set for each pair.

**Need to cleanup this cell in the next model**

In [24]:
# cooncatenate the two questions
odd_idx = [i for i in range(data.shape[0]) if i % 2 == 1]
even_idx = [i for i in range(data.shape[0]) if i % 2 == 0]

data_1 = data[odd_idx]
data_2 = data[odd_idx]

# split the data set into a validation set
data_train, data_val, label_train, label_val = train_test_split(np.hstack([data_1, data_2]), 
                                                                y_train, 
                                                                stratify=y_train, 
                                                                test_size = 0.33,
                                                                random_state=42)

# split the concatenation back into 2 data sets for the siamese network
data_1_train = data_train[:, :max_q_len]
data_2_train = data_train[:, max_q_len:]
data_1_val = data_val[:, :max_q_len]
data_2_val = data_val[:, max_q_len:]

print(f'Train major class: {len(label_train[label_train == 0]) / len(label_train):.2}')
print(f'Val major class: {len(label_val[label_val == 0]) / len(label_val):.2}')

Train major class: 0.63
Val major class: 0.63


## Calculate AUC after each epoch

In [38]:
"""
An example to check the AUC score on a validation set for each 10 epochs.
I hope it will be helpful for optimizing number of epochs.
"""

class IntervalEvaluation(Callback):
    def __init__(self, validation_data=(), interval=10):
        super(Callback, self).__init__()

        self.interval = interval
        self.X_val, self.y_val = validation_data

    def on_epoch_end(self, epoch, logs={}):
        if epoch % self.interval == 0:
            y_pred = self.model.predict(self.X_val, verbose=0)
            score = roc_auc_score(self.y_val, y_pred)
            logging.info("interval evaluation - epoch: {:d} - score: {:.6f}".format(epoch, score))            

## Build out legs of the siamese network

The architecure is the following,

0. Input - (100,) word tensor
1. Embedding Layer - outputs (300,) **not trainable**
2. LSTM - default outputs (300,)
3. Concatenate the two nets outputs (600,)
4. BatchNormalization
5. Dropout - 20%
6. Dense - outputs (100,), activation `tanh` -- somewhat random decision
7. BatchNormalization
8. Dropout - 20%
9. Dense - outputs (1,), activation `sigmoid`

In [32]:
# Creating word embedding layer
embedding_layer = Embedding(vocabulary_size, 300, input_length=100, 
                                     weights=[embedding_matrix], trainable=False)

# Creating LSTM Encoder
# Bidirectional(LSTM(self.number_lstm_units, dropout=self.rate_drop_lstm, recurrent_dropout=self.rate_drop_lstm))
lstm_layer = LSTM(300)

# Creating LSTM Encoder layer for First Sentence
sequence_1_input = Input(shape=(100,), dtype='int32')
embedded_sequences_1 = embedding_layer(sequence_1_input)
x1 = lstm_layer(embedded_sequences_1)

# Creating LSTM Encoder layer for Second Sentence
sequence_2_input = Input(shape=(100,), dtype='int32')
embedded_sequences_2 = embedding_layer(sequence_2_input)
x2 = lstm_layer(embedded_sequences_2)



In [33]:
# Merging two LSTM encodes vectors from sentences to
# pass it to dense layer applying dropout and batch normalisation

merged = concatenate([x1, x2])
merged = BatchNormalization()(merged)
merged = Dropout(.2)(merged)
merged = Dense(100, activation='tanh')(merged)
merged = BatchNormalization()(merged)
merged = Dropout(0.2)(merged)
preds = Dense(1, activation='sigmoid')(merged)

model = Model(inputs=[sequence_1_input, sequence_2_input], outputs=preds)
model.compile(loss='binary_crossentropy', optimizer='nadam', metrics=['acc'])
# SVG(model_to_dot(model, show_shapes=True).create(prog='dot', format='svg'))

model.summary()

__________________________________________________________________________________________________
Layer (type)                    Output Shape         Param #     Connected to                     
input_3 (InputLayer)            (None, 100)          0                                            
__________________________________________________________________________________________________
input_4 (InputLayer)            (None, 100)          0                                            
__________________________________________________________________________________________________
embedding_2 (Embedding)         (None, 100, 300)     6000000     input_3[0][0]                    
                                                                 input_4[0][0]                    
__________________________________________________________________________________________________
lstm_2 (LSTM)                   (None, 300)          721200      embedding_2[0][0]                
          

In [39]:
# Callbacks

file_path = '../data/keras_models/mvp_batch_norm{epoch:02d}-{val_loss:.2f}.hdf5'
model_checkpoint = ModelCheckpoint(filepath=file_path, save_best_only=True)


tensorboard = TensorBoard(log_dir='../data/tensorboard')

early_stopping = EarlyStopping(monitor='val_loss', 
                               min_delta=0, 
                               patience=3, 
                               verbose=1, 
                               mode='auto', 
                               restore_best_weights=True)

calc_auc = IntervalEvaluation(([data_1_val, data_2_val], label_val), interval=1)

In [None]:
model.fit([data_1_train, data_2_train], label_train, 
          validation_data=([data_1_val, data_2_val], label_val),
                  epochs=20, batch_size=64, shuffle=True,
                  callbacks=[model_checkpoint, tensorboard, early_stopping, calc_auc])

Train on 203143 samples, validate on 100056 samples
Epoch 1/20
Epoch 2/20
Epoch 3/20
Epoch 4/20
Epoch 5/20

## Results

The best average validation accuracy from the classification models is 0.783667, so very similar validation accuracy.

### Next Steps
Implement one or many of the ideas below.

Future Ideas:
* Explore LSTM settings
* Dropout rates
* Adding or removing the dense layers
* Add BatchNormalization - theoritically speeds up training
  * https://arxiv.org/pdf/1502.03167.pdf
* Add EarlyStopping

Change the validation scoring AUC, since the majority class represents 63% of the data.

## Example Code

Example below is how to calculate AUC at the end of each epoch on the validation data. **Add this to the next model**.

https://gist.github.com/smly/d29d079100f8d81b905e