# Attention Sandwich model

In the forked model [forked model](https://www.kaggle.com/nicksexton/different-embeddings-with-attention-fork-fork) we've inherited some code for creating an attention mechanism that just sits nicely on top of the two LSTMs that make up the recurrent part of the model. 

In this notebook, I'm going to implement an attention algorithm that sits *between* the two LSTMs (or more, if we decide to continue stacking.) This has the advantage that for each point in its sequence, the second LSTM (which we'll call LSTM_q) is able to look at a much wider input (context) than just it's own hidden state and the current output of the previous LSTM (which we'll call LSTM_p), and means that the two LSTMs aren't necessarily aligned (i.e., they don't need to have the same number of timesteps).

LSTM_p will be bidirectional, and will have the same number of timesteps (Tp) as the maximum sequence length. This feeds into an attention mechanism that computes a context vector, which is fed into LSTM_q, that has a variable sequence length (Tq). 

This frees up LSTM_q such that across Tq timesteps, it's free to learn its own articulated representation of what a question is. (For example, if Tq=3, it might correspond to representations for the beginning, middle, and end of the question). But an important point is the model will develop its own representations of quora questions, rather than following any of my own preconceived ideas.

Finally, the sequential output of LSTM_q across Tq timesteps is concatenated and fed into a fully connected layer, and finally into a classifier.

The model is quite slow to train (~200s per epoch on my GPU) but I think produces promising enough results from a single set of embeddings to be worth experimenting with.

The code for the attention model is adapted from Andrew Ng's Deep Learning Coursera course, with some adaptations to make it applicable to a sequence-to-binary-classification model.

In [1]:
import os
import time
import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
from tqdm import tqdm
import math
from sklearn.model_selection import train_test_split
from sklearn import metrics

from keras.preprocessing.text import Tokenizer
from keras.preprocessing.sequence import pad_sequences
from keras.layers import Dense, Input, CuDNNLSTM, Embedding, Dropout, Activation, CuDNNGRU, Conv1D, Concatenate, Flatten, RepeatVector, Dot, LSTM
from keras.layers import Bidirectional, GlobalMaxPool1D
from keras.optimizers import Adam, RMSprop
from keras.models import Model
from keras import backend as K
from keras.engine.topology import Layer
from keras import initializers, regularizers, constraints, optimizers, layers

Using TensorFlow backend.


In [2]:
train_df = pd.read_csv("../input/train.csv")
test_df = pd.read_csv("../input/test.csv")
print("Train shape : ",train_df.shape)
print("Test shape : ",test_df.shape)

Train shape :  (1306122, 3)
Test shape :  (56370, 2)


Next steps are as follows:
 * Split the training dataset into train and val sample. Cross validation is a time consuming process and so let us do simple train val split.
 * Fill up the missing values in the text column with '_na_'
 * Tokenize the text column and convert them to vector sequences
 * Pad the sequence as needed - if the number of words in the text is greater than 'max_len' trunacate them to 'max_len' or if the number of words in the text is lesser than 'max_len' add zeros for remaining values.

In [3]:
## split to train and val
train_df, val_df = train_test_split(train_df, test_size=0.08, random_state=2018)

## some config values 
embed_size = 300 # how big is each word vector
max_features = 95000 # 95000 # how many unique words to use (i.e num rows in embedding vector)
maxlen = 70 # max number of words in a question to use


## fill up the missing values
train_X = train_df["question_text"].fillna("_##_").values
val_X = val_df["question_text"].fillna("_##_").values
test_X = test_df["question_text"].fillna("_##_").values

## Tokenize the sentences
tokenizer = Tokenizer(num_words=max_features)
tokenizer.fit_on_texts(list(train_X))
train_X = tokenizer.texts_to_sequences(train_X)
val_X = tokenizer.texts_to_sequences(val_X)
test_X = tokenizer.texts_to_sequences(test_X)

## Pad the sentences 
train_X = pad_sequences(train_X, maxlen=maxlen)
val_X = pad_sequences(val_X, maxlen=maxlen)
test_X = pad_sequences(test_X, maxlen=maxlen)

## Get the target values
train_y = train_df['target'].values
val_y = val_df['target'].values

In [4]:
#shuffling the data
#np.random.seed(2018)
#trn_idx = np.random.permutation(len(train_X))
#val_idx = np.random.permutation(len(val_X))

#train_X = train_X[trn_idx]
#val_X = val_X[val_idx]
#train_y = train_y[trn_idx]
#val_y = val_y[val_idx]

# Code for the Attention Layer

In [44]:
K.clear_session()

In [45]:
# custom softmax activation function
def softmax(x, axis=1):
    """Softmax activation function.
    # Arguments
        x : Tensor.
        axis: Integer, axis along which the softmax normalization is applied.
    # Returns
        Tensor, output of softmax transformation.
    # Raises
        ValueError: In case `dim(x) == 1`.
    """
    ndim = K.ndim(x)
    if ndim == 2:
        return K.softmax(x)
    elif ndim > 2:
        e = K.exp(x - K.max(x, axis=axis, keepdims=True))
        s = K.sum(e, axis=axis, keepdims=True)
        return e / s
    else:
        raise ValueError('Cannot apply softmax to a tensor that is 1D')

In [46]:
#one_step_attention

def one_step_attention(h, s_prev):
    """
    Performs one step of attention: Outputs a context vector computed as a dot product of the attention weights
    "alphas" and the hidden states "a" of the Bi-LSTM.
    
    Arguments:
    h -- hidden state output of the Bi-LSTM, numpy-array of shape (m, Tp, 2*n_a)
    s_prev -- previous hidden state of the (post-attention) LSTM, numpy-array of shape (m, n_s)
    
    Returns:
    context -- context vector, input of the next (post-attetion) LSTM cell
    """
    
    # Use repeator to repeat s_prev to be of shape (m, Tp, n_s) so that you can concatenate it with all hidden states "a" (≈ 1 line)
    s_prev = repeator(s_prev)
    # Use concatenator to concatenate h and s_prev on the last axis (≈ 1 line)
    concat = concatenator([h, s_prev])
    # Use densor1 to propagate concat through a small fully-connected neural network to compute the "intermediate energies" variable e. (≈1 lines)
    e = densor1(concat)
    # Use densor2 to propagate e through a small fully-connected neural network to compute the "energies" variable energies. (≈1 lines)
    energies = densor2(e)
    # Use "activator" on "energies" to compute the attention weights "alphas" (≈ 1 line)
    alphas = activator(energies)
    # Use dotor together with "alphas" and "h" to compute the context vector to be given to the next (post-attention) LSTM-cell (≈ 1 line)
    context = dotor([alphas, h])
    
    return context

In [66]:
# setting the sequence length for LSTMs 1 (Tp) and 2 (Tq)
Tp = maxlen
Tq = 6
n_p = 256 # hidden state size of LSTM 1
n_q = 128 # 128 # hidden state size of LSTM 2




In [56]:
# Defined shared layers as global variables
repeator = RepeatVector(Tp)
concatenator = Concatenate(axis=-1)
densor1 = Dense(30, activation = "tanh") # was 10
densor2 = Dense(1, activation = "relu")
activator = Activation(softmax, name='attention_weights') # We are using a custom softmax(axis = 1) loaded in this notebook
dotor = Dot(axes = 1)
post_activation_LSTM_cell = CuDNNLSTM(n_q, return_state = True, name='LSTM_q')

We will use these layers $T_q$ times in a `for` loop to generate an output. The algorithm consists of the following steps.

1. Propagate the input into a [Bidirectional](https://keras.io/layers/wrappers/#bidirectional) [LSTM](https://keras.io/layers/recurrent/#lstm)
2. Iterate for $t = 0, \dots, T_q-1$: 
    1. Call `one_step_attention()` on $[\alpha^{<t,1>},\alpha^{<t,2>}, ..., \alpha^{<t,T_p>}]$ and $s^{<t-1>}$ to get the context vector $context^{<t>}$.
    2. Give $context^{<t>}$ to the post-attention LSTM cell, also passing in the previous hidden-state $s^{\langle t-1\rangle}$ and cell-states $c^{\langle t-1\rangle}$ of this LSTM using `initial_state= [previous hidden state, previous cell state]`. Get back the new hidden state $s^{<t>}$ and the new cell state $c^{<t>}$.
    3. add the cell state $c^{<t>}$ and hidden state $s^{<t>}$ to a list of states
3. Concatenate the list of states, pass this into a fully connected layer, and then into a sigmoid binary classifier


In [64]:
def build_attention_sandwich_model(Tp, Tq, n_p, n_q, embed_size, max_features, embed_matrix):
    """
    Arguments:
    Tp -- length of the first LSTM's sequence (ie. max sequence length)
    Tq -- length of the second LSTM's sequence
    n_p -- hidden state size of the Bi-LSTM
    n_q -- hidden state size of the post-attention LSTM
    embed_size -- number of embedding dimensions
    max_features -- number of embedded features (i.e. tokenized word count)
    embed_matrix -- embedding matrix
    machine_vocab_size -- size of the python dictionary "machine_vocab" # = 1 for classifier

    Returns:
    model -- Keras model instance
    """
    
    # Define the inputs of your model with a shape (Tx,)
    # Define s0 and c0, initial hidden state for LSTM_q of shape (n_q,)
    print ("Tp:", Tp)
    print ("embed size", embed_size)
    print ("max features", max_features)
    X = Input(shape=(Tp,), name='input') 
    s0 = Input(shape=(n_q,), name='s0')
    c0 = Input(shape=(n_q,), name='c0')
    s = s0
    c = c0
    
    E = Embedding(max_features, embed_size, weights=[embed_matrix], trainable=False)(X)
    
    # Initialize empty list of hidden cell state vectors
    states = []
    
    # Define the pre-attention Bi-LSTM. 
    a = Bidirectional(CuDNNLSTM(n_p, return_sequences=True, input_shape=(Tp, embed_size), name='LSTM_p1'))(E)
    # 2: Iterate for Tq steps
    for t in range(Tq):
    
        # 2.A: Perform one step of the attention mechanism to get back the context vector at step t
        context = one_step_attention(a, s)
        
        # 2.B: Apply the post-attention LSTM_q cell to the "context" vector.
        s, _, c = post_activation_LSTM_cell(inputs=context, initial_state=[s, c])
        
        # 2.C: Append states to the "states" list (≈ 1 line)
        states.append(c)
        states.append(s)
    
    # 3: Flatten the output of LSTM_q
    f = Concatenate()(states)
    x = Dense(64, activation="relu")(f)
    x = Dense(1, activation="sigmoid")(x)
    
    # Create model instance taking three inputs and returning the list of outputs. (≈ 1 line)
    model = Model(inputs=[X, s0, c0], outputs=x)
    
    return model

# Building the model

In [58]:
# a look at the available embeddings
!ls ../input/embeddings/

glove.840B.300d			paragram_300_sl999
GoogleNews-vectors-negative300	wiki-news-300d-1M


**Glove Embeddings:**

For now, let's use the GloVe embeddings for the attention sandwich model

In [59]:
EMBEDDING_FILE = '../input/embeddings/glove.840B.300d/glove.840B.300d.txt'
def get_coefs(word,*arr): return word, np.asarray(arr, dtype='float32')
embeddings_index = dict(get_coefs(*o.split(" ")) for o in open(EMBEDDING_FILE))

all_embs = np.stack(embeddings_index.values())
emb_mean,emb_std = all_embs.mean(), all_embs.std()
embed_size = all_embs.shape[1]

word_index = tokenizer.word_index
nb_words = min(max_features, len(word_index))
embedding_matrix = np.random.normal(emb_mean, emb_std, (nb_words, embed_size))
for word, i in word_index.items():
    if i >= max_features: continue
    embedding_vector = embeddings_index.get(word)
    if embedding_vector is not None: embedding_matrix[i] = embedding_vector

In [60]:
model = build_attention_sandwich_model(Tp, Tq, n_p, n_q, embed_size, max_features, embedding_matrix)
opt = RMSprop(lr=1e-3)
model.compile(opt, loss='binary_crossentropy', metrics=['accuracy'])
model.summary()

Tp: 70
embed size 300
max features 95000
__________________________________________________________________________________________________
Layer (type)                    Output Shape         Param #     Connected to                     
input (InputLayer)              (None, 70)           0                                            
__________________________________________________________________________________________________
embedding_2 (Embedding)         (None, 70, 300)      28500000    input[0][0]                      
__________________________________________________________________________________________________
s0 (InputLayer)                 (None, 96)           0                                            
__________________________________________________________________________________________________
bidirectional_1 (Bidirectional) (None, 70, 512)      1142784     embedding_2[0][0]                
____________________________________________________________________

And code to evaluate the model (F1 scores at various thresholds) on the validation set

In [61]:
from sklearn import metrics

def calc_f1_scores(model, dev_x, dev_y):
    
    s0_val = np.zeros((dev_x.shape[0], n_q))
    c0_val = np.zeros((dev_x.shape[0], n_q))

    pred_glove_dev_Y = model.predict([dev_x, s0_val, c0_val], batch_size=1024, verbose=1)

    best_thresh = -1 # init value
    best_f1 = 0

    for thresh in np.arange(0.1, 0.501, 0.01):

        thresh = np.round(thresh, 2)
    
        f1 = metrics.f1_score(dev_y, (pred_glove_dev_Y>thresh).astype(int))
        print("F1 score at threshold {0} is {1}".format(thresh, f1))
        if f1 > best_f1:
            best_f1 = f1
            best_thresh = thresh

        
    print("Best F1 score was at threshold {0}, {1}".format(best_thresh, best_f1))
    return (best_thresh, best_f1, pred_glove_dev_Y)

In [62]:
#initialize the context vectors for LSTM_q's hidden state
s0 = np.zeros((train_X.shape[0], n_q))
c0 = np.zeros((train_X.shape[0], n_q))

s0_val = np.zeros((val_X.shape[0], n_q))
c0_val = np.zeros((val_X.shape[0], n_q))

In [63]:
model.fit([train_X, s0, c0], train_y, batch_size=512, epochs=3, 
#          class_weight = {0: 1., 1: 2.},
          validation_data=([val_X, s0_val, c0_val], val_y))

Train on 1201632 samples, validate on 104490 samples
Epoch 1/3
Epoch 2/3
Epoch 3/3


<keras.callbacks.History at 0x7ffa73af8c18>

In [65]:
print ("Attention Sandwich model")
(best_thresh, best_f1, pred_glove_val_y) = calc_f1_scores (model, val_X, val_y)

s0_test = np.zeros((test_X.shape[0], n_q))
c0_test = np.zeros((test_X.shape[0], n_q))
pred_test_y = model.predict([test_X, s0_test, c0_test], batch_size=1024, verbose=1)

Attention Sandwich model
F1 score at threshold 0.1 is 0.6311949261940843
F1 score at threshold 0.11 is 0.6381563861499656
F1 score at threshold 0.12 is 0.6428571428571429
F1 score at threshold 0.13 is 0.6482096276974152
F1 score at threshold 0.14 is 0.6536168162380294
F1 score at threshold 0.15 is 0.6580589891078203
F1 score at threshold 0.16 is 0.6630650769995031
F1 score at threshold 0.17 is 0.6662060301507537
F1 score at threshold 0.18 is 0.668826631090782
F1 score at threshold 0.19 is 0.6712566758895824
F1 score at threshold 0.2 is 0.6746532977407382
F1 score at threshold 0.21 is 0.6760507794514241
F1 score at threshold 0.22 is 0.67787598856915
F1 score at threshold 0.23 is 0.6795620928202029
F1 score at threshold 0.24 is 0.6801762114537445
F1 score at threshold 0.25 is 0.6814855345051638
F1 score at threshold 0.26 is 0.6813004762890867
F1 score at threshold 0.27 is 0.6840122767857143
F1 score at threshold 0.28 is 0.6842327744117233
F1 score at threshold 0.29 is 0.6832006822057988


## Results

Good results obtained with the following hyperparameters
<table>
    <tr>
        <td>
            Tp:
        </td>
        <td> 70 (maxlen) </td>
        <td> 70 (maxlen) </td>
    </tr>
        <tr>
        <td>
            Tq:
        </td>
        <td> 5 </td>
        <td> 6 </td>
    </tr>
        <tr>
        <td>
            n_p:
        </td>
        <td> 256 </td>
        <td> 256 </td>
    </tr>
        <tr>
        <td>
            n_q:
        </td>
        <td> 128 </td>
        <td> 128 </td>
            <td> 160 </td>
    </tr>
    <tr>
        <td>
            Optimizer 
        </td>
        <td> RMSprop </td>
        <td> RMSprop </td>
    </tr>
                <tr>
        <td>
            Learning Rate 
        </td>
        <td> 1e-3 </td>
        <td> 1e-3 </td>
    </tr>
            <tr>
        <td> Batch Size </td>
        <td> 512 </td>
        <td> 512 </td>
    </tr>
        <tr>
        <td>
            Epochs 
        </td>
        <td>3</td>
        <td>3</td>
    </tr>
           <tr>
        <td>
            F1 Score 
        </td>
        <td> 0.6859 </td>
        <td> 0.6862 </td>
    </tr>
</table>


In [None]:
pred_test_y = (pred_test_y > best_thresh).astype(int)
out_df = pd.DataFrame({"qid":test_df["qid"].values})
out_df['prediction'] = pred_test_y
out_df.to_csv("submission.csv", index=False)