<h1>Table of Contents<span class="tocSkip"></span></h1>
<div class="toc"><ul class="toc-item"><li><span><a href="#Introduction" data-toc-modified-id="Introduction-1"><span class="toc-item-num">1&nbsp;&nbsp;</span>Introduction</a></span></li><li><span><a href="#Imports" data-toc-modified-id="Imports-2"><span class="toc-item-num">2&nbsp;&nbsp;</span>Imports</a></span></li><li><span><a href="#Generate-dataset" data-toc-modified-id="Generate-dataset-3"><span class="toc-item-num">3&nbsp;&nbsp;</span>Generate dataset</a></span></li><li><span><a href="#Vanilla-RNNS" data-toc-modified-id="Vanilla-RNNS-4"><span class="toc-item-num">4&nbsp;&nbsp;</span>Vanilla RNNS</a></span></li><li><span><a href="#Attention-layer" data-toc-modified-id="Attention-layer-5"><span class="toc-item-num">5&nbsp;&nbsp;</span>Attention layer</a></span></li><li><span><a href="#RNN-with-attention" data-toc-modified-id="RNN-with-attention-6"><span class="toc-item-num">6&nbsp;&nbsp;</span>RNN with attention</a></span></li><li><span><a href="#Conclusions" data-toc-modified-id="Conclusions-7"><span class="toc-item-num">7&nbsp;&nbsp;</span>Conclusions</a></span></li><li><span><a href="#References" data-toc-modified-id="References-8"><span class="toc-item-num">8&nbsp;&nbsp;</span>References</a></span></li></ul></div>

# Introduction
<hr style="border:2px solid black"> </hr>

<div class="alert alert-warning">
<font color=black>

**What?** Adding an attention layer to RNN

</font>
</div>

# Imports
<hr style="border:2px solid black"> </hr>

In [1]:
from pandas import read_csv
import numpy as np
from keras import Model
from keras.layers import Layer
import keras.backend as K
from keras.layers import Input, Dense, SimpleRNN
from sklearn.preprocessing import MinMaxScaler
from keras.models import Sequential
from keras.metrics import mean_squared_error
from keras.models import Sequential

# Generate dataset
<hr style = "border:2px solid black" ></hr>

<div class="alert alert-info">
<font color=black>

- We’ll use a very simple example of a Fibonacci sequence, where one number is constructed from previous two numbers. 
- We’ll construct the training examples from t time steps and use the value at t+1 as the target.
    
</font>
</div>

In [2]:
def get_fib_seq(n, scale_data=True):
    # Get the Fibonacci sequence
    seq = np.zeros(n)
    fib_n1 = 0.0
    fib_n = 1.0
    for i in range(n):
        seq[i] = fib_n1 + fib_n
        fib_n1 = fib_n
        fib_n = seq[i]
    scaler = []
    if scale_data:
        scaler = MinMaxScaler(feature_range=(0, 1))
        seq = np.reshape(seq, (n, 1))
        seq = scaler.fit_transform(seq).flatten()
    return seq, scaler


fib_seq = get_fib_seq(10, False)[0]
print(fib_seq)

[ 1.  2.  3.  5.  8. 13. 21. 34. 55. 89.]


In [3]:
def get_fib_XY(total_fib_numbers, time_steps, train_percent, scale_data=True):
    dat, scaler = get_fib_seq(total_fib_numbers, scale_data)
    Y_ind = np.arange(time_steps, len(dat), 1)
    Y = dat[Y_ind]
    rows_x = len(Y)
    X = dat[0:rows_x]
    for i in range(time_steps-1):
        temp = dat[i+1:rows_x+i+1]
        X = np.column_stack((X, temp))
    # random permutation with fixed seed
    rand = np.random.RandomState(seed=13)
    idx = rand.permutation(rows_x)
    split = int(train_percent*rows_x)
    train_ind = idx[0:split]
    test_ind = idx[split:]
    trainX = X[train_ind]
    trainY = Y[train_ind]
    testX = X[test_ind]
    testY = Y[test_ind]
    trainX = np.reshape(trainX, (len(trainX), time_steps, 1))
    testX = np.reshape(testX, (len(testX), time_steps, 1))
    return trainX, trainY, testX, testY, scaler


trainX, trainY, testX, testY, scaler = get_fib_XY(12, 3, 0.7, False)
print('trainX = ', trainX)
print('trainY = ', trainY)

trainX =  [[[ 8.]
  [13.]
  [21.]]

 [[ 5.]
  [ 8.]
  [13.]]

 [[ 2.]
  [ 3.]
  [ 5.]]

 [[13.]
  [21.]
  [34.]]

 [[21.]
  [34.]
  [55.]]

 [[34.]
  [55.]
  [89.]]]
trainY =  [ 34.  21.   8.  55.  89. 144.]


# Vanilla RNNS
<hr style = "border:2px solid black" ></hr>

In [4]:
# Set up parameters
time_steps = 20
hidden_units = 2
epochs = 30

# Create a traditional RNN network


def create_RNN(hidden_units, dense_units, input_shape, activation):
    model = Sequential()
    model.add(SimpleRNN(hidden_units, input_shape=input_shape,
              activation=activation[0]))
    model.add(Dense(units=dense_units, activation=activation[1]))
    model.compile(loss='mse', optimizer='adam')
    return model


model_RNN = create_RNN(hidden_units=hidden_units, dense_units=1, input_shape=(time_steps, 1),
                       activation=['tanh', 'tanh'])
model_RNN.summary()

Model: "sequential"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 simple_rnn (SimpleRNN)      (None, 2)                 8         
                                                                 
 dense (Dense)               (None, 1)                 3         
                                                                 
Total params: 11
Trainable params: 11
Non-trainable params: 0
_________________________________________________________________


2022-09-14 21:03:47.653892: I tensorflow/core/platform/cpu_feature_guard.cc:151] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations:  AVX2 FMA
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.


In [5]:
# Generate the dataset
trainX, trainY, testX, testY, scaler = get_fib_XY(1200, time_steps, 0.7)

model_RNN.fit(trainX, trainY, epochs=epochs, batch_size=1, verbose=2)

Epoch 1/30
826/826 - 2s - loss: 0.0033 - 2s/epoch - 3ms/step
Epoch 2/30
826/826 - 2s - loss: 0.0032 - 2s/epoch - 2ms/step
Epoch 3/30
826/826 - 2s - loss: 0.0031 - 2s/epoch - 2ms/step
Epoch 4/30
826/826 - 2s - loss: 0.0029 - 2s/epoch - 2ms/step
Epoch 5/30
826/826 - 2s - loss: 0.0027 - 2s/epoch - 2ms/step
Epoch 6/30
826/826 - 2s - loss: 0.0025 - 2s/epoch - 2ms/step
Epoch 7/30
826/826 - 2s - loss: 0.0023 - 2s/epoch - 2ms/step
Epoch 8/30
826/826 - 2s - loss: 0.0021 - 2s/epoch - 2ms/step
Epoch 9/30
826/826 - 2s - loss: 0.0019 - 2s/epoch - 2ms/step
Epoch 10/30
826/826 - 2s - loss: 0.0016 - 2s/epoch - 2ms/step
Epoch 11/30
826/826 - 2s - loss: 0.0014 - 2s/epoch - 2ms/step
Epoch 12/30
826/826 - 2s - loss: 0.0012 - 2s/epoch - 2ms/step
Epoch 13/30
826/826 - 2s - loss: 0.0010 - 2s/epoch - 2ms/step
Epoch 14/30
826/826 - 2s - loss: 8.7533e-04 - 2s/epoch - 2ms/step
Epoch 15/30
826/826 - 2s - loss: 7.2097e-04 - 2s/epoch - 2ms/step
Epoch 16/30
826/826 - 2s - loss: 5.6902e-04 - 2s/epoch - 2ms/step
Epoch

<keras.callbacks.History at 0x7faf162350a0>

In [6]:
# Evalute model
train_mse = model_RNN.evaluate(trainX, trainY)
test_mse = model_RNN.evaluate(testX, testY)

# Print error
print("Train set MSE = ", train_mse)
print("Test set MSE = ", test_mse)

Train set MSE =  6.494698027381673e-05
Test set MSE =  2.0355368178570643e-05


# Attention layer
<hr style = "border:2px solid black" ></hr>

<div class="alert alert-info">
<font color=black>


- In Keras, we can create a custom layer that implements attention by subclassing the Layer class. 
- We need to write the `__init__` method as well as override the following methods:
    - `build()`: Keras guide recommends adding weights in this method once the size of the inputs is known. This method ‘lazily’ creates weights. The builtin function add_weight() can be used to add weights and biases of the attention layer.
    - `call()`: The call() method implements the mapping of inputs to outputs. It should implement the forward pass during training. We’ll implement the Bahdanau attention in our call() method.

</font>
</div>    

In [7]:
# Add attention layer to the deep learning network
class attention(Layer):
    def __init__(self, **kwargs):
        super(attention, self).__init__(**kwargs)

    def build(self, input_shape):
        self.W = self.add_weight(name='attention_weight', shape=(input_shape[-1], 1),
                                 initializer='random_normal', trainable=True)
        self.b = self.add_weight(name='attention_bias', shape=(input_shape[1], 1),
                                 initializer='zeros', trainable=True)
        super(attention, self).build(input_shape)

    def call(self, x):
        
        # Alignment scores. Pass them through tanh function
        e = K.tanh(K.dot(x, self.W)+self.b)
        
        # Remove dimension of size 1
        e = K.squeeze(e, axis=-1)
        
        # Compute the weights
        alpha = K.softmax(e)
        
        # Reshape to tensorFlow format
        alpha = K.expand_dims(alpha, axis=-1)
        
        # Compute the context vector
        context = x * alpha
        context = K.sum(context, axis=1)
        
        return context

# RNN with attention
<hr style = "border:2px solid black" ></hr>

In [8]:
def create_RNN_with_attention(hidden_units, dense_units, input_shape, activation):
    x = Input(shape=input_shape)
    # return_sequences=True -> returnS hidden unit output for all previou stime steps
    RNN_layer = SimpleRNN(
        hidden_units, return_sequences=True, activation=activation)(x)
    attention_layer = attention()(RNN_layer)
    outputs = Dense(dense_units, trainable=True,
                    activation=activation)(attention_layer)
    model = Model(x, outputs)
    model.compile(loss='mse', optimizer='adam')
    return model


model_attention = create_RNN_with_attention(hidden_units=hidden_units, dense_units=1,
                                            input_shape=(time_steps, 1), activation='tanh')
model_attention.summary()

Model: "model"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 input_1 (InputLayer)        [(None, 20, 1)]           0         
                                                                 
 simple_rnn_1 (SimpleRNN)    (None, 20, 2)             8         
                                                                 
 attention (attention)       (None, 2)                 22        
                                                                 
 dense_1 (Dense)             (None, 1)                 3         
                                                                 
Total params: 33
Trainable params: 33
Non-trainable params: 0
_________________________________________________________________


In [9]:
model_attention.fit(trainX, trainY, epochs=epochs, batch_size=1, verbose=2)

Epoch 1/30
826/826 - 3s - loss: 0.0013 - 3s/epoch - 3ms/step
Epoch 2/30
826/826 - 2s - loss: 0.0013 - 2s/epoch - 2ms/step
Epoch 3/30
826/826 - 2s - loss: 0.0013 - 2s/epoch - 2ms/step
Epoch 4/30
826/826 - 2s - loss: 0.0012 - 2s/epoch - 2ms/step
Epoch 5/30
826/826 - 2s - loss: 0.0012 - 2s/epoch - 2ms/step
Epoch 6/30
826/826 - 2s - loss: 0.0012 - 2s/epoch - 2ms/step
Epoch 7/30
826/826 - 2s - loss: 0.0011 - 2s/epoch - 2ms/step
Epoch 8/30
826/826 - 2s - loss: 0.0011 - 2s/epoch - 2ms/step
Epoch 9/30
826/826 - 2s - loss: 0.0011 - 2s/epoch - 2ms/step
Epoch 10/30
826/826 - 2s - loss: 9.9490e-04 - 2s/epoch - 2ms/step
Epoch 11/30
826/826 - 2s - loss: 9.5451e-04 - 2s/epoch - 2ms/step
Epoch 12/30
826/826 - 2s - loss: 8.9651e-04 - 2s/epoch - 2ms/step
Epoch 13/30
826/826 - 2s - loss: 8.3155e-04 - 2s/epoch - 2ms/step
Epoch 14/30
826/826 - 2s - loss: 7.7556e-04 - 2s/epoch - 2ms/step
Epoch 15/30
826/826 - 2s - loss: 6.9762e-04 - 2s/epoch - 2ms/step
Epoch 16/30
826/826 - 2s - loss: 6.2822e-04 - 2s/epoch 

<keras.callbacks.History at 0x7faf16c0f8b0>

In [10]:
# Evalute model
train_mse_attn = model_attention.evaluate(trainX, trainY)
test_mse_attn = model_attention.evaluate(testX, testY)

# Print error
print("Train set MSE with attention = ", train_mse_attn)
print("Test set MSE with attention = ", test_mse_attn)

Train set MSE with attention =  5.882055120309815e-05
Test set MSE with attention =  8.229778359236661e-06


# Conclusions
<hr style = "border:2px solid black" ></hr>

<div class="alert alert-info">
<font color=black>

- We can see that even for this simple example, the mean square error on the test set is lower with the attention layer. You can achieve better results with hyper-parameter tuning and model selection. 
    
- This can be tried out on LSTM or an encoder decoder network as well.

</font>
</div>

# References
<hr style="border:2px solid black"> </hr>

<div class="alert alert-warning">
<font color=black>

- https://machinelearningmastery.com/adding-a-custom-attention-layer-to-recurrent-neural-network-in-keras/
- https://machinelearningmastery.com/what-is-attention/
- [Attention layer to LSTM](https://machinelearningmastery.com/attention-long-short-term-memory-recurrent-neural-networks/)
- [Attention layer to Encoder-Decoder](https://machinelearningmastery.com/encoder-decoder-attention-sequence-to-sequence-prediction-keras/)
    
</font>
</div>