# Code Attention

The goal of this demo is to teach you how to code an encoder decoder model with attention mechanism!
Since this is just a demo we will use generated data, the same generated data we used to demonstrate the encoder decoder. You'll be able to tackle the real problem during the exercise, the goal here is to focus on building the model and the training loop.

## Import libraries

In [29]:
# Import Tensorflow & Pathlib librairies
import tensorflow as tf 
import pathlib 
import pandas as pd 
import os
import io
import warnings
warnings.filterwarnings('ignore')
import json
from random import randint
from numpy import array
from numpy import argmax
from numpy import array_equal
from tensorflow.keras.utils import to_categorical
from tensorflow.keras.models import Model
from tensorflow.keras.layers import Input
from tensorflow.keras.layers import LSTM
from tensorflow.keras.layers import Dense

## Generate data

We will generate random input and target data for the purpose of the demonstration.

In [30]:
input_dim = 100
input_seq_len = 10
target_seq_len = 5

In [31]:
# generate a sequence of random integers from 2 to n_unique
def generate_sequence(length, n_unique):
	return [randint(2, n_unique-1) for _ in range(length)]

In [32]:
generate_sequence(input_seq_len,input_dim)

[65, 13, 20, 68, 85, 85, 4, 20, 26, 26]

In [33]:
# prepare data
def get_dataset(n_in, n_out, cardinality, n_samples, printing=False):
  X1, y = list(), list()
  for _ in range(n_samples):
    # generate source sequence
    source = generate_sequence(n_in, cardinality)
    source_pad = source
    if printing:
      print("source:", source_pad)
    # define padded target sequence
    # we add the <start> token at the beginning of each sequence
    # here we'll simply consider that the start token will coded
    # by the index 0
    target = source[:n_out]
    target.reverse()
    target = [0] + target
    if printing:
      print("target:", target)
    # store
    X1.append(source_pad)
    y.append(target)
  return array(X1), array(y)

In [34]:
input, target =  get_dataset(input_seq_len,target_seq_len,input_dim,1,True)

source: [46, 49, 66, 45, 49, 28, 84, 84, 82, 63]
target: [0, 49, 45, 66, 49, 46]


The data we are generating consists in a random sequence of numbers (they could very well represent encoded letters, words, sentences or anything you could think of).

The target is built using the first elements of the input in reversed order. We add a special token at the beginning of every target sequence for teacher.

Now that we understand this, let's create the training data and validation data.

In [35]:
X_train, y_train = get_dataset(input_seq_len,target_seq_len,input_dim,10000)
X_val, y_val = get_dataset(input_seq_len,target_seq_len,input_dim,5000)

Let's transform these train sets into batch datasets

In [36]:
BATCH_SIZE = 128
train_batch = tf.data.Dataset.from_tensor_slices((X_train,y_train)).shuffle(len(X_train)).batch(BATCH_SIZE)

## Create the encoder decoder with attention

In what follows we will code a model that will reproduce the following architecture for an encoder decoder model with Bahdanau style attention

![bahdanau](https://full-stack-bigdata-datasets.s3.eu-west-3.amazonaws.com/Deep+Learning/attention/Attention-encoder-decoder.drawio.png)

### Create encoder model

In this step we will define the encoder model.

The goal of the encoder is to create a representation of the input data, to extract information from the input data which will then be interpreted by the decoder model.

The encoder receives sequence inputs and will output sequences with a given depth of representation (we  usually called that dimension channels before)

In [37]:
# let's start by defining the number of units needed for the embedding and
# the lstm layers

n_embed = 32
n_gru = 32

In [38]:
class encoder_maker(tf.keras.Model):
  def __init__(self, in_vocab_size, embed_dim, n_units):
    super().__init__()
    # instanciate an embedding layer
    self.n_units = n_units
    self.embed = tf.keras.layers.Embedding(input_dim=in_vocab_size,
                                      output_dim=embed_dim)
    # instantiate GRU layer
    self.gru = tf.keras.layers.GRU(units=n_units,
                              return_sequences=True,
                              return_state=True)
  def __call__(self, input_batch):
    # each output will be saved as a class attribute so we can easily access
    # them to control the shapes throughout the demo
    self.embed_out = self.embed(input_batch)
    self.gru_out, self.gru_state = self.gru(self.embed_out)#, initial_state=initial_state)

    return self.gru_out, self.gru_state


That's it, it does not need to be anymore complicated than this, note though that we did not preserve the sequential nature of the data, but we output the cell state, which will serve as input state for the decoder!

Let's try it out on an input to see what comes out!

In [39]:
encoder = encoder_maker(input_dim, n_embed, n_gru)

In [40]:
encoder_output, encoder_state = encoder(tf.expand_dims(X_train[0],0))

In [41]:
encoder_output

<tf.Tensor: shape=(1, 10, 32), dtype=float32, numpy=
array([[[-1.63941365e-02,  9.70075652e-03, -1.08925989e-02,
          3.06798983e-03, -2.46530259e-03, -3.24438722e-03,
          1.00749983e-02,  6.95796916e-04, -3.49409529e-03,
         -5.35424566e-03, -4.43874579e-03,  1.00749237e-02,
          7.29551772e-03, -6.67085266e-03, -1.57688204e-02,
         -7.24462187e-03, -5.81210013e-03, -1.13154585e-02,
          3.73269990e-03, -4.04958799e-03, -7.74870813e-03,
         -6.87822513e-03,  5.98435709e-03, -2.97807511e-02,
          1.00458157e-03,  4.80343215e-03,  7.21000042e-03,
          2.84949690e-03,  3.60998698e-03, -1.88084392e-04,
         -7.55666243e-03,  1.11027174e-02],
        [-1.08593851e-02,  2.59605353e-03, -1.62866805e-02,
         -4.22731275e-03, -4.47256025e-05,  1.09004055e-03,
         -1.86565565e-03, -2.59361975e-03,  3.89671233e-03,
         -1.25934202e-02, -1.30948974e-02, -9.14700236e-03,
          2.23703310e-03,  1.16870622e-03,  1.03402119e-02,
   

In [42]:
encoder_state

<tf.Tensor: shape=(1, 32), dtype=float32, numpy=
array([[-0.00085207,  0.00528938,  0.02444724, -0.01053277,  0.00595644,
        -0.0025592 ,  0.00435245,  0.02582046, -0.00502042,  0.00940072,
        -0.01358306, -0.01176863,  0.00713067,  0.01072467,  0.00600605,
        -0.00553971, -0.00364128,  0.01307214, -0.00430903,  0.00132449,
        -0.00613171, -0.02432938, -0.00032585, -0.01560974,  0.01670919,
         0.01907848, -0.00613984, -0.0002509 , -0.00697608, -0.01444875,
        -0.00729234,  0.01350213]], dtype=float32)>

The first output as a shape of (1,12,16) which is normal because we applied the encoder to 1 input sequence of 12 elements (we chose return_sequences = True for the gru layer) and 16 channels since we have 16 units on the gru layer.

The second output is the gru state which has shape (1,16) for one input sequence and 16 units on the gru layer.

### Create the Attention layer

Let's now create the attention layer 

In [43]:
class Bahdanau_attention_maker(tf.keras.layers.Layer):
  def __init__(self, attention_units):
    super().__init__()

    # The attention layer contains three dense layers
    self.W1 = tf.keras.layers.Dense(units=attention_units)
    self.W2 = tf.keras.layers.Dense(units=attention_units)
    self.V = tf.keras.layers.Dense(units=1)

  def __call__(self, enc_out, state):
    # the choice of name of the arguments here is not random, enc_out
    # will represent the encoder output which will be used to create
    # the attention weights and then used to create the context vector once we
    # apply the attention weights
    # the state will be a hidden state from a recurrent unit coming either
    # from the encoder at first, and from the decoder as we make further 
    # predictions
    self.W1_out = self.W1(enc_out) # shape (1,12,attention_units)

    # If you have taken a close look the model's schema you would have noticed
    # that we are going to sum the outputs from W1 and W2, though the shapes
    # are incompatible
    # the enc_out is (batch_size,12,16) -> W1 -> (batch_size,12,attention_units)
    # the state is (batch_size,16) -> W2 -> (batch_size,attention_units)
    # thus we need to artificially add a dimension to the stata along axis 1
    self.state = tf.expand_dims(state, axis = 1)
    self.W2_out = self.W2(self.state) # shape (batch_size,1,attention_units)

    self.sum = self.W1_out + self.W2_out  # shape (batch_size,12,attention_units)
    self.sum_scale = tf.nn.tanh(self.sum) # shape (batch_size,12,attention_units)

    self.score = self.V(self.sum_scale) # shape (batch_size,12,1)

    self.attention_weights = tf.nn.softmax(self.score, axis=1) # shape (batch_size,12,1)

    self.weighted_enc_out = enc_out * self.attention_weights # shape (batch_size,12,16)

    self.context_vector = tf.reduce_sum(self.weighted_enc_out, axis=1) # shape (batch_size,16)

    return self.context_vector, self.attention_weights

In [44]:
attention_layer = Bahdanau_attention_maker(8)

In [45]:
attention_layer(encoder_output, encoder_state)

(<tf.Tensor: shape=(1, 32), dtype=float32, numpy=
 array([[-0.01388177,  0.0008263 ,  0.00609155, -0.00237381, -0.0008466 ,
         -0.00252712,  0.00335837,  0.0111282 ,  0.00649133,  0.00201039,
         -0.00846961, -0.00334105,  0.01006896,  0.00112595, -0.00367409,
          0.00145448, -0.00064366,  0.00469347, -0.00394646, -0.00235454,
         -0.00551376,  0.00033716, -0.00133298, -0.00139826, -0.00287167,
         -0.00075292, -0.00091411, -0.00043784, -0.00254322, -0.00717109,
         -0.00899556,  0.00648828]], dtype=float32)>,
 <tf.Tensor: shape=(1, 10, 1), dtype=float32, numpy=
 array([[[0.10051886],
         [0.10073073],
         [0.10188538],
         [0.10225762],
         [0.10266352],
         [0.10044947],
         [0.09940907],
         [0.0978561 ],
         [0.09729224],
         [0.09693695]]], dtype=float32)>)

### Create decoder

The goal of the decoder is to use the encoder output and the previous target element to predict the next target element!
Which means its output is a sequence with as many elements as the target (this is where the padded target comes in, it will serve as input) and must have a number of channels equals to the number of possible values for target elements.

Here we can't use the standard Sequential framework to build the model because the initial state of the decoder as to be set as the encoder states.

In addition to this, two versions of the same model (with the same weights) have to be prepared, one of them for training, and one of them for inference (prediction on new unknown data). We'll detail the reason for this in what follows.

In [46]:
class decoder_maker(tf.keras.Model):
  def __init__(self, tar_vocab_size, embed_dim, n_units):
    super().__init__()
    # The decoder contains an embedding layer to play with the teacher forcing
    # input, which comes from the target data
    # A gru layer
    # A dense layer to make the predictions
    # And an attention layer
    self.embed = tf.keras.layers.Embedding(input_dim=tar_vocab_size, 
                                    output_dim=embed_dim)
    self.gru = tf.keras.layers.GRU(units=n_units, return_sequences=True,
                                   return_state=True)
    self.pred = tf.keras.layers.Dense(units=tar_vocab_size,activation="softmax")
    self.attention = Bahdanau_attention_maker(attention_units=n_units)

  def __call__(self, dec_in, enc_out, state):
    # first let's apply the attention layer
    self.context_vector, self.attention_weights = self.attention(enc_out,state)

    # now the decoder will ingest one sequence element from the teacher forcing
    # this will be of shape (bacth_size, 1)
    self.embed_out = self.embed(dec_in) # shape (batch_size,1,embed_dim)

    # then we need to concatenate the embedding output and the context vector
    # though their shapes are incompatible
    # embed out (batch_size, 1, embed_dim)
    # context vector (batch_size, n_units) where n_units was defined in the encoder
    # so we need to add one dimension along axis 1
    self.context_vector_expanded = tf.expand_dims(self.context_vector, axis=1)
    # shape (batch_size,1,n_units)
    self.concat = tf.keras.layers.concatenate([self.embed_out,
                                               self.context_vector_expanded])
    # shape (bacth_size,1, embed_dim + n_units)
    
    # now we get to apply the gru layer
    self.gru_out, self.gru_state = self.gru(self.concat) 
    # shapes (batch_size, 1, n_units) and (batch_size, n_units)

    # let's reshape the gru output before feeding it to the dense layer
    self.gru_out_reshape = tf.reshape(self.gru_out, shape=(-1,
                                                           self.gru_out.shape[2]))

    # now let's make a prediction
    self.pred_out = self.pred(self.gru_out_reshape) # shape (batch_size, 1, tar_vocab_size)

    return self.pred_out, self.gru_state, self.attention_weights

Let's now try and use the decoder using the encoder output, the encoder state and the first element of the teacher forcing

In [47]:
decoder = decoder_maker(tar_vocab_size=input_dim, embed_dim=n_embed, n_units=n_gru)

In [48]:
decoder_input = tf.expand_dims(tf.expand_dims(y_train[0][0], axis=0), axis=0) # the teacher forcing is
# the first element of the target sequence which corresponds to the <start> token
# we use expand dim to artificially add the batch size dimension

In [49]:
decoder_input

<tf.Tensor: shape=(1, 1), dtype=int64, numpy=array([[0]])>

In [50]:
decoder(decoder_input,encoder_output,encoder_state)

(<tf.Tensor: shape=(1, 100), dtype=float32, numpy=
 array([[0.0100098 , 0.00999187, 0.01010906, 0.0098903 , 0.01004692,
         0.01004892, 0.00997533, 0.01002377, 0.01002064, 0.009948  ,
         0.00996668, 0.00998718, 0.01004508, 0.00991621, 0.01001625,
         0.01001755, 0.0099857 , 0.00996253, 0.01001414, 0.01004905,
         0.01008093, 0.00995257, 0.01008273, 0.00997252, 0.01001269,
         0.00998041, 0.00991401, 0.00998602, 0.01005298, 0.01002457,
         0.01004699, 0.01004811, 0.00999949, 0.01005429, 0.01000264,
         0.00997646, 0.01002604, 0.01000415, 0.01002245, 0.01002534,
         0.01002086, 0.01004102, 0.0099708 , 0.01000799, 0.00997931,
         0.0099346 , 0.01006611, 0.01005744, 0.01004413, 0.01000439,
         0.00995415, 0.01003723, 0.00990904, 0.01002757, 0.00990901,
         0.01008085, 0.00995578, 0.00999056, 0.00999638, 0.00993594,
         0.01005547, 0.01000185, 0.01005805, 0.00989978, 0.00999695,
         0.0100296 , 0.01008035, 0.01002653, 0.00998

Everything worked well, now all there is to do is to apply the decoder again to the second element of the teacher forcing and replacing the encoder state with the decoder state to produce the subsequent predictions.

## Training the encoder decoder model

We are almost there, but contrary to the classic encoder decoder architecture, using attention forces us to manually code the training steps because the encoder output is used for each prediction once weighted by the attention weights.

In [51]:
optimizer = tf.keras.optimizers.Adam()
loss_function = tf.keras.losses.SparseCategoricalCrossentropy()

In [52]:
import os
checkpoint_dir = './training_checkpoints'
checkpoint_prefix = os.path.join(checkpoint_dir, "ckpt")
checkpoint = tf.train.Checkpoint(optimizer=optimizer,
                                 encoder=encoder,
                                 decoder=decoder)

In [53]:
def train_step(inp, targ):#, enc_initial_state):
  loss = 0

  with tf.GradientTape() as tape: # we use the gradient tape to track all
  # the different operations happening in the network in order to be able
  # to compute the gradients later

    enc_output, enc_state = encoder(inp)#,enc_initial_state) # the input sequence is fed to the 
    # encoder to produce the encoder output and the encoder state

    dec_state = enc_state # the initial state used in the decoder is the encoder
    # state

    dec_input = tf.expand_dims(targ[:,0], axis=1) # the first decoder input
    # is the first sequence element of the target batch, which in our case
    # represents the <start> token for each sequence in the batch. This is
    # what we call the teacher forcing!

    # Everything is set up for the first step, now we need to loop over the
    # teacher forcing sequence to produce the predictions, we already have 
    # defined the first step (element 0) so we will loop from 1 to targ.shape[1]
    # which is the target sequence length
    for t in range(1, targ.shape[1]):
      # passing dec_input, dec_state and enc_output to the decoder
      # in order to produce the prediction, the new state, and the attention
      # weights which we will not need explicitely here
      pred, dec_state, _ = decoder(dec_input, enc_output, dec_state)

      loss += loss_function(targ[:, t], pred) # we compare the prediction
      # produced by teacher forcing with the next element of the target and
      # increment the loss

      # The new decoder input becomes the next element of the target sequence
      # which we just attempted to predict (teacher forcing)
      dec_input = tf.expand_dims(targ[:, t], 1)

  batch_loss = (loss / int(targ.shape[1])) # we divide the loss by the target
  # sequence's length to get the average loss across the sequence

  variables = encoder.trainable_variables + decoder.trainable_variables # here
  # we concatenate the lists of trainable variables for the encoder and the
  # decoder

  gradients = tape.gradient(loss, variables) # compute the gradient based on the
  # loss and the trainable variables

  optimizer.apply_gradients(zip(gradients, variables)) # then update the model's
  # parameters

  return batch_loss

In [54]:
import time
EPOCHS = 10

for epoch in range(EPOCHS):
  start = time.time()

  total_loss = 0

  for (batch, (inp, targ)) in enumerate(train_batch):
    batch_loss = train_step(inp, targ)
    total_loss += batch_loss

    if batch % 10 == 0:
      print('Epoch {} Batch {} Loss {:.4f}'.format(epoch + 1,
                                                   batch,
                                                   batch_loss.numpy()))
  
  # saving (checkpoint) the model every epoch
  checkpoint.save(file_prefix = checkpoint_prefix)

  print('Epoch {} Loss {:.4f}'.format(epoch + 1,
                                      total_loss))
  print('Time taken for 1 epoch {} sec'.format(time.time() - start))

  enc_input = X_val
  #classic encoder input

  dec_input = tf.zeros(shape=(len(X_val),1))
  # the first decoder input is the special token 0

  enc_out, enc_state = encoder(enc_input)#, initial_state)
  # we compute once and for all the encoder output and the encoder
  # h state and c state

  dec_state = enc_state
  # The encoder h state and c state will serve as initial states for the
  # decoder

  pred = []  # we'll store the predictions in here

  # we loop over the expected length of the target, but actually the loop can run
  # for as many steps as we wish, which is the advantage of the encoder decoder
  # architecture
  for i in range(y_val.shape[1]-1):
    dec_out, dec_state, attention_w = decoder(dec_input, enc_out, dec_state)
    # the decoder state is updated and we get the first prediction probability 
    # vector
    decoded_out = tf.expand_dims(tf.argmax(dec_out, axis=-1), axis=1)
    # we decode the softmax vector into and index
    pred.append(tf.expand_dims(dec_out,axis=1)) # update the prediction list
    dec_input = decoded_out # the previous pred will be used as the new input

  pred = tf.concat(pred, axis=1).numpy()
  print("\n val loss :", loss_function(y_val[:,1:],pred),"\n")

Epoch 1 Batch 0 Loss 3.8377
Epoch 1 Batch 10 Loss 3.8359
Epoch 1 Batch 20 Loss 3.8363
Epoch 1 Batch 30 Loss 3.8337
Epoch 1 Batch 40 Loss 3.8321
Epoch 1 Batch 50 Loss 3.8298
Epoch 1 Batch 60 Loss 3.8225
Epoch 1 Batch 70 Loss 3.8067
Epoch 1 Loss 302.2648
Time taken for 1 epoch 12.676093101501465 sec

 val loss : tf.Tensor(4.5288196, shape=(), dtype=float32) 

Epoch 2 Batch 0 Loss 3.7625
Epoch 2 Batch 10 Loss 3.7373
Epoch 2 Batch 20 Loss 3.6997
Epoch 2 Batch 30 Loss 3.6930
Epoch 2 Batch 40 Loss 3.6447
Epoch 2 Batch 50 Loss 3.6297
Epoch 2 Batch 60 Loss 3.6223
Epoch 2 Batch 70 Loss 3.6046
Epoch 2 Loss 289.9551
Time taken for 1 epoch 20.49025797843933 sec

 val loss : tf.Tensor(4.3099375, shape=(), dtype=float32) 

Epoch 3 Batch 0 Loss 3.5790
Epoch 3 Batch 10 Loss 3.5631
Epoch 3 Batch 20 Loss 3.5155
Epoch 3 Batch 30 Loss 3.4761
Epoch 3 Batch 40 Loss 3.4861
Epoch 3 Batch 50 Loss 3.4497
Epoch 3 Batch 60 Loss 3.4529
Epoch 3 Batch 70 Loss 3.4275
Epoch 3 Loss 275.5563
Time taken for 1 epoch 20.50

Nice! The training is over, and it looks as though the model performs really well both on train and validation sets!

## Make predictions with the inference model

To make predictions on the validation set, we cannot use teacher forcing, the model has to base itself on its own predictions!

In [55]:
enc_input = X_val
#classic encoder input

dec_input = tf.zeros(shape=(len(X_val),1))
# the first decoder input is the special token 0

#initial_state = encoder.state_initializer(len(X_val))

enc_out, enc_state = encoder(enc_input)#, initial_state)
# we compute once and for all the encoder output and the encoder
# h state and c state

dec_state = enc_state
# The encoder h state and c state will serve as initial states for the
# decoder

pred = []  # we'll store the predictions in here

# we loop over the expected length of the target, but actually the loop can run
# for as many steps as we wish, which is the advantage of the encoder decoder
# architecture
for i in range(y_val.shape[1]-1):
  dec_out, dec_state, attention_w = decoder(dec_input, enc_out, dec_state)
  # the decoder state is updated and we get the first prediction probability 
  # vector
  decoded_out = tf.expand_dims(tf.argmax(dec_out, axis=-1), axis=1)
  # we decode the softmax vector into and index
  pred.append(decoded_out) # update the prediction list
  dec_input = decoded_out # the previous pred will be used as the new input

pred = tf.concat(pred, axis=-1).numpy()
for i in range(10):
  print("pred:", pred[i,:].tolist())
  print("true:", y_val[i,:].tolist()[1:])
  print("\n")

pred: [51, 91, 60, 65, 68]
true: [51, 91, 60, 65, 68]


pred: [33, 98, 88, 13, 65]
true: [33, 98, 88, 13, 65]


pred: [68, 14, 76, 70, 80]
true: [68, 14, 76, 70, 80]


pred: [50, 69, 60, 37, 32]
true: [50, 69, 60, 37, 32]


pred: [4, 97, 70, 74, 86]
true: [4, 97, 70, 74, 86]


pred: [24, 39, 66, 11, 36]
true: [24, 39, 66, 11, 36]


pred: [48, 88, 55, 11, 72]
true: [48, 88, 55, 11, 72]


pred: [6, 82, 8, 68, 18]
true: [6, 82, 8, 68, 18]


pred: [14, 49, 65, 90, 2]
true: [14, 49, 65, 90, 2]


pred: [41, 55, 19, 9, 43]
true: [41, 55, 19, 9, 43]




The results do not look so bad, almost perfect actually! This is a clear improvement from the encoder decoder! Attention must be really powerful!

The fact that the model reuses the encoder output at each step with different weights is helping the model achieve better predictions in a shorter amount of time (understand epochs).

I hope you found this demonstration useful! Now it is time for you to apply what you have learned to a real world automatic translation problem!