# Code Encoder Decoder

The goal of this demo is to teach you how to code an encoder decoder model!
Since this is just a demo we will use generated data, you'll be able to tackle the real problem during the exercise, the goal here is to focus on building the model and the training loop.

## Import libraries

In [34]:
# Import Tensorflow & Pathlib librairies
import tensorflow as tf 
import pathlib 
import pandas as pd 
import os
import io
import warnings
warnings.filterwarnings('ignore')
import json
from random import randint
from numpy import array
from numpy import argmax
from numpy import array_equal
from tensorflow.keras.utils import to_categorical
from tensorflow.keras.models import Model
from tensorflow.keras.layers import Input
from tensorflow.keras.layers import LSTM
from tensorflow.keras.layers import Dense

## Generate data

We will generate random input and target data for the purpose of the demonstration.

In [35]:
input_dim = 100
input_seq_len = 10
target_seq_len = 5

In [36]:
# generate a sequence of random integers
def generate_sequence(length, n_unique):
	return [randint(1, n_unique-1) for _ in range(length)]

In [37]:
generate_sequence(input_seq_len,input_dim)

[37, 86, 61, 43, 95, 23, 20, 38, 4, 40]

In [38]:
# prepare data for the LSTM
def get_dataset(n_in, n_out, cardinality, n_samples, printing=False):
  X1, X2, y = list(), list(), list()
  for _ in range(n_samples):
    # generate source sequence
    source = generate_sequence(n_in, cardinality)
    if printing:
      print("source:", source)
    # define padded target sequence
    target = source[:n_out]
    target.reverse()
    if printing:
      print("target:", target)
    # create padded input target sequence
    target_in = [0] + target[:-1]
    if printing:
      print("padded target:", target_in)
    # store
    X1.append(source)
    X2.append(target_in)
    y.append(target)
  return array(X1), array(X2), array(y)

In [39]:
input, padded_target, target =  get_dataset(input_seq_len,target_seq_len,input_dim,1,True)

source: [8, 47, 44, 46, 28, 38, 29, 35, 72, 71]
target: [28, 46, 44, 47, 8]
padded target: [0, 28, 46, 44, 47]


The data we are generating consists in a random sequence of numbers (they could very well represent encoded letters, words, sentences or anything you could think of).

The target is built using the first elements of the input in reversed order.

We also create a padded target sequence for teacher forcing (remember it is when the previous element from the target will be used as information for the decoder to predict the next element in the target)

Now that we understand this, let's create the training data and validation data.

In [40]:
X_train, padded_y_train, y_train = get_dataset(input_seq_len,target_seq_len,input_dim,10000)
X_val, padded_y_val, y_val = get_dataset(input_seq_len,target_seq_len,input_dim,5000)

## Create encoder model

In this step we will define the encoder model.

The goal of the encoder is to create a representation of the input data, to extract information from the input data which will then be interpreted by the decoder model.

The encoder receives sequence inputs and will output sequences with a given depth of representation (we  usually called that dimension channels before)

In [41]:
# let's start by defining the number of units needed for the embedding and
# the lstm layers

n_embed = 32
n_lstm = 16

In [42]:
encoder_input = tf.keras.Input(shape=(input_seq_len))
encoder_embed = tf.keras.layers.Embedding(input_dim=input_dim, output_dim=n_embed)
encoder_lstm = tf.keras.layers.LSTM(n_lstm, return_state=True)

encoder_embed_ouput = encoder_embed(encoder_input)
encoder_output = encoder_lstm(encoder_embed_ouput)

encoder = tf.keras.Model(inputs = encoder_input, outputs = encoder_output)

That's it, it does not need to be anymore complicated than this, note though that we did not preserve the sequential nature of the data, but we output the cell state, which will serve as input state for the decoder!

Let's try it out on an input to see what comes out!

In [43]:
encoder(tf.expand_dims(X_train[0],0))

[<tf.Tensor: shape=(1, 16), dtype=float32, numpy=
 array([[-0.0008015 , -0.01416984,  0.01448857, -0.00112979,  0.00277521,
         -0.01476327,  0.00390315,  0.00686706, -0.01231197,  0.00438643,
          0.00354464,  0.00281679,  0.01053815,  0.0159905 ,  0.01118115,
         -0.00448248]], dtype=float32)>,
 <tf.Tensor: shape=(1, 16), dtype=float32, numpy=
 array([[-0.0008015 , -0.01416984,  0.01448857, -0.00112979,  0.00277521,
         -0.01476327,  0.00390315,  0.00686706, -0.01231197,  0.00438643,
          0.00354464,  0.00281679,  0.01053815,  0.0159905 ,  0.01118115,
         -0.00448248]], dtype=float32)>,
 <tf.Tensor: shape=(1, 16), dtype=float32, numpy=
 array([[-0.00157401, -0.02897191,  0.02902606, -0.00225902,  0.00572838,
         -0.02893112,  0.00783362,  0.01341706, -0.02487618,  0.0086576 ,
          0.00697065,  0.00555785,  0.02056631,  0.03217805,  0.02257848,
         -0.00900368]], dtype=float32)>]

## Create decoder

The goal of the decoder is to use the encoder output and the previous target element to predict the next target element!
Which means its output is a sequence with as many elements as the target (this is where the padded target comes in, it will serve as input) and must have a number of channels equals to the number of possible values for target elements.

Here we can't use the standard Sequential framework to build the model because the initial state of the decoder as to be set as the encoder states.

In addition to this, two versions of the same model (with the same weights) have to be prepared, one of them for training, and one of them for inference (prediction on new unknown data). We'll detail the reason for this in what follows.

### Decoder for training

Training the decoder requires that we use the teacher forcing mechanism, that will provide the model with the correct answer from the previous element in the output sequence to predict the next element in the output sequence.

In [44]:
decoder_input = tf.keras.Input(shape=(target_seq_len))
decoder_embed = tf.keras.layers.Embedding(input_dim=input_dim,output_dim=n_embed)
decoder_lstm = tf.keras.layers.LSTM(n_lstm, return_sequences=True, return_state=True)
decoder_pred = tf.keras.layers.Dense(input_dim, activation="softmax")

decoder_embed_output = decoder_embed(decoder_input) # teacher forcing happens here
# the decoder input is actually the padded target we created earlier, remember
# if target is: [91, 47, 89, 21, 62]
# the decoder input will be: [0, 91, 47, 89, 21]
decoder_lstm_output, _, _ = decoder_lstm(decoder_embed_output, initial_state=encoder_output[1:])
# in the step described above the decoder receives the encoder state as its
# initial state.
decoder_output = decoder_pred(decoder_lstm_output)
# then the dense layer will convert the vector representation for each element
# in the sequence into a probability distribution across all possible tokens
# in the vocabulary!

decoder = tf.keras.Model(inputs = [encoder_input,decoder_input], outputs = decoder_output)
# all we need to do is put the model together using the input output framework!

Let's try out the decoder model on some input sequences!

In [45]:
decoder([tf.expand_dims(X_train[0],0),tf.expand_dims(padded_y_train[0],0)])

<tf.Tensor: shape=(1, 5, 100), dtype=float32, numpy=
array([[[0.00993765, 0.01003179, 0.01005074, 0.00998255, 0.01002816,
         0.00986233, 0.00999714, 0.00993832, 0.01003569, 0.00992759,
         0.01005428, 0.01008545, 0.01012676, 0.00999361, 0.01012079,
         0.00996771, 0.01009511, 0.01004484, 0.01006847, 0.00988823,
         0.00993143, 0.0100653 , 0.00998847, 0.00998062, 0.00996088,
         0.0099541 , 0.00996993, 0.01003942, 0.01001204, 0.01008488,
         0.01000296, 0.00995553, 0.01000181, 0.00999019, 0.01004315,
         0.01002374, 0.01003929, 0.00997067, 0.01001312, 0.0099583 ,
         0.00996789, 0.01012802, 0.01003536, 0.01004526, 0.01000515,
         0.01009611, 0.01004796, 0.01001268, 0.01008893, 0.00995894,
         0.01001169, 0.01008057, 0.01000783, 0.01000184, 0.01001132,
         0.00988387, 0.0100054 , 0.00994856, 0.00998683, 0.00993286,
         0.01003086, 0.00998121, 0.01007184, 0.01000163, 0.00999537,
         0.00999809, 0.01002349, 0.00998077, 0.010

### Decoder for inference (prediction)

Contrary to the training case, for inference we do not have access to the target nor the padded target. The decoder input will be made out of a sequence starting with $0$ which is the special start token in our case, then followed by the predictions of the decoder as they come.

In [46]:
decoder_state_input_h = Input(shape=(n_lstm,))
decoder_state_input_c = Input(shape=(n_lstm,))
decoder_states_inputs = [decoder_state_input_h, decoder_state_input_c]
# at the first step of the inference, these input will be respectively the
# hidden state and C state of the encoder model
# for following steps, they will become the hidden and C state from the decoder
# itself since the input sequence is unknown we will have to predict step by step
# using a loop

decoder_input_inf = tf.keras.Input(shape=(1))
decoder_embed_output = decoder_embed(decoder_input_inf)
# the decoder input here is of shape 1 because we will feed the elements in the 
# sequence one by one

decoder_outputs, state_h, state_c = decoder_lstm(decoder_embed_output, initial_state=decoder_states_inputs)
# the lstm layer works in the same way, the output from the embedding is used
# and the decoder state is used as described above

decoder_states = [state_h, state_c]
# we store the lstm states in a specific object as we'll have to use them as 
# initial state for the next inference step

decoder_outputs = decoder_pred(decoder_outputs)
# the lstm output is then converted to a probability distribution over the
# target vocabulary

decoder_inf = Model(inputs = [decoder_input_inf, decoder_states_inputs], 
                     outputs = [decoder_outputs, decoder_states])
# Finally we wrap up the model building by setting up the inputs and outputs

Here we'll give you an example of how this version of the model will be able to give predictions, we'lls need to write a loop for this!

In [47]:
enc_input = tf.expand_dims(X_train[0],0)
#classic encoder input

dec_input = tf.zeros(shape=(1,1))
# the first decoder input is the special token 0

enc_out, state_h_inf, state_c_inf = encoder(enc_input)
# we compute once and for all the encoder output and the encoder
# h state and c state

dec_state = [state_h_inf, state_c_inf]
# The encoder h state and c state will serve as initial states for the
# decoder

pred = []  # we'll store the predictions in here

# we loop over the expected length of the target, but actually the loop can run
# for as many steps as we wish, which is the advantage of the encoder decoder
# architecture
for i in range(target_seq_len):
  dec_out, dec_state = decoder_inf([dec_input, dec_state])
  # the decoder state is updated and we get the first prediction probability 
  # vector
  decoded_out = tf.argmax(dec_out, axis=-1)
  # we decode the softmax vector into and index
  pred.append(decoded_out) # update the prediction list
  dec_input = decoded_out # the previous pred will be used as the new input

pred

[<tf.Tensor: shape=(1, 1), dtype=int64, numpy=array([[41]])>,
 <tf.Tensor: shape=(1, 1), dtype=int64, numpy=array([[41]])>,
 <tf.Tensor: shape=(1, 1), dtype=int64, numpy=array([[41]])>,
 <tf.Tensor: shape=(1, 1), dtype=int64, numpy=array([[41]])>,
 <tf.Tensor: shape=(1, 1), dtype=int64, numpy=array([[41]])>]

## Training the encoder decoder model

We are almost there, the difficult part of this was building the model, now the training step will be super easy!
All we have to do is first `compile` the model to assign a loss function then use the `fit` method!

In [48]:
decoder.compile(
    optimizer="Adam",
    loss=tf.keras.losses.SparseCategoricalCrossentropy(),
    metrics=[tf.keras.metrics.SparseCategoricalAccuracy()]
)

In [49]:
decoder.fit(x=[X_train,padded_y_train],y=y_train,epochs=50, validation_data=([X_val,padded_y_val],y_val))

Epoch 1/50
Epoch 2/50
Epoch 3/50
Epoch 4/50
Epoch 5/50
Epoch 6/50
Epoch 7/50
Epoch 8/50
Epoch 9/50
Epoch 10/50
Epoch 11/50
Epoch 12/50
Epoch 13/50
Epoch 14/50
Epoch 15/50
Epoch 16/50
Epoch 17/50
Epoch 18/50
Epoch 19/50
Epoch 20/50
Epoch 21/50
Epoch 22/50
Epoch 23/50
Epoch 24/50
Epoch 25/50
Epoch 26/50
Epoch 27/50
Epoch 28/50
Epoch 29/50
Epoch 30/50
Epoch 31/50
Epoch 32/50
Epoch 33/50
Epoch 34/50
Epoch 35/50
Epoch 36/50
Epoch 37/50
Epoch 38/50
Epoch 39/50
Epoch 40/50
Epoch 41/50
Epoch 42/50
Epoch 43/50
Epoch 44/50
Epoch 45/50
Epoch 46/50
Epoch 47/50
Epoch 48/50
Epoch 49/50
Epoch 50/50


<keras.callbacks.History at 0x7f763cb92ed0>

Nice! The training is over, and it looks as though we could have continued to train the model even longer since it has not yet started to overfit!

## Make predictions with the inference model

I don't know if you have noticed, but we used the exact same layers for the training and the inference model, therefore they have the same weights, only we are able to use the inference model on new data since we cannot use teacher forcing anymore!

In [58]:
enc_input = X_val
#classic encoder input

dec_input = tf.zeros(shape=(len(X_val),1))
# the first decoder input is the special token 0

enc_out, state_h_inf, state_c_inf = encoder(enc_input)
# we compute once and for all the encoder output and the encoder
# h state and c state

dec_state = [state_h_inf, state_c_inf]
# The encoder h state and c state will serve as initial states for the
# decoder

pred = []  # we'll store the predictions in here

# we loop over the expected length of the target, but actually the loop can run
# for as many steps as we wish, which is the advantage of the encoder decoder
# architecture
for i in range(target_seq_len):
  dec_out, dec_state = decoder_inf([dec_input, dec_state])
  # the decoder state is updated and we get the first prediction probability 
  # vector
  decoded_out = tf.argmax(dec_out, axis=-1)
  # we decode the softmax vector into and index
  pred.append(decoded_out) # update the prediction list
  dec_input = decoded_out # the previous pred will be used as the new input

pred = tf.concat(pred, axis=-1).numpy()
for i in range(10):
  print("pred:", pred[i,:])
  print("true:", y_val[i,:])
  print("\n")

pred: [20 84 70 68 68]
true: [20 84 70 68 68]


pred: [98 61 55 20 47]
true: [98 61 65 66 78]


pred: [50 33 63 85 46]
true: [50 33 63 28 17]


pred: [99 81 58 79 51]
true: [12 19 68 70 60]


pred: [54 83 22 99 76]
true: [54 83 22 24  2]


pred: [84  8 34 72 48]
true: [84  2 16 83 60]


pred: [28 15 54 62 13]
true: [82 42 85 68 65]


pred: [94 55 34  5 61]
true: [94 55 50 28 76]


pred: [ 2  3 77 14 22]
true: [ 2  3 77 79 54]


pred: [37 41 78 93 12]
true: [37 45 14 66 30]




The results do not look so bad, however it looks as though once the model make a mistake on one of the predictions, then the rest of the sequence will also not be well predicted!

This behaviour can be explained in the following way: the information taken from the encoder is only taken into account directly in the first decoding step, which means that everything that happens after this step depends on what information the decoder feeds itself from that point onwards.

The encoder decoder framework however has made possible major advances, especially in terms of predicting sequences of arbitrary length. However we'll learn tomorrow about a solution that can deal with the "worsening of predictions over the sequence" problem!

I hope you found this demonstration useful! Now it is time for you to apply what you have learned to a real world automatic translation problem!