## <font color='darkblue'>Preface</font>
([article source](https://machinelearningmastery.com/implementing-the-transformer-decoder-from-scratch-in-tensorflow-and-keras/)) <b><font size='3ptx'>There are many similarities between the Transformer encoder and decoder, such as in their implementation of multi-head attention, layer normalization and a fully connected feed-forward network as their final sub-layer.</font></b>

Having implemented the [Transformer encoder](https://github.com/johnklee/ml_articles/blob/master/mlmastery/Implementing_the_transformer_encoder_from_scratch_in_tensorflow_and_keras/notebook.ipynb), we will now proceed to apply our knowledge in implementing the Transformer decoder, as a further step towards implementing the complete Transformer model.

In this tutorial, you will discover how to implement the Transformer decoder from scratch in TensorFlow and Keras. After completing this tutorial, you will know:
* The layers that form part of the Transformer decoder.
* How to implement the Transformer decoder from scratch.  

### <font color='darkgreen'>Tutorial Overview</font>
This tutorial is divided into three parts; they are:
* [**Recap of the Transformer Architecture**](#sect1)
* [**Implementing the Transformer Decoder From Scratch**](#sect2)
    * The Decoder Layer
    * The Transformer Decoder
* Testing Out the Code

### <font color='darkgreen'>Prerequisites</font>
For this tutorial, we assume that you are already familiar with:
* [The Transformer model](https://machinelearningmastery.com/the-transformer-model/)
* [The scaled dot-product attention](https://machinelearningmastery.com/?p=13364&preview=true)
* [The multi-head attention](https://machinelearningmastery.com/?p=13351&preview=true)
* [The Transformer positional encoding](https://machinelearningmastery.com/the-transformer-positional-encoding-layer-in-keras-part-2/)
* [The Transformer encoder](https://machinelearningmastery.com/?p=13389&preview=true)

<a id='sect1'></a>
## <font color='darkblue'>Recap of the Transformer Architecture</font>
[Recall](https://machinelearningmastery.com/the-transformer-model/) having seen that the Transformer architecture follows an encoder-decoder structure: the encoder, on the left-hand side, is tasked with mapping an input sequence to a sequence of continuous representations; **the decoder, on the right-hand side, receives the output of the encoder together with the decoder output at the previous time step, to generate an output sequence**.

![Transformer decoder in arch](images/1.PNG)

In generating an output sequence, the Transformer does not rely on recurrence and convolutions.

We had seen that the decoder part of the Transformer shares many similarities in its architecture with the encoder. This tutorial will be exploring these similarities. 

### <font color='darkgreen'>The Transformer Decoder</font>
Similar to the [Transformer encoder](https://github.com/johnklee/ml_articles/blob/master/mlmastery/Implementing_the_transformer_encoder_from_scratch_in_tensorflow_and_keras/notebook.ipynb), the Transformer decoder also consists of a stack of identical layers. The Transformer decoder, however, implements an additional multi-head attention block, for a total of three main sub-layers:
* The first sub-layer comprises a multi-head attention mechanism that receives the queries, keys and values as inputs.
* The second sub-layer comprises a second multi-head attention mechanism. 
* The third sub-layer comprises a fully-connected feed-forward network. 

![Transformer decoder in arch](images/1.PNG)

<br/>

Each one of these three sub-layers is also followed by layer normalisation, where the input to the layer normalization step is its corresponding sub-layer input (<font color='brown'>through a residual connection</font>) and output. 

On the decoder side, the queries, keys and values that are fed into the first multi-head attention block also represent the same input sequence. However, this time round, it is the target sequence that is embedded and augmented with positional information before being supplied to the decoder. The second multi-head attention block, on the other hand, receives the encoder output in the form of keys and values, and the normalized output of the first decoder attention block as the queries. In both cases, the dimensionality of the queries and keys remains equal to $d_{k}$, whereas the dimensionality of the values remains equal to $d_{v}$.

Vaswani et al. introduce regularization into the model on the decoder side too, by applying dropout to the output of each sub-layer (before the layer normalization step), as well as to the positional encodings before these are fed into the decoder. 

Let’s now see how to implement the Transformer decoder from scratch in TensorFlow and Keras.

<a id='sect2'></a>
## <font color='darkblue'>Implementing the Transformer Decoder From Scratch</font>

### <font color='darkgreen'>The Decoder Layer</font>
Since we have already implemented the required sub-layers when we covered the [implementation of the Transformer encoder](https://github.com/johnklee/ml_articles/blob/master/mlmastery/Implementing_the_transformer_encoder_from_scratch_in_tensorflow_and_keras/notebook.ipynb), we will create a class for the decoder layer that makes use of these sub-layers straight away:
```python
from multihead_attention import MultiHeadAttention
from encoder import AddNormalization, FeedForward

class DecoderLayer(Layer):
    def __init__(self, h, d_k, d_v, d_model, d_ff, rate, **kwargs):
        super(DecoderLayer, self).__init__(**kwargs)
        self.multihead_attention1 = MultiHeadAttention(h, d_k, d_v, d_model)
        self.dropout1 = Dropout(rate)
        self.add_norm1 = AddNormalization()
        self.multihead_attention2 = MultiHeadAttention(h, d_k, d_v, d_model)
        self.dropout2 = Dropout(rate)
        self.add_norm2 = AddNormalization()
        self.feed_forward = FeedForward(d_ff, d_model)
        self.dropout3 = Dropout(rate)
        self.add_norm3 = AddNormalization()
        ...
```

<br/>

Notice here that since my code for the different sub-layers had been saved into several Python scripts (<font color='brown'>namely,</font> <font color='olive'>multihead_attention.py</font> <font color='brown'>and</font> <font color='olive'>encoder.py</font>), it was necessary to import them to be able to use the required classes. 

As we had done for the Transformer encoder, we will proceed to create the class method, <font color='blue'>call()</font>, that implements all of the decoder sub-layers:
```python
...
def call(self, x, encoder_output, lookahead_mask, padding_mask, training):
    # Multi-head attention layer
    multihead_output1 = self.multihead_attention1(x, x, x, lookahead_mask)
    # Expected output shape = (batch_size, sequence_length, d_model)

    # Add in a dropout layer
    multihead_output1 = self.dropout1(multihead_output1, training=training)

    # Followed by an Add & Norm layer
    addnorm_output1 = self.add_norm1(x, multihead_output1)
    # Expected output shape = (batch_size, sequence_length, d_model)

    # Followed by another multi-head attention layer
    multihead_output2 = self.multihead_attention2(addnorm_output1, encoder_output, encoder_output, padding_mask)

    # Add in another dropout layer
    multihead_output2 = self.dropout2(multihead_output2, training=training)

    # Followed by another Add & Norm layer
    addnorm_output2 = self.add_norm1(addnorm_output1, multihead_output2)

    # Followed by a fully connected layer
    feedforward_output = self.feed_forward(addnorm_output2)
    # Expected output shape = (batch_size, sequence_length, d_model)

    # Add in another dropout layer
    feedforward_output = self.dropout3(feedforward_output, training=training)

    # Followed by another Add & Norm layer
    return self.add_norm3(addnorm_output2, feedforward_output)
```

<br/>

The multi-head attention sub-layers can also receive a padding mask or a look-ahead mask. As a brief reminder of what we had said in a [previous tutorial](https://machinelearningmastery.com/how-to-implement-scaled-dot-product-attention-from-scratch-in-tensorflow-and-keras), <b>the padding mask is necessary to suppress the zero padding in the input sequence from being processed along with the actual input values</b>. The look-ahead mask prevents the decoder from attending to succeeding words, such that the prediction for a particular word can only depend on known outputs for the words that come before it.

The same <font color='blue'>call()</font> class method can also receive a training flag to only apply the [**Dropout**](https://keras.io/api/layers/regularization_layers/dropout/) layers during training, when the value of this flag is set to `True`.

We will be creating the following <b><font color='blue'>Decoder</font></b> class to implement the Transformer decoder:
```python
from positional_encoding import PositionEmbeddingFixedWeights

class Decoder(Layer):
    def __init__(self, vocab_size, sequence_length, h, d_k, d_v, d_model, d_ff, n, rate, **kwargs):
        super(Decoder, self).__init__(**kwargs)
        self.pos_encoding = PositionEmbeddingFixedWeights(sequence_length, vocab_size, d_model)
        self.dropout = Dropout(rate)
        self.decoder_layer = [DecoderLayer(h, d_k, d_v, d_model, d_ff, rate) for _ in range(n)
        ...
```

<br/>

As in the Transformer encoder, the input to the first multi-head attention block on the decoder side receives the input sequence after this would have undergone a process of word embedding and positional encoding. For this purpose, an instance of the <b><font color='blue'>PositionEmbeddingFixedWeights</font></b> class (<font color='brown'>covered in this tutorial</font>) is initialized and its output assigned to the `pos_encoding` variable.

The final step is to create a class method, <font color='blue'>call()</font>, that applies word embedding and positional encoding to the input sequence and feeds the result, together with the encoder output, to  decoder layers:
```python
...
def call(self, output_target, encoder_output, lookahead_mask, padding_mask, training):
    # Generate the positional encoding
    pos_encoding_output = self.pos_encoding(output_target)
    # Expected output shape = (number of sentences, sequence_length, d_model)

    # Add in a dropout layer
    x = self.dropout(pos_encoding_output, training=training)

    # Pass on the positional encoded values to each encoder layer
    for i, layer in enumerate(self.decoder_layer):
        x = layer(x, encoder_output, lookahead_mask, padding_mask, training)

    return x
```

<br/>

The code listing for the full Transformer decoder is the following:

In [2]:
from tensorflow.keras.layers import LayerNormalization, Layer, Dense, ReLU, Dropout
from multihead_attention import MultiHeadAttention
from positional_encoding import PositionEmbeddingFixedWeights
from encoder import AddNormalization, FeedForward
 
# Implementing the Decoder Layer
class DecoderLayer(Layer):
    def __init__(self, h, d_k, d_v, d_model, d_ff, rate, **kwargs):
        super(DecoderLayer, self).__init__(**kwargs)
        self.multihead_attention1 = MultiHeadAttention(h, d_k, d_v, d_model)
        self.dropout1 = Dropout(rate)
        self.add_norm1 = AddNormalization()
        self.multihead_attention2 = MultiHeadAttention(h, d_k, d_v, d_model)
        self.dropout2 = Dropout(rate)
        self.add_norm2 = AddNormalization()
        self.feed_forward = FeedForward(d_ff, d_model)
        self.dropout3 = Dropout(rate)
        self.add_norm3 = AddNormalization()
 
    def call(self, x, encoder_output, lookahead_mask, padding_mask, training):
        # Multi-head attention layer
        multihead_output1 = self.multihead_attention1(x, x, x, lookahead_mask)
        # Expected output shape = (batch_size, sequence_length, d_model)
 
        # Add in a dropout layer
        multihead_output1 = self.dropout1(multihead_output1, training=training)
 
        # Followed by an Add & Norm layer
        addnorm_output1 = self.add_norm1(x, multihead_output1)
        # Expected output shape = (batch_size, sequence_length, d_model)
 
        # Followed by another multi-head attention layer
        multihead_output2 = self.multihead_attention2(addnorm_output1, encoder_output, encoder_output, padding_mask)
 
        # Add in another dropout layer
        multihead_output2 = self.dropout2(multihead_output2, training=training)
 
        # Followed by another Add & Norm layer
        addnorm_output2 = self.add_norm1(addnorm_output1, multihead_output2)
 
        # Followed by a fully connected layer
        feedforward_output = self.feed_forward(addnorm_output2)
        # Expected output shape = (batch_size, sequence_length, d_model)
 
        # Add in another dropout layer
        feedforward_output = self.dropout3(feedforward_output, training=training)
 
        # Followed by another Add & Norm layer
        return self.add_norm3(addnorm_output2, feedforward_output)
 
# Implementing the Decoder
class Decoder(Layer):
    def __init__(self, vocab_size, sequence_length, h, d_k, d_v, d_model, d_ff, n, rate, **kwargs):
        super(Decoder, self).__init__(**kwargs)
        self.pos_encoding = PositionEmbeddingFixedWeights(sequence_length, vocab_size, d_model)
        self.dropout = Dropout(rate)
        self.decoder_layer = [DecoderLayer(h, d_k, d_v, d_model, d_ff, rate) for _ in range(n)]
 
    def call(self, output_target, encoder_output, lookahead_mask, padding_mask, training):
        # Generate the positional encoding
        pos_encoding_output = self.pos_encoding(output_target)
        # Expected output shape = (number of sentences, sequence_length, d_model)
 
        # Add in a dropout layer
        x = self.dropout(pos_encoding_output, training=training)
 
        # Pass on the positional encoded values to each encoder layer
        for i, layer in enumerate(self.decoder_layer):
            x = layer(x, encoder_output, lookahead_mask, padding_mask, training)
 
        return x

<a id='sect3'></a>
## <font color='darkblue'>Testing Out the Code</font>
We will be working with the parameter values specified in the paper, [Attention Is All You Need, by Vaswani et al. (2017)](https://arxiv.org/abs/1706.03762):
```python
h = 8  # Number of self-attention heads
d_k = 64  # Dimensionality of the linearly projected queries and keys
d_v = 64  # Dimensionality of the linearly projected values
d_ff = 2048  # Dimensionality of the inner fully connected layer
d_model = 512  # Dimensionality of the model sub-layers' outputs
n = 6  # Number of layers in the encoder stack

batch_size = 64  # Batch size from the training process
dropout_rate = 0.1  # Frequency of dropping the input units in the dropout layers
...
```

<br/>

As for the input sequence we will be working with dummy data for the time being until we arrive to the stage of training the complete Transformer model in a separate tutorial, at which point we will be using actual sentences:

```python
...
dec_vocab_size = 20 # Vocabulary size for the decoder
input_seq_length = 5  # Maximum length of the input sequence

input_seq = random.random((batch_size, input_seq_length))
enc_output = random.random((batch_size, input_seq_length, d_model))
...
```

<br/>

Next, we will create a new instance of the <b><font color='blue'>Decoder</font></b> class, assigning its to `decoder` variable, and subsequently passing in the input arguments and printing the result. We will be setting the padding and look-ahead masks to None for the time being, but we shall return to these when we implement the complete Transformer model:
```python
...
decoder = Decoder(dec_vocab_size, input_seq_length, h, d_k, d_v, d_model, d_ff, n, dropout_rate)
print(decoder(input_seq, enc_output, None, True)
```

<br/>

Tying everything together produces the following code listing:

In [3]:
from numpy import random

dec_vocab_size = 20  # Vocabulary size for the decoder
input_seq_length = 5  # Maximum length of the input sequence
h = 8  # Number of self-attention heads
d_k = 64  # Dimensionality of the linearly projected queries and keys
d_v = 64  # Dimensionality of the linearly projected values
d_ff = 2048  # Dimensionality of the inner fully connected layer
d_model = 512  # Dimensionality of the model sub-layers' outputs
n = 6  # Number of layers in the decoder stack

batch_size = 64  # Batch size from the training process
dropout_rate = 0.1  # Frequency of dropping the input units in the dropout layers

input_seq = random.random((batch_size, input_seq_length))
enc_output = random.random((batch_size, input_seq_length, d_model))

decoder = Decoder(dec_vocab_size, input_seq_length, h, d_k, d_v, d_model, d_ff, n, dropout_rate)
print(decoder(input_seq, enc_output, None, True))

2022-10-21 03:08:53.604258: W tensorflow/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libcuda.so.1'; dlerror: libcuda.so.1: cannot open shared object file: No such file or directory
2022-10-21 03:08:53.604337: W tensorflow/stream_executor/cuda/cuda_driver.cc:263] failed call to cuInit: UNKNOWN ERROR (303)
2022-10-21 03:08:53.604373: I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:156] kernel driver does not appear to be running on this host (johnkclee.c.googlers.com): /proc/driver/nvidia/version does not exist
2022-10-21 03:08:53.615576: I tensorflow/core/platform/cpu_feature_guard.cc:193] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations:  AVX2 FMA
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.


tf.Tensor(
[[[-0.8248985   0.2964239   0.7608083  ...  0.17464584  0.23421288
    0.26576504]
  [-0.77121806  0.2589691   0.8694546  ...  0.22020744  0.25140268
    0.30782634]
  [-0.7686267   0.14425175  0.8768898  ...  0.21052054  0.27436918
    0.34159267]
  [-0.83794403  0.05393364  0.79679465 ...  0.19523147  0.28727317
    0.3666878 ]
  [-0.9168807   0.09301145  0.7091867  ...  0.19346212  0.26375723
    0.365894  ]]

 [[-0.83108205  0.17180996  0.89795715 ...  0.02750048  0.24393825
   -0.19888619]
  [-0.76181316  0.1321857   1.0282778  ...  0.06248236  0.273029
   -0.15914586]
  [-0.7523107  -0.00998755  1.054501   ...  0.06388298  0.31676665
   -0.11417438]
  [-0.82676876 -0.10075451  0.97691804 ...  0.03817829  0.33013073
   -0.09794228]
  [-0.907696   -0.04062966  0.9192459  ...  0.02174829  0.30881098
   -0.11598752]]

 [[-0.8936668   0.09203106  0.3527665  ... -0.40788966  0.42008328
   -0.29435515]
  [-0.84474474  0.04377091  0.48532057 ... -0.35225812  0.42709574
   -0.2

Running this code produces an output of shape, `(batch size, sequence length, model dimensionality)`. Note that you will likely see a different output due to the random initialization of the input sequence, and the parameter values of the Dense layers. 

## <font color='darkblue'>Further Reading</font>
This section provides more resources on the topic if you are looking to go deeper.