<img src='https://raw.githubusercontent.com/huggingface/transformers/master/docs/source/imgs/transformers_logo_name.png'>
<p style='text-align:center'><b>Author: </b> Tamoghna Saha</p>

# ðŸ¤— Introduction

__ðŸ¤— Transformers__ provides general-purpose architectures (BERT, GPT-2, RoBERTa, XLM, DistilBert, XLNet, etc.) for __Natural Language Understanding (NLU)__ and __Natural Language Generation (NLG)__ with over 2000+ pre-trained models in 100+ languages available in __TensorFlow 2.0__ and __PyTorch__, with a seamless integration between them, allowing you to train your models with one then load it for inference with the other.

ðŸ¤— Transformers provide APIs to quickly download and use those pre-trained models on a given text, fine-tune them on your own datasets then share them with the community on our [model hub](https://huggingface.co/models).

## Why should I use ðŸ¤— Transformers?

1. _Easy-to-use state-of-the-art models_:
    * High performance on NLU and NLG tasks.
    * Low barrier to entry for educators and practitioners.
    * Few user-facing abstractions with just three classes to learn.
    * A unified API for using all our pre-trained models.
    
2. _Lower compute costs, smaller carbon footprint_:
    * Researchers can share trained models instead of always retraining.
    * Practitioners can reduce compute time and production costs.
    * Dozens of architectures with over 2,000 pre-trained models, some in more than 100 languages.
    
3. _Choose the right framework for every part of a model's lifetime_:
    * Train state-of-the-art models in 3 lines of code.
    * Move a single model between TF2.0/PyTorch frameworks at will.
    * Seamlessly pick the right framework for training, evaluation, production.
    
4. _Easily customize a model or an example to your needs_:
    * Examples for each architecture to reproduce the results by the official authors of said architecture.
    * Expose the models internal as consistently as possible.
    * Model files can be used independently of the library for quick experiments.

ðŸ¤— Transformers provides the following tasks out of the box:

* Sentiment analysis
* Text generation (in English)
* Name entity recognition (NER)
* Question answering
* Filling masked text
* Summarization
* Translation

<details><summary>By translation, I didn't mean this...</summary>
<img src='https://i.pinimg.com/originals/83/42/b6/8342b62b4cdbbb32e05f107348bbc69d.gif'>
So let's help our beloved Joey ðŸ¤— in translation.
</details>

## What was the need for ðŸ¤— Transformers?

__Recurrent neural networks (RNN)__ are capable of looking at previous inputs to predict the next possible word. But RNNâ€™s curse of the __shorter window of reference__, resulting in Vanishing Gradient, makes it difficult to capture the context of a story when the story gets longer. This is still true for __Gated Recurrent Units (GRUâ€™s)__ and __Long-short Term Memory (LSTMâ€™s)__ networks, although they do have a bigger capacity to achieve longer-term memory compared to RNN.

Not only that, __RNN is slow to train__. Such a recurrent process does not make use of modern graphics processing units (GPUs), which were designed for parallel computation. But what's even worse is that __LSTM is even slower to train__.

The attention mechanism, __in theory__, have an infinite window to reference from, therefore being capable of using the entire context of the story. In terms of training, Transformers is definitely faster because of the parallel processing capability. Let's find out more!

# ðŸ¤— Transformers Architecture

## High-level look

<img src='./source/the_transformer.png'>
<img src='./source/down_arrow.png' width="100" height="100">
<img src='./source/the_transformer_encoders_decoders.png'>
<img src='./source/down_arrow.png' width="100" height="100">
<img src='./source/the_transformer_encoder_decoder_stack.png'>
<img src='./source/down_arrow.png' width="100" height="100">
<img src='./source/transformer_encoder_decoder_detail.png'>

I believe this looks familiar and "professional" to you!

<img src='./source/transformer_model_architecture.png' width="400" height="400">

The breakdown of this "professional" diagram is like this!

<img src='./source/transformer_breakdown.png'>

## Encoder - in depth!

Now, we will deep dive into the Encoder section. This is the "professional" view.

<img src='./source/transformer_encoder_architecture.png' width=400 height=400>

We pass a sentence as an input (of course), but machine can only understand 0s and 1s (again, of course).

<details><summary>So we need to translate the words in a sentence into...</summary>
<img src='./source/The-Matrix.jpg' width=200 height=200>
</details>

### Inputs

This model is trained from corpus of ~30,000 unique words. Each of these words have a unique ID, known as vocabulary index.
<img src='./source/vocab.png' width=175 height=175>
So, the words in the sentence are transformed into it's corresponding vocabulary indices.
<img src='./source/converted_tokens.png'>

### Input Embedding

The next step is to convert the input word into it's corresponding word embedding. _Word embedding is the vector representation of each words in the vocabulary._
<img src='./source/embedding.png'>
For simplicity in explanation, I used 3 dimension `d` over here, but in reality, it is __512, 768__ or even __1024__. The more, the better.

Each of these dimensions captures "some" linguistic feature about that word. Since the model decides these features itself during the training, it can be non-trivial to find out what exactly each of the dimensions represents.

These vectors are randomly initialized and IT IS THESE that will get fine-tuned during the model training, and will ultimately generate the contextual representation of the words to be leveraged during the inference time.
<img src='./source/before-after_embed.png' width=450 height=450>

### Positional Encoding

In recurrent networks like LSTMs and GRUs, the network processes the input sequentially, token after token. The hidden state at position `t+1` depends on the hidden state from position `t`. This way, the network has a reference to identify the relative positions of each token by accumulating information. However, __Transformers has no notion of word order. That's why it is faster but we do need the information of word's positions.__

<p style='color:red; text-align: center'><b>Here is why position matters!</b></p>
<img src='./source/order_matters.png' width=450 height=450>

__Hence the requirement of positional encoding.__

<p style='color:blue; text-align: center; font-size: 20px'><b>So how do we do it?</b></p>

> __Strategy 1__: Add vector of positions IDs `(0,1,2,...,(N-1))` with the Word Vectors.
<img src='./source/pos_encode_strategy_1.png' width=450 height=450>
<details><summary>But there is a problem ...</summary>
    <p style='color:red'>Adding numbers like these will distort the word embedding value, specially those of the ones appearing in the later part of the text.</p>
</details>

> __Strategy 2__: Add vector of fractions of positions IDs (`0*1/(N-1)`, `1*1/(N-1)`, `2*1/(N-1)`,...,`(N-1)*1/(N-1)`) with the Word Vectors.
<img src='./source/pos_encode_strategy_2.png' width=450 height=450>
<details><summary>But there is a problem ...</summary>
    <p style='color:red'>Different sentences will have different number of words - so even when we try to get fractional value for those positional vectors, it will be <b>different for the same position for different sentences</b>. This positional vectors needs to be constant for their corresponding position.</p>
</details>

<b style='color:green'>Implemented Strategy</b>

Hence, the authors propose a cyclic (dynamic) solution where __sine and cosine function with different frequencies__ is added to each word embedding. The formula goes like this:

$$\begin{gather*}
PE_{(pos, 2i)} = \sin \left({\frac{pos}{10000^{\frac{2i}{d_{model}} } }} \right)
\end{gather*}$$
$$\begin{gather*}
PE_{(pos, 2i+1)} = \cos \left({\frac{pos}{10000^{\frac{2i}{d_{model}} } }} \right)
\end{gather*}$$

Let us try understanding the `sin` part of the formula to compute the position embeddings:

<img src='./source/pos_encode_sin_1.png' width=450 height=450>

Here `pos` refers to the position of the __word__ in the sequence. `P0` refers to the position embedding of the first word, `d` means the size of the word/token embedding (here it is d=5). Finally, `i` refers to each of the 5 individual dimensions of the embedding (i.e. 0, 1,2,3,4).

While `d` is fixed, `pos` and `i` vary. Let us try understanding the latter two.

__`pos`__

<img src='./source/pos_encode_sin_2.png' width=450 height=450>

If we plot a sin curve and vary `pos` (on the x-axis), you will land up with different position values on the y-axis. Therefore, words with different positions will have different values position embeddings values.

There is a problem though. Since `sin` curve repeat in intervals, you can see in the figure above that `P0` and `P6` have the same position embedding values, despite being at two very different (word) positions. This is where the `i` part in the equation comes into play.

__`i`__

<img src='./source/pos_encode_sin_3.png' width=450 height=450>

If you vary `i` in the equation above, you will get a bunch of curves with varying frequencies. Reading off the position embedding values against different frequencies, lands up giving different values at different embedding dimensions for `P0` and `P6`.

For every <span style='color:red;'><b>odd index</b></span> on the position vector, we pass the <span style='color:red;'><b>cosine function</b></span> and for every <span style='color:blue;'><b>even index</b></span>, the <span style='color:blue;'><b>sine function</b></span>.

In [1]:
import numpy as np
import plotly.graph_objects as go


def get_angles(pos, i, d_model):
    angle_rates = 1 / np.power(10000, (2 * (i)) / np.float32(d_model))
    return pos * angle_rates


def positional_encoding(word_pos, d_model):
    # get the matrix of word pos and angle rate based on index of positional vector
    angle_rads = get_angles(np.arange(word_pos)[:, np.newaxis], 
                            np.arange(d_model)[np.newaxis, :],
                            d_model)

    # apply sin to even indices in the array; 2i
    angle_rads[:, 0::2] = np.sin(angle_rads[:, 0::2])

    # apply cos to odd indices in the array; 2i+1
    angle_rads[:, 1::2] = np.cos(angle_rads[:, 1::2])
    
    # final positional encoded vector
    pos_encoding = angle_rads[np.newaxis, ...]

    return pos_encoding

In [5]:
sentence = 'I am a student from Kolkata'
splitted_sentence = sentence.split(' ')
word_pos_list = len(splitted_sentence)
model_dim = 128

position_encoded_vector = positional_encoding(word_pos_list, model_dim)

trace = go.Heatmap(
    z=position_encoded_vector[0], y=splitted_sentence,
    hovertemplate='Position Vector Index (X): %{x}<br>Word Index (Y): %{y}<br>Position Vector Value (Z): %{z}<extra></extra>'
)
data = [trace]

layout = go.Layout(xaxis=go.layout.XAxis(
    title=go.layout.xaxis.Title(
        text='Position Vector',
    )),
yaxis=go.layout.YAxis(
    title=go.layout.yaxis.Title(
        text='Words',
    )
))
fig = go.Figure(data, layout=layout)

fig.show(config= {'displaylogo': False})

<p style='color:purple'><b>Final Step/Summary</b></p>

Finally, add these positional vectors to their corresponding input embedding vectors. This successfully gives the network information on the position of each word.

<img src='./source/transformer_positional_encoding.png' width=500 height=500>

### Multi-Headed Self-Attention Mechanism

#### Self-Attention

Self-attention allows the models to associate each word in the input to other words.

_Example #1_
<img src='./source/self-attention_visual.png' width=500 height=500>

_Example #2_

Say the following sentence is an input sentence we want to translate:

`The animal didn't cross the street because it was too tired`

What does `it` in this sentence refer to? Is it referring to `street` or `animal`? As the model processes each word, self attention allows it to look at other positions in the input sequence for clues that can help lead to a better encoding for this word.

__So how is it working?__
<img src='./source/self_attn_full.png' width=350 height=350>

<details><summary>What motivated to have this architecture?</summary>
    <p>This analogy can be partially motivated in the way retrieval system works.</p>
    <img src='./source/q_k.png' width=750 height=750>
    <img src='./source/down_arrow.png' width="100" height="100">
    <img src='./source/q_k_v.png' width=750 height=750>
</details>

__Step 1: 3 Linear Components__

<img src='./source/self_attn_full_1.png' width=350 height=350>

We feed the positional embedding input into __3 distinct linear layers comprising of randomly initialized weight matrix__ to create 3 vectors - <span style='color:purple'><b>Query</b></span>, <span style='color:orange'><b>Key</b></span> and <span style='color:blue'><b>Value</b></span>.

<img src='./source/transformer_self_attention_vectors.png'  width=600 height=600>

Multiplying <span style='color:green'><b>$X_1$</b></span> by the <span style='color:purple'><b>$W^Q$</b></span> weight matrix produces <span style='color:purple'><b>q1</b></span>, the "query" vector associated with that word. Likewise, we end up creating a "key" and a "value" projection of each word in the input sentence.

__NOTE__ : These new vectors are smaller in dimension than the positional embedding vector. We will come back to this.

__Step 2 : Getting the Attention Weight__

<img src='./source/self_attn_full_2.png' width=350 height=350>

<img src='./source/score_matrix.png'>

Now, <span style='color:purple'><b>Query</b></span> and __transpose of__ <span style='color:orange'><b>Key</b></span> undergoes dot product matrix multiplication to generate a __score matrix__, which determines how much focus should a word be put on other words. __Higher score means more focus. This is how the queries are mapped to the keys.__

<img src='./source/before-after-attention-filter.png'>

Then, the scores get scaled down by getting divided by the __square root of the dimension of key vector__. This is to allow for more stable gradients, as multiplying values can have exploding effects. Next, you take the __`softmax`__ of the __scaled score__ to get the __attention weights (or filters)__, which gives you probability values between 0 and 1. 

By doing a softmax the higher scores get enhanced, and lower scores are depressed. This allows the model to be more confident about which words to attend.

__Step 3 : Mapping the Attention Weight with Original Matrix__

<img src='./source/self_attn_full_3.png' width=350 height=350>

Then you take the __attention weights__ and __multiply__ it by your <span style='color:blue'><b>Value</b></span> vector to get an __output matrix__ <span style='color:pink'><b>Z</b></span>. The higher softmax scores will keep the value of words the model learns is more important. The lower scores will drown out the irrelevant words.

<br>
<details><summary>Why is this multiplication done?</summary>
    <p>The best way to explain the reason for implementing this technique is in the context of computer vision.</p><br><p>Imagine encountering Yahiko from the 6 path of Pain. In reality, the entire view is like this.</p>
    <img src='./source/yahiko_original.png'>
    <p>But you need to focus on Yahiko.</p>
    <img src='./source/yahiko_focused.png'>
    <p>This is achieved using the following way.</p>
    <img src='./source/yahiko_complete_view.png'>
</details>

<p style='color:purple'><b>Final Step/Summary</b></p>

So, this is how self-attention works! The following formula gives you the summary:

<img src='./source/self-attention_matrix_calculation.png'>

#### Multi-Headed

The paper further refined the self-attention layer by adding a mechanism called __multi-headed__ attention.

<img src="./source/transformer_multihead_attn.png" width=250 height=250>

Each self-attention process we learned above is called a __head__. Stacking up multiple self-attention will give us multi-headed attention. In the paper, we have __8 heads__. So the `512 d` input gets segmented to 8 `64 d` vectors. In case of BERT, there are 12 `64 d` vectors, resulting in `(12*64=)768 d` vectors.

<br>
<details><summary>Why is this technique implemented?</summary>
    <p>In theory, <b>each head would learn something different</b> therefore giving the encoder model <b>more representation power</b>. Another visual example will help.</p><br>You are encountering all the members of the 6 path of pain.
    <img src="./source/6_path_of_pain_original.png">
    <p>Now, we decided to process 2 individual at a time to process the entire scenario, thereby keeping an eye on everyone.</p>
    <div class="row" style="display: flex;">
      <div class="column" style="flex: 33.33%; padding: 5px;">
        <img src="./source/6_path_of_pain_attn_1.png" alt="attn1" style="width:100%">
      </div>
      <div class="column">
        <img src="./source/6_path_of_pain_attn_2.png" alt="attn2" style="width:100%">
      </div>
      <div class="column">
        <img src="./source/6_path_of_pain_attn_3.png" alt="attn3" style="width:100%">
      </div>
    </div>
</details>

In this multi-headed attention computation, each head has <span style='color:purple'><b>Query</b></span>, <span style='color:orange'><b>Key</b></span> and <span style='color:blue'><b>Value</b></span> weight matrices which are _randomly initialized and mutually exclusive_ that will help to project the positional embeddings into a different representation subspace.

<img src='./source/transformer_attention_heads_qkv.png'  width=750 height=750>

Now, if we perform the same self-attention calculation as outlined in the previous section, 8 different times with different weight matrices, we end up with 8 different <span style='color:pink'><b>Z</b></span> matrices.

<img src='./source/transformer_attention_heads_z.png' width=750 height=750>

An example to clearly understand it.

<img src='./source/encoder-self-attention-example.png'>

However, this leaves us with a bit of a challenge. The upcoming feed-forward neural network (FFNN) layer is not expecting 8 matrices - __itâ€™s expecting a single matrix (a vector for each word)__. Hence, we __concatenate the matrices__, then pass it through another __linear layer__ (again comprising of weights matrices) <span style='color:pink'><b>$W_O$</b></span> and get the original vector dimension back (for example, to 512).

<img src='./source/transformer_attention_heads_weight_matrix_o.png' width=750 height=750>

<p style='color:purple'><b>Recap</b></p>

A quick recap of the operations and steps performed in multi-headed self-attention mechanism.

<img src='./source/transformer_multi-headed_self-attention-recap.png'>

### Residual Connections and Layer Normalization

The multi-headed attention output vector is added to the original positional input embedding. This is called a __residual connection__. The purpose of this component is to __preserve the original context__, thereby _tackling the Vanishing Gradient problem_.

The output of the residual connection goes through a __layer normalization__. This is placed after each sub-layer (self-attention, FFNN) for each encoder.

<img src='./source/transformer_resideual_layer_norm.png' width=600 height=600>

Training state-of-the-art, deep neural networks is computationally expensive. One way to reduce the training time is to normalize the activities of the neurons using __batch normalization__. However, the effect of batch normalization is dependent on the mini-batch size and it is not obvious how to apply it to recurrent neural networks. So, researchers __transposed batch normalization into layer normalization__.

Just to have a better understanding of this, take a look at this visual.

<img src='./source/diff_batch_layer_norm.png' width=650 height=650>

In batch normalization, the statistics are computed across the batch. In contrast, in layer normalization, the statistics are computed across each feature and are independent of other examples.

### Feed-Forward Neural Network

The penultimate layer in the block is a pack of feed-forward networks. Each word vector in the sentence (up to the capped sentence length) is given its own feed-forward network. Thus each position in the sentence is learned independently of each other position. This network consists of __2 linear layers *(2 1D convolutions with kernel size 1)* with a ReLU activation in between__.

The output of this network is further normalized by first performing residual connection and layer normalization.

__But, WHY do we need this layer?__

It's __main purpose__ is to process the output from one attention layer in a way to _better fit_ the input for the next attention layer.

This layer which usually appear near the end of a network.

After the attention layer, the latent representation of each words contains information from other words. However, we want to consolidate a __unique representation__ for each words. This is done using a localized layer, which does not consider neighbors or other positions, and simply transforms the local representation on its own.

<span style='color:brown; font-size:150%;'><b>Encoder - Wrap up!</b></span>

That wraps up the encoder layer. All of these operations are to encode the input to a continuous representation with attention information. This will help the decoder focus on the appropriate words in the input during the decoding process. You can __stack the encoder `N times`__ to further encode the information, where each layer has the opportunity to learn different attention representations therefore potentially boosting the predictive power of the transformer network.

Based on this, <span style='color:gold; background-color:black;'><b>BERT</b></span> came into the picture. 

<br>
<details><summary>Not this BERT actually...</summary>
<img src='./source/bert_bert.jpg' width=500>
</details>

A simple example of BERT implementation is done in the image below:

<img src='./source/BERT-classification-spam.png'>

On September 2020, __Google__ published <span style='color:gold; background-color:black;'><b>BigBird</b></span> (again inspired by __Sesame Street__).

<img src='./source/bigbird_architecture.png' width=750>

This Transformer based model is expected to handle larger input sequences. It incorporates __Sparse Attention Mechanism__ which enables it to process sequences of length up to __8 times more__ than what was possible with BERT. Using this, researchers decreased the complexity of $O(n^2)$ (of BERT) to just $O(n)$.

Link to the paper can be found [here](https://arxiv.org/pdf/2007.14062.pdf).

## Decoder - in depth!

Now, we will deep dive into the Decoder section. This is the "professional" view.

<img src='./source/transformer_decoder_architecture.png' width=250 height=250>

The decoder's job is to generate text sequences. The decoder has a similar sub-layer as the encoder. it has 2 multi-headed attention layers, a pointwise feed-forward layer, residual connections, and layer normalization. These sub-layers behave similarly to the layers in the encoder but __each multi-headed attention layer has a different job__. The decoder is capped off with a linear layer that acts as a classifier, and a softmax to get the word probabilities.

__The decoder is autoregressive__. This is how it operates:
* It begins with a special token `<start>`.
* This token's corresponding vector gets calculated with __encoder outputs that contain the attention information__ and generates a possible word.
* Then it takes the __previous output(s) as input(s)__ and again that _encoder outputs_.
* Then it generates the next possible word, and this process goes on.
* The decoder stops decoding when it generates `<eos>` (short for *end-of-sentence*) token as an output.

### Output Embedding and Positional Encoding

The beginning of the decoder is pretty much the same as the encoder. The input goes through an embedding layer and positional encoding layer to get positional embeddings.

### First Multi-Headed Self Attention Mechanism

This multi-headed attention layer operates slightly differently from the encoder one. Since the decoder is __autoregressive__ and generates the sequence word by word, one need to __prevent it from conditioning to future tokens__. For example, when computing attention scores on the word "am", one should not have access to the word "fine", because that word is a future word that was generated after. The word "am" should only have access to itself and the words before it. 

__This is true for all other words, where they can only attend to previous words.__

So, when Ross says...
<img src='./source/ross_is_fine.jpg' width=300 height=300>
... he IS fine!

<img src='./source/decoder_first_attention_1.png'>

__*So, how do we prevent computing attention scores for future words?*__

<img src='./source/transformer_self_attn.png' width=300 height=300>

This is done using __Look-Ahead Mask__. The mask is a matrix that has the same size as the attention scores filled with __values of 0â€™s and negative infinities__. When you add the mask to the __scaled attention scores__, you get a matrix of the scores, with the top right triangle filled with negativity infinities.

<img src='./source/decoder_first_attention_2.png'>

Once you take the __`softmax`__ of the masked scores, the negative infinities becomes 0, leaving a zero attention scores for future tokens.

<img src='./source/decoder_first_attention_3.png'>

This component also has __multiple heads__, in each of them the mask is being applied and then getting concatenated. Again, an example for clear understanding.

<img src='./source/decoder-self-attention-example.png'>

Then similar to the encoder, the model employ residual connections followed by layer normalization.

### Second Multi-Headed Attention Mechanism

For this layer, the inputs are:
* <span style='color:purple'><b>Query</b></span> - output of the masked multi-headed attention layer of decoder
* <span style='color:orange'><b>Key</b></span> - Encoder's output
* <span style='color:blue'><b>Value</b></span> - Encoder's output

This process matches the encoder's output to the decoder's output, allowing the decoder to decide __which encoder section is relevant to put a focus on__. In other words, the decoder predicts the next word by looking at the encoder output and self-attending to its own output.

Hence, this layer is also called __encoder-decoder attention__ or __source-target attention__. The following picture will help you to understand this.

<img src='./source/self_attn-enc_dec_attn-difference.png' width=750 height=750>

An example to understand it better.

<img src='./source/encoder-decoder-self-attention-example.png'>

Then, again, the model performs residual connection followed by layer normalization.

### Feed-Forward Neural Network

Just like in Encoder, the output of encoder-decoder attention is fed to a FFNN to process it in an acceptable form to be fed to the final layer.

### Linear Classifier and Softmax Function

The decoder stack outputs a vector of floats. How do we turn that into a word? That is the job of the final __Linear layer__ which is followed by a __`softmax` layer__.

* The __Linear layer__ is a simple fully connected neural network that projects the vector produced by the stack of decoders, into a much, much larger vector called a __logits vector__. So this layer is basically a __classifier__. The classifier is as big as the number of classes you have. With respect to this paper, this layer ~30,000 classes for ~30,000 words. This would make the logits vector ~30,000 cells wide â€“ each cell corresponding to the score of a unique word.
* The __`softmax` layer__ then turns those scores into probabilities _(all positive, all add up to 1.0)_. The cell with the highest probability is chosen, and the word associated with it is produced as the output for this time step.

<img src='./source/transformer_decoder_output_softmax.png' width=600 height=600>

### Optimizer and Loss Function

Authors of the paper used __Adam__ optimizer with a __custom learning rate__ that varied over the course of training. This is achieved using the formula:

<img src="./source/optimizer.png" width=500>

where `warmup_steps` = 4000.

As for loss function, the paper is using __Categorical Cross Entropy__.

<img src="./source/transformer_logits_output_and_label.png" width=500>


### Final view of the decoder output

If we go back to the translation example, the output from the decor will be as follows:

__Ground Truth__
<img src="./source/output_target_probability_distributions.png" width=500>

__Predicted Answer__
<img src="./source/output_trained_model_probability_distributions.png" width=500>

<span style='color:brown; font-size:150%;'><b>Decoder - Wrap up!</b></span>

That wraps up the decoder layer. Now, the decoder will be able to map the relevant information with the encoder output, capturing the context and generating the result. You can __stack the decoder `N times`__ just like it was done in encoder to further process and decode the information, where each layer has the opportunity to learn different attention representations therefore potentially boosting the predictive power of the transformer network.

__The OpenAI GPT-2 model uses these decoder-only blocks.__ Here is a sample output of GPT-2.

<img src='./source/gpt-2-autoregression-2.gif'>

# ðŸ¤— Transformers - Wrap up!

So, we covered individually, how each components in Encoder and Decoder works. Let's take a look at how they work together.

For simplicity, we have taken a stack of 2 encoders and decoders and performing French-to-English translation.

<img src='./source/transformer_decoding_1.gif'>

After finishing the encoding phase, we begin the decoding phase. Each step in the decoding phase outputs an element from the output sequence.

The output of each step is fed to the bottom decoder in the next time step, and the decoders bubble up their decoding results just like the encoders did. And just like we did with the encoder inputs, we embed and add positional encoding to those decoder inputs to indicate the position of each word.

This step is repeated till the model spits out `<eos>` token indicating the end of process (here, translation).

<img src='./source/transformer_decoding_2.gif'>

That's it! This is the entire mechanics of the transformers.

Now, it would be easier for you to go through the original paper [Attention is all you need](https://arxiv.org/pdf/1706.03762.pdf).

Take a look at how [TensorFlow](https://www.tensorflow.org/tutorials/text/transformer) have implemented with code snippets for even more detailed understanding of the model.

<br>
<details><summary>But for Joey... </summary>
    <img src='./source/joey_french.png'>
    <p>Looks like we can train a model to translate but Joey is impossible. So all we can say is...</p>
    <img src='./source/good-job-joey.jpg' width=500>
    <div style="text-align: center;">
        <span style='color:blue; font-size:200%;'><br><b>La Fin.</b></span>
    </div>
</details>