# 04. ENCODER

1. Seq2seq
2. Transformer
3. Encoder
4. Attention
5. FFN
6. Embeddings and Softmax
7. Positional Encodings
8. Exercise
9. References

# 1. Seq2seq

Seq2seq is a family of approaches for sequence transformation problems (e.g. NLP, ASR, ML4SE).

Currently, the most common approaches are Transformer-based approaches (encoder-decoder, decoder, or encoder).

An encoder-decoder model consists of:
- an encoder and
- a decoder.

The *encoder* maps an input sequence $(x_1, ..., x_n)$ to a sequence of continuous representations $z = (z_1, ..., z_n)$.

The *decoder* maps $z$ to an output sequence $(y_1, ..., y_m)$ one element at a time.

At each step, the model is *autoregressive*, receiving previously generated elements (tokens) as additional input when generating the next one.

In [1]:
from IPython.display import IFrame

IFrame(src='http://jalammar.github.io/images/seq2seq_3.mp4', width=800, height=None)

# 2. Transformer

Previously, encoder-decoder architectures based on [RNN](https://pytorch.org/tutorials/intermediate/char_rnn_generation_tutorial.html) were used. In 2017, the Transformer architecture was proposed in the work [Vaswani et al - Attention Is All You Need](https://arxiv.org/abs/1706.03762) (Google Brain).

![paper](res/04_attention_is_all_you_need.png)

The transformer consists of an encoder and a decoder, each of which consists of layers [lena-voita.github.io]:

![](https://lena-voita.github.io/resources/lectures/seq2seq/transformer/model-min.png)

#### More
- [Vaswani et al - Attention Is All You Need](https://arxiv.org/abs/1706.03762)
- [Jay Allamar - The Illustrated Transformer](http://jalammar.github.io/illustrated-transformer/)
- [Huang et al - The Annotated Transformer](http://nlp.seas.harvard.edu/annotated-transformer/)
- [Lena Voita - Seq2seq and Attention](https://lena-voita.github.io/nlp_course/seq2seq_and_attention.html)
- [Brandon Rohrer - Transformers from Scratch](https://e2eml.school/transformers.html)
- [Peter Bloem - Transformers from Scratch](https://peterbloem.nl/blog/transformers)

# 3. Encoder

The encoder consists of $N$ (e.g.,  $N = 6$) identical consecutive layers.
Each layer has two sublayers:
- multi-head self-attention mechanism, and
- position-wise fully connected feed-forward network.

![](http://jalammar.github.io/images/t/Transformer_encoder.png)

The input sequence is tokenized. Each token corresponds to a vector (embedding).

![](http://jalammar.github.io/images/t/encoder_with_tensors_2.png)

In addition, the architecture uses a *residual connection* for each of the two sublayers, followed by [layer normalization](https://arxiv.org/abs/1607.06450).

More formally, the output of each sublayer is `LayerNorm(x + Sublayer(x))`, where `Sublayer(x)` is a function implemented by the sublayer itself.

![](https://www.researchgate.net/publication/334288604/figure/fig1/AS:778232232148992@1562556431066/The-Transformer-encoder-structure_W640.jpg)

All sublayers in the model and output embeddings have the same dimension $d_{model}$ (e.g. $d_{model} = 512$).

#### More
- [Ulf Mertens - The Transformer encoder](https://github.com/mertensu/transformer-tutorial/blob/master/transformer_encoder.ipynb)

# 4. Self-attention

Attention can be described as a function mapping a *query* and a set of *key-value* pairs to some output.

![](http://jalammar.github.io/images/t/transformer_self_attention_vectors.png)

The query, keys, values, and output are vectors.
The output vector is computed as a weighted sum of values, where the weight assigned to each value is computed over the query-key pair.

![](https://jalammar.github.io/images/t/self-attention-output.png)

In practice, the calculations are done in matrix form (simultaneously for a set of queries, packed together into a matrix $Q$). The keys and values ​​are also packed together into matrices $K$ and $V$.

Thus,
$$Attention(Q, K, V) = softmax(\frac{QK^T}{\sqrt{d_k}})V$$

Dividing by $\sqrt{d_k}$ allows us to reduce the value of large scalar products.
(If they are large, then the softmax function has very small gradients. This is bad.)

In matrix notation:

![](http://jalammar.github.io/images/t/self-attention-matrix-calculation-2.png)

#### Multi-Head Attention

Multi-Head Attention allows you to run several attention mechanisms, each of which can have its own useful features.
This way, information is collected from different views.

![](https://i.ytimg.com/vi/mMa2PmYJlCo/mqdefault.jpg)

Therefore, using linear transformations (projections), we obtain several ($h$) query-key-value triplets (dimensions, respectively, $d_k$, $d_k$, $d_v$).

Then, for each of these triplets, an attention mechanism is launched in parallel, which yields output values ​​of dimension $d_v$.
Then the obtained output values ​​are concatenated and projected again.

![](https://repository-images.githubusercontent.com/283979760/0f00ed80-d368-11ea-979d-78033d0a1cee)

More formally:

$$MultiHead(Q, K, V ) = concat(head_1, ..., head_h)W^O$$

where $head_i = Attention(QW^Q_i, KW^K_i, V W^V_i)$ and the following projection matrices are used

$W^Q_i \in R^{d_{model} \times d_k}$,

$W^K_i \in R^{d_{model} \times d_k}$,

$W^V_i \in R^{d_{model} \times d_v}$,

$W^O_i \in R^{hd_v \times d_{model}}$.

The dimensions can be, for example, $h = 8$, $d_k = d_v = \frac{d_{model}}{h} = 64$.

The transformer uses multi-head attention in three different ways:

1. In the encoder-decoder attention layers, the *queries* come from the previous decoder layer, and the *keys* and *values* come from the encoder output. This allows each position in the decoder to visit all positions in the input sequence.

2. The encoder contains self-attention layers.
Here, all *keys*, *values*, and *queries* come from the same place --- from the output of the previous encoder layer. Each position in the encoder can visit all positions in the previous encoder layer.

3. The decoder contains self-attention layers.
Here, each position in the decoder is allowed to visit all previous positions (from the left) up to and including the current position.
This restriction arises because it is necessary to preserve the autoregressive property of the decoder (predict the next token based on the previous tokens).
The constraint is implemented inside scaled dot-product attention by maximizing (assuming $-\infty$) all values ​​at the softmax input that correspond to invalid visits.

# 5. FFN

Each of the encoder and decoder layers contains a FFN. This sublayer is applied equally to each position separately.

![](https://jalammar.github.io/images/t/encoder_with_tensors_2.png)
It consists of two linear transforms with a [ReLU](https://arxiv.org/abs/1803.08375) activation between them.

$$FFN(x) = \max(0, xW_1 + b_1)W_2 + b_2$$

Although the linear transforms are the same for different positions,
they use different parameters from layer to layer.

The inner layer has a higher dimensionality.
For example, the input and output dimensionality is $d_{model} = 512$, the inner layer dimensionality is $d_{ff} = 2048$.

In particular, we have that in the Transformer, the token in each position goes its own path in the encoder.
There are dependencies at the level of self-attention layers, but no dependencies at the FFN level.
This allows FFN sublayers to be passed in parallel.

# 6. Embeddings and softmax

To transform input/output tokens into vectors (of dimension $d_{model}$), learnable embeddings are used.

To transform outputs into predicted probabilities of the next token, learnable linear transformation and softmax are used.

In a bit more detail, the output vector is passed through a linear layer (a fully connected neural network), followed by softmax.

The linear layer projects the output into a much larger vector (a logits vector).
The size of this vector is the size of the token dictionary.

Then the softmax layer turns the coordinates into probabilities.
The most probable token is selected.

![](./res/04_transformer_decoder_output_softmax.png)

# 7. Positional encodings

In order to use the information about the order of tokens in the sequence, the Transformer uses positional encoding. Positional encoding is a vector that encodes the token's ordinal number. It has the same length as the embeddings ($d_{model}$), and is added to the embedding at the input.

For example, you can use the following functions:
$$PE(pos, 2i) = \sin(\frac{pos}{10000^{\frac{2i}{d_{model}}}})$$
$$PE(pos, 2i+1) = \cos(\frac{pos}{10000^{\frac{2i}{d_{model}}}})$$

where $pos$ is the position index, and $i$ is the coordinate index.

![](https://jalammar.github.io/images/t/attention-is-all-you-need-positional-encoding.png)

Other options:
- [learnable](https://aclanthology.org/2022.findings-aacl.42.pdf)
- [relative](https://arxiv.org/abs/1803.02155)
- [RoPE](https://arxiv.org/abs/2104.09864)

#### More

- [Lilian Weng - The Transformer Family](https://lilianweng.github.io/posts/2023-01-27-the-transformer-family-v2/)
- [LongRoPE](https://arxiv.org/abs/2402.13753)

# 8. Batches and Paddings

![](./res/04_padding.png)

# 9. Exercise

Implement Encoder from scratch in PyTorch. Be prepared to answer questions about the code.

# 10. References

- [Lilian Weng - The Transformer Family](https://lilianweng.github.io/posts/2023-01-27-the-transformer-family-v2/)
- [nn.labml.ai - Paper Implementations](https://nn.labml.ai/)
- [Post LayerNorm - Pre LayerNorm](https://arxiv.org/abs/2002.04745)
- [LayerNorm - RMSNorm](https://arxiv.org/abs/1910.07467)
- [Attention modifications](https://arxiv.org/abs/2305.13245)
- [Andrej Karpathy - minGPT](https://github.com/karpathy/minGPT)
- [pytorch - NLP from Scratch](https://pytorch.org/tutorials/intermediate/seq2seq_translation_tutorial.html)