# 04. ENCODER

1. Seq2seq
2. Transformer
3. Encoder
4. Attention
5. FFN
6. Embeddings and Softmax
7. Positional Encodings
8. Residual Stream
9. Exercise
10. References

# 1. Seq2seq

Seq2seq is a family of machine learning approaches designed for sequence transformation tasks, such as machine translation, automatic speech recognition (ASR), and code generation.

A standard _encoder-decoder_ model consists of two main components:
- The _encoder_ processes an input sequence $(x_1, ..., x_n)$ and converts it into a sequence of continuous representations $z = (z_1, ..., z_n)$.
- The _decoder_ then generates the output sequence $(y_1, ..., y_m)$ from this representation $z$, one element at a time.

The generation process is _autoregressive_, meaning that at each step, the decoder produces the next output element conditioned on both the encoded representation $z$ and the previously generated tokens.

In [1]:
from IPython.display import IFrame

IFrame(src='http://jalammar.github.io/images/seq2seq_3.mp4', width=800, height=None)

# 2. Transformer

Early encoder-decoder models were primarily based on Recurrent Neural Networks (RNNs). This paradigm shifted in 2017 with the introduction of the Transformer architecture in the seminal paper [Attention Is All You Need](https://arxiv.org/abs/1706.03762) by Vaswani et al. from Google Brain.


![paper](res/04_attention_is_all_you_need.png)

The transformer consists of an encoder and a decoder, each of which consists of layers [lena-voita.github.io]:

![](https://lena-voita.github.io/resources/lectures/seq2seq/transformer/model-min.png)

#### More
- [Vaswani et al - Attention Is All You Need](https://arxiv.org/abs/1706.03762)
- [Jay Allamar - The Illustrated Transformer](http://jalammar.github.io/illustrated-transformer/)
- [Huang et al - The Annotated Transformer](http://nlp.seas.harvard.edu/annotated-transformer/)
- [Lena Voita - Seq2seq and Attention](https://lena-voita.github.io/nlp_course/seq2seq_and_attention.html)
- [Brandon Rohrer - Transformers from Scratch](https://e2eml.school/transformers.html)
- [Peter Bloem - Transformers from Scratch](https://peterbloem.nl/blog/transformers)

# 3. Encoder

The encoder consists of $N$ (e.g.,  $N = 6$) identical consecutive layers.
Each layer has two sublayers:
- multi-head self-attention mechanism, and
- position-wise fully connected feed-forward network.

![](http://jalammar.github.io/images/t/Transformer_encoder.png)

The input sequence is tokenized. Each token corresponds to a vector (embedding).

![](http://jalammar.github.io/images/t/encoder_with_tensors_2.png)

In addition, the architecture uses a *residual connection* for each of the two sublayers, followed by [layer normalization](https://arxiv.org/abs/1607.06450).

More formally, the output of each sublayer is `LayerNorm(x + Sublayer(x))`, where `Sublayer(x)` is a function implemented by the sublayer itself.

![](https://www.researchgate.net/publication/334288604/figure/fig1/AS:778232232148992@1562556431066/The-Transformer-encoder-structure_W640.jpg)

All sublayers in the model and output embeddings have the same dimension $d_{model}$ (e.g. $d_{model} = 512$).

#### More
- [Ulf Mertens - The Transformer encoder](https://github.com/mertensu/transformer-tutorial/blob/master/transformer_encoder.ipynb)

# 4. Self-attention

Attention can be described as a function mapping a *query* and a set of *key-value* pairs to some output.

![](http://jalammar.github.io/images/t/transformer_self_attention_vectors.png)

The query, keys, values, and output are vectors.
The output vector is computed as a weighted sum of values, where the weight assigned to each value is computed over the query-key pair.

![](https://jalammar.github.io/images/t/self-attention-output.png)

In practice, the calculations are done in matrix form (simultaneously for a set of queries, packed together into a matrix $Q$). The keys and values ​​are also packed together into matrices $K$ and $V$.

Thus,
$$Attention(Q, K, V) = softmax(\frac{QK^T}{\sqrt{d_k}})V$$

Dividing by $\sqrt{d_k}$ allows us to reduce the value of large scalar products.
(If they are large, then the softmax function has very small gradients. This is bad.)

In matrix notation:

![](http://jalammar.github.io/images/t/self-attention-matrix-calculation-2.png)

#### Multi-Head Attention

Multi-Head Attention allows you to run several attention mechanisms, each of which can have its own useful features.
This way, information is collected from different views.

Therefore, using linear transformations (projections), we obtain several ($h$) query-key-value triplets (dimensions, respectively, $d_k$, $d_k$, $d_v$).

Then, for each of these triplets, an attention mechanism is launched in parallel, which yields output values ​​of dimension $d_v$.
Then the obtained output values ​​are concatenated and projected again.

![](https://repository-images.githubusercontent.com/283979760/0f00ed80-d368-11ea-979d-78033d0a1cee)

More formally:

$$MultiHead(Q, K, V ) = concat(head_1, ..., head_h)W^O$$

where $head_i = Attention(QW^Q_i, KW^K_i, V W^V_i)$ and the following projection matrices are used

$W^Q_i \in R^{d_{model} \times d_k}$,

$W^K_i \in R^{d_{model} \times d_k}$,

$W^V_i \in R^{d_{model} \times d_v}$,

$W^O_i \in R^{hd_v \times d_{model}}$.

The dimensions can be, for example, $h = 8$, $d_k = d_v = \frac{d_{model}}{h} = 64$.

The transformer uses multi-head attention in three different ways:

1. (cross) In the encoder-decoder attention layers, the *queries* come from the _previous_ decoder layer, and the *keys* and *values* come from the encoder output. This allows each position in the decoder to visit all positions in the input sequence.

2. (self) The encoder contains self-attention layers.
Here, all *keys*, *values*, and *queries* come from the same place --- from the output of the previous encoder layer. Each position in the encoder can visit all positions in the previous encoder layer.

3. (masked) The decoder contains self-attention layers.
Here, each position in the decoder is allowed to visit all previous positions (from the left) up to and including the current position.
This restriction arises because it is necessary to preserve the autoregressive property of the decoder (predict the next token based on the previous tokens).
The constraint is implemented inside scaled dot-product attention by maximizing (assuming $-\infty$) all values ​​at the softmax input that correspond to invalid visits.

# 5. FFN

Both the encoder and decoder layers contain a _Feed-Forward Network (FFN)_.
This is a sublayer that is applied identically and independently to each position.

![](https://jalammar.github.io/images/t/encoder_with_tensors_2.png)

The FFN consists of two linear transformations with a [ReLU](https://arxiv.org/abs/1803.08375) activation function in between:

$$FFN(x) = \max(0, xW_1 + b_1)W_2 + b_2$$

While the same linear transformation is applied to every position, the parameters $W_1, b_1, W_2, b_2$ are different for each layer.
The inner layer has a higher dimensionality; for instance, with an input and output dimension of \(d_{model} = 512\), the inner layer dimension is often \(d_{ff} = 2048\).

A key characteristic of the Transformer is that the token at each position follows its own path through the encoder. While the self-attention layers create dependencies between these positions, the FFN layers operate independently. This allows the FFN computations for all positions to be processed in parallel.

# 6. Embeddings and softmax

To transform input/output tokens into vectors (of dimension $d_{model}$), learnable embeddings are used.

To transform outputs into predicted probabilities of the next token, learnable linear transformation and softmax are used.

In a bit more detail, the output vector is passed through a linear layer (a fully connected neural network), followed by softmax.

The linear layer projects the output into a much larger vector (a logits vector).
The size of this vector is the size of the token dictionary.

Then the softmax layer turns the coordinates into probabilities.
The most probable token is selected.

![](./res/04_transformer_decoder_output_softmax.png)

Learnable embeddings are used to convert input and output tokens into vectors of dimension $d_{model}$.

To generate predictions, the decoder's output is converted into probabilities for the next token using a learnable linear transformation followed by a softmax function.

In detail, the output vector is passed through a linear layer that projects it into a logits vector with a size equal to the vocabulary.
The softmax function then converts these logits into probabilities. The token with the highest probability is selected as the prediction.

![](./res/04_transformer_decoder_output_softmax.png)

# 7. Positional encodings

In order to use the information about the order of tokens in the sequence, the Transformer uses positional encoding. Positional encoding is a vector that encodes the token's ordinal number. It has the same length as the embeddings ($d_{model}$), and is added to the embedding at the input.

For example, you can use the following functions:
$$PE(pos, 2i) = \sin(\frac{pos}{10000^{\frac{2i}{d_{model}}}})$$
$$PE(pos, 2i+1) = \cos(\frac{pos}{10000^{\frac{2i}{d_{model}}}})$$

where $pos$ is the position index, and $i$ is the coordinate index.

![](https://jalammar.github.io/images/t/attention-is-all-you-need-positional-encoding.png)

Other options:
- [learnable](https://aclanthology.org/2022.findings-aacl.42.pdf)
- [relative](https://arxiv.org/abs/1803.02155)
- [RoPE](https://arxiv.org/abs/2104.09864)

#### More

- [Lilian Weng - The Transformer Family](https://lilianweng.github.io/posts/2023-01-27-the-transformer-family-v2/)
- [LongRoPE](https://arxiv.org/abs/2402.13753)

# 8. Batches and Paddings

![](./res/04_padding.png)

# 9. Residual stream

[Source: [A Mathematical Framework for Transformer Circuits](https://transformer-circuits.pub/2021/framework/index.html)]

Simplification: "attention-only" transformers, which don't have MLP layers. 

![](./res/04_residual.png)


The residual stream can be considered as a communication channel, since it doesn't do any processing itself and all layers communicate through it.

The residual stream has a deeply linear structure.
Every layer performs an arbitrary linear transformation to "read in" information from the residual stream at the start, and performs another arbitrary linear transformation before adding to "write" its output back into the residual stream.
This linear, additive structure of the residual stream has a lot of important implications.

![](./res/04_attention_heads.png)

The fundamental action of attention heads is _moving information_.

They read information from the residual stream of one token, and write it to the residual stream of another token. The main observation to take away from this section is that which tokens to move information from is completely separable from what information is "read" to be moved and how it is "written" to the destination.

# 10. Exercise

Using LLM implement Encoder from scratch in PyTorch. Be prepared to answer questions about the code.

# 11. References

- [Lilian Weng - The Transformer Family](https://lilianweng.github.io/posts/2023-01-27-the-transformer-family-v2/)
- [nn.labml.ai - Paper Implementations](https://nn.labml.ai/)
- [Post LayerNorm - Pre LayerNorm](https://arxiv.org/abs/2002.04745)
- [LayerNorm - RMSNorm](https://arxiv.org/abs/1910.07467)
- [Attention modifications](https://arxiv.org/abs/2305.13245)
- [Andrej Karpathy - minGPT](https://github.com/karpathy/minGPT)
- [pytorch - NLP from Scratch](https://pytorch.org/tutorials/intermediate/seq2seq_translation_tutorial.html)