want to add:

different section:
- rnn
- lstm
- attention 

# **THE TRANSFORMER**

in this notebook we introduce the Transformer, which is the architecture that has revolutionized NLP. Today, it is still the standard architecture for building large language models. 

it was introduced in the paper "Attention is All you Need" (2017), which I suggest as primary source of information [Attention is all you need (2017)](https://arxiv.org/abs/1706.03762)

# **RESOURCES**:

### **BOOKS**
[Hands-On Large Language Models](https://www.llm-book.com/)

### **ONLINE RESOURCES**
(from chapter 3 of the "Hands-On Large Language Models" book):
[Transformer illustrated explanation](https://jalammar.github.io/illustrated-transformer/)


### **PAPERS**

[Attention (2014)](https://arxiv.org/abs/1409.0473). 

[Attention is all you need (2017)](https://arxiv.org/abs/1706.03762)

[BERT (2018)](https://arxiv.org/abs/1810.04805)

One of the main motivations for introducing the Transformer was to overcome the sequential nature of Recurrent Neural Networks and LSTMs, which process input tokens one at a time sequentially. This sequential dependency makes training slow and limits parallelization. In contrast, Transformers allow for parallel computation across sequence positions, making them significantly more efficient, especially when leveraging GPUs. In fact GPUs are designed to perform many operations in parallel, which is exactly what Transformers exploit, unlike RNNs, which are inherently sequential and cannot fully utilize GPU parallelism.

In the case of RNNs and LSTMs, therefore, the sequence is examined just one tokent per time, sequentially. On the other hand, the Transformer relies entirely on the **attention mechanism**, looking at all tokens at once.

the Transformere is the first transduction model relying entirely on self-attention to compute representations of its input and output without using sequence-aligned RNNs or convolution.

## **TRANSFORMER ARCHITECTURE** 


The Transformer is a neural network with a specific structure that includes a mechanism called self-attention or multi-head attention. Attention can be thought as a way to build contextual representions of tokens' meaning by **attending to** and integrating information from sorrounding tokens, helping the model learn how tokens relate to each other over large spans.

![Alt text](images/transformer_architecture.png)


*Figure 1: The architecture of a (left-to-right) transformer, showing how each input token get encoded, passed through a set of stacked transformer blocks, and then a language model head that predicts the next token*

*NOTE*: "left-to-right": 


### **LEFT-TO-RIGHT OR AUTOREGRESSIVE MODEL**

We’ll focus for now on left-to-right (sometimes called causal or autoregressive) language modeling, in which we are given a sequence of input tokens and predict output tokens one by one by conditioning on the prior context. The transformer architecture is autoregressive, meaning that it needs to consume each generated word before creating a new word.


### **TRANSFORMER BLOCKS**

in the picture, the transformer architecture is sketched. We can see the transformer blocks (the purple ones) that are multilayer networks. 
The transformer blocks are defined as "multilayer networks" because they contain various layers stacked up on each other, working together. 

The Layers inserted are:
1) **MULTI-HEAD SELF-ATTENTION LAYER**, which is the main innovation of the transformer architecture. This allows the transformer to focus on different parts of the sentence simultaneously.
2) **FEEDFORWARD NETWORK (multilayer perceptrons)** - Usually 2-3 dense layers with nonlinear activations
3) **LAYER NORMALIZATION** steps, which are usually applied before or after the other components. 
4) **SKIP CONNECTIONS** 


now, the order between skip connections and the layer normalization is not fixed. while in the original paper from 2017 (Attention is All you Need), the skip connections is first, while the normalization layer is later. this is called "POST-NL" (post normalization layer)

In  more recent architectures, such as BERT or GPT-2, we do have first the normalization layer, and the skip connections after. (PRE-LN variant)

That is:
	1.	LayerNorm first (before attention/FFN).
	2.	Then sublayer computation.
	3.	Then add the skip connection.

This makes optimization more stable, especially for very deep stacks.











---

**TIP**

If any of these four steps are unclear (or **unbekannt** to you), I suggest revisiting the basics of **Machine Learning** or **Deep Learning**.  

**📚 RECOMMENDED RESOURCES**

- *Hands-On Machine Learning with Scikit-Learn, Keras, and TensorFlow* by Aurélien Geron  
  (really great book. easy and practical)  

- *Deep Learning* by Ian Goodfellow  
  "Deep Learning" by Ian Goodfellow (really great book but a bit more mathematics-heavy. Great for understanding deeply the maths. Highly suggest for the part on MLPs and how they work)

---

the transformer blocks are preceeded by the encodng layer.




![Alt](images/encoder_decoder_RNN.png)

Using an encoder-decoder sequence is not new to the Transformer architecture. What's new is using only the attention mechanism. Although RNNs were used also before, they are not great: using RNNs is not the best. Long sequences kept in memory lead to a vanishing gradient. A solution was created by the discovery of the Attention mechanism by (Bahdanau, Cho, Bengio) [Attention Paper (2014)](https://arxiv.org/abs/1409.0473). Even though the full the true power of attention , and what drives the amazing abilities of large language models, was first explored in the well-know [Attention is all you need paper (2017)](https://arxiv.org/abs/1706.03762)

Attention allows a model to focus "attend" just to the the most relevant parts. Attention selectively determines which words are most important in a given sentence. For instance, the output word “lama’s” is Dutch for “llamas,” which is why the attention between both is high. Similarly, the words “lama’s” and “I” have lower attention since they aren’t as related. 
![Alt](images/attention.png)

By adding these attention mechanisms to the decoder step, the RNN can
generate signals for each input word in the sequence related to the potential
output. Instead of passing only a context embedding to the decoder, the
hidden states of all input words are passed. This process is demonstrated in the
Figure 

![Alt](images/attention_RNN.png)

### The encoding layer

takes words as input, embeds them trough an embedding matrix. As we saw in the foundations, we cannot feed a machine just with words. We need numbers. Numbers will have to define not only the specific words, but also grammatical relationships. 
This is the job of the embedding matrix. It will embed both specific words and also the position of such words (positional embeddings)

after the transformer blocks, we do have an unembedding matrix. 
task of the unembedding matrix is to unembed the vector. As we saw, we need numbers (vectors) to compute operations and actually be able to predict the next word. but actually, we clearly need the final word at the end. Therefore, we need to "unembed" the vector, and , through a softmax, choose the most probable word (the predicted word)


### Encoder-Decoder; Encoder only; Decoder-only.
in the case of the tranformers, the encoder and decoder components are stacked on top of each other. 
this differs from other architectures, such as **BERT** that is encoder-only, or **GPT**, which is decoder-only. 
In fact, the vanilla version of BERT, which is encoder-only, can only create a contextual embedding. On the other hand, GPT, which is decoder-only, can only generate output and words.

the original Transformers in an encoder-decoder architecture. This works well for some tasks, such as Machine Translation, but cannot be used for different tasks, such as text classification.

For these kind of tasks, different architecture were created.
An example is the [**bidirectional encoder representations from transformers (2018: BERT)**](https://arxiv.org/abs/1810.04805).
Bert is an encoder-only structure. This means that his task is only creating representations of the word embeddings. It focuses only on representing language. It only uses the encoder while removes the decoder entirely.

## TRANSFORMER BLOCKS ARE STACKED 

transformer blocks are stacked one over the other.

**Why this??**
the first blocks learn representation of words. following blocks, instead, learn more complex representations, such as the grammatical rules and semantic structures of the sentence. A column might contain 12 to 96 or more stacked blocks.


**Words are represented as vectors of a fixed size**

We have seen that words are represented as vectors. The dimension of these vectors is defined initially and stays consistent through all the transformer blocks. Usually, the choice is a vector of 512, 768, 1024 or more dimensions.

Summary : 

- RNNs process inputs token by token: at time step t, you need the hidden state from time step t-1. 
- This means: each step depends on the previous one, so you can’t compute them in parallel — even with a GPU.
- GPUs sit mostly idle, waiting for sequential computations to finish.

Transformers: Parallelism Friendly
- Transformers compute all token representations at once (self-attention allows looking at all tokens simultaneously).
- No time-step dependencies like in RNNs.
- Enables parallel matrix multiplications over entire sequences — which GPUs handle extremely efficiently.

# Distillation

Transformers model may use a lot of energy consumption. training a model may use the energy consumption of running a household for several years ! Moreover, they may not even fit the RAM of a computer. 
A possible solution that has been developed is using **distillation**. How does distillation work ? You traing a distilled model (such as **DistillBERT**) by using as label data the predictions of the "parent" model - in this case, BERT. surprisingly, in this way, the resulting model performs better than if it was directly trained on the same training data the parent model was trained on. And it is way more portable.