# TD 5: Attention, Transformers - GPT

By Jill-Jênn Vie

In this TD we will focus on [GPT](https://en.wikipedia.org/wiki/Generative_pre-trained_transformer) (*generative pre-trained transformer*), a decoder-only transformer that powered GPT-2, ChatGPT, etc. (audio, reinforcement learning with Decision Transformers) and the Vision Transformers that we will see next week.

<img width="70%" src="https://heidloff.net/assets/img/2023/02/transformers.png" />

This first part does not require GPUs. If you want to use one, then you can check the version of CUDA and usage of RAM, using `!nvidia-smi`.

To connect via SSH to Polytechnique machines: https://www.enseignement.polytechnique.fr/informatique/INF473V/TD/0/SSH_JUPYTER.html

In [None]:
!pip install huggingface-hub transformers torch matplotlib

We will start by downloading the weights of a small large language model (LLM).

Qwen2.5-0.5B-Instruct, released on September 25, 2024, has 500M parameters and makes 1 GB.

In [None]:
from transformers import AutoModelForCausalLM, AutoTokenizer

model_name = "Qwen/Qwen2.5-0.5B-Instruct"
model = AutoModelForCausalLM.from_pretrained(model_name)
tokenizer = AutoTokenizer.from_pretrained(model_name)

In [None]:
model

We see that there are 24 layers of attention. We can compute the exact number of parameters:

In [None]:
import numpy as np

n_parameters = 0
for k, v in model.named_parameters():
    print(k, v.shape)
    n_parameters += np.prod(v.shape)
n_parameters

I assume you cannot wait to try it. First we should convert our prompt into tokens. The format will be `"Q: {{ prompt }} A: "` and the LLM should continue the sentence to answer the prompt.

In [None]:
input_text = "Q: Translate into English 'les voitures de la Commission européenne sont vertes' A:"

In [None]:
inputs = tokenizer(input_text, return_tensors="pt")  # Returns a PyTorch tensor
inputs

In [None]:
tokenizer.decode(inputs.input_ids[0])

In [None]:
from transformers import set_seed, TextStreamer
set_seed(42)

streamer = TextStreamer(tokenizer)

model.generate(
    **inputs,
    max_length=100,
    do_sample=True,
    temperature=1.,
    streamer=streamer
)

If we make a forward pass on this model:

In [None]:
model(**inputs)

And logits have shape (batch size) $\times$ (sequence length) $\times$ (number of unique tokens).

In [None]:
model(**inputs).logits.shape

Increase the temperature to make the model hallucinate. The logits are divided by $T$ before the softmax. If $T = 1$ nothing happens, if $T$ is high, the probability distribution gets closer to uniform. If $T$ gets closer to 0 the distribution is sharper.

## Question 1.

Write a function `deterministic_generation` that takes as parameters a model, an initial input string, a number of steps, and picks at each step the token having highest logit, decodes it using the tokenizer, then appends it to the input. Your function should return the same text as `model.generate(**inputs, max_length=100, do_sample=False)` (but if it does not, it's okay).

Hints: `argmax`, slicing, `print(something, end='')` to print without a newline. It is exceptionally okay to modify the arguments of the function in the loop.

In [None]:
def deterministic_generation(model, input_text, n_steps=50):
    # Your code here

In [None]:
input_text = "Q: Translate into English 'les voitures de la Commission européenne sont vertes' A:"
deterministic_generation(model, input_text)

In [None]:
input_text = "Q: Translate into English 'les voitures de la Commission européenne sont vertes' A:"
inputs = tokenizer(input_text, return_tensors="pt")
model.generate(**inputs, max_length=100, do_sample=False, streamer=streamer)

Here are suggestions of prompts. Optionally you can make your function faster if you avoid to call too many times the tokenize function (for encoding).

In [None]:
deterministic_generation(model, "Q: Who is Isaac Newton? A:")

In [None]:
deterministic_generation(model, "Q: What is bigger between 0.9 and 0.11? A:")

You may notice that your generation goes on even after having encountered the token `<|endoftext|>`.

## Question 2.

Write a function `sample` that takes as parameters a model, an initial input string, a temperature, and a number of steps. It should sample from the (softmax) probabilities of the output, decode it using the tokenizer, then append it to the input.

The $T > 0$ temperature parameter in the softmax is a smoothing parameter:

$$p_i = \textrm{softmax}(\mathbf{x}, T)_i = \frac{\exp(x_i / T)}{\sum_{j = 1}^n \exp(x_j / T)}$$

See what happens when $T \to \infty$ or $T \to 0$. 

Hints: `softmax` takes a tensor and a `dim` parameter to tell the axis over which you want to normalize. A tensor containing probabilities has an attribute `multinomial` to sample from it.

In [None]:
outputs = model(**inputs)

In [None]:
from torch import softmax, multinomial, manual_seed
manual_seed(42)

def sample(model, input_text, temperature, n_steps=50):
    # Your code here

In [None]:
input_text = "Q: Translate into English 'les voitures de la Commission européenne sont vertes' A:"
sample(model, input_text, 1.5)

In [None]:
model.generate(
    **inputs,
    max_length=100,
    do_sample=True,
    temperature=10.,
    streamer=streamer
)

In [None]:
input_text = "Q: Translate into English 'les voitures de la Commission européenne sont vertes' A: The English translation of 'les voitures de la Commission européenne sont vertes' is 'The European Commission cars are green'."
inputs = tokenizer(input_text, return_tensors="pt")

outputs = model(**inputs, output_attentions=True)

Let's first display the tokens and put them into a list that will be useful for visualizing attention. What does outputs.attentions contain?

In [None]:
len(outputs.attentions)

In [None]:
outputs.attentions[0].shape

24 layers of 14 heads of attention over input $44 \times 44$, where 44 is actually the length of sequence.

In [None]:
tokens = []
for i, token in enumerate(inputs.input_ids.detach().numpy()[0].tolist()):
  tokens.append(tokenizer.decode(token))
  print(i, token, tokenizer.decode(token))

## Question 3.

Using [seaborn's heatmap](https://seaborn.pydata.org/generated/seaborn.heatmap.html) (or `plt.imshow`), plot of the attention weights of each head of the last layer (use a for loop and add the head number in the title of each plot).

Hint: `plt.xticks(ticks, labels)` for labeling the plot using tokens, and `plt.tick_params("x", rotation=90)` for rotating the tick labels.

Trap: be careful when choosing the labels on the $y$-axis.

In [None]:
import matplotlib.pyplot as plt
import seaborn as sns
import numpy as np

# Your code here

We are now going to focus on the masked self-attention layer. You will see how this important and simple tensor operation can be implemented in roughly 5 lines, even with a batch of data, and several heads.

Let's now assume we have sets of vectors keys, queries, values. In the particular case of decoder-only GPT, they are all linear projections from a same source $X$ (self-attention).

We want to compute the attention mechanism:

$$A(Q, K, V) = \underbrace{\textrm{softmax}\left(\frac{Q^T K}{\sqrt{d_k}}\right)}_{\textrm{attention weights}}\, V$$

Note that the attention weights are only computed using $Q$ and $K$, not $V$.

We assume that for each token from each batch, we have several of these sets of vectors (one per attention head). It gives 4-order tensors.

In [None]:
import numpy as np
from torch import Tensor
import torch

batch_size = 8
n_heads = 14
seq_len = 10
embed_size = 16
embed_size_values = 12

k = torch.rand((batch_size, n_heads, seq_len, embed_size))
q = torch.rand((batch_size, n_heads, seq_len, embed_size))
v = torch.rand((batch_size, n_heads, seq_len, embed_size_values))

In [None]:
k.shape

## Question 4a.

<img width="300" src="https://substackcdn.com/image/fetch/w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F97567e7b-f8b9-4dea-a678-162378609a75_1304x1150.png" />

Implement the self-attention. Your implementation should be vectorized (no for loop, only tensor operations) and work for all batches and heads. Plot the attention weights using simply `plt.imshow`.

Hints: transpose or view, it is okay to use [einsum](https://pytorch.org/docs/stable/generated/torch.einsum.html) but it is not needed, the @ operator (equivalent to [matmul](https://pytorch.org/docs/stable/generated/torch.matmul.html) which does more than just matrix multiplications) should be enough.

<img width="70%" src="https://substackcdn.com/image/fetch/w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb75a8df1-0a82-4f79-8e68-4fe16587063d_1474x1108.png" />

In [None]:
# Your code here

In [None]:
# Checking your weights
assert attn_weights.shape == (batch_size, n_heads, seq_len, seq_len)
assert torch.all(attn_weights.sum(axis=3) - 1. < 1e6)  # Scores should sum to 1 for each batch and head

In [None]:
import matplotlib.pyplot as plt

plt.imshow(attn_weights[0, 0])

## Question 4b.

Actually, when we generate token per token, we should not attend on the future, it is not feasible (even though that's what stable diffusion is trying to do). Implement the masked self-attention, that ensures that attention weights for the $i$th row only attend before column $i$. Your implementation should be vectorized (no for loop, only tensor operations) and work for all batches and heads. Again, plot the attention weights using simply `plt.imshow`.

Hints: [triu](https://pytorch.org/docs/stable/generated/torch.triu.html) for upper triangular (check the `diagonal` parameter) and [masked_fill](https://pytorch.org/docs/stable/generated/torch.Tensor.masked_fill_.html#torch.Tensor.masked_fill_) according to boolean and a filling value. Then renormalize.

<img width="50%" src="https://substackcdn.com/image/fetch/w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc51bfe11-c2cf-4ce5-95d4-4f8a57eac997_1026x1148.png" />

<img width="50%" src="https://substackcdn.com/image/fetch/w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe1317a05-3542-4158-94bf-085109a5793a_1220x702.png" />

In [None]:
import torch

# Your code here

In [None]:
# You're gonna carry that weight
assert attn_weights.shape == (batch_size, n_heads, seq_len, seq_len)
assert torch.all(attn_weights.sum(axis=3) - 1. < 1e6)  # Scores should sum to 1 for each batch and head
assert torch.all(torch.triu(attn_weights, diagonal=1) - 1. < 1e6)  # Zeroes on the diagonal and above

In [None]:
plt.imshow(attn_weights[0, 0])

Finally, `git clone https://github.com/karpathy/nanoGPT` and train a little GPT from scratch on character-level tokens using the GPUs of Polytechnique (should take 6 minutes, follow the README). You can also fine-tune an existing LLM on word-level tokens.

# References

https://magazine.sebastianraschka.com/p/understanding-and-coding-self-attention

## To know more

This (wrong) PyTorch tutorial contains a dataset for translation from [tatoeba.org](http://tatoeba.org/): https://pytorch.org/tutorials/intermediate/seq2seq_translation_tutorial.html

Flash Attention is faster than attention $O(N^2)$.

https://github.com/Dao-AILab/flash-attention?tab=readme-ov-file

Download llama.cpp or ollama (Go wrapper over llama.cpp, more friendly) to have a GPT implemented in C++ on your laptop. Some multimodal LLMs like Pixtral or Gemma 3 can accept text and images as input, you will see this next week.

<img src="https://huggingface.co/blog/assets/02_how-to-generate/beam_search.png" />

There is a lot of work these days in how to retrieve nice answers from a LLM using scaling test-time compute or reasoning.

https://huggingface.co/blog/how-to-generate

https://huggingface.co/spaces/HuggingFaceH4/blogpost-scaling-test-time-compute