# Lab 2: GPT from scratch

In this lab, you will dive into the inner workings of the GPT architecture. You will walk through a complete implementation of the architecture in PyTorch, instantiate this implementation with pre-trained weights, and put the resulting model to the test by generating text. At the end of this lab, you will understand the building blocks of the GPT architecture and how they are connected.

*Tasks you can choose for the oral exam are marked with the graduation cap 🎓 emoji.*

In [200]:
from dataclasses import dataclass

import torch
import torch.nn as nn

## Part 1: GPT architecture

GPT-2 was first described by [Radford et al. (2019)](https://cdn.openai.com/better-language-models/language_models_are_unsupervised_multitask_learners.pdf). To faithfully implement the model, one needs to also read the earlier paper by [Radford et al. (2018)](https://cdn.openai.com/research-covers/language-unsupervised/language_understanding_paper.pdf). Another important source of information is the official implementation, which is available on GitHub ([link](https://github.com/openai/gpt-2)).

The GPT architecture is made up of a stack of Transformer blocks. Each block has two main parts: one handles multi-head self-attention, and the other is a feed-forward network. Before these parts do their work, their input undergoes layer normalisation, and residual connections are added to help the model learn more effectively. The input to the architecture is a sequence of token IDs. These are turned into embeddings and augmented with information about the absolute position of each token in the sequence. The output layer converts the internal representations into logit scores for every token in the vocabulary.

### Model configuration

[Radford et al. (2019)](https://cdn.openai.com/better-language-models/language_models_are_unsupervised_multitask_learners.pdf) present four increasingly larger GPT models based on the same architecture. Here, we will implement the smallest of these, characterised by the following hyperparameters:

In [201]:
@dataclass
class Config:
    n_vocab: int = 50257
    n_ctx: int = 1024
    n_embd: int = 768
    n_head: int = 12
    n_layer: int = 12

#### 🎈 Task 2.01: Model configuration

Explain the purpose of these hyperparameters. In particular, where does the number 50,257 come from?

#### The number comes from the vocabolary ,50000is the vocab, 256 is the utf, 1 is the end of text token. ,embd are the embeddings layer, context size(ctx)this is the limit upto which we can feed the gpt the input level, including its response tokens. layer is the newral network layer ,,
#### 256 byte-level tokens.50,000 additional tokens learned during the BPE training process. 1 special token (usually an end-of-text token,
#### A context window of 1,024 means the model can handle sequences of up to 1,024 tokens at a time.


### GELU activation function

We start by implementing the feed-forward network. This is a standard two-layer network with a Gaussian Error Linear Unit (GELU) activation function ([Hendrycks and Gimpel, 2016](https://doi.org/10.48550/arXiv.1606.08415)).

The GELU is a smooth version of the rectified linear unit (ReLU) that weights inputs by their value under the cumulative distribution function of the standard Gaussian. This function is commonly denoted by $\Phi$. For example, $\text{GELU}(0{.}5) = 0{.}5 \cdot \Phi(0{.}5) \approx 0{.}5 \cdot 0{.}6915 = 0{.}3457$ because approximately 69.15% of normally distributed data lies to the left of $0{.}5$.

When GPT-2 was released, computing the GELU exactly was expensive, and the official implementation therefore used an approximation originally presented by [Choudhury (2014)](https://dx.doi.org/10.13189/ms.2014.020307). We follow suit here, as we want to create a replica of the original model. However, it is worth mentioning that PyTorch now offers an exact implementation of the GELU so fast that using an approximation is unnecessary.

In [202]:
def gelu(x):
    return 0.5 * x * (1 + torch.tanh((2 / torch.pi) ** 0.5 * (x + 0.044715 * x**3)))

#### 🎓 Task 2.02: Mathematical properties of the GELU

Find the minimal output value of the GELU and the input value for which it yields that output. Use a service such as [WolframAlpha](https://www.wolframalpha.com/) for the necessary derivations. What are the main differences between the GELU and the ReLU?

#### Using wolframalpha, GELU has local minima y~ -0.169971 for x~ -0.751792. In our understanding the GELU can be viewed as a smoother RELU,with the benefit that it is differentiable over all ranges. From the paper we also see that GELU in fact becomes RELU for sigma ->0 and mu=0.
#### Φ(x) is the cumulative distribution function (CDF) of the standard Gaussian distribution.
#### GELU: Commonly used in state-of-the-art models like GPT and BERT due to its smoothness and probabilistic nature. ReLU: Widely used in traditional neural networks for its simplicity and efficiency.

### Feed-forward network

Next, here is the code for the feed-forward network. Note that we follow the official codebase and use the name **multi-layer perceptron (MLP)** rather than “feed-forward network”.

In [203]:
class MLP(nn.Module):
    def __init__(self, config):
        super().__init__()
        self.c_fc = nn.Linear(config.n_embd, config.n_embd * 4)#config.n_embd * 4 is the input layer here
        self.c_proj = nn.Linear(config.n_embd * 4, config.n_embd)#config.n_embd is the outpt layer here

    def forward(self, x):
        batch_size, seq_len, n_embd = x.shape
        x = self.c_fc(x)#shape: {x,config.n_embd*4}
        x = gelu(x) #Shape remains unchanged, only values change.
        x = self.c_proj(x) #shape: {x,config.n_embd}
        return x

This code defines a Multi-Layer Perceptron (MLP), a small neural network used in models like GPT. It takes an input of shape (batch_size, seq_len, n_embd) (e.g., token embeddings). First, it expands the embedding size by 4 using a linear layer (self.c_fc). Then, it applies the GELU activation function to introduce non-linearity.It projects the embeddings back to the original size using another linear layer (self.c_proj). The output has the same shape as the input but with transformed values, helping the model learn complex patterns. This MLP is a key building block in transformer architectures.

#### 🎓 Task 2.03: Shape annotations

One of the most common errors in deep learning is a mismatch in tensor dimensions. To avoid this, it is good practice to annotate PyTorch code with shapes. For example, suppose you are given the following code:

In [204]:
f = nn.Linear(5, 7)
x = torch.rand(2, 3, 5)
y = f(x)

The annotation of this code with shapes would look as follows:

In [205]:
f = nn.Linear(5, 7)
# not a tensor variable; needs no annotation

x = torch.rand(2, 3, 5)
# shape of x: [2, 3, 5]

y = f(x)
# shape of y: [2, 3, 7]

Annotate the shapes in the `forward()` method of the feed-forward network. Instead of using actual numbers, refer to dimension sizes by symbolic names such as `n_embd`, `batch_size` (number of samples in a batch of input data) and `seq_len` (length of an input sequence). You can introduce additional names and other notation you find useful. Make your annotations as detailed as you need them to explain how the shapes change from one line to the next.

In [206]:
def forward(self, x):
    # Input shape: [batch_size, seq_len, n_embd]
    batch_size, seq_len, n_embd = x.shape

    # Apply the first linear layer (self.c_fc)
    # Input shape: [batch_size, seq_len, n_embd]
    # Output shape: [batch_size, seq_len, n_embd * 4]
    x = self.c_fc(x)

    # Apply the GELU activation function
    # Shape remains unchanged: [batch_size, seq_len, n_embd * 4]
    x = gelu(x)

    # Apply the second linear layer (self.c_proj)
    # Input shape: [batch_size, seq_len, n_embd * 4]
    # Output shape: [batch_size, seq_len, n_embd]
    x = self.c_proj(x)

    # Final output shape: [batch_size, seq_len, n_embd]
    return x

#### This code defines a Multi-Layer Perceptron (MLP) used in transformer models. The input x has shape [batch_size, seq_len, n_embd]. The first linear layer (self.c_fc) expands the embedding dimension to n_embd * 4. The GELU activation is applied, keeping the shape unchanged. The second linear layer (self.c_proj) projects it back to n_embd. The output shape remains [batch_size, seq_len, n_embd]. This MLP helps the model learn complex patterns by transforming input embeddings.

### Causal mask

Our next goal is to implement the core of the GPT architecture: the multi-head attention mechanism.

Recall that the attention mechanism in the Transformer decoder must be restricted to attending only to previously generated tokens. This type of attention is also called **causal attention**. In practice, we implement it through a masking technique that sets the post-softmax attention weights of future tokens to zero. The following utility function implements such a mask:

In [207]:
def make_causal_mask(n):
    return torch.triu(torch.full((n, n), float("-inf")), diagonal=1)

#### 🎈 Task 2.04: Causal mask

Have a close look at the following code and run it to see the result. What are the shapes of `x` and `mask`? Given that the shapes are different, why does the addition operation in the last line not raise an error? What is the shape of the result? How does the addition operation implement masking? (Recall that the attention scores are normalised using the softmax function.)

##### Shape of mask is 5,5 and x is 1,2,3,3. Shape of x remains unchanged after adding the mask. The addition doesn't raise an error due to broadcasting. 

In [208]:
x = torch.rand(1, 2, 3, 3)
x = torch.rand(1,2, 3, 3)
y = torch.rand(1,3,3)
y + x
print(x.shape)
mask = make_causal_mask(5)
print(mask)
print(x)
x = x + mask[:3, :3]
print(x)
print(mask.shape)
print(x.shape)

torch.Size([1, 2, 3, 3])
tensor([[0., -inf, -inf, -inf, -inf],
        [0., 0., -inf, -inf, -inf],
        [0., 0., 0., -inf, -inf],
        [0., 0., 0., 0., -inf],
        [0., 0., 0., 0., 0.]])
tensor([[[[0.2496, 0.5228, 0.2746],
          [0.4343, 0.1225, 0.5783],
          [0.8770, 0.2533, 0.8164]],

         [[0.9336, 0.5080, 0.2660],
          [0.6637, 0.3988, 0.3612],
          [0.0786, 0.1919, 0.7157]]]])
tensor([[[[0.2496,   -inf,   -inf],
          [0.4343, 0.1225,   -inf],
          [0.8770, 0.2533, 0.8164]],

         [[0.9336,   -inf,   -inf],
          [0.6637, 0.3988,   -inf],
          [0.0786, 0.1919, 0.7157]]]])
torch.Size([5, 5])
torch.Size([1, 2, 3, 3])


##### The code creates a causal mask and applies it to a batch of sequences. The mask ensures that each token can only attend to itself and previous tokens, enforcing causality. Broadcasting allows the mask to be applied efficiently across the batch. After masking, the attention scores for future tokens are set to -inf, ensuring they are ignored during softmax normalization.

### Attention mechanism

Here is the code for the multi-head attention mechanism:

In [209]:
class Attention(nn.Module):
    def __init__(self, config):
        super().__init__()
        assert config.n_embd % config.n_head == 0
        self.n_head = config.n_head
        self.c_attn = nn.Linear(config.n_embd, config.n_embd * 3)
        self.c_proj = nn.Linear(config.n_embd, config.n_embd)
        self.register_buffer("mask", make_causal_mask(config.n_ctx), persistent=False)

    def forward(self, x):
        print( x.shape)
        batch_size, seq_len, n_embd = x.shape
        head_embd = n_embd // self.n_head
        q, k, v = self.c_attn(x).chunk(3, dim=-1)
        print( q.shape)
        print( k.shape)
        print( v.shape)
        q = q.view(batch_size, seq_len, self.n_head, head_embd)
        k = k.view(batch_size, seq_len, self.n_head, head_embd)
        v = v.view(batch_size, seq_len, self.n_head, head_embd)
        print( q.shape)
        print( k.shape)
        print( v.shape)
        q = q.transpose(-2, -3)
        k = k.transpose(-2, -3)
        v = v.transpose(-2, -3)
        print( q.shape)
        print( k.shape)
        print( v.shape)
        x = q @ k.transpose(-1, -2)
        print(x.shape)
        x = x / head_embd**0.5
        print(x.shape)
        x = x + self.mask[:seq_len, :seq_len]
        print(x.shape)
        x = torch.softmax(x, dim=-1)
        print(x.shape)
        x = x @ v
        print(x.shape)
        x = x.transpose(-2, -3).contiguous()
        print(x.shape)        
        x = x.view(batch_size, seq_len, n_embd)
        print(x.shape)        
        x = self.c_proj(x)
        print(x.shape)
        return x

#### This code implements the multi-head attention mechanism with causal masking. It takes an input tensor, computes attention scores, applies the causal mask, and produces an output tensor of the same shape. This mechanism is crucial for enabling the model to focus on relevant parts of the input while ensuring causality in autoregressive tasks like text generation.

#### 🎓 Task 2.05: Multi-head attention

Trace the input `x` through the `forward()` method line by line and annotate the shapes of all tensor variables. Identify all lines that rely on broadcasting.

##### This code implements multi-head attention, a key part of models like GPT. It takes an input (x), splits it into query (Q), key (K), and value (V) vectors, and computes attention scores to determine how much each token should focus on others. A causal mask ensures tokens only attend to past tokens, preventing future information leakage. The scores are normalized, weighted, and combined to produce an output of the same shape as the input. This allows the model to focus on relevant parts of the input while maintaining causality.

### Layer normalisation

As mentioned above, the inputs to both the feed-forward network and the multi-head attention mechanism undergo **layer normalisation**. This normalises the inputs to have zero mean and unit variance across the activations. [Ba et al. (2016)](https://doi.org/10.48550/arXiv.1607.06450) introduce two trainable parameters (called $\gamma$ and $\beta$ in the paper) that allow the network to learn an appropriate scale and shift for the normalised values.

We implement layer normalisation as follows:

In [210]:
class LayerNorm(nn.Module):
    def __init__(self, config):
        super().__init__()
        self.g = nn.Parameter(torch.ones(config.n_embd))
        self.b = nn.Parameter(torch.zeros(config.n_embd))

    def forward(self, x):
        mean = x.mean(dim=-1, keepdim=True)
        variance = x.var(unbiased=False, dim=-1, keepdim=True)
        return self.g * (x - mean) / torch.sqrt(variance + 1e-05) + self.b

#### The LayerNorm class implements layer normalization, which normalizes inputs to have zero mean and unit variance across the activations. It uses two trainable parameters, self.g (scale) and self.b (shift), to allow the network to learn appropriate scaling and shifting of the normalized values.

##### keepdim=True:Ensures the mean and variance retain the same number of dimensions as the input x. Without it, the dimensions would be reduced, causing errors in subsequent operations like broadcasting.

##### 1e-05:A small constant added to the variance to avoid division by zero, ensuring numerical stability. Omitting it could lead to NaN values during training.

#### 🎈 Task 2.06: Layer normalisation

What is the relevance of the `keepdim=True` keyword argument in the `mean()` and `var()` functions? What would happen if we omitted it?

What is the relevance of the constant 1e-05? What could happen if we omitted it?

##### keepdim=True: Retains the dimensionality of the input tensor, ensuring proper broadcasting in the normalization step. Without it, the output would lose a dimension, causing shape mismatches.

##### 1e-05:Prevents division by zero in case the variance is very small. Omitting it could result in numerical instability or NaN values during training.

### Decoder block

We now combine the feed-forward network, the multi-head attention mechanism and the layer normalisation into a decoder block.

In [211]:
class Block(nn.Module):
    def __init__(self, config):
        super().__init__()
        self.ln_1 = LayerNorm(config)
        self.attn = Attention(config)
        self.ln_2 = LayerNorm(config)
        self.mlp = MLP(config)

    def forward(self, x):
        x = x + self.attn(self.ln_1(x))
        x = x + self.mlp(self.ln_2(x))
        return x

#### 🎓 Task 2.07: Pre-norm and post-norm architectures

The original Transformer ([Vaswani et al., 2017](https://arxiv.org/abs/1706.03762)) is a “post-norm architecture”, where the normalisation is applied **after** each residual block. In contrast, GPT-2 is a “pre-norm architecture”, where the normalisation is applied **before**. Find the passage in Section&nbsp;2.3 of [Radford et al. (2019)](https://cdn.openai.com/better-language-models/language_models_are_unsupervised_multitask_learners.pdf) that reports on this modification.

[Xiong et al. (2020)](https://arxiv.org/pdf/2002.04745) compare pre-norm and post-norm architectures empirically. Read the abstract of their paper and summarise their main findings. According to these findings, what are the benefits of the pre-norm architecture?

Pre-norm architecture can be trained without a warmup stage, which in turn means less hyperparameter tuning. Also it converges faster, resulting in lessened training time while achieveing comparable results

### Model

We now have almost all components in place to complete the implementation of the GPT-2 model. The only thing  missing are the position embeddings. These simply associate an embedding vector with every position in the context window. To set them up, we first define another utility function:

In [212]:
def make_positions(n):
    return torch.arange(n, dtype=torch.long)

We then code the complete model as follows:

In [213]:
class Model(nn.Module):
    def __init__(self, config):
        super().__init__()
        self.config = config
        self.wte = nn.Embedding(config.n_vocab, config.n_embd)
        self.wpe = nn.Embedding(config.n_ctx, config.n_embd)
        self.h = nn.Sequential(*(Block(config) for _ in range(config.n_layer)))
        self.ln_f = LayerNorm(config)
        self.lm_head = nn.Linear(config.n_embd, config.n_vocab, bias=False)
        self.lm_head.weight = self.wte.weight #2.09
        self.register_buffer("pos", make_positions(config.n_ctx), persistent=False)

    def forward(self, x):
        batch_size, seq_len = x.shape
        wte = self.wte(x)
        wpe = self.wpe(self.pos[:seq_len])
        x = wte + wpe
        x = self.h(x)
        x = self.ln_f(x)
        x = self.lm_head(x)
        return x

#### 🎈 Task 2.08: Buffers

Our implementation registers the vector of positions as a buffer. (Earlier, we also registered the causal mask as a buffer.) Consult the PyTorch documentation to determine the benefits of registering a tensor as a buffer, in contrast to computing it in the `forward()` method.

Buffers are useful for non-trainable parameters, and ensure they stay aligned with the data. This means that e.g. if you don't use a buffer you might have that tensor on your cpu while the trainable data is on the gpu, leading to errors.
Buffer are subclass tensors

#### 🎓 Task 2.09: Number of trainable parameters

The model we have implemented is the smallest one presented by [Radford et al. (2019)](https://cdn.openai.com/better-language-models/language_models_are_unsupervised_multitask_learners.pdf). But how many trainable parameters exactly does it have? Interestingly, the number originally reported by the authors is wrong. (What number did they report?)

Your task is to write code to compute the number of parameters yourself. This should only take 1–3 lines of code. What number do you get when you apply this code to a fresh model instance?

[Radford et al. (2019)](https://cdn.openai.com/better-language-models/language_models_are_unsupervised_multitask_learners.pdf) followed the original Transformers paper ([Vaswani et al., 2017](https://doi.org/10.48550/arXiv.1706.03762)) and shared the trainable weights between the token embedding and the final linear layer. Implement this weight sharing strategy. (Hint: This only requires one line of code.) Then, re-compute the number of trainable parameters for the modified model. What number do you get now? How large is the reduction caused by the weight sharing?

1.they have reported they have told is 117M and the layers is 12. Printing the nbr of params we got 163,037,184. After weight sharing we got 124,439,808 params. The difference is 38,597,376 ~39M


In [214]:
model = Model(Config)

params = sum(p.numel() for p in model.parameters())
print(params)

124439808


## Part 2: Load pre-trained weights

Now that you have a complete implementation of the GPT-2 model in place, you can instantiate it by loading the pre-trained weights released by OpenAI. These weights were originally provided in the TensorFlow format. For this lab, we have re-packaged them as a single file in NumPy’s `.npz` archive format ([link](http://www.ida.liu.se/~TDDE09/gpt-2-pretrained.npz)). Note that the file weighs in at 463&nbsp;MB. We can load it as follows:

In [215]:
import numpy as np

pretrained = np.load("/courses/TDDE09/gpt-2-pretrained.npz")

The contents of the archive are files in the `.npy` format, each containing a single NumPy array. When you print the file names, you will see that you can map them to the attributes of the network modules you have seen above. For example, the file `h0.attn.c_attn.b` corresponds to the biases (`b`) of the `c_attn` linear layer in the attention mechanism (`attn`) of the first transformer block (`h0`). We can verify that the array has the correct shape:

In [216]:
pretrained["h0.attn.c_attn.b"].shape

#for i in pretrained:
  #  print(i)
print("\n")
#print(model.h[0].attn.c_attn.weight)
#print(pretrained["wpe"].shape)
#print(model.wpe.weight.shape)
print(pretrained["h0.mlp.c_fc.b"].shape)
#print(model.h[0].attn.c_attn.weight.shape)
#print(pretrained["h0.attn.c_attn.w"].transpose)
print(model.h[0].mlp.c_proj.bias.shape)
#print(pretrained)
#h0.mlp.c_fc.b
#h0.mlp.c_fc.w
#h0.mlp.c_proj.b
#h0.mlp.c_proj.w




(3072,)
torch.Size([768])


#### 🎓 Task 2.10: Load pre-trained weights

Create a model from the pre-trained weights. To do this, you need to instantiate a fresh model and write the contents of each array from the `npz` archive with the pre-trained weights into the corresponding tensor. To make this a bit easier, here is a utility function that re-initialises a PyTorch tensor `target` with data from a NumPy array `source`:

In [217]:
def reinit(target: torch.Tensor, source: np.ndarray):
    assert source.shape == target.shape
    with torch.no_grad():
        target.copy_(torch.tensor(source, dtype=torch.float32))

You can start from this skeleton code:

In [218]:
def from_pretrained() -> Model:
    model = Model(Config())
    pretrained = np.load("/courses/TDDE09/gpt-2-pretrained.npz")
    # TODO: Implement the reinitialisation of the model's parameters
    reinit(model.wpe.weight, pretrained["wpe"])
    reinit(model.wte.weight, pretrained["wte"])
    index = [0,1,2,3,4,5,6,7,8,10,11]
    for i in index:
        reinit(model.h[i].attn.c_attn.bias,pretrained["h" + str(i) + ".attn.c_attn.b"])
        reinit(model.h[i].attn.c_attn.weight,pretrained["h" + str(i) + ".attn.c_attn.w"].transpose())
        reinit(model.h[i].attn.c_proj.bias,pretrained["h" + str(i) + ".attn.c_proj.b"])
        reinit(model.h[i].attn.c_proj.weight,pretrained["h" + str(i) + ".attn.c_proj.w"].transpose())
        reinit(model.h[i].ln_1.g, pretrained["h" + str(i) + ".ln_1.g"])
        reinit(model.h[i].ln_1.b, pretrained["h" + str(i) + ".ln_1.b"])
        reinit(model.h[i].ln_2.g, pretrained["h" + str(i) + ".ln_2.g"])
        reinit(model.h[i].ln_2.b, pretrained["h" + str(i) + ".ln_2.b"])
        reinit(model.h[i].mlp.c_fc.bias, pretrained["h" + str(i) + ".mlp.c_fc.b"])
        reinit(model.h[i].mlp.c_fc.weight, pretrained["h" + str(i) + ".mlp.c_fc.w"].transpose())
        reinit(model.h[i].mlp.c_proj.bias, pretrained["h" + str(i) + ".mlp.c_proj.b"])
        reinit(model.h[i].mlp.c_proj.weight, pretrained["h" + str(i) + ".mlp.c_proj.w"].transpose())


#h0.mlp.c_fc.b
#h0.mlp.c_fc.w
#h0.mlp.c_proj.b
#h0.mlp.c_proj.w

    return model
model = from_pretrained()

One technical detail to note is that PyTorch stores the weights of linear layers in a transposed form. For example, a linear layer created as `nn.Linear(2, 3)` has a weight matrix of shape [3, 2].

## Part 3: Put the model to use

In the third and final part of this lab, you will use the pre-trained model to generate text and evaluate it on a standard benchmark.

### Sampling-based text generation

The easiest way to generate text with a language model is by using a **greedy approach**. This method works by creating text one token at a time. At each step, the model takes the previously generated text (called the **context**) as input and adds the token with the highest output logit as a new token. The code in the next cell defines a function `generate()` that forms the core of a greedy generator:

In [219]:
def generate(model, context, context_size=1024, n_tokens=20):
    for _ in range(n_tokens):
        context = context[:, -context_size:]
        with torch.no_grad():
            logits = model(context)[:, -1, :]
        next_idx = torch.argmax(logits, dim=-1, keepdim=True)
        context = torch.cat([context, next_idx], dim=-1)
    return context

To use this function with an actual text input, you need a tokeniser to first encode the text into a vector of token IDs, and later decode the generated `context` into new text. The reference implementation of the GPT-2 tokeniser is in the library `tiktoken`. The code in the next cell sets up the tokeniser, loads the pretrained model from Task&nbsp;2.10, and then defines a helper function that handles the encoding and decoding.

In [220]:
import tiktoken

tokenizer = tiktoken.get_encoding("gpt2")
model = from_pretrained()


def generate_helper(text, context_size=1024, n_tokens=20):
    context = torch.tensor([tokenizer.encode(text)], dtype=torch.long)
    context = generate(model, context, context_size=context_size, n_tokens=n_tokens)
    return tokenizer.decode(context[0].tolist())

You can use this helper function to generate text as follows:

In [221]:
generate_helper("how are   ")

torch.Size([1, 5, 768])
torch.Size([1, 5, 768])
torch.Size([1, 5, 768])
torch.Size([1, 5, 768])
torch.Size([1, 5, 12, 64])
torch.Size([1, 5, 12, 64])
torch.Size([1, 5, 12, 64])
torch.Size([1, 12, 5, 64])
torch.Size([1, 12, 5, 64])
torch.Size([1, 12, 5, 64])
torch.Size([1, 12, 5, 5])
torch.Size([1, 12, 5, 5])
torch.Size([1, 12, 5, 5])
torch.Size([1, 12, 5, 5])
torch.Size([1, 12, 5, 64])
torch.Size([1, 5, 12, 64])
torch.Size([1, 5, 768])
torch.Size([1, 5, 768])
torch.Size([1, 5, 768])
torch.Size([1, 5, 768])
torch.Size([1, 5, 768])
torch.Size([1, 5, 768])
torch.Size([1, 5, 12, 64])
torch.Size([1, 5, 12, 64])
torch.Size([1, 5, 12, 64])
torch.Size([1, 12, 5, 64])
torch.Size([1, 12, 5, 64])
torch.Size([1, 12, 5, 64])
torch.Size([1, 12, 5, 5])
torch.Size([1, 12, 5, 5])
torch.Size([1, 12, 5, 5])
torch.Size([1, 12, 5, 5])
torch.Size([1, 12, 5, 64])
torch.Size([1, 5, 12, 64])
torch.Size([1, 5, 768])
torch.Size([1, 5, 768])
torch.Size([1, 5, 768])
torch.Size([1, 5, 768])
torch.Size([1, 5, 768])


'how are    the the the the the the the the the the the the the the the the the the the the'

**Tip:** If you did not manage to complete Task&nbsp;2.10, you can still work on this task by using a pretrained GPT-2 model from [Hugging Face](https://huggingface.co/openai-community/gpt2). The next code cell shows how you would instantiate this model. Note that you may have to first install the `transformers` library.

In [222]:
#from transformers import GPT2LMHeadModel
#model = GPT2LMHeadModel.from_pretrained("gpt2")
#logits = model(context).logits[:, -1, :]

#### 🎓 Task 2.11: Sampling-based text generation

 The greedy approach to text generation is not very interesting for practical applications because it always chooses the most likely token, leading to predictable and less creative results. Your task is to modify the code for the `generate()` function to use a **sampling-based approach** instead. In this approach, the next token is chosen randomly based on the probabilities assigned by the model (softmax-normalised logits), treating them as a categorical distribution over the token vocabulary. Additionally, your code should include two common techniques to improve sampling:
 
 * **temperature scaling**, which lets the user control the randomness of the sampling
 * **top-$k$ sampling**, which limits the sampling to the top-$k$ most likely tokens, ignoring less probable ones

### Evaluating the pretrained model

If you have experimented with your pretrained GPT-2 model, you will have noticed that its ability to generate useful text is somewhat limited. By today’s standards, GPT-2 is a small model with modest capabilities. However, it can still be helpful for certain tasks, such as text autocompletion, generating filler text, or answering simple questions. To rigourosly evaluate language models, researchers often use standard benchmark datasets. Creating these benchmarks is a discipline of its own, and they tend to become increasingly challenging as models continue to improve.

In the final task of this lab, you will evaluate GPT-2’s performance on a small subset of the [HellaSwag dataset](https://rowanzellers.com/hellaswag/), which was published in the same year as GPT-2 itself (2019). HellaSwag is designed to test a model’s ability to perform commonsense reasoning in challenging contexts. Unlike simpler benchmarks, HellaSwag presents scenarios where the correct text completion depends on semantic relationships between events and on world knowledge. This makes it a good choice for assessing the ability of language models to go beyond surface-level patterns and produce meaningful, context-aware predictions.

#### 🎓 Task 2.12: Evaluating the pretrained model

Read the [HellaSwag website](https://rowanzellers.com/hellaswag/) to get some background on the benchmark. How does a sample from the dataset look like? What is an expected prediction? How does the benchmark allow us to score models? What is the random baseline? What is the human performance reported on the task?

The next cell contains code for evaluating your pretrained model on a small sample from HellaSwag. You will also need a tokenizer. The HellaSwag subset is in the file `hellaswag-mini.jsonl`. Inspect that file to understand the format. Next, read the code and explain how it works. Specifically, how does the code compute the score of individual endings? In the call to `cross_entropy()`, why are the tensors sliced in this specific way?

Finally, what overall score does the pretrained GPT-2 model get on this benchmark? How does that score compare to the random baseline and the human performance?

In [223]:
import json

with open("hellaswag-mini.jsonl") as f:
    n_correct = 0
    n_total = 0
    for line in f:
        sample = json.loads(line)
        prefix = tokenizer.encode(sample["ctx"])
        ending_scores = []
        for i, ending in enumerate(sample["endings"]):
            suffix = tokenizer.encode(" " + ending)
            context = torch.tensor([prefix + suffix], dtype=torch.long)
            with torch.no_grad():
                logits = model(context)
                ending_score = torch.nn.functional.cross_entropy(
                    logits[0, -len(suffix) - 1 : -1], context[0, -len(suffix) :]
                )
            ending_scores.append((ending_score, i))
        predicted = min(ending_scores)[1]
        n_correct += int(predicted == sample["label"])
        n_total += 1
    print(f"Accuracy: {n_correct / n_total:.2%}")

torch.Size([1, 79, 768])
torch.Size([1, 79, 768])
torch.Size([1, 79, 768])
torch.Size([1, 79, 768])
torch.Size([1, 79, 12, 64])
torch.Size([1, 79, 12, 64])
torch.Size([1, 79, 12, 64])
torch.Size([1, 12, 79, 64])
torch.Size([1, 12, 79, 64])
torch.Size([1, 12, 79, 64])
torch.Size([1, 12, 79, 79])
torch.Size([1, 12, 79, 79])
torch.Size([1, 12, 79, 79])
torch.Size([1, 12, 79, 79])
torch.Size([1, 12, 79, 64])
torch.Size([1, 79, 12, 64])
torch.Size([1, 79, 768])
torch.Size([1, 79, 768])
torch.Size([1, 79, 768])
torch.Size([1, 79, 768])
torch.Size([1, 79, 768])
torch.Size([1, 79, 768])
torch.Size([1, 79, 12, 64])
torch.Size([1, 79, 12, 64])
torch.Size([1, 79, 12, 64])
torch.Size([1, 12, 79, 64])
torch.Size([1, 12, 79, 64])
torch.Size([1, 12, 79, 64])
torch.Size([1, 12, 79, 79])
torch.Size([1, 12, 79, 79])
torch.Size([1, 12, 79, 79])
torch.Size([1, 12, 79, 79])
torch.Size([1, 12, 79, 64])
torch.Size([1, 79, 12, 64])
torch.Size([1, 79, 768])
torch.Size([1, 79, 768])
torch.Size([1, 79, 768])
tor

**🥳 Congratulations on finishing lab&nbsp;2!**