# 01 - DISTILGPT2

We will get started by open and exploring distilgpt2 model: https://huggingface.co/distilbert/distilgpt2
This model is a distilation of OpenAI's GPT-2 but smaller (6 transformers blocks and 82M parametres instead of 12 transformers blocks and 124M parametetrs from GPT-2).

In [1]:
import os

from huggingface_hub import login
import torch
from transformers import AutoConfig, AutoModelForCausalLM, AutoTokenizer

In [2]:
# login to Hugging Face, login token is in an environment variable
login(token=os.getenv("HUGGINGFACE_HUB_TOKEN"))

## Load model from hugging face

In [3]:
tokenizer = AutoTokenizer.from_pretrained("distilgpt2", cache_dir=r"D:\hf_files")  
model = AutoModelForCausalLM.from_pretrained("distilgpt2", cache_dir=r"D:\hf_files")

## Explore tokenizer
The LLMs don't break words in syllables neither letters. It has its own form of breaking text, each individual piece of the split text is a token. Different models have different token system. A token can be a single word, a piece of a word, a single character or even more then one work. A cool way to explore how tokenized texts is OpenAI's tokenizer webapp: https://platform.openai.com/tokenizer . They create it to help people visualize the tokenization output. ![image.png](attachment:cac6edcf-ea18-49dc-8fb8-7a25ee50fa86.png) <br> The token system is create during model training. Each token has its own ID. The very first step when using a LLM is transform text in tokens (encode). The very last step when using a LLM is converst its response in tokens back to text (decode).

In [4]:
# tokenize string to see output
messages = "Hello, My friend!" 
tokens = tokenizer.encode(messages)  
print(tokens)           # list of token IDs  
print(len(tokens))      # number of tokens  

[15496, 11, 2011, 1545, 0]
5


In [5]:
#retunr token list back to string
decoded_text = tokenizer.decode(tokens)  
print(decoded_text) 

Hello, My friend!


In [6]:
# note that "Hello" and "hello" are differente tokens
messages = "Hello hello, My friend!" 
tokens = tokenizer.encode(messages)  
print(tokens)           # list of token IDs  
print(len(tokens))      # number of tokens  

[15496, 23748, 11, 2011, 1545, 0]
6


In [7]:
# even with both words starting with lowercase letter they are different tokens
# one is start of string, another is preceeded by whitespace 
messages = "hello hello, My friend!" 
tokens = tokenizer.encode(messages)  
print(tokens)           # list of token IDs  
print(len(tokens))      # number of tokens  

[31373, 23748, 11, 2011, 1545, 0]
6


In [8]:
# even with both words starting with lowercase letter they are different tokens
messages = " hello hello, My friend!" 
tokens = tokenizer.encode(messages)  
print(tokens)           # list of token IDs  
print(len(tokens))      # number of tokens  

[23748, 23748, 11, 2011, 1545, 0]
6


## Explore the model
LLMs are made of several layers with different roles. Let's understand all these layer and its roles

In [9]:
# see model's details
print(model)  

GPT2LMHeadModel(
  (transformer): GPT2Model(
    (wte): Embedding(50257, 768)
    (wpe): Embedding(1024, 768)
    (drop): Dropout(p=0.1, inplace=False)
    (h): ModuleList(
      (0-5): 6 x GPT2Block(
        (ln_1): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
        (attn): GPT2Attention(
          (c_attn): Conv1D(nf=2304, nx=768)
          (c_proj): Conv1D(nf=768, nx=768)
          (attn_dropout): Dropout(p=0.1, inplace=False)
          (resid_dropout): Dropout(p=0.1, inplace=False)
        )
        (ln_2): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
        (mlp): GPT2MLP(
          (c_fc): Conv1D(nf=3072, nx=768)
          (c_proj): Conv1D(nf=768, nx=3072)
          (act): NewGELUActivation()
          (dropout): Dropout(p=0.1, inplace=False)
        )
      )
    )
    (ln_f): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
  )
  (lm_head): Linear(in_features=768, out_features=50257, bias=False)
)


### wte (Word Token Embedding)
Each token has its own embedding vector. These embedding vectors were created during training. <br> The details provided for this fase are: <br>`(wte): Embedding(50257, 768)` <br> From this details we can learn: <br> - there are 50257 tokens (all text is mapped to one of this tokens)<br> - the embedding vector have dimension 768 (embedding vectors have fixed size)<br> Each token has a embedding vector (dense) of dimension 768. During this layer a lookup table is used and the embedding vector of each input token is recovered. The output is the embedding matrix, where each line represents one input token.

In [10]:
# enconde again now generating a Tensor as output
tokens = tokenizer(messages,  return_tensors="pt") 

# visualize the embedding matrix
# notice that repeated tokens recovery same embedding
token_embeds = model.transformer.wte(tokens["input_ids"])  # (batch, seq_len, 768)  
print(token_embeds)

tensor([[[ 0.0653, -0.1096,  0.1105,  ..., -0.0884,  0.1287,  0.0113],
         [ 0.0653, -0.1096,  0.1105,  ..., -0.0884,  0.1287,  0.0113],
         [ 0.0086, -0.0009,  0.0056,  ...,  0.0484, -0.0737, -0.0636],
         [ 0.0714,  0.1300,  0.1132,  ...,  0.1220, -0.0033,  0.0546],
         [-0.0640, -0.1471,  0.0998,  ..., -0.0914,  0.1967,  0.1006],
         [-0.1445, -0.0455,  0.0042,  ..., -0.1523,  0.0184,  0.0991]]],
       grad_fn=<EmbeddingBackward0>)


### wpe (Word position Embedding)
The token position in the string is important. So, this layer adds this is about this information. Each position in the string have an embedding vector.  <br> The details provided for this fase are: <br>`(wpe): Embedding(1024, 768)` <br> From this details we can learn: <br> - the maximum token window size is 1024<br> - the embedding vector also have dimension 768 (embedding vectors have fixed size)<br> During this layer a lookup table is used and the embedding vector of each position is recovered. The output is the embedding matrix, where each line represents one position.

In [11]:
# see embeddings of positions
position_ids = torch.arange(0, tokens['input_ids'].size(1)).unsqueeze(0)

pos_embeds   = model.transformer.wpe(position_ids)
print(pos_embeds)

tensor([[[-1.8821e-02, -1.9742e-01,  4.0267e-03,  ..., -4.3044e-02,
           2.8267e-02,  5.4490e-02],
         [ 2.3959e-02, -5.3792e-02, -9.4879e-02,  ...,  3.4170e-02,
           1.0172e-02, -1.5573e-04],
         [ 4.2161e-03, -8.4764e-02,  5.4515e-02,  ...,  1.9745e-02,
           1.9325e-02, -2.1424e-02],
         [-2.8337e-04, -7.3803e-02,  1.0553e-01,  ...,  1.0157e-02,
           1.7659e-02, -7.0854e-03],
         [ 7.6374e-03, -2.5090e-02,  1.2696e-01,  ...,  8.4643e-03,
           9.8542e-03, -7.0117e-03],
         [ 9.6023e-03, -3.3885e-02,  1.3123e-01,  ...,  5.8940e-03,
           7.1222e-03, -7.4742e-03]]], grad_fn=<EmbeddingBackward0>)


### wte + wpe
A linear sum between wte and wpe output is performed now. Both are matrix with same size, the number of row is the number of tokens in the input and the number of columns is the embedding vector size. The output is also a matrix with same dimension.

In [12]:
input_embeddings = token_embeds + pos_embeds
print(input_embeddings)

tensor([[[ 0.0465, -0.3070,  0.1146,  ..., -0.1315,  0.1570,  0.0658],
         [ 0.0893, -0.1634,  0.0157,  ..., -0.0543,  0.1389,  0.0112],
         [ 0.0128, -0.0856,  0.0601,  ...,  0.0681, -0.0544, -0.0850],
         [ 0.0711,  0.0562,  0.2187,  ...,  0.1322,  0.0143,  0.0475],
         [-0.0564, -0.1722,  0.2268,  ..., -0.0830,  0.2066,  0.0936],
         [-0.1349, -0.0794,  0.1355,  ..., -0.1464,  0.0255,  0.0916]]],
       grad_fn=<AddBackward0>)


### Dropout
Dropout is a regularization technique used to prevent overfitting by randomly setting a fraction of the input elements to zero during training. This forces the model to learn more robust features.
<br> The details provided for this fase are: <br>`(drop): Dropout(p=0.1, inplace=False)` <br> From this details we can learn: <br> - dropout happens to 10% of elements<br> - a new vector is generated instead of replace values of the existing vector inplace=False<br>
So, during training, for each embedding vector in the input matrix 10% of the elements are randomly choose and set to 0. All remaining values are increase to maintaing the expectec sum value. During inference the dropout is disabled.

### Transformation Block
In this part we have the tranformers architecture. <br> The details provided for this fase are: <br>`(0-5): 6 x GPT2Block(` <br> It means that there are 6 transformation blocks with same architecture called GPT2Block index from 0 to 5. Now <br> 
#### Normalization Layer 1 (ln_1)
<br> The first layer is a normalization layer and its details are: <br> `(ln_1): LayerNorm((768,), eps=1e-05, elementwise_affine=True)`<br> It means:<br> - The normalization is applied to all 768 elements of each embedding <br> - A small constant is added to denominator (1e-5) to avoid division by zero<br> - the element wise affine enables and advanced normalization (z-score) technique that apply gama (proportional) and beta (constant) parameters to the output of basic normalization. Both gamma and betta are learned along the training process so it allows the normalization to adapt to the data.<br> By normalizing the inputs to the transformer block, it ensures that the activations remain well-behaved, which is crucial for the stability of deep networks like GPT



In [13]:
# Set model to evaluation mode (disable dropout for consistency)
model.eval()

# Register a forward hook to capture the LayerNorm output from the first block (block 0)
def hook(module, hook_input, output):
    print("LayerNorm input:", hook_input)
    print("\nLayerNorm output:", output)

# Attach the hook to the ln_1 of the first GPT2Block (index 0)
hook_handle = model.transformer.h[0].ln_1.register_forward_hook(hook)

# Forward pass through the model
with torch.no_grad():  # Disable gradient computation for efficiency
    outputs = model(tokens['input_ids'])

# Remove the hook after use
hook_handle.remove()

LayerNorm input: (tensor([[[ 0.0465, -0.3070,  0.1146,  ..., -0.1315,  0.1570,  0.0658],
         [ 0.0893, -0.1634,  0.0157,  ..., -0.0543,  0.1389,  0.0112],
         [ 0.0128, -0.0856,  0.0601,  ...,  0.0681, -0.0544, -0.0850],
         [ 0.0711,  0.0562,  0.2187,  ...,  0.1322,  0.0143,  0.0475],
         [-0.0564, -0.1722,  0.2268,  ..., -0.0830,  0.2066,  0.0936],
         [-0.1349, -0.0794,  0.1355,  ..., -0.1464,  0.0255,  0.0916]]]),)

LayerNorm output: tensor([[[ 0.0344, -0.1351,  0.0305,  ..., -0.0629,  0.0545,  0.0270],
         [ 0.0923, -0.1218, -0.0079,  ..., -0.0457,  0.0874,  0.0023],
         [ 0.0214, -0.0658,  0.0307,  ...,  0.0636, -0.0639, -0.0840],
         [ 0.0919,  0.0721,  0.1640,  ...,  0.1291, -0.0011,  0.0431],
         [-0.0565, -0.1489,  0.1654,  ..., -0.0777,  0.1644,  0.0825],
         [-0.1509, -0.0637,  0.0963,  ..., -0.1427,  0.0067,  0.0842]]])


#### Attention Layer
This layer implements the self-attention mechanism, a core part of the transformer architecture. The layer is responsible for allowing the model to focus on different parts of the input sequence when processing each token.<br>
Let's see it in details:<br>
`(attn): GPT2Attention(`<br>
`    (c_attn): Conv1D(nf=2304, nx=768)`<br>
`    (c_proj): Conv1D(nf=768, nx=768)`<br>
`    (attn_dropout): Dropout(p=0.1, inplace=False)`<br>
`    (resid_dropout): Dropout(p=0.1, inplace=False)`<br>
`)`<br>
 - `(c_attn): Conv1D(nf=2304, nx=768)`: A 1D convolution layer that projects the input into query, key, and value vectors. A convolution kernel is applied to each token vector and, as result of the convolution, 3 vectors a created Query (Q), Key (K) and Value (V). This is an Attention Component.

In [14]:
def hook2(module, hook_input, output):
    #print("Attention output shape:", output.shape)
    print("Attention output:", output)

# Attach hook to the attn layer of the first block
hook_handle = model.transformer.h[0].attn.register_forward_hook(hook2)

with torch.no_grad():
    model(tokens['input_ids'])

hook_handle.remove()

Attention output: (tensor([[[-0.0570,  0.1594, -0.4522,  ...,  0.0009,  0.0534,  0.0127],
         [-0.4796,  0.6074, -0.7725,  ..., -0.0054,  0.0342, -0.0250],
         [ 0.5455,  0.3538, -0.5408,  ...,  0.0220,  0.0340,  0.0448],
         [-0.0961,  0.2955, -0.3571,  ...,  0.0276,  0.0669,  0.0506],
         [ 0.8775,  0.3109, -0.1867,  ..., -0.0256,  0.0185,  0.0135],
         [ 0.8812, -0.3464,  0.3855,  ...,  0.0118,  0.0187,  0.0869]]]), None)


- `(c_proj): Conv1D(nf=768, nx=768)`: A new kernel is applied to the result of (c_attn) and a vector with dimension 768 is generated as result. Returning it back to original dimesion. This is an Projection Component.<br> Rhe (c_proj) layer allows the model to learn a separate transformation for the attention output, which can help adjust the scale, emphasize certain features, or align the output with the input dimension for the residual connection (where the c_proj output is added to the input of the block).



In [15]:
def hook3(module, hook_input, output):
    print("c_proj output shape:", output.shape)
    print("c_proj output:", output)

hook_handle = model.transformer.h[0].attn.c_proj.register_forward_hook(hook3)

with torch.no_grad():
    model(tokens['input_ids'])

hook_handle.remove()

c_proj output shape: torch.Size([1, 6, 768])
c_proj output: tensor([[[-0.0570,  0.1594, -0.4522,  ...,  0.0009,  0.0534,  0.0127],
         [-0.4796,  0.6074, -0.7725,  ..., -0.0054,  0.0342, -0.0250],
         [ 0.5455,  0.3538, -0.5408,  ...,  0.0220,  0.0340,  0.0448],
         [-0.0961,  0.2955, -0.3571,  ...,  0.0276,  0.0669,  0.0506],
         [ 0.8775,  0.3109, -0.1867,  ..., -0.0256,  0.0185,  0.0135],
         [ 0.8812, -0.3464,  0.3855,  ...,  0.0118,  0.0187,  0.0869]]])


- `(attn_dropout): Dropout(p=0.1, inplace=False)`: This dropout is applied to the attention weights before they are used to compute the weighted sum of the value vectors (i.e., after the softmax((QK^T)/√d_k) step but before multiplying by V). It regularizes the attention mechanism by randomly setting 10% of the attention weight values to zero during training. This prevents the model from relying too heavily on any single attention relationship between tokens, encouraging more robust feature learning.<br>




 - `(resid_dropout): Dropout(p=0.1, inplace=False)`: This dropout is applied to the output of c_proj (the projected attention output, shape (batch_size, seq_len, 768) after the attention computation and projection but before the residual connection is added. It regularizes the final output of the attention block by randomly setting 10% of the elements in the 768-dimensional vectors to zero during training. This helps prevent overfitting by introducing noise and ensuring the model doesn’t depend too much on specific features in the attention output.<br>

It completes the Attention Module.

#### Normalization Layer 2 (ln_2)
- `(ln_2): LayerNorm((768,), eps=1e-05, elementwise_affine=True)`: Other normalization layer. The (ln_2) layer is applied after the residual connection, where the output of the GPT2Attention module (after resid_dropout) is added to the input of the block (the output of the initial ln_1 and dropout). This normalized output is then fed into the feed-forward network (MLP) within the GPT2Block.<br> Like ln_1, ln_2 performs layer normalization to stabilize and accelerate training by normalizing the input across the 768 feature dimensions for each token. It ensures that the combined output from the attention block and residual connection has consistent statistical properties, which is crucial before passing it to the non-linear feed-forward layers.

In [16]:
def hook4(module, hook_input, output):
    print("ln_2 output shape:", output.shape)
    print("ln_2 output:", output)

# Attach hook to the ln_2 layer of the first block
hook_handle = model.transformer.h[0].ln_2.register_forward_hook(hook4)

with torch.no_grad():
    model(tokens['input_ids'])

hook_handle.remove()

ln_2 output shape: torch.Size([1, 6, 768])
ln_2 output: tensor([[[ 3.8605e-02,  3.3201e-02, -4.4046e-02,  ..., -1.4648e-01,
           2.3323e-01,  7.9626e-02],
         [-1.4356e-04,  1.2792e-01, -1.0325e-01,  ..., -7.9846e-02,
           1.5782e-01, -1.1437e-02],
         [ 9.1684e-02,  9.8509e-02, -6.1494e-02,  ...,  4.9615e-02,
          -1.8802e-02, -3.6460e-02],
         [ 3.4916e-02,  1.1486e-01, -1.0481e-02,  ...,  1.2145e-01,
           7.4712e-02,  7.1423e-02],
         [ 1.2014e-01,  7.8262e-02,  1.7161e-02,  ..., -1.3898e-01,
           1.9977e-01,  7.3096e-02],
         [ 1.1336e-01, -1.3552e-02,  9.4564e-02,  ..., -1.5752e-01,
           4.1171e-02,  1.3407e-01]]])


#### Feed-Forward Network (MLP)
This layer represents the feed-forward network (FFN) within each transformer block, applying a non-linear transformation to the data. Let's check it: <br>
`(mlp): GPT2MLP(`<br>
`      (c_fc): Conv1D(nf=3072, nx=768)`<br>
`      (c_proj): Conv1D(nf=768, nx=3072)`<br>
`      (act): NewGELUActivation()`<br>
`      (dropout): Dropout(p=0.1, inplace=False)`<br>
`)`<br>
<br>
 - `(c_fc): Conv1D(nf=3072, nx=768)`: A 1D convolution layer that expands the 768-dimensional input to a 3072-dimensional intermediate representation.The input from ln_2 (shape batch_size, seq_len, 768) is fed into c_fc, which uses a Conv1D with nf=3072 and nx=768.

In [17]:
def hook5(module, hook_input, output):
    print("c_fc output shape:", output.shape)
    print("c_fc output:", output)

# Attach hook to the ln_2 layer of the first block
hook_handle = model.transformer.h[0].mlp.c_fc.register_forward_hook(hook5)

with torch.no_grad():
    model(tokens['input_ids'])

hook_handle.remove()

c_fc output shape: torch.Size([1, 6, 3072])
c_fc output: tensor([[[ 0.5976, -0.3890, -0.7655,  ..., -2.8732, -1.8182,  0.6358],
         [ 0.8803,  0.5737,  0.1588,  ..., -1.5404, -0.9705,  0.1162],
         [ 0.7707, -0.6079, -0.6141,  ..., -1.4422, -0.8109,  0.3963],
         [ 0.5312, -0.1843, -0.6936,  ..., -1.0675, -0.8499,  0.4928],
         [ 0.9267, -0.0587, -0.0083,  ..., -1.5407, -0.3576, -0.7753],
         [ 0.6779, -0.7721,  0.0674,  ..., -2.0618, -0.8126,  0.4597]]])


 - `(act): NewGELUActivation()`: Non-linear Activation with NewGELUActivation. The 3072-dimensional output from c_fc is passed through the NewGELUActivation function, a modified Gaussian Error Linear Unit (GELU) activation. GELU introduces non-linearity and is defined as: `GELU(x)=x⋅Φ(x)` where Φ(x) is the cumulative distribution function of the standard normal distribution. The "NewGELU" variant (used in GPT-2) is an approximation for efficiency.

In [18]:
def hook6(module, hook_input, output):
    print("act output shape:", output.shape)
    print("act output:", output)

# Attach hook to the ln_2 layer of the first block
hook_handle = model.transformer.h[0].mlp.act.register_forward_hook(hook6)

with torch.no_grad():
    model(tokens['input_ids'])

hook_handle.remove()

act output shape: torch.Size([1, 6, 3072])
act output: tensor([[[ 0.4332, -0.1356, -0.1700,  ..., -0.0054, -0.0628,  0.4689],
         [ 0.7135,  0.4113,  0.0894,  ..., -0.0953, -0.1611,  0.0635],
         [ 0.6007, -0.1652, -0.1656,  ..., -0.1079, -0.1693,  0.2592],
         [ 0.3731, -0.0787, -0.1693,  ..., -0.1527, -0.1681,  0.3395],
         [ 0.7625, -0.0280, -0.0041,  ..., -0.0953, -0.1288, -0.1699],
         [ 0.5091, -0.1700,  0.0355,  ..., -0.0403, -0.1693,  0.3113]]])


 - `(c_proj): Conv1D(nf=768, nx=3072)`: The activated 3072-dimensional vector is then projected back to 768 dimensions using c_proj (nf=768, nx=3072). The weight matrix size is 3072×768×1=2,359,2963072 \times 768 \times 1 = 2,359,2963072 \times 768 \times 1 = 2,359,296
 (plus 768 bias terms if included, totaling 2,360,064 parameters).

In [19]:
def hook7(module, hook_input, output):
    print("c_proj output shape:", output.shape)
    print("c_proj output:", output)

# Attach hook to the ln_2 layer of the first block
hook_handle = model.transformer.h[0].mlp.act.register_forward_hook(hook7)

with torch.no_grad():
    model(tokens['input_ids'])

hook_handle.remove()

c_proj output shape: torch.Size([1, 6, 3072])
c_proj output: tensor([[[ 0.4332, -0.1356, -0.1700,  ..., -0.0054, -0.0628,  0.4689],
         [ 0.7135,  0.4113,  0.0894,  ..., -0.0953, -0.1611,  0.0635],
         [ 0.6007, -0.1652, -0.1656,  ..., -0.1079, -0.1693,  0.2592],
         [ 0.3731, -0.0787, -0.1693,  ..., -0.1527, -0.1681,  0.3395],
         [ 0.7625, -0.0280, -0.0041,  ..., -0.0953, -0.1288, -0.1699],
         [ 0.5091, -0.1700,  0.0355,  ..., -0.0403, -0.1693,  0.3113]]])


 - `(dropout): Dropout(p=0.1, inplace=False)`: A dropout layer with p=0.1 is applied to the 768-dimensional output, randomly setting 10% of the elements to zero during training. This regularizes the MLP output.

In [20]:
def hook8(module, hook_input, output):
    print("GPT2MLP output shape:", output.shape)
    print("GPT2MLP output:", output)

# Attach hook to the mlp layer of the first block
hook_handle = model.transformer.h[0].mlp.register_forward_hook(hook8)

with torch.no_grad():
    model(tokens['input_ids'])

hook_handle.remove()

GPT2MLP output shape: torch.Size([1, 6, 768])
GPT2MLP output: tensor([[[-0.5530, -0.2582, -1.3098,  ..., -1.5001, -0.3564, -1.5608],
         [-1.4469,  0.3642, -1.3667,  ..., -0.5334,  0.1527, -2.2635],
         [-0.6052, -0.2670, -0.3868,  ...,  0.7127,  0.1794,  0.4032],
         [-0.1885,  0.0130,  0.5213,  ...,  0.0481,  0.6167, -0.6835],
         [-0.4523,  0.1644, -0.0204,  ..., -0.6527, -1.7580, -1.2175],
         [ 0.2289, -0.7968,  0.2175,  ..., -1.0243,  0.3720,  0.5569]]])


Exectuion order of steps is different form order shown in model details.<br>
Execution order in the GPT2BlockInput:<br>
ln_1 → GPT2Attention → Residual Connection → ln_2 → GPT2MLP → Residual Connection → Output. <br>
ln_2 output → c_fc → NewGELUActivation → c_proj → dropout → Residual Addition.<br>
<br>
You can understand the execution order inspecting source code. There is a foward method.<br>
```
def forward(self, x):
    hidden_states = self.c_fc(x)         # Project to 3072
    hidden_states = self.act(hidden_states)  # Apply activation
    hidden_states = self.c_proj(hidden_states)  # Project back to 768
    hidden_states = self.dropout(hidden_states)  # Apply dropout
    return hidden_states
```

Or you can inspect a foward pass, like the code below:

In [21]:
# code to read exection order from model execution

# Dictionary to store the order of layer execution
execution_order = []
hooks_handles = []

# Hook function to record the order
def hook9(module, hook_input, output, layer_name):
    execution_order.append(layer_name)
    print(f"Processed: {layer_name}, Input shape: {hook_input[0].shape}, Output shape: {output.shape}")

# Attach hooks to each layer in the first block's MLP
mlp = model.transformer.h[0].mlp
for name, module in mlp.named_children():
    handle = module.register_forward_hook(lambda m, i, o, n=name: hook9(m, i, o, n))
    hooks_handles.append(handle)

# Create a sample input
with torch.no_grad():
    model(tokens['input_ids'])

# Print the execution order
print("Execution order:", execution_order)

# remove hooks to each layer in the first block's MLP
for hook in hooks_handles:
    hook.remove()

Processed: c_fc, Input shape: torch.Size([1, 6, 768]), Output shape: torch.Size([1, 6, 3072])
Processed: act, Input shape: torch.Size([1, 6, 3072]), Output shape: torch.Size([1, 6, 3072])
Processed: c_proj, Input shape: torch.Size([1, 6, 3072]), Output shape: torch.Size([1, 6, 768])
Processed: dropout, Input shape: torch.Size([1, 6, 768]), Output shape: torch.Size([1, 6, 768])
Execution order: ['c_fc', 'act', 'c_proj', 'dropout']


### Final Normalization
The ln_f layer is applied to the output of the last GPT2Block (after the residual connection of the MLP in the sixth block). This normalized output is then passed to the (lm_head) layer for the final language modeling prediction.<br>
This layer normalizes the 768-dimensional embedding vectors across the feature dimension for each token in the sequence, ensuring the output is well-conditioned before the linear transformation to the vocabulary size (50257 in this case). It helps stabilize the final representation and prepares it for the prediction task (e.g., next token prediction).<br>

`(ln_f): LayerNorm((768,), eps=1e-05, elementwise_affine=True)`


In [22]:
# Hook function to capture the output
def hook10(module, hook_input, output):
    print("ln_f output shape:", output.shape)
    print("ln_f output:", output)

# Attach hook to the ln_f layer
hook_handle = model.transformer.ln_f.register_forward_hook(hook10)

# Create a sample input
with torch.no_grad():
    outputs = model(tokens['input_ids'])

# Remove the hook
hook_handle.remove()

ln_f output shape: torch.Size([1, 6, 768])
ln_f output: tensor([[[-2.5024e-02,  3.6233e-01, -1.1136e-01,  ..., -1.8130e-01,
           1.1482e-01, -1.8825e-01],
         [-1.7677e-01,  9.3993e-02, -1.9730e-01,  ..., -2.3405e-01,
           5.3724e-01, -2.4905e-01],
         [ 1.9197e-01, -3.7879e-02,  1.3627e-02,  ...,  2.1321e-02,
           1.8326e-01,  3.5414e-03],
         [ 6.2068e-01,  3.8346e-01,  8.5196e-01,  ..., -2.4001e-02,
           3.8335e-01, -4.9142e-01],
         [ 1.4832e-01,  1.2018e-01,  7.5178e-02,  ..., -7.7200e-02,
          -1.6340e-01, -5.0223e-04],
         [ 3.6321e-02, -1.6660e-01, -4.7713e-02,  ..., -2.3693e-02,
          -1.2833e-01,  5.3685e-02]]])


### Final Normalization
The final layer in the DistilGPT2 model after the (ln_f) normalization. The lm_head layer takes the normalized output from ln_f (shape [batch_size, seq_len, 768]) and produces the final output of the model, which is a set of logits for language modeling. This layer is applied after the transformer stack to map the hidden states to the vocabulary space.<br>
It transforms the 768-dimensional hidden representations of each token into a probability distribution over the 50,257-token vocabulary, enabling the model to predict the next token in a sequence (e.g., for autoregressive tasks like text generation).<br>

`(lm_head): Linear(in_features=768, out_features=50257, bias=False)`


In [25]:


# Hook function to capture the output
def hook11(module, hook_input, output):
    print("lm_head output shape:", output.shape)
    print("lm_head output (first few logits):\n", output[:, :, :5]) 

# Attach hook to the lm_head layer
hook_handle = model.lm_head.register_forward_hook(hook11)

# Create a sample input
with torch.no_grad():
    outputs = model(tokens['input_ids'])

# Remove the hook
hook_handle.remove()

lm_head output shape: torch.Size([1, 6, 50257])
lm_head output (first few logits):
 tensor([[[-32.7786, -31.2604, -33.4580, -33.3068, -33.5693],
         [-41.9481, -47.0836, -48.3396, -48.4019, -47.3441],
         [-57.8325, -60.0321, -60.2201, -59.9341, -60.7340],
         [-46.5706, -48.6622, -49.4792, -48.8984, -47.7866],
         [-54.2288, -58.9350, -62.3507, -63.2325, -62.3509],
         [-64.1996, -66.1278, -66.1443, -66.4251, -67.9091]]])


The (lm_head) layer produces a tensor of shape (batch_size, seq_len, 50257), where each element is a logit. These logits are unnormalized scores (raw numbers) representing the model’s preference for each of the 50,257 tokens in the vocabulary at each position in the sequence.<br>
The logits are not probabilities. To convert them into a probability distribution, you need to apply a softmax function over the vocabulary dimension (axis 2) for each token position. The softmax normalizes the logits into values between 0 and 1 that sum to 1, interpretable as probabilities.<br>

In [28]:
# using output from previous execution
logits = outputs.logits

# Convert logits to probabilities
probabilities = torch.softmax(logits, dim=-1)

# Print probabilities for the last token
last_token_probs = probabilities[0, -1, :]  # Shape: [50257]
print("Probabilities for the last token (first 10):", last_token_probs[:10])

Probabilities for the last token (first 10): tensor([3.4514e-04, 5.0186e-05, 4.9366e-05, 3.7278e-05, 8.4521e-06, 7.9013e-06,
        4.9008e-05, 1.5883e-04, 2.4542e-05, 1.6336e-04])


To determine the next token, you need to identify the token with the highest probability (or sample from the distribution, depending on the strategy). Ordering the probabilities helps you rank the tokens.

In [29]:
# Get the index of the most likely token (greedy decoding)
next_token_id = torch.argmax(last_token_probs).item()
next_token = tokenizer.decode(next_token_id)
print(f"Most likely next token: {next_token}")

# Get top 5 tokens
top_k_values, top_k_indices = torch.topk(last_token_probs, k=5)
top_tokens = [tokenizer.decode(idx.item()) for idx in top_k_indices]
print(f"Top 5 next tokens and probabilities: {list(zip(top_tokens, top_k_values.tolist()))}")

Most likely next token: 

Top 5 next tokens and probabilities: [('\n', 0.40251022577285767), (' I', 0.11191854625940323), (' You', 0.021729281172156334), (' My', 0.018897606059908867), ('<|endoftext|>', 0.016403187066316605)]


In [42]:
response = ''

for logit in probabilities[0]:
    next_token_id = torch.argmax(logit).item()
    next_token = tokenizer.decode(next_token_id)
    #print(f'next token is: {next_token}')
    response = response + next_token

print(response)

 The, I name,



A few examples:

In [45]:
type(model)

transformers.models.gpt2.modeling_gpt2.GPT2LMHeadModel

In [67]:
def generate_recursive(input_ids, model, current_length = 0, max_tokens = 20, end_token_id = 50256):
    # Base cases: reached max_tokens or generated end token
    if current_length >= max_tokens:
        return input_ids
    with torch.no_grad():  # Disable gradient computation
        outputs = model(input_ids)
        logits = outputs.logits[:, -1, :]  # Get logits for the last token
        probabilities = torch.softmax(logits, dim=-1)
        next_token_id = torch.argmax(probabilities, dim=-1).unsqueeze(0)  # Greedy decoding

        # Check if end token is generated
        if next_token_id.item() == end_token_id:
            return torch.cat([input_ids, next_token_id], dim=-1)

        # Recursive call with updated input
        new_input_ids = torch.cat([input_ids, next_token_id], dim=-1)
        return generate_recursive(new_input_ids, model, current_length + 1)
    

In [70]:
def generate_text(messages, model, tokenizer):
    input_ids = tokenizer.encode(messages, return_tensors="pt")
    generated_ids = generate_recursive(input_ids, model)
    generated_text = tokenizer.decode(generated_ids[0], skip_special_tokens=False)
    return generated_text
    

In [71]:
message = "Hello, my friend. How you doing?"

print(generate_text(message, model, tokenizer))

Hello, my friend. How you doing?






















In [72]:
message = "My favorite color is"

print(generate_text(message, model, tokenizer))

My favorite color is the red. I love the color of the red. I love the color of the red. I


In [73]:
message = "Cats are "

print(generate_text(message, model, tokenizer))

Cats are iced with a little bit of salt.














In [74]:
message = "Eu falo portugues "

print(generate_text(message, model, tokenizer))

Eu falo portugues ia falo portugues ia falo portugues ia falo portugues


In [75]:
message = "Palmeiras não tem "

print(generate_text(message, model, tokenizer))

Palmeiras não tem été de la vida de la vida de la vida de la vida de la
