# Coding an LLM architecture
## This notebook is inspired from Sebastian Raschka's book: https://github.com/rasbt/LLMs-from-scratch/tree/main/setup

In [42]:
from importlib.metadata import version
print("torch version:", version("torch"))
print("tiktoken version:", version("tiktoken"))

torch version: 2.8.0+cu126
tiktoken version: 0.11.0


In [43]:
import sys
sys.path.append('/content/')
import supplementary


- Compared to conventional deep learning models, LLMs are larger, mainly due to their vast number of parameters, not the amount of code
- We'll see that many elements are repeated in an LLM's architecture
- In this notebook, we consider embedding and model sizes akin to a small GPT-2 model
- We'll specifically code the architecture of the smallest GPT-2 model (124 million parameters), as outlined in Radford et al.'s [Language Models are Unsupervised Multitask Learners](https://cdn.openai.com/better-language-models/language_models_are_unsupervised_multitask_learners.pdf) (note that the initial report lists it as 117M parameters, but this was later corrected in the model weight repository)
- The next notebook will show how to load pretrained weights into our implementation, which will be compatible with model sizes of 345, 762, and 1542 million parameters
- Models like Llama and others are very similar to this model, since they are all based on the same core concepts



- Configuration details for the 124 million parameter GPT-2 model (GPT-2 "small") include:

In [None]:
GPT_CONFIG_124M = {
    "vocab_size": 50257,    # Vocabulary size
    "context_length": 1024, # Context length
    "emb_dim": 768,         # Embedding dimension
    "n_heads": 12,          # Number of attention heads
    "n_layers": 12,         # Number of layers
    "drop_rate": 0.0,       # Dropout rate
    "qkv_bias": False       # Query-Key-Value bias
}


# Coding the GPT model


- Note that the transformer block is repeated multiple times; in the case of the smallest 124M GPT-2 model, we repeat it 12 times.
- The corresponding code implementation, where `cfg["n_layers"] = 12`:

In [None]:
import torch.nn as nn
from supplementary import TransformerBlock, LayerNorm


class GPTModel(nn.Module):
    def __init__(self, cfg):
        super().__init__()
        self.tok_emb = nn.Embedding(cfg["vocab_size"], cfg["emb_dim"])
        self.pos_emb = nn.Embedding(cfg["context_length"], cfg["emb_dim"])
        self.drop_emb = nn.Dropout(cfg["drop_rate"])

        self.trf_blocks = nn.Sequential(
            *[TransformerBlock(cfg) for _ in range(cfg["n_layers"])])

        self.final_norm = LayerNorm(cfg["emb_dim"])
        self.out_head = nn.Linear(
            cfg["emb_dim"], cfg["vocab_size"], bias=False
        )

    def forward(self, in_idx):
        batch_size, seq_len = in_idx.shape
        tok_embeds = self.tok_emb(in_idx)
        pos_embeds = self.pos_emb(torch.arange(seq_len, device=in_idx.device))
        x = tok_embeds + pos_embeds  # Shape [batch_size, num_tokens, emb_size]
        x = self.drop_emb(x)
        x = self.trf_blocks(x)
        x = self.final_norm(x)
        logits = self.out_head(x)
        return logits

- Using the configuration of the 124M parameter model, we can now instantiate this GPT model with random initial weights as follows:

In [None]:
import torch
import tiktoken

tokenizer = tiktoken.get_encoding("gpt2")

batch = []

txt1 = "SysML is an exciting topic!"
txt2 = "We are enjoying this class activity!"

batch.append(torch.tensor(tokenizer.encode(txt1)))
batch.append(torch.tensor(tokenizer.encode(txt2)))
batch = torch.stack(batch, dim=0)
print(batch)

tensor([[44387,  5805,   318,   281,  7895,  7243,     0],
        [ 1135,   389, 13226,   428,  1398,  3842,     0]])


In [None]:
torch.manual_seed(123)
model = GPTModel(GPT_CONFIG_124M)

out = model(batch)
print("Input batch:\n", batch)
print("\nOutput shape:", out.shape)
print(out)

Input batch:
 tensor([[44387,  5805,   318,   281,  7895,  7243,     0],
        [ 1135,   389, 13226,   428,  1398,  3842,     0]])

Output shape: torch.Size([2, 7, 50257])
tensor([[[ 0.1617, -0.1764,  0.3950,  ...,  0.6123, -0.2504,  0.0080],
         [ 0.9303,  0.6303, -0.1192,  ..., -0.7093, -0.7461,  0.6131],
         [ 1.4983, -0.6454,  0.0533,  ...,  0.3796,  0.0106, -0.5612],
         ...,
         [ 1.2678,  0.5919, -0.4522,  ..., -0.8229,  0.1071, -0.5686],
         [-0.5783,  0.2095, -0.1544,  ..., -0.4177,  0.5092, -0.2287],
         [ 0.2756,  0.9296,  0.4836,  ..., -0.0557,  0.0341, -0.6004]],

        [[ 0.8183, -0.0958, -0.3817,  ...,  0.0393,  0.5270, -0.8108],
         [ 0.8268, -0.1798, -0.4543,  ...,  0.0599,  0.7603, -0.0481],
         [ 1.1810, -0.0887, -0.1735,  ..., -0.3513,  0.6742, -0.2838],
         ...,
         [ 0.7055,  0.0464, -0.2255,  ..., -0.7350, -0.6667, -0.4484],
         [-0.0459,  0.4305, -0.0806,  ..., -0.1916,  0.8593, -0.9296],
         [ 0.32

- We will train this model in the next notebook


# Generating text

- LLMs like the GPT model we implemented above are used to generate one word at a time
- The following `generate_text_simple` function implements greedy decoding, which is a simple and fast method to generate text
- In greedy decoding, at each step, the model chooses the word (or token) with the highest probability as its next output (the highest logit corresponds to the highest probability, so we technically wouldn't even have to compute the softmax function explicitly)
- The figure below depicts how the GPT model, given an input context, generates the next word token

In [None]:
def generate_text_simple(model, idx, max_new_tokens, context_size):
    # idx is (batch, n_tokens) array of indices in the current context
    for _ in range(max_new_tokens):

        # Crop current context if it exceeds the supported context size
        # E.g., if LLM supports only 5 tokens, and the context size is 10
        # then only the last 5 tokens are used as context
        idx_cond = idx[:, -context_size:]

        # Get the predictions
        with torch.no_grad():
            logits = model(idx_cond)

        # Focus only on the last time step
        # (batch, n_tokens, vocab_size) becomes (batch, vocab_size)
        logits = logits[:, -1, :]

        # Apply softmax to get probabilities
        probas = torch.softmax(logits, dim=-1)  # (batch, vocab_size)

        # Get the idx of the vocab entry with the highest probability value
        idx_next = torch.argmax(probas, dim=-1, keepdim=True)  # (batch, 1)

        # Append sampled index to the running sequence
        idx = torch.cat((idx, idx_next), dim=1)  # (batch, n_tokens+1)

    return idx

- The `generate_text_simple` above implements an iterative process, where it creates one token at a time



# Exercise: Generate some text

1. Use the `tokenizer.encode` method to prepare some input text
2. Then, convert this text into a pytprch tensor via (`torch.tensor`)
3. Add a batch dimension via `.unsqueeze(0)`
4. Use the `generate_text_simple` function to have the GPT generate some text based on your prepared input text
5. The output from step 4 will be token IDs, convert them back into text via the `tokenizer.decode` method

In [None]:
model.eval();  # disable dropout



# Solution

In [None]:
start_context = "Hello, I am"

encoded = tokenizer.encode(start_context)
print("encoded:", encoded)

encoded_tensor = torch.tensor(encoded).unsqueeze(0)
print("encoded_tensor.shape:", encoded_tensor.shape)

encoded: [15496, 11, 314, 716]
encoded_tensor.shape: torch.Size([1, 4])


In [None]:
out = generate_text_simple(
    model=model,
    idx=encoded_tensor,
    max_new_tokens=6,
    context_size=GPT_CONFIG_124M["context_length"]
)

print("Output:", out)
print("Output length:", len(out[0]))

Output: tensor([[15496,    11,   314,   716, 27018, 24086, 47843, 30961, 42348,  7267]])
Output length: 10


- Remove batch dimension and convert back into text:

In [None]:
decoded_text = tokenizer.decode(out.squeeze(0).tolist())
print(decoded_text)

Hello, I am Featureiman Byeswickattribute argue


- Note that the model is untrained; hence the random output texts above
- We will train the model in the next notebook