<a href="https://colab.research.google.com/github/hollyemblem/raschka-llm-from-scratch/blob/main/chapter4_implementing_from_scratch.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## Coding LLM Architecture

A high-level view of the architecture required for predicting one word (token) at a time.

Text -> Tokenized text -> Embeddings -> One or more Transformer blocks w/ multi-headed attention -> Output layers

### Parameter Size and Changes

In the previous chapter notebooks, we had small embedding dimensions so we could manually calculate the components of things like attention mechanisms. Now, we're scaling up to the small GPT model of 124 million parameters like [the GPT-2 paper](https://cdn.openai.com/better-language-models/language_models_are_unsupervised_multitask_learners.pdf) (note, they originally state 117 million, but this was corrected).

#### What are parameters?
Parameters are trainable weights that are adjusted and optimised during training to minimise the loss function.

#### How do we calculate the number of parameters?

Imagine we have a neural network with a 2048 x 2048 trainable weight matrix. Every value in this matrix is a parameter, so we have 2048 x 2048 = 4,194,304 parameters.

## GPT Config


In [17]:
GPT_CONFIG_124M = {
    "vocab_size": 50257,     # Vocabulary size
    "context_length": 1024,  # Context length
    "emb_dim": 768,          # Embedding dimension
    "n_heads": 12,           # Number of attention heads
    "n_layers": 12,          # Number of layers
    "drop_rate": 0.1,        # Dropout rate
    "qkv_bias": False        # Query-Key-Value bias
}

In [34]:
### Implementing a placeholder GPT class

import torch
import torch.nn as nn

class DummyGPTModel(nn.Module):
    def __init__(self, cfg):
        super().__init__()
        '''the embedding layer is essentially a lookup operation that retrieves rows from the embedding layer’s weight matrix via a token ID.'''

        self.tok_emb = nn.Embedding(cfg["vocab_size"], cfg["emb_dim"])
        self.pos_emb = nn.Embedding(cfg["context_length"], cfg["emb_dim"])
        self.drop_emb = nn.Dropout(cfg["drop_rate"])
        self.trf_blocks = nn.Sequential(               #1 Transformer block placeholder
            *[DummyTransformerBlock(cfg)               #1 Transformer block placeholder
              for _ in range(cfg["n_layers"])]         #1 Transformer block placeholder
        )                                              #1 Transformer block placeholder
        self.final_norm = DummyLayerNorm(cfg["emb_dim"])     #2 Layer norm placeholder
        self.out_head = nn.Linear(
            cfg["emb_dim"], cfg["vocab_size"], bias=False #take an input of 768 and output 50257
        )

    def forward(self, in_idx):
        batch_size, seq_len = in_idx.shape
        tok_embeds = self.tok_emb(in_idx)
        pos_embeds = self.pos_emb(
            torch.arange(seq_len, device=in_idx.device)
        )
        x = tok_embeds + pos_embeds
        print("Hidden representation shape:", x.shape)  # ← THIS IS 768
        x = self.drop_emb(x)
        x = self.trf_blocks(x)
        x = self.final_norm(x)
        logits = self.out_head(x)
        return logits

class DummyTransformerBlock(nn.Module):    #3 placeholder class
    def __init__(self, cfg):
        super().__init__()

    def forward(self, x):     #4 placeholder forward pass
        '''
        forward method describes the data flow through the model: it computes token and positional embeddings for the input indices,
        applies dropout, processes the data through the transformer blocks, applies normalization, and finally produces logits with the linear output layer.
        '''
        return x

class DummyLayerNorm(nn.Module):           #5
    def __init__(self, normalized_shape, eps=1e-5):    #6 placeholder layernorm
        super().__init__()

    def forward(self, x):
        return x

### Tokenising Text

In [27]:
!pip install tiktoken



In [31]:
import tiktoken

tokenizer = tiktoken.get_encoding("gpt2")
batch = []
txt1 = "Every effort moves you"
txt2 = "Every day holds a"

batch.append(torch.tensor(tokenizer.encode(txt1)))
batch.append(torch.tensor(tokenizer.encode(txt2)))
batch = torch.stack(batch, dim=0)
print(batch)

tensor([[6109, 3626, 6100,  345],
        [6109, 1110, 6622,  257]])


In [35]:
##Initialise dummy GPT model and feed it tokenized batch
'''The output tensor has two rows corresponding to the two text samples. Each text sample consists of four tokens; each token is a 50,257-dimensional vector
This represents one score per vocabulary token
'''

torch.manual_seed(123)
model = DummyGPTModel(GPT_CONFIG_124M)
logits = model(batch)
print("Output shape:", logits.shape)
print(logits)


Hidden representation shape: torch.Size([2, 4, 768])
Output shape: torch.Size([2, 4, 50257])
tensor([[[-1.2034,  0.3201, -0.7130,  ..., -1.5548, -0.2390, -0.4667],
         [-0.1192,  0.4539, -0.4432,  ...,  0.2392,  1.3469,  1.2430],
         [ 0.5307,  1.6720, -0.4695,  ...,  1.1966,  0.0111,  0.5835],
         [ 0.0139,  1.6755, -0.3388,  ...,  1.1586, -0.0435, -1.0400]],

        [[-1.0908,  0.1798, -0.9484,  ..., -1.6047,  0.2439, -0.4530],
         [-0.7860,  0.5581, -0.0610,  ...,  0.4835, -0.0077,  1.6621],
         [ 0.3567,  1.2698, -0.6398,  ..., -0.0162, -0.1296,  0.3717],
         [-0.2407, -0.7349, -0.5102,  ...,  2.0057, -0.3694,  0.1814]]],
       grad_fn=<UnsafeViewBackward0>)


In [None]:
batch.shape

In [23]:
batch

tensor([[6109, 3626, 6100,  345],
        [6109, 1110, 6622,  257]])

In [24]:
##Tokeniser creates a batch of 2 with 4 tokens in
##This is then passed to embedding layers as part of GPTModel
# Each text sample consists of 4 tokens; each token is a 50257 dimensional vector matching tiktoken's bpe config
'''
token ids
   ↓
embedding lookup
   ↓
(batch, seq, 768)   ← internal representation
   ↓
linear projection
   ↓
(batch, seq, 50257) ← logits
'''

'\ntoken ids\n   ↓\nembedding lookup\n   ↓\n(batch, seq, 768)   ← internal representation\n   ↓\nlinear projection\n   ↓\n(batch, seq, 50257) ← logits\n'

In [25]:
'''This is the embedding matrix. A row per token and a column per dimension'''
print(model.tok_emb.weight.shape)

torch.Size([50257, 768])


### Thinking about the shape of the inputs and outputs

One of the slightly confusing elements of the book is how this is stated:

> The output tensor has two rows corresponding to the two text samples. Each text sample consists of four tokens; each token is a 50,257-dimensional vector, which matches the size of the tokenizer’s vocabulary.

> The embedding has 50,257 dimensions because each of these dimensions refers to a unique token in the vocabulary. When we implement the postprocessing code, we will convert these 50,257-dimensional vectors back into token IDs, which we can then decode into words.

It can be a bit difficult to follow this, we know that we go from a 768 dimension embedding with a vocab size of 50,257, but it's not really clear what's happening to then get to the logit out.

From working through the logic, we use a linear layer to go from a 768-dim embedding to 50,257 dimension of token scores.

When we train the model, we'll use the 768 dimension semantic space, but we'll then 'predict' by looking at the token scores across a 50,257 dimension.

A good tip to remember that for this GPT-2 example, the weight matrix (embeddings) are:



```
print(model.tok_emb.weight.shape)
```
which is 50,257 rows (one row per possible token) and 768 columns (one column per embedding dimension)

Adapting from Chapter 3:

> "In other words, the embedding layer is essentially a lookup operation that retrieves rows from the embedding layer’s weight matrix via a token ID."

The embedding dimension "represents the embedding size, transforming each token into a 768-dimensional vector"

Therefore, we create a 50,257 x 768 matrix. This then goes through a linear layer to output 'logits'. In the batch example, this has two rows corresponding to the two text samples. Each text sample consists of four tokens; each token is a 50,257-dimensional vector, which matches the size of the tokenizer's vocabulary.


