## GPT Config

In [1]:
GPT_CONFIG_124M = {
    "vocab_size": 50257,     # Vocabulary size
    "context_length": 1024,  # Context length
    "emb_dim": 768,          # Embedding dimension
    "n_heads": 12,           # Number of attention heads
    "n_layers": 12,          # Number of layers
    "drop_rate": 0.1,        # Dropout rate
    "qkv_bias": False        # Query-Key-Value bias
}

In [2]:
import numpy
import torch
import re
import sys
import tiktoken


if isinstance(batch, list):
    if isinstance(batch[0], torch.Tensor):
        batch = torch.stack(batch)
    else:
        batch = torch.tensor(batch, dtype=torch.long)

NameError: name 'batch' is not defined

## Tokenization

In [None]:
import tiktoken

tokenizer = tiktoken.get_encoding("gpt2")
batch = []
txt1 = "Every effort moves you"
txt2 = "Every day holds a"

batch.append(torch.tensor(tokenizer.encode(txt1)))
batch.append(torch.tensor(tokenizer.encode(txt2)))

print(batch)

[tensor([6109, 3626, 6100,  345]), tensor([6109, 1110, 6622,  257])]


## Testing DummyGPTModel - test 1

In [None]:
import importlib
import dummy_gpt_model
importlib.reload(dummy_gpt_model)
from dummy_gpt_model import DummyGPTModel

torch.manual_seed(123)
model = DummyGPTModel(GPT_CONFIG_124M)
batch = torch.stack(batch) if isinstance(batch, list) else batch
logits = model(batch)
print("Output shape:", logits.shape)
print(logits)

Output shape: torch.Size([2, 4, 50257])
tensor([[[-0.9289,  0.2748, -0.7557,  ..., -1.6070,  0.2702, -0.5888],
         [-0.4476,  0.1726,  0.5354,  ..., -0.3932,  1.5285,  0.8557],
         [ 0.5680,  1.6053, -0.2155,  ...,  1.1624,  0.1380,  0.7425],
         [ 0.0447,  2.4787, -0.8843,  ...,  1.3219, -0.0864, -0.5856]],

        [[-1.5474, -0.0542, -1.0571,  ..., -1.8061, -0.4494, -0.6747],
         [-0.8422,  0.8243, -0.1098,  ..., -0.1434,  0.2079,  1.2046],
         [ 0.1355,  1.1858, -0.1453,  ...,  0.0869, -0.1590,  0.1552],
         [ 0.1666, -0.8138,  0.2307,  ...,  2.5035, -0.3055, -0.3083]]],
       grad_fn=<UnsafeViewBackward0>)


## Layer Normalization

Layer Normalization is a crucial component in modern transformer architectures like GPT for several key reasons.

In short, **it stabilizes the training process and helps the model learn more efficiently.**

Hereâ€™s a breakdown of why it's so important:

### 1. Stabilizes Hidden State Values (Activations)

As data passes through many layers of a deep network, the values can become very large or very small. This can lead to:
- **Vanishing Gradients**: Activations become tiny, causing the gradients used for learning to become zero. The model stops learning.
- **Exploding Gradients**: Activations become huge, causing gradients to become massive (`NaN` or `inf`). The model's weights are destroyed, and training fails.

Layer Normalization combats this by **rescaling the activations within each layer** for every single training example. It forces the outputs of a layer to have a mean of 0 and a standard deviation of 1, keeping them in a controlled, "well-behaved" range.

### 2. Speeds Up Training

By keeping activations stable, Layer Norm allows you to use **higher learning rates** without the training process becoming unstable. A higher learning rate means the model can learn faster, significantly reducing training time.

### 3. Reduces Sensitivity to Initialization

Without normalization, the initial weights of the model must be chosen very carefully. A bad initialization can cause the network to fail from the start. Layer Norm makes the model much more robust to the choice of initial weights.

### How It Works in Your Code

In your `dummy_gpt_model.py`, Layer Normalization is planned in two key places:

1.  **Inside the Transformer Block**: As the `TODO` comments in dummy_gpt_model.py suggest, it's applied *before* the self-attention and feed-forward sub-layers. This is known as the **"Pre-LN"** architecture, which is very common and effective. It ensures the inputs to these critical components are always well-scaled.

2.  **Before the Final Output**: The dummy_gpt_model.py layer normalizes the output of the last transformer block before it's fed to the final linear layer (`out_head`) to produce logits. This provides one last stabilization step before the final prediction.

### LayerNorm vs. BatchNorm

You might have heard of Batch Normalization. The key difference is the dimension they normalize over:
- **BatchNorm**: Normalizes across the **batch dimension**. It calculates one mean/variance for the entire batch. This works well for CNNs but can be tricky for variable-length sequences in LMs.
- **LayerNorm**: Normalizes across the **feature dimension** for *each individual sequence*. This makes it independent of the batch size and sequence length, which is ideal for transformers.

## Layer Normalization - Samples

In [5]:
import torch.nn as nn

torch.manual_seed(123)
torch.set_printoptions(sci_mode=False)

print("--- set up tensor sample for showing mean and variance")
batch_example = torch.randn(2, 5)
layer = nn.Sequential(nn.Linear(5, 6), nn.ReLU())
out = layer(batch_example)
print(out)

print("--- mean and variance")
mean = out.mean(dim=-1, keepdim=True)
var = out.var(dim=-1, keepdim=True)

print("Mean:\n", mean)
print("Variance:\n", var)

print("--- Normalizing layer output")
out_norm = (out - mean) / torch.sqrt(var)
mean = out_norm.mean(dim=-1, keepdim=True)
var = out_norm.var(dim=-1, keepdim=True)
print("Normalized layer output:\n", out_norm)
print("Mean:\n", mean)
print("Variance:\n", var)


--- set up tensor sample for showing mean and variance
tensor([[0.2260, 0.3470, 0.0000, 0.2216, 0.0000, 0.0000],
        [0.2133, 0.2394, 0.0000, 0.5198, 0.3297, 0.0000]],
       grad_fn=<ReluBackward0>)
--- mean and variance
Mean:
 tensor([[0.1324],
        [0.2170]], grad_fn=<MeanBackward1>)
Variance:
 tensor([[0.0231],
        [0.0398]], grad_fn=<VarBackward0>)
--- Normalizing layer output
Normalized layer output:
 tensor([[ 0.6159,  1.4126, -0.8719,  0.5872, -0.8719, -0.8719],
        [-0.0189,  0.1121, -1.0876,  1.5173,  0.5647, -1.0876]],
       grad_fn=<DivBackward0>)
Mean:
 tensor([[    0.0000],
        [    0.0000]], grad_fn=<MeanBackward1>)
Variance:
 tensor([[1.0000],
        [1.0000]], grad_fn=<VarBackward0>)


## LayerNorm - Testing

In [22]:
import torch.nn as nn
import layer_norm

# Needed only if we are changing LayerNorm file.
import importlib
importlib.reload(layer_norm)

torch.manual_seed(123)
torch.set_printoptions(sci_mode=False)

batch_example = torch.randn(2, 5)

print("--- with unbiased = False (default)")
ln = layer_norm.LayerNorm(emb_dim=5)
out_ln = ln(batch_example)

print("Mean:\n", out_ln.mean(dim=-1, keepdim=True))
print("Variance:\n", out_ln.var(dim=-1, keepdim=True))

print("--- with unbiased = True")
ln_bias = layer_norm.LayerNorm(emb_dim=5, unbiased=True)
out_ln_bias = ln_bias(batch_example)
print("Mean:\n", out_ln_bias.mean(dim=-1, keepdim=True))
print("Variance:\n", out_ln_bias.var(dim=-1, keepdim=True))


--- with unbiased = False (default)
Mean:
 tensor([[    -0.0000],
        [     0.0000]], grad_fn=<MeanBackward1>)
Variance:
 tensor([[1.2499],
        [1.2500]], grad_fn=<VarBackward0>)
--- with unbiased = True
Mean:
 tensor([[    -0.0000],
        [     0.0000]], grad_fn=<MeanBackward1>)
Variance:
 tensor([[1.0000],
        [1.0000]], grad_fn=<VarBackward0>)


In [17]:

# Re-using batch_example from the previous cell
ln = layer_norm.LayerNorm(emb_dim=5)

# Manually compute the normalized output without scale and shift
mean = batch_example.mean(dim=-1, keepdim=True)
var = batch_example.var(dim=-1, keepdim=True, unbiased=False)
norm_x = (batch_example - mean) / torch.sqrt(var + ln.eps)

print("Mean of normalized output (before scale and shift):\n", norm_x.mean(dim=-1, keepdim=True))
print("Variance of normalized output (before scale and shift):\n", norm_x.var(dim=-1, keepdim=True, unbiased=False))

# Note: The variance is calculated with unbiased=False to match PyTorch's LayerNorm implementation for variance calculation.
# Using unbiased=False for the final variance check confirms that the normalization step itself produces a variance of 1.


Mean of normalized output (before scale and shift):
 tensor([[    -0.0000],
        [     0.0000]])
Variance of normalized output (before scale and shift):
 tensor([[1.0000],
        [1.0000]])
