# Learn GPT Architecture - StriveLM

An educational walkthrough of building a character-level GPT from scratch.

[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/muhdaldiansyah/strivelm/blob/main/learn_gpt.ipynb)

---

## What You'll Learn

1. **Token & Positional Embeddings** - How text becomes numbers
2. **Transformer Architecture** - Self-attention mechanism
3. **Multi-Head Attention** - Query, Key, Value (QKV)
4. **Layer Normalization** - Stabilizing training
5. **Residual Connections** - Gradient flow
6. **Feed-Forward Networks** - Processing representations
7. **Training & Generation** - Making your model talk!

## Step 1: Setup Environment

In [None]:
# Clone the repository
!git clone https://github.com/muhdaldiansyah/strivelm.git
%cd strivelm
!ls -la

In [None]:
# Verify PyTorch and GPU
import torch
import torch.nn as nn
import torch.nn.functional as F

print(f"PyTorch version: {torch.__version__}")
print(f"CUDA available: {torch.cuda.is_available()}")
if torch.cuda.is_available():
    print(f"GPU: {torch.cuda.get_device_name(0)}")
    device = 'cuda'
else:
    device = 'cpu'
print(f"\nUsing device: {device}")

## Step 2: Understanding the Data

GPT is a **character-level language model** - it learns patterns at the character level.

In [None]:
# Load and inspect the training data
with open('input.txt', 'r') as f:
    text = f.read()

print(f"Dataset size: {len(text)} characters")
print(f"\nFirst 300 characters:\n{'-'*60}")
print(text[:300])
print(f"{'-'*60}\n")

# Build vocabulary
chars = sorted(set(text))
vocab_size = len(chars)
print(f"Vocabulary size: {vocab_size} unique characters")
print(f"Characters: {''.join(chars[:50])}{'...' if len(chars) > 50 else ''}")

## Step 3: Architecture Deep Dive

Let's explore the GPT architecture component by component.

### 3.1 Token Embedding

**Purpose:** Convert discrete tokens (characters) into continuous vectors.

```
Input: 'H' → Token ID: 12
       ↓
Embedding Layer (vocab_size × n_embd)
       ↓
Output: [0.23, -0.15, 0.89, ...] (128-dim vector)
```

Each character gets its own learned vector representation.

In [None]:
# Example: Token Embedding
from model import GPT

# Load the model to inspect
model = GPT(vocab_size=vocab_size, block_size=64, n_layer=2, n_head=2, n_embd=128)
print("Token Embedding Layer:")
print(f"  Shape: {model.tok.weight.shape}")
print(f"  Meaning: {vocab_size} tokens × {128} dimensions")
print(f"\n  Example: Character 'T' → 128-dimensional vector")

### 3.2 Positional Embedding

**Purpose:** Add position information since transformers have no inherent notion of sequence order.

```
Position 0: [0.12, 0.45, -0.23, ...]
Position 1: [0.34, -0.12, 0.56, ...]
Position 2: [0.89, 0.23, -0.67, ...]
...
```

**Formula:** `final_embedding = token_embedding + position_embedding`

In [None]:
# Example: Positional Embedding
print("Positional Embedding Layer:")
print(f"  Shape: {model.pos.weight.shape}")
print(f"  Meaning: {64} positions × {128} dimensions")
print(f"\n  The model can handle sequences up to 64 tokens long")
print(f"  Each position gets a unique learned vector")

### 3.3 Multi-Head Self-Attention (The Core of Transformers)

**Purpose:** Let each token "attend to" (look at) other tokens in the sequence.

#### QKV Mechanism:

```
Input: "Hello"
       ↓
For each token, compute 3 vectors:
  Q (Query):  "What am I looking for?"
  K (Key):    "What do I contain?"
  V (Value):  "What do I output?"
       ↓
Attention(Q,K,V) = softmax(QK^T / √d) × V
       ↓
Output: Context-aware representations
```

#### Example:
When processing "**cat**", the model might attend to:
- Previous word "the" (70% attention)
- Itself "cat" (20% attention)
- Next word "sat" (10% attention)

In [None]:
# Example: QKV Projection
block = model.blocks[0]  # First transformer block
print("Multi-Head Attention:")
print(f"  Number of heads: {block.h}")
print(f"  Head dimension: {block.hd}")
print(f"  QKV projection shape: {block.qkv.weight.shape}")
print(f"\n  The input (128-dim) is projected to:")
print(f"    Q: 128 dimensions (queries)")
print(f"    K: 128 dimensions (keys)")
print(f"    V: 128 dimensions (values)")
print(f"  Then split into {block.h} heads of {block.hd} dims each")

### 3.4 Layer Normalization

**Purpose:** Normalize activations to stabilize training.

```
LayerNorm(x) = γ × (x - μ) / σ + β
```

Where:
- μ = mean of features
- σ = standard deviation
- γ, β = learnable parameters

**Why?** Prevents internal covariate shift and allows deeper networks.

In [None]:
# Example: Layer Normalization
print("Layer Normalization:")
print(f"  Pre-attention LayerNorm: {block.ln1}")
print(f"  Pre-FFN LayerNorm: {block.ln2}")
print(f"\n  Applied BEFORE each sub-layer (Pre-LN architecture)")
print(f"  Normalizes across the embedding dimension (128)")

### 3.5 Residual Connections

**Purpose:** Allow gradients to flow directly through the network.

```
x_out = x + SubLayer(LayerNorm(x))
```

**Visualization:**
```
     Input (x)
       |
       ├──→ LayerNorm → Attention → ──┐
       |                              |
       └──────────────────────────────┴→ Add → Output
```

**Why?** Solves vanishing gradient problem in deep networks.

In [None]:
# Visualize the forward pass
print("Transformer Block Forward Pass:")
print("\n1. x = input")
print("2. x = x + Dropout(Attention(LayerNorm(x)))    ← Residual #1")
print("3. x = x + Dropout(FFN(LayerNorm(x)))          ← Residual #2")
print("4. return x")
print("\nThe input can flow directly to the output via '+' operations!")

### 3.6 Feed-Forward Network (FFN)

**Purpose:** Process each position independently with non-linearity.

```
FFN(x) = Linear₂(GELU(Linear₁(x)))
       = W₂ × GELU(W₁ × x + b₁) + b₂
```

**Architecture:**
- Expand: 128 → 512 (4x expansion)
- Activation: GELU (smoother than ReLU)
- Compress: 512 → 128

In [None]:
# Example: Feed-Forward Network
print("Feed-Forward Network:")
print(f"  Architecture: {block.ff}")
print(f"\n  Layer 1: 128 → 512 (4x expansion)")
print(f"  GELU activation (smooth, non-linear)")
print(f"  Layer 2: 512 → 128 (back to original)")
print(f"\n  Applied to each token independently")

### 3.7 Dropout

**Purpose:** Regularization to prevent overfitting.

```
During training: Randomly zero out p% of neurons
During inference: Use all neurons, scale by (1-p)
```

**Why?** Forces network to learn robust features, not memorize.

In [None]:
# Check dropout rate
from config import Config
cfg = Config()
print(f"Dropout rate: {cfg.dropout}")
print(f"\nDropout = {cfg.dropout} means:")
if cfg.dropout == 0.0:
    print("  No dropout (all neurons active)")
    print("  Good for small datasets where we want to learn everything")
else:
    print(f"  {cfg.dropout*100}% of neurons randomly zeroed during training")
    print(f"  Helps prevent overfitting on large datasets")

## Step 4: Complete Architecture Summary

```
Input Text: "Hello"
     ↓
┌─────────────────────────────────────┐
│ Token Embedding (vocab → n_embd)   │  ← Convert chars to vectors
│        +                            │
│ Positional Embedding (pos → n_embd)│  ← Add position info
└─────────────────────────────────────┘
     ↓
┌─────────────────────────────────────┐
│ Transformer Block 1                 │
│   ├─ LayerNorm                      │
│   ├─ Multi-Head Attention (QKV)    │  ← Tokens look at each other
│   ├─ Residual Connection            │
│   ├─ LayerNorm                      │
│   ├─ Feed-Forward Network (FFN)    │  ← Process representations
│   └─ Residual Connection            │
└─────────────────────────────────────┘
     ↓
┌─────────────────────────────────────┐
│ Transformer Block 2 (same structure)│
└─────────────────────────────────────┘
     ↓
┌─────────────────────────────────────┐
│ Final LayerNorm                     │
└─────────────────────────────────────┘
     ↓
┌─────────────────────────────────────┐
│ Linear Head (n_embd → vocab_size)  │  ← Predict next character
└─────────────────────────────────────┘
     ↓
Output: Probability distribution over vocab
```

In [None]:
# Print full model architecture
print("Complete Model Architecture:")
print("="*60)
print(model)
print("="*60)

# Count parameters
total_params = sum(p.numel() for p in model.parameters())
trainable_params = sum(p.numel() for p in model.parameters() if p.requires_grad)
print(f"\nTotal parameters: {total_params:,}")
print(f"Trainable parameters: {trainable_params:,}")

## Step 5: Train the Model

Now let's train your GPT! The model learns to predict the next character given previous characters.

**Training objective:** Minimize cross-entropy loss
```
Loss = -log(P(correct_next_char | context))
```

In [None]:
# Train the model
print("Starting training...\n")
!python train.py

### Understanding Training Output

```
iter 1   | train 3.735 | val 3.826
```

- **iter**: Training iteration
- **train loss**: Lower = better predictions on training data
- **val loss**: Lower = better generalization to unseen data

**Loss starts high (~3.7)** because the model is guessing randomly.

**Loss decreases** as the model learns character patterns.

## Step 6: Generate Text (Inference)

### How Text Generation Works:

```
1. Start with seed: "T"
2. Model predicts: "o" (90%), "h" (5%), "a" (5%)
3. Sample from distribution (with temperature)
4. Append to sequence: "To"
5. Repeat → "To be or not to be..."
```

### Temperature Parameter:
- **Low (0.3-0.7):** Conservative, repetitive
- **Medium (0.8-1.0):** Balanced
- **High (1.1-2.0):** Creative, random

### Top-K Parameter:
- Only sample from top K most likely tokens
- Prevents choosing very unlikely characters

In [None]:
# Quick test
print("Testing your trained model...\n")
!python inference.py --start "T" --steps 300 --temp 0.9 --top_k 40

### Experiment with Different Settings

In [None]:
# Compare creativity levels
print("\n" + "="*70)
print("CONSERVATIVE (temp=0.5, top_k=10) - Focused, repetitive")
print("="*70)
!python inference.py --start "To" --steps 200 --temp 0.5 --top_k 10

print("\n" + "="*70)
print("BALANCED (temp=0.9, top_k=40) - Natural mix")
print("="*70)
!python inference.py --start "To" --steps 200 --temp 0.9 --top_k 40

print("\n" + "="*70)
print("CREATIVE (temp=1.3, top_k=60) - Diverse, surprising")
print("="*70)
!python inference.py --start "To" --steps 200 --temp 1.3 --top_k 60

## Step 7: Interactive Playground

In [None]:
# Interactive generation
start_text = input("Enter starting text [default: 'W']: ") or "W"
num_tokens = input("Tokens to generate [default: 250]: ") or "250"
temperature = input("Temperature 0.1-2.0 [default: 0.9]: ") or "0.9"

print(f"\nGenerating from '{start_text}'...\n")
!python inference.py --start "{start_text}" --steps {num_tokens} --temp {temperature}

## Step 8: Analyze the Model's Behavior

Let's peek inside the trained model to understand what it learned.

In [None]:
# Load trained model
ckpt = torch.load('checkpoints/out.pt', map_location='cpu')
vocab = list(ckpt['meta']['vocab'])

print("Model learned vocabulary:")
print(f"  Total characters: {len(vocab)}")
print(f"  Characters: {''.join(vocab)}")

# Visualize token embeddings
trained_model = GPT(
    vocab_size=len(vocab),
    block_size=ckpt['meta']['block_size'],
    n_layer=ckpt['meta']['n_layer'],
    n_head=ckpt['meta']['n_head'],
    n_embd=ckpt['meta']['n_embd']
).eval()
trained_model.load_state_dict(ckpt['model'])

print(f"\nModel configuration:")
print(f"  Block size (context length): {ckpt['meta']['block_size']}")
print(f"  Number of layers: {ckpt['meta']['n_layer']}")
print(f"  Number of heads: {ckpt['meta']['n_head']}")
print(f"  Embedding dimension: {ckpt['meta']['n_embd']}")

## Summary: What You Learned

### Architecture Components:
1. ✅ **Token Embedding** - Characters → vectors
2. ✅ **Positional Embedding** - Add position info
3. ✅ **Multi-Head Attention** - QKV mechanism
4. ✅ **Layer Normalization** - Stabilize training
5. ✅ **Residual Connections** - Gradient flow
6. ✅ **Feed-Forward Network** - Non-linear processing
7. ✅ **Dropout** - Regularization

### Training:
- ✅ Cross-entropy loss
- ✅ AdamW optimizer
- ✅ Gradient descent

### Generation:
- ✅ Autoregressive sampling
- ✅ Temperature scaling
- ✅ Top-K filtering

---

## Next Steps

Want to go deeper?

1. **Modify architecture** - Change `config.py` parameters
2. **Visualize attention** - See what the model focuses on
3. **Try different datasets** - Train on other text
4. **Scale up** - Larger models, more data

---

**Resources:**
- [GitHub Repo](https://github.com/muhdaldiansyah/strivelm)
- [Attention Paper](https://arxiv.org/abs/1706.03762) - "Attention is All You Need"
- [GPT-3 Paper](https://arxiv.org/abs/2005.14165) - Language Models are Few-Shot Learners