# EECS 595 HW3: Debug GPT Pretraining

This notebook provides step-by-step verification of your GPT implementation. 

## Instructions for Students:

1. **Implement the TODO sections** in `gpt.py` before running the corresponding cells
2. **Run each cell in order** to verify your implementation step by step
3. **Use `importlib.reload()`** to reload your latest code changes
4. **Check the output** of each test to ensure your implementation is correct

## TODO Requirements by Cell:

- **Cell 2**: Requires TODO 1.15 (setup_tokenizer)
- **Cell 3**: Requires TODO 1.1, 1.2 (GPTEmbedding)
- **Cell 4**: Requires TODO 1.3, 1.4 (MultiHeadAttention with RoPE)
- **Cell 5**: Requires TODO 1.5, 1.6 (SwiGLU and FeedForward)
- **Cell 6**: Requires TODO 1.7, 1.8, 1.9 (TransformerBlock)
- **Cell 7**: Requires TODO 1.10, 1.11 (GPTModel)
- **Cell 8**: Requires TODO 1.10, 1.11 (GPTModel for text generation)
- **Cell 9**: Requires TODO 1.12, 1.13 (GPTDataset)
- **Cell 10**: Requires TODO 1.14 (create_dataloader)

## Key Features of This Implementation:
- **RoPE (Rotary Position Embedding)**: Positional information is encoded directly in the attention mechanism
- **No separate position embeddings**: The `GPTEmbedding` layer only handles token embeddings
- **RoPE in attention**: Queries and keys are rotated based on their position using RoPE

Let's start by importing the necessary modules and setting up the environment.


In [None]:
# Cell 1: Imports and Setup
import os
import math
import numpy as np
import random
import logging
import importlib

# PyTorch imports
import torch
import torch.nn as nn
import torch.functional as F
from torch.nn import RMSNorm
from torch.amp import autocast, GradScaler

# Data loading imports
from torch.utils.data import Dataset, DataLoader
import json
import glob
import gzip
import bz2

# Tokenization imports
from transformers import AutoTokenizer
import tiktoken

# Progress and timing
from tqdm.auto import tqdm, trange
import time

# Import our GPT implementation
import gpt
importlib.reload(gpt)  # Reload to get latest changes

print("‚úÖ All imports successful!")
print(f"PyTorch version: {torch.__version__}")
print(f"CUDA available: {torch.cuda.is_available()}")
if torch.cuda.is_available():
    print(f"CUDA device: {torch.cuda.get_device_name()}")


## Cell 2: Tokenizer Setup and Testing

**Required TODOs**: 1.15 (setup_tokenizer)

First, let's set up the tokenizer and verify it works correctly. This is important because the vocabulary size determines the size of our embedding layer.


In [None]:
# Cell 2: Tokenizer Setup and Testing
importlib.reload(gpt)  # Reload to get latest changes

# Set up the tokenizer
print("Setting up tokenizer...")
tokenizer = gpt.setup_tokenizer()

# Test tokenization
test_text = "The quick brown fox jumps over the lazy dog."
tokens = tokenizer.encode(test_text)
decoded = tokenizer.decode(tokens)

print(f"\nTest text: '{test_text}'")
print(f"Tokens: {tokens}")
print(f"Decoded: '{decoded}'")
print(f"Vocabulary size: {tokenizer.vocab_size}")

# Find special token IDs
special_tokens = ["<|user|>", "<|assistant|>", "<|end|>", "<|system|>", "<|pad|>"]
print(f"\nSpecial token IDs:")
for token in special_tokens:
    token_id = tokenizer.convert_tokens_to_ids(token)
    print(f"  {token}: {token_id}")

# Calculate actual vocabulary size needed
max_token_id = max(tokenizer.convert_tokens_to_ids(token) for token in special_tokens)
actual_vocab_size = max_token_id + 1
print(f"\nActual vocabulary size needed: {actual_vocab_size}")
print(f"Difference from tokenizer vocab size: {actual_vocab_size - tokenizer.vocab_size}")

print("\n‚úÖ Tokenizer setup complete!")


## Cell 3: Test GPTEmbedding Layer (with RoPE)

**Required TODOs**: 1.1 (GPTEmbedding initialization), 1.2 (GPTEmbedding forward pass)

Now let's test the embedding layer. **Important**: This version only handles token embeddings - no positional embeddings since we use RoPE!


In [None]:
# Cell 3: Test GPTEmbedding Layer
importlib.reload(gpt)  # Reload to get latest changes

# Test parameters
vocab_size = 1000
emb_dim = 8
context_length = 256
batch_size = 2
seq_length = 6

# Create random token IDs
token_ids = torch.randint(0, vocab_size, (batch_size, seq_length))
print(f"Input token IDs shape: {token_ids.shape}")
print(f"Sample token IDs: {token_ids[0]}")

# Initialize and test the embedding layer
print("\nTesting GPTEmbedding layer...")
embedding_layer = gpt.GPTEmbedding(vocab_size, emb_dim, context_length)
output = embedding_layer(token_ids)

# Verify output
print(f"Output shape: {output.shape}")
print(f"Expected shape: {(batch_size, seq_length, emb_dim)}")
print(f"Output sample (first token): {output[0, 0, :5]}")

# Sanity checks
assert output.shape == (batch_size, seq_length, emb_dim), \
    f"Expected output shape {(batch_size, seq_length, emb_dim)}, got {output.shape}"

# Check that embeddings are different for different tokens
if not torch.allclose(output[0, 0], output[0, 1]):
    print("‚úÖ Different tokens produce different embeddings")
else:
    print("‚ö†Ô∏è  Warning: Different tokens produce similar embeddings")

print("\n‚úÖ GPTEmbedding layer test passed!")


## Cell 4: Test MultiHeadAttention with RoPE

**Required TODOs**: 1.3 (MultiHeadAttention initialization), 1.4 (MultiHeadAttention forward pass)

This is the most complex part! The attention mechanism now uses RoPE to encode positional information directly in the queries and keys.


In [None]:
# Cell 4: Test MultiHeadAttention with RoPE
importlib.reload(gpt)  # Reload to get latest changes

# Test parameters
torch.manual_seed(123)  # For reproducible results
d_in = 16
d_out = d_in
num_heads = 4
context_length = 32
dropout = 0.0
batch_size = 3
seq_len = 7

# Create random input tensor
x = torch.randn(batch_size, seq_len, d_in)
print(f"Input shape: {x.shape}")

# Initialize MultiHeadAttention with RoPE
print("\nTesting MultiHeadAttention with RoPE...")
mha = gpt.MultiHeadAttention(d_in, context_length, dropout, num_heads, qkv_bias=True)
out = mha(x)

# Verify output
print(f"Output shape: {out.shape}")
print(f"Expected shape: {(batch_size, seq_len, d_out)}")
print(f"Output sample: {out[0, 0, :5]}")

# Sanity checks
assert out.shape == (batch_size, seq_len, d_out), \
    f"Expected output shape {(batch_size, seq_len, d_out)}, got {out.shape}"
assert not torch.isnan(out).any(), "Output contains NaNs!"

# Test that RoPE is working by checking positional sensitivity
# Create two identical sequences but at different positions
seq1 = torch.randn(1, 2, d_in)
seq2 = torch.randn(1, 2, d_in)
seq2[:, 0] = seq1[:, 0]  # Make first tokens identical
seq2[:, 1] = seq1[:, 1]  # Make second tokens identical

out1 = mha(seq1)
out2 = mha(seq2)

# The outputs should be different due to RoPE encoding different positions
if not torch.allclose(out1, out2):
    print("‚úÖ RoPE is working: same tokens at different positions produce different outputs")
else:
    print("‚ö†Ô∏è  Warning: RoPE might not be working correctly")

print("\n‚úÖ MultiHeadAttention with RoPE test passed!")


## Cell 5: Test SwiGLU Activation Function

**Required TODOs**: 1.5 (FeedForward initialization), 1.6 (FeedForward forward pass)

Let's test the SwiGLU activation function, which provides better performance than traditional ReLU or GELU.


In [None]:
# Cell 5: Test SwiGLU Activation Function
importlib.reload(gpt)  # Reload to get latest changes

# Test parameters
dimension = 16
batch_size = 4
seq_len = 8

# Create test input
x = torch.randn(batch_size, seq_len, dimension)
print(f"Input shape: {x.shape}")
print(f"Input sample: {x[0, 0, :5]}")

# Initialize and test SwiGLU
print("\nTesting SwiGLU activation...")
swiglu = gpt.SwiGLU(dimension)
out = swiglu(x)

# Verify output
print(f"Output shape: {out.shape}")
print(f"Expected shape: {(batch_size, seq_len, dimension)}")
print(f"Output sample: {out[0, 0, :5]}")

# Sanity checks
assert out.shape == (batch_size, seq_len, dimension), \
    f"Expected output shape {(batch_size, seq_len, dimension)}, got {out.shape}"
assert not torch.isnan(out).any(), "Output contains NaNs!"

# Test that SwiGLU is non-linear
# Create two different inputs
x1 = torch.randn(1, 1, dimension)
x2 = torch.randn(1, 1, dimension)
out1 = swiglu(x1)
out2 = swiglu(x2)

# Test linearity: SwiGLU(x1 + x2) should NOT equal SwiGLU(x1) + SwiGLU(x2)
combined_input = x1 + x2
combined_output = swiglu(combined_input)
sum_outputs = out1 + out2

if not torch.allclose(combined_output, sum_outputs):
    print("‚úÖ SwiGLU is non-linear (as expected)")
else:
    print("‚ö†Ô∏è  Warning: SwiGLU appears to be linear")

print("\n‚úÖ SwiGLU activation test passed!")


## Cell 6: Test FeedForward Layer

**Required TODOs**: 1.5 (FeedForward initialization), 1.6 (FeedForward forward pass)

Now let's test the feed-forward network that uses SwiGLU activation.


In [None]:
# Cell 6: Test FeedForward Layer
importlib.reload(gpt)  # Reload to get latest changes

# Test parameters
emb_dim = 16
batch_size = 10
seq_len = 4

# Create test input
x = torch.randn(batch_size, seq_len, emb_dim)
print(f"Input shape: {x.shape}")
print(f"Input sample: {x[0, 0, :5]}")

# Initialize and test FeedForward
print("\nTesting FeedForward layer...")
ff = gpt.FeedForward(emb_dim)
out = ff(x)

# Verify output
print(f"Output shape: {out.shape}")
print(f"Expected shape: {(batch_size, seq_len, emb_dim)}")
print(f"Output sample: {out[0, 0, :5]}")

# Sanity checks
assert out.shape == (batch_size, seq_len, emb_dim), \
    f"Expected output shape {(batch_size, seq_len, emb_dim)}, got {out.shape}"
assert not torch.isnan(out).any(), "Output contains NaNs!"

# Test that FeedForward transforms the input
if not torch.allclose(x, out):
    print("‚úÖ FeedForward transforms the input (as expected)")
else:
    print("‚ö†Ô∏è  Warning: FeedForward doesn't seem to transform the input")

print("\n‚úÖ FeedForward layer test passed!")


## Cell 7: Test TransformerBlock

**Required TODOs**: 1.7 (TransformerBlock initialization), 1.8 (TransformerBlock maybe_dropout), 1.9 (TransformerBlock forward pass)

Now let's test the complete transformer block that combines attention and feed-forward layers.


In [None]:
# Cell 7: Test TransformerBlock
importlib.reload(gpt)  # Reload to get latest changes

# Test configuration
OG_GPT_CONFIG = {
    "vocab_size": 50257,
    "context_length": 1024,
    "emb_dim": 768,
    "n_heads": 12,
    "n_layers": 12,
    "drop_rate": 0.1,
    "qkv_bias": False
}

# Create test input
torch.manual_seed(123)
x = torch.rand(2, 4, OG_GPT_CONFIG["emb_dim"])
print(f"Input shape: {x.shape}")

# Initialize and test TransformerBlock
print("\nTesting TransformerBlock...")
block = gpt.TransformerBlock(OG_GPT_CONFIG)
output = block(x)

# Verify output
print(f"Output shape: {output.shape}")
print(f"Expected shape: {(2, 4, OG_GPT_CONFIG['emb_dim'])}")

# Sanity checks
assert output.shape == (2, 4, OG_GPT_CONFIG["emb_dim"]), \
    f"Expected output shape {(2, 4, OG_GPT_CONFIG['emb_dim'])}, got {output.shape}"
assert not torch.isnan(output).any(), "Output contains NaNs!"

# Test that the block transforms the input
if not torch.allclose(x, output):
    print("‚úÖ TransformerBlock transforms the input (as expected)")
else:
    print("‚ö†Ô∏è  Warning: TransformerBlock doesn't seem to transform the input")

print("\n‚úÖ TransformerBlock test passed!")


## Cell 8: Test Complete GPTModel

**Required TODOs**: 1.10 (GPTModel initialization), 1.11 (GPTModel forward pass)

Now let's test the complete GPT model!


In [None]:
# Cell 8: Test Complete GPTModel
importlib.reload(gpt)  # Reload to get latest changes

# Calculate vocabulary size from tokenizer
special_tokens = ["<|user|>", "<|assistant|>", "<|end|>", "<|system|>", "<|pad|>"]
max_token_id = max(tokenizer.convert_tokens_to_ids(token) for token in special_tokens)
actual_vocab_size = max_token_id + 1

# Test configuration
CUSTOM_GPT_CONFIG = {
    "vocab_size": actual_vocab_size,
    "context_length": 1024,
    "emb_dim": 512,
    "n_heads": 8,
    "n_layers": 12,
    "drop_rate": 0.1,
    "qkv_bias": False
}

print(f"Using vocabulary size: {actual_vocab_size}")

# Test with real tokenized text
sentence = "The quick brown fox jumps over the lazy dog."
token_ids = tokenizer.encode(sentence)
token_ids = torch.tensor(token_ids)
print(f"Input sentence: '{sentence}'")
print(f"Token IDs: {token_ids}")
print(f"Token IDs shape: {token_ids.unsqueeze(0).shape}")

# Initialize and test GPTModel
print("\nTesting GPTModel...")
gpt_model = gpt.GPTModel(CUSTOM_GPT_CONFIG)
output = gpt_model(token_ids.unsqueeze(0))

# Verify output
print(f"Output shape: {output.shape}")
print(f"Expected shape: {(1, len(token_ids), actual_vocab_size)}")

# Sanity checks
assert output.shape == (1, len(token_ids), actual_vocab_size), \
    f"Expected output shape {(1, len(token_ids), actual_vocab_size)}, got {output.shape}"
assert not torch.isnan(output).any(), "Output contains NaNs!"

# Check that logits are reasonable (not all the same)
logits_variance = output.var()
print(f"Logits variance: {logits_variance:.4f}")
if logits_variance > 0.01:
    print("‚úÖ Logits have reasonable variance")
else:
    print("‚ö†Ô∏è  Warning: Logits have very low variance")

print("\n‚úÖ GPTModel test passed!")


## Cell 9: Test Text Generation

**Required TODOs**: 1.10 (GPTModel initialization), 1.11 (GPTModel forward pass)

Let's test text generation with our untrained model (it will be random, but we can verify the mechanics work).


In [None]:
# Cell 9: Test Text Generation
importlib.reload(gpt)  # Reload to get latest changes

# Test text generation
start_context = "The quick brown fox"
print(f"Starting context: '{start_context}'")

# Generate text (will be random since model is untrained)
full_text = gpt.generate_text(
    start_context=start_context,
    tokenizer=tokenizer,
    model=gpt_model,
    max_new_tokens=10,
    context_size=CUSTOM_GPT_CONFIG["context_length"]
)

print(f"Generated text: '{full_text}'")
print("\nNote: The generated text will be random since the model is untrained.")
print("This is expected! After training, the model should generate more coherent text.")

print("\n‚úÖ Text generation test passed!")


## Cell 10: Test Dataset Creation

**Required TODOs**: 1.12 (GPTDataset initialization), 1.13 (GPTDataset __getitem__)

Let's test the dataset creation for training.


In [None]:
# Cell 10: Test Dataset Creation
importlib.reload(gpt)  # Reload to get latest changes

# Create a small test dataset
test_docs = [
    "The quick brown fox jumps over the lazy dog.",
    "Machine learning is a subset of artificial intelligence.",
    "Natural language processing helps computers understand human language."
]

print("Test documents:")
for i, doc in enumerate(test_docs):
    print(f"  {i+1}. {doc}")

# Test GPTDataset
print("\nTesting GPTDataset...")
dataset = gpt.GPTDataset(test_docs, tokenizer, max_length=10, stride=5)

print(f"Dataset size: {len(dataset)}")

# Get first sample
input_ids, labels = dataset[0]
print(f"First sample input shape: {input_ids.shape}")
print(f"First sample labels shape: {labels.shape}")
print(f"Input IDs: {input_ids}")
print(f"Labels: {labels}")

# Verify causal language modeling setup
print("\nVerifying causal language modeling setup...")
assert input_ids.shape == labels.shape, "Input and label shapes should match"
assert torch.all(input_ids[1:] == labels[:-1]), "Labels should be input shifted by 1"

print("‚úÖ Labels are correctly shifted by one position")

# Test with different stride
print("\nTesting with different stride...")
dataset_stride = gpt.GPTDataset(test_docs, tokenizer, max_length=10, stride=10)
print(f"Dataset with stride=10 size: {len(dataset_stride)}")

print("\n‚úÖ Dataset creation test passed!")


## Cell 11: Test DataLoader Creation

**Required TODOs**: 1.14 (create_dataloader)

Finally, let's test the DataLoader creation for training.


In [None]:
# Cell 11: Test DataLoader Creation
importlib.reload(gpt)  # Reload to get latest changes

# Test DataLoader creation
print("Testing DataLoader creation...")
dataloader = gpt.create_dataloader(
    txt=test_docs,
    batch_size=2,
    max_length=10,
    stride=5,
    shuffle=True,
    drop_last=True,
    num_workers=0
)

print(f"DataLoader created successfully!")
print(f"Number of batches: {len(dataloader)}")

# Test getting a batch
print("\nTesting batch retrieval...")
for i, (batch_input_ids, batch_labels) in enumerate(dataloader):
    print(f"Batch {i+1}:")
    print(f"  Input shape: {batch_input_ids.shape}")
    print(f"  Labels shape: {batch_labels.shape}")
    print(f"  Input IDs: {batch_input_ids[0]}")
    print(f"  Labels: {batch_labels[0]}")

    # Verify batch properties
    assert batch_input_ids.shape == batch_labels.shape, "Batch input and label shapes should match"
    assert batch_input_ids.shape[0] <= 2, "Batch size should be <= 2"

    if i >= 1:  # Only test first 2 batches
        break

print("\n‚úÖ DataLoader creation test passed!")

print("\n" + "="*60)
print("üéâ ALL TESTS PASSED! üéâ")
print("="*60)
print("\nYour GPT implementation is working correctly!")
print("You can now proceed to train your model using pretrain_gpt.py")
print("\nKey points about your implementation:")
print("‚úÖ RoPE encodes positional information in attention")
print("‚úÖ No separate positional embeddings needed")
print("‚úÖ SwiGLU activation for better performance")
print("‚úÖ All components work together correctly")
print("‚úÖ Ready for training!")


## Cell 10: Test Dataset Creation

Let's test the dataset creation for training.
