# nanoGPT: Complete Implementation

This notebook is a complete standalone implementation based on Andrej Karpathy's nanoGPT repository.  
**Repository:** https://github.com/karpathy/nanoGPT

This notebook contains:
1. **Complete GPT model architecture** (faithful to the original)
2. **All methods including:**
   - `from_pretrained()` - Load pretrained GPT-2 models
   - `crop_block_size()` - Model surgery for smaller context
   - `estimate_mfu()` - Model FLOPS utilization
3. Training loop with proper optimization
4. Character-level Shakespeare dataset preparation
5. Text generation capabilities
6. Examples of fine-tuning pretrained GPT-2

The implementation is designed to be educational and hackable while remaining efficient.

## 1. Setup and Imports

In [None]:
import torch
import torch.nn as nn
from torch.nn import functional as F
import math
import inspect
from dataclasses import dataclass
import time
import requests
import pickle
import os

# Set device
device = 'cuda' if torch.cuda.is_available() else 'cpu'
print(f"Using device: {device}")
print(f"PyTorch version: {torch.__version__}")

## 2. Complete Model Architecture

This is the complete, faithful implementation from nanoGPT's model.py.

In [None]:
class LayerNorm(nn.Module):
    """LayerNorm but with an optional bias. PyTorch doesn't support simply bias=False"""

    def __init__(self, ndim, bias):
        super().__init__()
        self.weight = nn.Parameter(torch.ones(ndim))
        self.bias = nn.Parameter(torch.zeros(ndim)) if bias else None

    def forward(self, input):
        return F.layer_norm(input, self.weight.shape, self.weight, self.bias, 1e-5)

In [None]:
class CausalSelfAttention(nn.Module):
    """Multi-head masked self-attention layer"""

    def __init__(self, config):
        super().__init__()
        assert config.n_embd % config.n_head == 0
        
        # Key, query, value projections for all heads, but in a batch
        self.c_attn = nn.Linear(config.n_embd, 3 * config.n_embd, bias=config.bias)
        # Output projection
        self.c_proj = nn.Linear(config.n_embd, config.n_embd, bias=config.bias)
        # Regularization
        self.attn_dropout = nn.Dropout(config.dropout)
        self.resid_dropout = nn.Dropout(config.dropout)
        self.n_head = config.n_head
        self.n_embd = config.n_embd
        self.dropout = config.dropout
        
        # Flash attention support (PyTorch >= 2.0)
        self.flash = hasattr(torch.nn.functional, 'scaled_dot_product_attention')
        if not self.flash:
            print("WARNING: using slow attention. Flash Attention requires PyTorch >= 2.0")
            # Causal mask to ensure attention is only applied to the left in the input sequence
            self.register_buffer("bias", torch.tril(torch.ones(config.block_size, config.block_size))
                                        .view(1, 1, config.block_size, config.block_size))

    def forward(self, x):
        B, T, C = x.size()  # batch size, sequence length, embedding dimensionality (n_embd)

        # Calculate query, key, values for all heads in batch and move head forward to be the batch dim
        q, k, v = self.c_attn(x).split(self.n_embd, dim=2)
        k = k.view(B, T, self.n_head, C // self.n_head).transpose(1, 2)  # (B, nh, T, hs)
        q = q.view(B, T, self.n_head, C // self.n_head).transpose(1, 2)  # (B, nh, T, hs)
        v = v.view(B, T, self.n_head, C // self.n_head).transpose(1, 2)  # (B, nh, T, hs)

        # Causal self-attention; Self-attend: (B, nh, T, hs) x (B, nh, hs, T) -> (B, nh, T, T)
        if self.flash:
            # Efficient attention using Flash Attention CUDA kernels
            y = torch.nn.functional.scaled_dot_product_attention(q, k, v, attn_mask=None, 
                                                                  dropout_p=self.dropout if self.training else 0, 
                                                                  is_causal=True)
        else:
            # Manual implementation of attention
            att = (q @ k.transpose(-2, -1)) * (1.0 / math.sqrt(k.size(-1)))
            att = att.masked_fill(self.bias[:,:,:T,:T] == 0, float('-inf'))
            att = F.softmax(att, dim=-1)
            att = self.attn_dropout(att)
            y = att @ v  # (B, nh, T, T) x (B, nh, T, hs) -> (B, nh, T, hs)
        
        y = y.transpose(1, 2).contiguous().view(B, T, C)  # Re-assemble all head outputs side by side

        # Output projection
        y = self.resid_dropout(self.c_proj(y))
        return y

In [None]:
class MLP(nn.Module):
    """Feed-forward network"""

    def __init__(self, config):
        super().__init__()
        self.c_fc    = nn.Linear(config.n_embd, 4 * config.n_embd, bias=config.bias)
        self.gelu    = nn.GELU()
        self.c_proj  = nn.Linear(4 * config.n_embd, config.n_embd, bias=config.bias)
        self.dropout = nn.Dropout(config.dropout)

    def forward(self, x):
        x = self.c_fc(x)
        x = self.gelu(x)
        x = self.c_proj(x)
        x = self.dropout(x)
        return x

In [None]:
class Block(nn.Module):
    """Transformer block: communication followed by computation"""

    def __init__(self, config):
        super().__init__()
        self.ln_1 = LayerNorm(config.n_embd, bias=config.bias)
        self.attn = CausalSelfAttention(config)
        self.ln_2 = LayerNorm(config.n_embd, bias=config.bias)
        self.mlp = MLP(config)

    def forward(self, x):
        x = x + self.attn(self.ln_1(x))
        x = x + self.mlp(self.ln_2(x))
        return x

In [None]:
@dataclass
class GPTConfig:
    block_size: int = 1024
    vocab_size: int = 50304  # GPT-2 vocab_size of 50257, padded up to nearest multiple of 64 for efficiency
    n_layer: int = 12
    n_head: int = 12
    n_embd: int = 768
    dropout: float = 0.0
    bias: bool = True  # True: bias in Linears and LayerNorms, like GPT-2. False: a bit better and faster

In [None]:
class GPT(nn.Module):
    """GPT Language Model - Complete Implementation"""

    def __init__(self, config):
        super().__init__()
        assert config.vocab_size is not None
        assert config.block_size is not None
        self.config = config

        self.transformer = nn.ModuleDict(dict(
            wte = nn.Embedding(config.vocab_size, config.n_embd),
            wpe = nn.Embedding(config.block_size, config.n_embd),
            drop = nn.Dropout(config.dropout),
            h = nn.ModuleList([Block(config) for _ in range(config.n_layer)]),
            ln_f = LayerNorm(config.n_embd, bias=config.bias),
        ))
        self.lm_head = nn.Linear(config.n_embd, config.vocab_size, bias=False)
        
        # Weight tying: https://paperswithcode.com/method/weight-tying
        self.transformer.wte.weight = self.lm_head.weight

        # Init all weights
        self.apply(self._init_weights)
        # Apply special scaled init to the residual projections, per GPT-2 paper
        for pn, p in self.named_parameters():
            if pn.endswith('c_proj.weight'):
                torch.nn.init.normal_(p, mean=0.0, std=0.02/math.sqrt(2 * config.n_layer))

        # Report number of parameters
        print("number of parameters: %.2fM" % (self.get_num_params()/1e6,))

    def get_num_params(self, non_embedding=True):
        """
        Return the number of parameters in the model.
        For non-embedding count (default), the position embeddings get subtracted.
        The token embeddings would too, except due to the parameter sharing these
        params are actually used as weights in the final layer, so we include them.
        """
        n_params = sum(p.numel() for p in self.parameters())
        if non_embedding:
            n_params -= self.transformer.wpe.weight.numel()
        return n_params

    def _init_weights(self, module):
        if isinstance(module, nn.Linear):
            torch.nn.init.normal_(module.weight, mean=0.0, std=0.02)
            if module.bias is not None:
                torch.nn.init.zeros_(module.bias)
        elif isinstance(module, nn.Embedding):
            torch.nn.init.normal_(module.weight, mean=0.0, std=0.02)

    def forward(self, idx, targets=None):
        device = idx.device
        b, t = idx.size()
        assert t <= self.config.block_size, f"Cannot forward sequence of length {t}, block size is only {self.config.block_size}"
        pos = torch.arange(0, t, dtype=torch.long, device=device)  # shape (t)

        # Forward the GPT model itself
        tok_emb = self.transformer.wte(idx)  # token embeddings of shape (b, t, n_embd)
        pos_emb = self.transformer.wpe(pos)  # position embeddings of shape (t, n_embd)
        x = self.transformer.drop(tok_emb + pos_emb)
        for block in self.transformer.h:
            x = block(x)
        x = self.transformer.ln_f(x)

        if targets is not None:
            # If we are given some desired targets also calculate the loss
            logits = self.lm_head(x)
            loss = F.cross_entropy(logits.view(-1, logits.size(-1)), targets.view(-1), ignore_index=-1)
        else:
            # Inference-time mini-optimization: only forward the lm_head on the very last position
            logits = self.lm_head(x[:, [-1], :])  # note: using list [-1] to preserve the time dim
            loss = None

        return logits, loss

    def crop_block_size(self, block_size):
        """
        Model surgery to decrease the block size if necessary.
        e.g. we may load the GPT2 pretrained model checkpoint (block size 1024)
        but want to use a smaller block size for some smaller, simpler model
        """
        assert block_size <= self.config.block_size
        self.config.block_size = block_size
        self.transformer.wpe.weight = nn.Parameter(self.transformer.wpe.weight[:block_size])
        for block in self.transformer.h:
            if hasattr(block.attn, 'bias'):
                block.attn.bias = block.attn.bias[:,:,:block_size,:block_size]

    @classmethod
    def from_pretrained(cls, model_type, override_args=None):
        """
        Load pretrained GPT-2 weights from HuggingFace transformers.
        Available models: 'gpt2', 'gpt2-medium', 'gpt2-large', 'gpt2-xl'
        """
        assert model_type in {'gpt2', 'gpt2-medium', 'gpt2-large', 'gpt2-xl'}
        override_args = override_args or {}  # default to empty dict
        # only dropout can be overridden see more notes below
        assert all(k == 'dropout' for k in override_args)
        from transformers import GPT2LMHeadModel
        print("loading weights from pretrained gpt: %s" % model_type)

        # n_layer, n_head and n_embd are determined from model_type
        config_args = {
            'gpt2':         dict(n_layer=12, n_head=12, n_embd=768),  # 124M params
            'gpt2-medium':  dict(n_layer=24, n_head=16, n_embd=1024), # 350M params
            'gpt2-large':   dict(n_layer=36, n_head=20, n_embd=1280), # 774M params
            'gpt2-xl':      dict(n_layer=48, n_head=25, n_embd=1600), # 1558M params
        }[model_type]
        print("forcing vocab_size=50257, block_size=1024, bias=True")
        config_args['vocab_size'] = 50257  # always 50257 for GPT model checkpoints
        config_args['block_size'] = 1024  # always 1024 for GPT model checkpoints
        config_args['bias'] = True  # always True for GPT model checkpoints
        # we can override the dropout rate, if desired
        if 'dropout' in override_args:
            print(f"overriding dropout rate to {override_args['dropout']}")
            config_args['dropout'] = override_args['dropout']
        # create a from-scratch initialized minGPT model
        config = GPTConfig(**config_args)
        model = GPT(config)
        sd = model.state_dict()
        sd_keys = sd.keys()
        sd_keys = [k for k in sd_keys if not k.endswith('.attn.bias')]  # discard this mask / buffer, not a param

        # init a huggingface/transformers model
        model_hf = GPT2LMHeadModel.from_pretrained(model_type)
        sd_hf = model_hf.state_dict()

        # copy while ensuring all of the parameters are aligned and match in names and shapes
        sd_keys_hf = sd_hf.keys()
        sd_keys_hf = [k for k in sd_keys_hf if not k.endswith('.attn.masked_bias')]  # ignore these, just a buffer
        sd_keys_hf = [k for k in sd_keys_hf if not k.endswith('.attn.bias')]  # same, just the mask (buffer)
        transposed = ['attn.c_attn.weight', 'attn.c_proj.weight', 'mlp.c_fc.weight', 'mlp.c_proj.weight']
        # basically the openai checkpoints use a "Conv1D" module, but we only want to use a vanilla Linear
        # this means that we have to transpose these weights when we import them
        assert len(sd_keys_hf) == len(sd_keys), f"mismatched keys: {len(sd_keys_hf)} != {len(sd_keys)}"
        for k in sd_keys_hf:
            if any(k.endswith(w) for w in transposed):
                # special treatment for the Conv1D weights we need to transpose
                assert sd_hf[k].shape[::-1] == sd[k].shape
                with torch.no_grad():
                    sd[k].copy_(sd_hf[k].t())
            else:
                # vanilla copy over the other parameters
                assert sd_hf[k].shape == sd[k].shape
                with torch.no_grad():
                    sd[k].copy_(sd_hf[k])

        return model

    def configure_optimizers(self, weight_decay, learning_rate, betas, device_type):
        # Start with all of the candidate parameters
        param_dict = {pn: p for pn, p in self.named_parameters()}
        # Filter out those that do not require grad
        param_dict = {pn: p for pn, p in param_dict.items() if p.requires_grad}
        # Create optim groups. Any parameters that is 2D will be weight decayed, otherwise no.
        # i.e. all weight tensors in matmuls + embeddings decay, all biases and layernorms don't.
        decay_params = [p for n, p in param_dict.items() if p.dim() >= 2]
        nodecay_params = [p for n, p in param_dict.items() if p.dim() < 2]
        optim_groups = [
            {'params': decay_params, 'weight_decay': weight_decay},
            {'params': nodecay_params, 'weight_decay': 0.0}
        ]
        num_decay_params = sum(p.numel() for p in decay_params)
        num_nodecay_params = sum(p.numel() for p in nodecay_params)
        print(f"num decayed parameter tensors: {len(decay_params)}, with {num_decay_params:,} parameters")
        print(f"num non-decayed parameter tensors: {len(nodecay_params)}, with {num_nodecay_params:,} parameters")
        # Create AdamW optimizer and use the fused version if it is available
        fused_available = 'fused' in inspect.signature(torch.optim.AdamW).parameters
        use_fused = fused_available and device_type == 'cuda'
        extra_args = dict(fused=True) if use_fused else dict()
        optimizer = torch.optim.AdamW(optim_groups, lr=learning_rate, betas=betas, **extra_args)
        print(f"using fused AdamW: {use_fused}")

        return optimizer

    def estimate_mfu(self, fwdbwd_per_iter, dt):
        """
        Estimate model flops utilization (MFU) in units of A100 bfloat16 peak FLOPS.
        See PaLM paper Appendix B as ref: https://arxiv.org/abs/2204.02311
        """
        # first estimate the number of flops we do per iteration.
        N = self.get_num_params()
        cfg = self.config
        L, H, Q, T = cfg.n_layer, cfg.n_head, cfg.n_embd//cfg.n_head, cfg.block_size
        flops_per_token = 6*N + 12*L*H*Q*T
        flops_per_fwdbwd = flops_per_token * T
        flops_per_iter = flops_per_fwdbwd * fwdbwd_per_iter
        # express our flops throughput as ratio of A100 bfloat16 peak flops
        flops_achieved = flops_per_iter * (1.0/dt)  # per second
        flops_promised = 312e12  # A100 GPU bfloat16 peak flops is 312 TFLOPS
        mfu = flops_achieved / flops_promised
        return mfu

    @torch.no_grad()
    def generate(self, idx, max_new_tokens, temperature=1.0, top_k=None):
        """
        Take a conditioning sequence of indices idx (LongTensor of shape (b,t)) and complete
        the sequence max_new_tokens times, feeding the predictions back into the model each time.
        Most likely you'll want to make sure to be in model.eval() mode of operation for this.
        """
        for _ in range(max_new_tokens):
            # If the sequence context is growing too long we must crop it at block_size
            idx_cond = idx if idx.size(1) <= self.config.block_size else idx[:, -self.config.block_size:]
            # Forward the model to get the logits for the index in the sequence
            logits, _ = self(idx_cond)
            # Pluck the logits at the final step and scale by desired temperature
            logits = logits[:, -1, :] / temperature
            # Optionally crop the logits to only the top k options
            if top_k is not None:
                v, _ = torch.topk(logits, min(top_k, logits.size(-1)))
                logits[logits < v[:, [-1]]] = -float('Inf')
            # Apply softmax to convert logits to (normalized) probabilities
            probs = F.softmax(logits, dim=-1)
            # Sample from the distribution
            idx_next = torch.multinomial(probs, num_samples=1)
            # Append sampled index to the running sequence and continue
            idx = torch.cat((idx, idx_next), dim=1)

        return idx

## 3. Download and Prepare Shakespeare Dataset

We'll download the complete works of Shakespeare and prepare it for character-level language modeling.

In [None]:
# Download Shakespeare dataset
input_file_path = 'input.txt'

if not os.path.exists(input_file_path):
    data_url = 'https://raw.githubusercontent.com/karpathy/char-rnn/master/data/tinyshakespeare/input.txt'
    with open(input_file_path, 'w', encoding='utf-8') as f:
        f.write(requests.get(data_url).text)

with open(input_file_path, 'r', encoding='utf-8') as f:
    text = f.read()

print(f"Length of dataset in characters: {len(text):,}")
print(f"\nFirst 500 characters:\n{text[:500]}")

In [None]:
# Create character-level vocabulary
chars = sorted(list(set(text)))
vocab_size = len(chars)
print(f"Vocabulary size: {vocab_size}")
print(f"All characters: {''.join(chars)}")

# Create mappings
stoi = {ch: i for i, ch in enumerate(chars)}
itos = {i: ch for i, ch in enumerate(chars)}

def encode(s):
    return [stoi[c] for c in s]

def decode(l):
    return ''.join([itos[i] for i in l])

# Test encoding/decoding
print(f"\nExample encoding: {encode('hello')}")
print(f"Example decoding: {decode(encode('hello'))}")

In [None]:
# Prepare train and validation splits
data = torch.tensor(encode(text), dtype=torch.long)
n = int(0.9 * len(data))
train_data = data[:n]
val_data = data[n:]

print(f"Train size: {len(train_data):,} characters")
print(f"Val size: {len(val_data):,} characters")

## 4. BPE Tokenization (GPT-2 Style)

In addition to character-level tokenization, nanoGPT supports BPE (Byte Pair Encoding) tokenization using tiktoken. This is what GPT-2 uses and is essential for working with pretrained models.

In [None]:
# BPE Tokenization using tiktoken (for GPT-2)
# Install with: pip install tiktoken

try:
    import tiktoken
    
    # Initialize GPT-2 BPE encoder
    enc = tiktoken.get_encoding("gpt2")
    
    def encode_bpe(s):
        """Encode string to BPE token ids"""
        return enc.encode(s, allowed_special={"<|endoftext|>"})
    
    def decode_bpe(l):
        """Decode BPE token ids to string"""
        return enc.decode(l)
    
    print("✓ tiktoken loaded successfully!")
    print(f"\nGPT-2 BPE vocab size: {enc.n_vocab}")
    
    # Test BPE encoding
    test_text = "Hello, world! This is GPT-2 tokenization."
    test_tokens = encode_bpe(test_text)
    print(f"\nExample text: {test_text}")
    print(f"BPE tokens: {test_tokens}")
    print(f"Number of tokens: {len(test_tokens)}")
    print(f"Decoded back: {decode_bpe(test_tokens)}")
    
    # Compare with character-level
    print(f"\nCharacter-level would use {len(test_text)} tokens")
    print(f"BPE uses {len(test_tokens)} tokens (more efficient!)")
    
    bpe_available = True
    
except ImportError:
    print("tiktoken not installed. Install with: pip install tiktoken")
    print("You can still use character-level tokenization for Shakespeare.")
    bpe_available = False
    encode_bpe = None
    decode_bpe = None

### Tokenize Shakespeare with BPE

Let's tokenize the Shakespeare dataset using BPE encoding (like the original nanoGPT).

In [None]:
if bpe_available:
    # Tokenize Shakespeare data with BPE
    shakespeare_bpe_tokens = encode_bpe(text)
    print(f"Shakespeare dataset:")
    print(f"  Original characters: {len(text):,}")
    print(f"  BPE tokens: {len(shakespeare_bpe_tokens):,}")
    print(f"  Compression ratio: {len(text) / len(shakespeare_bpe_tokens):.2f}x")
    
    # Create train/val split for BPE tokens
    data_bpe = torch.tensor(shakespeare_bpe_tokens, dtype=torch.long)
    n_bpe = int(0.9 * len(data_bpe))
    train_data_bpe = data_bpe[:n_bpe]
    val_data_bpe = data_bpe[n_bpe:]
    
    print(f"\nBPE tokenized splits:")
    print(f"  Train: {len(train_data_bpe):,} tokens")
    print(f"  Val: {len(val_data_bpe):,} tokens")
else:
    print("Skipping BPE tokenization (tiktoken not available)")
    train_data_bpe = None
    val_data_bpe = None

## 5. Initialize Model (Small for Training)

For training from scratch on Shakespeare, we'll use a smaller model configuration.

In [None]:
# Choose tokenization method
use_bpe = bpe_available and True  # Set to False to force character-level

if use_bpe:
    print("Using BPE tokenization (GPT-2 style)")
    model_vocab_size = 50304  # GPT-2 vocab of 50257, padded to 50304 for efficiency
    train_data_active = train_data_bpe
    val_data_active = val_data_bpe
    encode_fn = encode_bpe
    decode_fn = decode_bpe
else:
    print("Using character-level tokenization")
    model_vocab_size = vocab_size
    train_data_active = train_data
    val_data_active = val_data
    encode_fn = encode
    decode_fn = decode

# Small config for training from scratch on Shakespeare
config_small = GPTConfig(
    block_size=256,      # Context length
    vocab_size=model_vocab_size,
    n_layer=6,
    n_head=6,
    n_embd=384,
    dropout=0.2,
    bias=False           # False is slightly better and faster
)

model = GPT(config_small)
model.to(device)

print(f"\nModel configuration:")
print(f"  Tokenization: {'BPE (GPT-2)' if use_bpe else 'Character-level'}")
print(f"  Vocab size: {config_small.vocab_size:,}")
print(f"  Block size (context length): {config_small.block_size}")
print(f"  Number of layers: {config_small.n_layer}")
print(f"  Number of heads: {config_small.n_head}")
print(f"  Embedding dimension: {config_small.n_embd}")
print(f"  Dropout: {config_small.dropout}")

## 5. Training Configuration

In [None]:
# Training hyperparameters
batch_size = 64
max_iters = 5000
eval_interval = 500
learning_rate = 3e-4
eval_iters = 200
gradient_accumulation_steps = 1
grad_clip = 1.0

# Learning rate decay settings
decay_lr = True
warmup_iters = 100
lr_decay_iters = max_iters
min_lr = 3e-5

def get_batch(split):
    """Generate a small batch of data of inputs x and targets y"""
    data = train_data_active if split == 'train' else val_data_active
    ix = torch.randint(len(data) - config_small.block_size, (batch_size,))
    x = torch.stack([data[i:i+config_small.block_size] for i in ix])
    y = torch.stack([data[i+1:i+config_small.block_size+1] for i in ix])
    x, y = x.to(device), y.to(device)
    return x, y

@torch.no_grad()
def estimate_loss():
    """Estimate the average loss on train and val splits"""
    out = {}
    model.eval()
    for split in ['train', 'val']:
        losses = torch.zeros(eval_iters)
        for k in range(eval_iters):
            X, Y = get_batch(split)
            logits, loss = model(X, Y)
            losses[k] = loss.item()
        out[split] = losses.mean()
    model.train()
    return out

def get_lr(it):
    """Learning rate decay scheduler (cosine with warmup)"""
    # 1) Linear warmup for warmup_iters steps
    if it < warmup_iters:
        return learning_rate * it / warmup_iters
    # 2) If it > lr_decay_iters, return min learning rate
    if it > lr_decay_iters:
        return min_lr
    # 3) In between, use cosine decay down to min learning rate
    decay_ratio = (it - warmup_iters) / (lr_decay_iters - warmup_iters)
    assert 0 <= decay_ratio <= 1
    coeff = 0.5 * (1.0 + math.cos(math.pi * decay_ratio))  # coeff ranges 0..1
    return min_lr + coeff * (learning_rate - min_lr)

## 6. Training Loop

In [None]:
# Initialize optimizer
optimizer = model.configure_optimizers(
    weight_decay=0.1,
    learning_rate=learning_rate,
    betas=(0.9, 0.95),
    device_type=device
)

# Training loop
print("\nStarting training...\n")
X, Y = get_batch('train')  # Fetch the first batch
t0 = time.time()

for iter_num in range(max_iters):
    # Determine and set the learning rate for this iteration
    lr = get_lr(iter_num) if decay_lr else learning_rate
    for param_group in optimizer.param_groups:
        param_group['lr'] = lr

    # Evaluate the loss on train/val sets
    if iter_num % eval_interval == 0:
        losses = estimate_loss()
        print(f"step {iter_num}: train loss {losses['train']:.4f}, val loss {losses['val']:.4f}")
        
        # Estimate MFU if on GPU
        if device == 'cuda' and iter_num > 0:
            dt = time.time() - t0
            mfu = model.estimate_mfu(batch_size * gradient_accumulation_steps, dt / eval_interval)
            print(f"  MFU: {mfu*100:.2f}%")
            t0 = time.time()

    # Forward backward update, with optional gradient accumulation
    for micro_step in range(gradient_accumulation_steps):
        logits, loss = model(X, Y)
        loss = loss / gradient_accumulation_steps
        X, Y = get_batch('train')
        loss.backward()
    
    # Clip the gradient
    if grad_clip != 0.0:
        torch.nn.utils.clip_grad_norm_(model.parameters(), grad_clip)
    
    # Step the optimizer
    optimizer.step()
    optimizer.zero_grad(set_to_none=True)
    
    # Timing and logging
    if iter_num % 100 == 0 and iter_num > 0:
        t1 = time.time()
        dt = t1 - t0
        t0 = t1
        lossf = loss.item() * gradient_accumulation_steps
        print(f"iter {iter_num}: loss {lossf:.4f}, time {dt*1000:.2f}ms")

print("\nTraining completed!")

## 7. Text Generation from Trained Model

Generate text using the trained model with the chosen tokenization scheme.

In [None]:
model.eval()

# Generate from the model
context = "\n"  # Start with a newline
context_ids = torch.tensor(encode_fn(context), dtype=torch.long, device=device).unsqueeze(0)

print(f"Generated text from trained model (using {('BPE' if use_bpe else 'character-level')} tokenization):\n")
print("=" * 80)

# Generate multiple samples
num_samples = 3
max_new_tokens = 500

for i in range(num_samples):
    generated_ids = model.generate(context_ids, max_new_tokens=max_new_tokens, temperature=0.8, top_k=200)
    generated_text = decode_fn(generated_ids[0].tolist())
    print(f"\nSample {i+1}:")
    print("-" * 80)
    print(generated_text)
    print("-" * 80)

## 8. Load Pretrained GPT-2 Model

Now let's demonstrate the `from_pretrained()` method to load a pretrained GPT-2 model.

In [None]:
# Load pretrained GPT-2
# Note: This requires the 'transformers' library: pip install transformers
# Uncomment the following lines to load a pretrained model

# model_gpt2 = GPT.from_pretrained('gpt2')
# model_gpt2.to(device)
# model_gpt2.eval()

print("To load a pretrained GPT-2 model, uncomment the lines above.")
print("Available models: 'gpt2' (124M), 'gpt2-medium' (350M), 'gpt2-large' (774M), 'gpt2-xl' (1558M)")

## 9. Generate Text with Pretrained GPT-2

Generate text using the pretrained GPT-2 model with BPE tokenization.

In [None]:
# Example of generating with pretrained GPT-2
if bpe_available:
    print("Loading pretrained GPT-2 model...")
    print("Note: This will download ~500MB on first run\n")
    
    # Uncomment to load and use:
    # model_gpt2 = GPT.from_pretrained('gpt2')
    # model_gpt2.to(device)
    # model_gpt2.eval()
    # 
    # # Generate text
    # prompt = "What is the answer to life, the universe, and everything?"
    # print(f"Prompt: {prompt}\n")
    # 
    # tokens = torch.tensor(encode_bpe(prompt), dtype=torch.long, device=device).unsqueeze(0)
    # 
    # with torch.no_grad():
    #     generated = model_gpt2.generate(tokens, max_new_tokens=100, temperature=0.8, top_k=200)
    #     generated_text = decode_bpe(generated[0].tolist())
    #     print("Generated text:")
    #     print("=" * 80)
    #     print(generated_text)
    #     print("=" * 80)
    
    print("Uncomment the code above to generate text with pretrained GPT-2.")
    print("\nYou can try prompts like:")
    print('  - "What is the answer to life, the universe, and everything?"')
    print('  - "Once upon a time"')
    print('  - "The future of artificial intelligence"')
else:
    print("tiktoken not available. Install with: pip install tiktoken transformers")
    print("Then you can load pretrained GPT-2 models and generate text.")

## 10. Demonstrate crop_block_size()

Show how to reduce the context length of a model.

In [None]:
# Example of cropping block size
# Useful if you load a pretrained model with block_size=1024 but want to use 256

print(f"Current block size: {model.config.block_size}")
print(f"Current position embedding shape: {model.transformer.wpe.weight.shape}")

# If we wanted to crop to 128 tokens (uncomment to try):
# model.crop_block_size(128)
# print(f"\nAfter cropping:")
# print(f"New block size: {model.config.block_size}")
# print(f"New position embedding shape: {model.transformer.wpe.weight.shape}")

print("\nThis is useful when you want to use a smaller context window than the original model.")

## 11. Fine-tuning Pretrained GPT-2 on Shakespeare

Example of how to fine-tune a pretrained GPT-2 model on your custom dataset.

In [None]:
# Fine-tuning example (uncomment to run)
# This demonstrates the complete workflow:
# 1. Load pretrained GPT-2
# 2. Prepare data with BPE tokenization
# 3. Fine-tune with lower learning rate

print("""To fine-tune GPT-2 on Shakespeare:

1. Install dependencies:
   pip install transformers tiktoken

2. Load pretrained model:
   model = GPT.from_pretrained('gpt2')

3. Tokenize your data with tiktoken (BPE):
   import tiktoken
   enc = tiktoken.get_encoding('gpt2')
   tokens = enc.encode(text)

4. Train with smaller learning rate:
   learning_rate = 6e-5  # Lower than from-scratch training
   max_iters = 1000      # Fewer iterations needed

5. The fine-tuned model will generate Shakespeare-like text
   but with the knowledge from GPT-2's pretraining!
""")

## 12. Save and Load Model

In [None]:
# Save model checkpoint (nanoGPT format)
checkpoint = {
    'model': model.state_dict(),
    'optimizer': optimizer.state_dict(),
    'model_args': {
        'n_layer': config_small.n_layer,
        'n_head': config_small.n_head,
        'n_embd': config_small.n_embd,
        'block_size': config_small.block_size,
        'bias': config_small.bias,
        'vocab_size': config_small.vocab_size,
        'dropout': config_small.dropout,
    },
    'iter_num': max_iters,
    'best_val_loss': estimate_loss()['val'],
    'config': {
        'dataset': 'shakespeare',
        'tokenization': 'bpe' if use_bpe else 'char',
    },
}

torch.save(checkpoint, 'ckpt.pt')
print("Model checkpoint saved to 'ckpt.pt'")

# Save tokenization metadata
if use_bpe:
    # For BPE, we don't need to save encoding (tiktoken handles it)
    print("Using BPE tokenization (tiktoken)")
else:
    # For character-level, save the character mappings
    meta = {
        'vocab_size': vocab_size,
        'itos': itos,
        'stoi': stoi,
    }
    with open('meta.pkl', 'wb') as f:
        pickle.dump(meta, f)
    print("Character mappings saved to 'meta.pkl'")

In [None]:
# Load model checkpoint (compatible with nanoGPT sample.py)
def load_checkpoint(checkpoint_path='ckpt.pt', device='cuda'):
    """
    Load a model checkpoint in the same way as nanoGPT's sample.py
    Returns: model, encode_fn, decode_fn
    """
    checkpoint = torch.load(checkpoint_path, map_location=device)
    
    # Create model from checkpoint
    gptconf = GPTConfig(**checkpoint['model_args'])
    model = GPT(gptconf)
    
    # Load state dict
    state_dict = checkpoint['model']
    # Handle compiled models (remove '_orig_mod.' prefix)
    unwanted_prefix = '_orig_mod.'
    for k, v in list(state_dict.items()):
        if k.startswith(unwanted_prefix):
            state_dict[k[len(unwanted_prefix):]] = state_dict.pop(k)
    
    model.load_state_dict(state_dict)
    model.to(device)
    model.eval()
    
    # Determine tokenization scheme
    tokenization = checkpoint.get('config', {}).get('tokenization', 'char')
    
    if tokenization == 'bpe':
        print("Loading BPE tokenization (tiktoken)")
        if not bpe_available:
            raise ImportError("tiktoken required for BPE. Install with: pip install tiktoken")
        encode_fn = encode_bpe
        decode_fn = decode_bpe
    else:
        print("Loading character-level tokenization")
        # Load meta.pkl for character mappings
        with open('meta.pkl', 'rb') as f:
            meta = pickle.load(f)
        stoi_loaded = meta['stoi']
        itos_loaded = meta['itos']
        encode_fn = lambda s: [stoi_loaded[c] for c in s]
        decode_fn = lambda l: ''.join([itos_loaded[i] for i in l])
    
    return model, encode_fn, decode_fn

# Example usage:
# loaded_model, encode_fn, decode_fn = load_checkpoint('ckpt.pt')
# prompt = "ROMEO:"
# tokens = torch.tensor(encode_fn(prompt), dtype=torch.long, device=device).unsqueeze(0)
# with torch.no_grad():
#     generated = loaded_model.generate(tokens, max_new_tokens=200)
#     print(decode_fn(generated[0].tolist()))

print("\nUse load_checkpoint() to reload a saved model with the correct tokenization.")
print("This is compatible with nanoGPT's checkpoint format!")

## 13. Model Analysis

In [None]:
# Analyze model parameters
total_params = sum(p.numel() for p in model.parameters())
trainable_params = sum(p.numel() for p in model.parameters() if p.requires_grad)

print("Model Statistics:")
print(f"  Total parameters: {total_params:,}")
print(f"  Trainable parameters: {trainable_params:,}")
print(f"  Non-embedding parameters: {model.get_num_params():,}")
print(f"\nLayer breakdown:")
for name, param in model.named_parameters():
    if param.requires_grad:
        print(f"  {name:50s} {str(list(param.shape)):20s} {param.numel():10,} params")

In [None]:
# Final evaluation
print("\nFinal evaluation:")
losses = estimate_loss()
print(f"  Train loss: {losses['train']:.4f}")
print(f"  Val loss: {losses['val']:.4f}")

## 14. Summary of All Features

This notebook now includes all methods from the original nanoGPT `model.py` plus proper tokenization support:

### Core Methods:
- ✅ `__init__()` - Initialize the model
- ✅ `forward()` - Forward pass with loss calculation
- ✅ `generate()` - Autoregressive text generation
- ✅ `configure_optimizers()` - Set up AdamW with weight decay

### Advanced Methods:
- ✅ `from_pretrained()` - Load pretrained GPT-2 weights (gpt2, gpt2-medium, gpt2-large, gpt2-xl)
- ✅ `crop_block_size()` - Reduce context length for smaller models
- ✅ `estimate_mfu()` - Calculate model FLOPS utilization on A100 GPUs
- ✅ `get_num_params()` - Count model parameters
- ✅ `_init_weights()` - Weight initialization

### Tokenization Support:
- ✅ **Character-level tokenization** - For simple datasets like Shakespeare
- ✅ **BPE tokenization (tiktoken)** - GPT-2 style, for pretrained models
- ✅ Automatic tokenization detection when loading checkpoints
- ✅ Compatible with nanoGPT's checkpoint format

### Key Features:
1. **Train from scratch** on character-level Shakespeare
2. **Train with BPE** tokenization (like GPT-2)
3. **Load pretrained GPT-2** and generate text immediately
4. **Fine-tune GPT-2** on custom datasets
5. **Adjust context length** dynamically with crop_block_size()
6. **Monitor efficiency** with MFU estimation
7. **Save/load models** in nanoGPT-compatible format

### Tokenization Comparison:
```python
# Character-level (simple, but inefficient)
text = "Hello, world!"
char_tokens = [stoi[c] for c in text]  # 13 tokens

# BPE (efficient, used by GPT-2)
bpe_tokens = encode_bpe(text)  # 4 tokens: [15496, 11, 995, 0]
```

### Experiment Ideas:
- Compare character-level vs BPE tokenization on Shakespeare
- Load GPT-2 and generate text on various prompts
- Fine-tune GPT-2 on Shakespeare with BPE tokenization
- Try different model sizes (gpt2 through gpt2-xl)
- Experiment with different context lengths using crop_block_size()
- Profile training efficiency with estimate_mfu()
- Train a small model with BPE to see the difference from character-level

### Dependencies:
```bash
pip install torch numpy
pip install tiktoken          # For BPE tokenization
pip install transformers       # For loading pretrained GPT-2
pip install requests          # For downloading datasets
```

For more information, visit: https://github.com/karpathy/nanoGPT