## Fine-tuning GPT-2 With your data. 

This notebook demonstrates how to fine-tune the smallest GPT-2 model (124M parameters) on your custom text data. Unlike training from scratch, fine-tuning starts with a pre-trained model and adapts it to your needs.

Important: Make sure you run every cell in this workbook by using the "Play" button on the right-hand side of each cell before moving on to the next one.
If you have to restart the program for some reason, you might have to run the cells again.


## Prepare Your Data
Place your text data in a file called input.txt in the same directory as this notebook. The text should be clean and representative of what you want the model to learn.


In [None]:
import os
import sys
import time
import torch
import numpy as np
import importlib.util
from pathlib import Path

# Add the project root to Python path if needed
# sys.path.append(os.path.abspath('..'))

# Check input data
input_file = 'alice.txt'
if not os.path.exists(input_file):
    raise FileNotFoundError(f"Please ensure {input_file} exists in the current directory")

print(f"Fine-tuning with data from: {input_file}")

# Create configuration
config = {
    # I/O
    'out_dir': f'out-{input_file.split(".")[0]}-finetune',
    'eval_interval': 5,
    'eval_iters': 40,
    'always_save_checkpoint': False,
    'init_from': 'gpt2',
    
    # wandb (optional)
    'wandb_log': False,
    'wandb_project': f'{input_file.split(".")[0]}-finetune',
    'wandb_run_name': f'ft-{time.time()}',
    
    # data
    'dataset': input_file.split('.')[0],
    'gradient_accumulation_steps': 32,
    'batch_size': 1,
    'block_size': 1024,
    
    # optimizer
    'learning_rate': 3e-5,
    'max_iters': 20,
    'weight_decay': 1e-2,
    'beta1': 0.9,
    'beta2': 0.95,
    'grad_clip': 1.0,
    
    # learning rate decay
    'decay_lr': False,
    'warmup_iters': 0,
    'lr_decay_iters': 20,  # same as max_iters
    'min_lr': 3e-6,  # 1/10th of learning_rate
    
    # system
    'device': 'cuda' if torch.cuda.is_available() else 'cpu',
    'dtype': 'float16',
    'compile': False,
}

print(f"Configuration ready for fine-tuning")

Length of dataset in characters: 148,043
First 500 characters of your data:
﻿Alice's Adventures in Wonderland

                ALICE'S ADVENTURES IN WONDERLAND

                          Lewis Carroll

               THE MILLENNIUM FULCRUM EDITION 3.0




                            CHAPTER I

                      Down the Rabbit-Hole


  Alice was beginning to get very tired of sitting by her sister
on the bank, and of having nothing to do:  once or twice she had
peeped into the book her sister was reading, but it had no
pictures or conversations in it, `and what is t


## Configure your data

In [19]:
out_dir = 'out-alice-finetune-notebook'
eval_interval = 5
log_interval = 1
eval_iters = 40 # 
always_save_checkpoint = True # Save checkpoint if validation loss improves
init_from = 'gpt2' # 'scratch' or 'resume' or 'gpt2*'
max_iters = 20

# wandb logging (optional)
wandb_log = False
wandb_project = 'alice-finetune-notebook'
wandb_run_name = 'ft-notebook-' + str(time.time()) # Requires import time if not already done

# Data
dataset = input_file.split('.')[0] # e.g., 'alice' from 'alice.txt'
gradient_accumulation_steps = 32
batch_size = 1 # Phsyical batch size
block_size = 1024

# Model (these are for 'gpt2' and will be overridden by from_pretrained, but good for reference)
n_layer = 12
n_head = 12
n_embd = 768
dropout = 0.0 # For fine-tuning, can increase if overfitting, 0.0 for gpt2 pretrain
bias = False # GPT-2 uses bias, but nanoGPT model.py allows False

# AdamW Optimizer
learning_rate = 3e-5  # Main learning rate       # Total number of training iterations.
weight_decay = 1e-2
beta1 = 0.9
beta2 = 0.95
grad_clip = 1.0 # Clip gradients at this value, or 0.0 to disable

# Learning rate decay settings
decay_lr = True # Whether to decay the learning rate
warmup_iters = 0 # How many steps to warm up for, 0 for fine-tuning is often ok
lr_decay_iters = max_iters # Should be ~= max_iters per Chinchilla
min_lr = learning_rate / 10 # Minimum learning rate, should be ~= learning_rate/10 per Chinchilla

# System
device = 'cuda' if torch.cuda.is_available() else 'cpu'
# dtype = 'bfloat16' if torch.cuda.is_available() and torch.cuda.is_bf16_supported() else 'float16' # 'float32', 'bfloat16', or 'float16'
dtype = 'float16' # For broader compatibility, use float16. Change to bfloat16 if your GPU supports it and you prefer.
# Note: float16 requires a GradScaler. bfloat16 typically does not.
compile_model = False # Requires PyTorch 2.0+

# Ensure output directory exists
if not os.path.exists(out_dir):
    os.makedirs(out_dir)

print(f"Configuration loaded:")
print(f"  Output directory: {out_dir}")
print(f"  Dataset: {dataset}")
print(f"  Device: {device}, dtype: {dtype}, compile: {compile_model}")
print(f"  Max iterations: {max_iters}")
print(f"  Learning rate: {learning_rate}")

Configuration loaded:
  Output directory: out-alice-finetune-notebook
  Dataset: alice
  Device: cuda, dtype: float16, compile: False
  Max iterations: 20
  Learning rate: 3e-05


## Tokenize the Data

Instead of direct integer tokenization, we use *TikTok* which is also used by OpenAI to tokenize.

In [20]:
n = len(text)
train_data = text[:int(n*0.9)]
val_data = text[int(n*0.9):]

# Encode with tiktoken GPT-2 BPE (exactly like nanoGPT)
enc = tiktoken.get_encoding("gpt2")
train_ids = enc.encode_ordinary(train_data)
val_ids = enc.encode_ordinary(val_data)

print(f"Train has {len(train_ids):,} tokens")
print(f"Validation has {len(val_ids):,} tokens")

# Get vocabulary size from the tokenizer
vocab_size = enc.n_vocab
print(f"Vocabulary size: {vocab_size:,} tokens")

# Export to binary files (named after your dataset)
dataset_name = input_file.split('.')[0]  # e.g., 'alice' from 'alice.txt'
global_train_file = f'{dataset}_train.bin'
global_val_file = f'{dataset}_val.bin'

train_ids = np.array(train_ids, dtype=np.uint16)
val_ids = np.array(val_ids, dtype=np.uint16)
train_ids.tofile(global_train_file)
val_ids.tofile(global_val_file)

print(f"Created {global_train_file} and {global_val_file} binary files in the current directory.")


Train has 38,141 tokens
Validation has 4,189 tokens
Vocabulary size: 50,257 tokens
Created alice_train.bin and alice_val.bin binary files in the current directory.


## Step 3: Set up batching



In [21]:
def get_batch(split, batch_size_gb=1, block_size_gb=1024): # Renamed params to avoid conflict if config vars are global
    """Load a batch from the binary files (nanoGPT style)"""
    # Use dataset-specific file names (now using global_train_file, global_val_file)
    if split == 'train':
        data = np.memmap(global_train_file, dtype=np.uint16, mode='r')
    else:
        data = np.memmap(global_val_file, dtype=np.uint16, mode='r')
    
    # Generate random starting positions
    ix = torch.randint(len(data) - block_size_gb, (batch_size_gb,))
    
    # Create input and target sequences
    x = torch.stack([torch.from_numpy((data[i:i+block_size_gb]).astype(np.int64)) for i in ix])
    y = torch.stack([torch.from_numpy((data[i+1:i+1+block_size_gb]).astype(np.int64)) for i in ix])
    
    return x, y

x_test, y_test = get_batch('train', batch_size_gb=batch_size, block_size_gb=8) # Use config batch_size

print("Sample input tokens:", x_test[0].tolist())
print("Sample target tokens:", y_test[0].tolist())
print("Decoded input:", enc.decode(x_test[0].tolist()))


Sample input tokens: [10010, 13679, 319, 543, 262, 41857, 367, 1436]
Sample target tokens: [13679, 319, 543, 262, 41857, 367, 1436, 198]
Decoded input:  concert!' on which the wretched Hatter


## Step 4: Configuring device & Loading the model

smol GPT - precursor to what we know as ChatGPT in a sense

In [22]:

# device = 'cuda' if torch.cuda.is_available() else 'cpu' # Now from config
print(f"Using device: {device}")

# Initialize model from pre-trained GPT-2 (using nanoGPT's architecture)
print(f"Loading pre-trained model: {init_from}...")
if init_from == 'scratch':
    # init a new model from scratch
    print("Initializing a new model from scratch")
    gptconf = GPTConfig(block_size=block_size, vocab_size=vocab_size, n_layer=n_layer, n_head=n_head, n_embd=n_embd, dropout=dropout, bias=bias)
    model = GPT(gptconf)
elif init_from.startswith('gpt2'):
    model = GPT.from_pretrained(init_from, dict(dropout=dropout))
else:
    print(f"Resuming training from {init_from}")
    # TODO: Implement resume logic if needed, similar to train.py
    ckpt_path = os.path.join(out_dir, init_from if init_from != 'resume' else 'ckpt.pt')
    checkpoint = torch.load(ckpt_path, map_location=device)
    gptconf = GPTConfig(**checkpoint['model_args'])
    model = GPT(gptconf)
    state_dict = checkpoint['model']
    # fix state dict keys if needed (e.g. DDP)
    unwanted_prefix = '_orig_mod.'
    for k,v in list(state_dict.items()):
        if k.startswith(unwanted_prefix):
            state_dict[k[len(unwanted_prefix):]] = state_dict.pop(k)
    model.load_state_dict(state_dict)


model = model.to(device)
# model.train() # Will be set in training loop

if compile_model:
    print("Compiling the model... (takes a ~minute)")
    try:
        model = torch.compile(model) # requires PyTorch 2.0
        print("Model compiled successfully.")
    except Exception as e:
        print(f"Model compilation failed: {e}. Proceeding without compilation.")
        compile_model = False # Fallback

print(f"Model has {sum(p.numel() for p in model.parameters())/1e6:.2f}M parameters")

print(f"Model configuration:")
print(f"  - Layers: {model.config.n_layer}")
print(f"  - Heads: {model.config.n_head}")  
print(f"  - Embedding dimension: {model.config.n_embd}")
print(f"  - Block size: {model.config.block_size}")
print(f"  - Vocabulary size: {model.config.vocab_size}")


Using device: cuda
Loading pre-trained model: gpt2...
loading weights from pretrained gpt: gpt2
forcing vocab_size=50257, block_size=1024, bias=True
overriding dropout rate to 0.0
number of parameters: 123.65M
Model has 124.44M parameters
Model configuration:
  - Layers: 12
  - Heads: 12
  - Embedding dimension: 768
  - Block size: 1024
  - Vocabulary size: 50257


## Step 5: Set hyperparameters



In [23]:
# Training hyperparameters (following nanoGPT's fine-tuning approach)
batch_size = 1                      # Small physical batch size for memory efficiency
gradient_accumulation_steps = 32    # Accumulate gradients (effective batch = 32)
block_size = 1024                   # Full GPT-2 context length
learning_rate = 3e-5                # Conservative fine-tuning rate
eval_interval = 5                   # Evaluate every 5 iterations
eval_iters = 40                     # Number of iterations for evaluation

print(f"Effective batch size: {batch_size * gradient_accumulation_steps}")
print(f"Context length (block_size): {block_size} tokens")
print(f"Initial learning rate: {learning_rate}")

# Create optimizer (following nanoGPT's approach)
optimizer = model.configure_optimizers(
    weight_decay=weight_decay, # from config
    learning_rate=learning_rate, # from config
    betas=(beta1, beta2), # from config
    device_type=device # from config
)

# @torch.no_grad() ... estimate_loss function remains the same for now
# ... rest of the cell (estimate_loss and its test call) ...
# Make sure get_batch in estimate_loss uses the correct block_size and batch_size from config
@torch.no_grad()
def estimate_loss():
    """Evaluate the model on both training and validation data"""
    out = {}
    model.eval()
    for split in ['train', 'val']:
        losses = torch.zeros(eval_iters) # uses config eval_iters
        for k in range(eval_iters):
            # Pass config batch_size and block_size
            X, Y = get_batch(split, batch_size_gb=batch_size, block_size_gb=block_size)
            X, Y = X.to(device), Y.to(device)
            # Autocast for mixed precision if not CPU
            ptdtype = {'float32': torch.float32, 'bfloat16': torch.bfloat16, 'float16': torch.float16}[dtype]
            ctx = torch.amp.autocast(device_type=device, dtype=ptdtype) if device != 'cpu' else nullcontext()
            with ctx:
                logits, loss = model(X, Y)
            losses[k] = loss.item()
        out[split] = losses.mean()
    model.train()
    return out

# Test evaluation function
print("Testing evaluation function...")
test_losses = estimate_loss()
print(f"Initial training loss: {test_losses['train']:.4f}")
print(f"Initial validation loss: {test_losses['val']:.4f}")


Effective batch size: 32
Context length (block_size): 1024 tokens
Initial learning rate: 3e-05
num decayed parameter tensors: 50, with 124,318,464 parameters
num non-decayed parameter tensors: 98, with 121,344 parameters
using fused AdamW: True
Testing evaluation function...
Initial training loss: 3.0043
Initial validation loss: 3.1948


## Step 6: Learning rate scheduling

In [24]:
def get_lr(it):
    # 1) linear warmup for warmup_iters steps
    if it < warmup_iters: # from config
        return learning_rate * it / warmup_iters # learning_rate, warmup_iters from config
    # 2) if it > lr_decay_iters, return min_lr
    if it > lr_decay_iters: # from config
        return min_lr # from config
    # 3) in between, use cosine decay down to min_lr
    decay_ratio = (it - warmup_iters) / (lr_decay_iters - warmup_iters)
    assert 0 <= decay_ratio <= 1
    coeff = 0.5 * (1.0 + math.cos(math.pi * decay_ratio)) # coeff ranges 0..1
    return min_lr + coeff * (learning_rate - min_lr)

In [None]:
import time # ensure time is imported

print(f"Starting fine-tuning for {max_iters} iterations...")

# Autocast and GradScaler setup
ptdtype = {'float32': torch.float32, 'bfloat16': torch.bfloat16, 'float16': torch.float16}[dtype]
ctx = torch.amp.autocast(device_type=device, dtype=ptdtype) if device != 'cpu' else nullcontext()
# GradScaler for float16
scaler = torch.cuda.amp.GradScaler(enabled=(dtype == 'float16' and device == 'cuda'))

start_time = time.time()
iter_num = 0 # Use iter_num to align with train.py
best_val_loss = float('inf')
local_iter_num = 0 # Number of iterations for this training run

# Track metrics for analysis (can be expanded)
train_losses_log = []
val_losses_log = []
iter_log = []

model.train() # Ensure model is in training mode

# Training loop
for iter_num in range(max_iters): # Loop up to max_iters
    
    # Determine and set the learning rate for this iteration
    lr = get_lr(iter_num) if decay_lr else learning_rate
    for param_group in optimizer.param_groups:
        param_group['lr'] = lr

    # Evaluate the model on train/val sets and write checkpoints
    if iter_num % eval_interval == 0 and iter_num > 0 : # Use config eval_interval
        losses = estimate_loss()
        elapsed_time = time.time() - start_time
        print(f"step {iter_num}: train loss {losses['train']:.4f}, val loss {losses['val']:.4f}, lr {lr:.2e}, time {elapsed_time:.2f}s")
        
        train_losses_log.append(losses['train'].item()) # .item() to get float
        val_losses_log.append(losses['val'].item())
        iter_log.append(iter_num)

        if wandb_log:
            try:
                wandb.log({
                    "iter": iter_num,
                    "train/loss": losses['train'],
                    "val/loss": losses['val'],
                    "lr": lr,
                })
            except Exception as e:
                print(f"Wandb logging failed: {e}")
        
        if losses['val'] < best_val_loss or always_save_checkpoint: # always_save_checkpoint from config
            best_val_loss = losses['val']
            if iter_num > 0:
                checkpoint = {
                    'model': model.state_dict(),
                    'optimizer': optimizer.state_dict(),
                    'model_args': model.config, # Save GPTConfig object directly
                    'iter_num': iter_num,
                    'best_val_loss': best_val_loss,
                    'config': {k: v for k, v in globals().items() if k in ['learning_rate', 'batch_size', 'block_size', 'dataset', 'max_iters', 'dropout', 'grad_clip', 'weight_decay', 'beta1', 'beta2', 'n_layer', 'n_head', 'n_embd']}, # save relevant config
                }
                print(f"saving checkpoint to {out_dir}")
                torch.save(checkpoint, os.path.join(out_dir, 'ckpt.pt'))
                if losses['val'] == best_val_loss and not always_save_checkpoint: # only print if it's a new best and not always saving
                     print(f"  → Saved new best model (val_loss: {best_val_loss:.4f})")


    # Forward backward update, with gradient accumulation...
    optimizer.zero_grad(set_to_none=True)
    for micro_step in range(gradient_accumulation_steps): # gradient_accumulation_steps from config
        X, Y = get_batch('train', batch_size_gb=batch_size, block_size_gb=block_size) # Use config vars
        X, Y = X.to(device), Y.to(device)
        with ctx:
            logits, loss = model(X, Y)
            loss = loss / gradient_accumulation_steps # Scale loss for accumulation
        
        # Backward pass, with gradient scaling if training in fp16
        scaler.scale(loss).backward()

    # Clip gradients
    if grad_clip != 0.0: # grad_clip from config
        scaler.unscale_(optimizer) # unscale before clipping
        torch.nn.utils.clip_grad_norm_(model.parameters(), grad_clip)
    
    # Step the optimizer and scaler if training in fp16
    scaler.step(optimizer)
    scaler.update()
    
    local_iter_num +=1

    if iter_num >= max_iters:
        break

# Final evaluation
elapsed = time.time() - start_time
final_losses = estimate_loss()

print(f"\nFine-tuning completed!")
print(f"Total time: {elapsed:.1f} seconds ({local_iter_num} iterations performed in this run)")
print(f"Final training loss: {final_losses['train']:.4f}")
print(f"Final validation loss: {final_losses['val']:.4f}")
print(f"Best validation loss achieved: {best_val_loss:.4f}")
print(f"Processing speed: {local_iter_num / elapsed:.2f} iterations per second")

# Plotting losses (optional)
import matplotlib.pyplot as plt
if iter_log: # Check if any evaluations were done
    plt.figure(figsize=(10, 4))
    plt.plot(iter_log, train_losses_log, label='Train Loss')
    plt.plot(iter_log, val_losses_log, label='Validation Loss')
    plt.xlabel('Iteration')
    plt.ylabel('Loss')
    plt.legend()
    plt.title('Training and Validation Loss Over Iterations')
    plt.savefig(os.path.join(out_dir, 'loss_plot.png'))
    plt.show()


Starting fine-tuning for 20 iterations...


  scaler = torch.cuda.amp.GradScaler(enabled=(dtype == 'float16' and device == 'cuda'))


step 5: train loss 2.7919, val loss 2.9311, lr 2.60e-05, time 6.06s
saving checkpoint to out-alice-finetune-notebook
step 10: train loss 2.6741, val loss 2.8495, lr 1.65e-05, time 13.67s
saving checkpoint to out-alice-finetune-notebook
step 15: train loss 2.6604, val loss 2.8628, lr 6.95e-06, time 21.09s
saving checkpoint to out-alice-finetune-notebook

Fine-tuning completed!
Total time: 27.6 seconds (20 iterations performed in this run)
Final training loss: 2.5456
Final validation loss: 2.7631
Best validation loss achieved: 2.8628
Processing speed: 0.73 iterations per second


ModuleNotFoundError: No module named 'matplotlib'

## Step 7: Generating

In [None]:
def generate_text(prompt_text, model_to_use, max_new_tokens=500, temperature=1.0, top_k=None):
    # ... (rest of the function is the same, but ensure it uses model_to_use) ...
    # Encode the prompt text into tokens
    tokens = enc.encode_ordinary(prompt_text)
    
    # Convert to tensor and move to the correct device
    context = torch.tensor([tokens], dtype=torch.long, device=device)
    
    # Generate new tokens using nanoGPT's generate method
    # model.eval() # model_to_use should already be in eval mode
    with torch.no_grad():
        generated = model_to_use.generate(context, max_new_tokens=max_new_tokens, 
                                 temperature=temperature, top_k=top_k)
    # ... (rest of the function) ...
    generated_tokens = generated[0].tolist()
    generated_text = enc.decode(generated_tokens)
    return generated_text


# Test generation with different prompts based on your dataset
if 'alice' in dataset_name.lower():
    test_prompts = ["Alice was", "The Queen", "Down the rabbit"]
elif 'shakespeare' in dataset_name.lower():
    test_prompts = ["To be or not", "Romeo and", "All the world"]
else:
    test_prompts = ["Once upon a time", "In the beginning", "The story"]

print("Testing text generation with different prompts:\n")

for prompt in test_prompts:
    print(f"Prompt: '{prompt}'")
    print("-" * 50)
    generated = generate_text(prompt, max_new_tokens=200, temperature=0.8)
    print(generated)
    print("\n" + "="*80 + "\n")



Testing text generation with different prompts:

Prompt: 'Alice was'
--------------------------------------------------
Alice was to cut off his nose, and he did not look at her as she had done an hour ago, and he said to himself: "Augh! how may I be so angry as this little dog? How can I possibly be so angry at a dog?"

"Why, now that I think about it, you're a little bit mad!! I must see you again! You may be ill-tempered, very much so! Or you might as well be gentle and kind to a dog, as very mad: and you'll have to begin from the beginning."

"I must say I have rather great surprise in me than any things I've ever used to do: and I've said it to a great many people that I feel very little respect for dogs, that I think only some of them would ever be very, very fond of Indian dogs, and that, if any of them were to do it, I would be very glad to have them in


Prompt: 'The Queen'
--------------------------------------------------
The Queen of Hearts…"

"Yes, Queen of Hearts…"

Once 