# InstructGPT

## Overview

The history of language models can be understood through one key constraint: **alignment with human intent**.

GPT-3 was powerful but impractical. It needed complex prompts to produce useful outputs. Most users couldn't effectively use it.

InstructGPT solved this with a simple insight: use human feedback to teach the model what we actually want.

This notebook implements a simplified version of InstructGPT to understand how RLHF works.

## The Problem

Language models learn to predict the next word. This doesn't naturally align with being helpful.

GPT-3 would often:
- Ignore instructions
- Generate plausible but false information
- Produce unhelpful responses

The gap between "predicting text" and "being helpful" was the core problem.

## The Solution: RLHF

InstructGPT introduced a 3-step process:

**1. Supervised Fine-Tuning (SFT)**
- Collect examples of good responses
- Fine-tune the model on these examples

**2. Reward Model**  
- Have humans compare different responses
- Train a model to predict human preferences

**3. Reinforcement Learning**
- Use the reward model to optimize responses
- Balance between being helpful and staying coherent

Let's implement each step.

## Setup

In [None]:
import torch
import torch.nn as nn
from transformers import GPT2LMHeadModel, GPT2Tokenizer
from datasets import Dataset
import numpy as np

# Setup
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
model_name = "gpt2"  # Using small GPT-2 for speed

# Load tokenizer
tokenizer = GPT2Tokenizer.from_pretrained(model_name)
tokenizer.pad_token = tokenizer.eos_token

## Step 1: Supervised Fine-Tuning

First, we teach the model to follow instructions using examples.

In [None]:
# Load base GPT-2
base_model = GPT2LMHeadModel.from_pretrained(model_name).to(device)

# Create training examples
# In real InstructGPT, humans wrote thousands of these
examples = [
    {
        "instruction": "Explain quantum computing in one sentence",
        "response": "Quantum computing uses quantum physics to process information in ways classical computers cannot."
    },
    {
        "instruction": "What is 2+2?",
        "response": "2+2 equals 4."
    },
    {
        "instruction": "Write a haiku about coding",
        "response": "Bugs hide in the code\nPatience reveals the answer\nSuccess compiles slow"
    }
]

# Format for training
def format_example(ex):
    return {"text": f"Human: {ex['instruction']}\nAssistant: {ex['response']}"}

dataset = Dataset.from_list(examples).map(format_example)
print(f"Training on {len(dataset)} examples")

In [None]:
# Simple training loop
def train_sft(model, dataset, epochs=5):
    model.train()
    optimizer = torch.optim.AdamW(model.parameters(), lr=5e-5)
    
    for epoch in range(epochs):
        total_loss = 0
        for example in dataset:
            # Tokenize
            inputs = tokenizer(example['text'], return_tensors='pt', 
                             max_length=128, truncation=True, padding=True)
            inputs = {k: v.to(device) for k, v in inputs.items()}
            
            # Train
            outputs = model(**inputs, labels=inputs['input_ids'])
            loss = outputs.loss
            
            optimizer.zero_grad()
            loss.backward()
            optimizer.step()
            
            total_loss += loss.item()
        
        print(f"Epoch {epoch+1}: Loss = {total_loss/len(dataset):.4f}")
    
    return model

# Train SFT model
sft_model = train_sft(base_model, dataset)
print("\nSFT training complete!")

In [None]:
# Test the SFT model
def generate(model, instruction):
    prompt = f"Human: {instruction}\nAssistant:"
    inputs = tokenizer(prompt, return_tensors='pt').to(device)
    
    with torch.no_grad():
        outputs = model.generate(**inputs, max_length=100, 
                               temperature=0.7, pad_token_id=tokenizer.eos_token_id)
    
    response = tokenizer.decode(outputs[0], skip_special_tokens=True)
    return response.split("Assistant:")[-1].strip()

# Compare base vs SFT
test_instruction = "What is machine learning?"

print(f"Instruction: {test_instruction}")
print(f"\nBase GPT-2: {generate(base_model, test_instruction)}")
print(f"\nSFT Model: {generate(sft_model, test_instruction)}")

## Step 2: Reward Model

Now we need to teach a model to recognize good responses. In real InstructGPT, humans compared thousands of response pairs.

In [None]:
# Simple reward model
class RewardModel(nn.Module):
    def __init__(self, base_model):
        super().__init__()
        self.base = base_model
        self.reward_head = nn.Linear(768, 1)  # GPT-2 hidden size = 768
        
    def forward(self, input_ids, attention_mask):
        outputs = self.base.transformer(input_ids, attention_mask=attention_mask)
        # Use last token's hidden state
        last_hidden = outputs.last_hidden_state[:, -1, :]
        reward = self.reward_head(last_hidden)
        return reward

# Create reward model
reward_model = RewardModel(GPT2LMHeadModel.from_pretrained(model_name)).to(device)

# Simulate preference data
# In reality, humans would rank these
def score_response(response):
    """Simple heuristic to simulate human preferences"""
    score = 0
    if len(response) > 10 and len(response) < 100:  # Good length
        score += 1
    if response.endswith('.'):  # Complete sentence
        score += 1
    if len(set(response.split())) / len(response.split()) > 0.7:  # Not repetitive
        score += 1
    return score

print("Reward model created")

## Step 3: Reinforcement Learning

Finally, we optimize the model using the reward signal. This is a simplified version of PPO.

In [None]:
# Simple RL training (very simplified PPO)
def train_with_rewards(model, reward_model, instructions, steps=10):
    optimizer = torch.optim.AdamW(model.parameters(), lr=1e-5)
    
    for step in range(steps):
        instruction = np.random.choice(instructions)
        
        # Generate response
        response = generate(model, instruction)
        
        # Get reward
        reward = score_response(response)
        
        # Update model (simplified - real PPO is more complex)
        prompt = f"Human: {instruction}\nAssistant: {response}"
        inputs = tokenizer(prompt, return_tensors='pt', truncation=True, max_length=128)
        inputs = {k: v.to(device) for k, v in inputs.items()}
        
        outputs = model(**inputs, labels=inputs['input_ids'])
        loss = -reward * outputs.loss  # Maximize reward
        
        optimizer.zero_grad()
        loss.backward()
        optimizer.step()
        
        if step % 2 == 0:
            print(f"Step {step}: Reward = {reward}")
    
    return model

# Train with RL
rl_model = GPT2LMHeadModel.from_pretrained(model_name).to(device)
rl_model.load_state_dict(sft_model.state_dict())  # Start from SFT

training_instructions = [
    "Explain gravity",
    "What is Python?",
    "How do computers work?"
]

rl_model = train_with_rewards(rl_model, reward_model, training_instructions)
print("\nRL training complete!")

## Results

In [None]:
# Compare all three models
test_instruction = "How does the internet work?"

print(f"Instruction: {test_instruction}\n")

# Reload base model for fair comparison
base_model = GPT2LMHeadModel.from_pretrained(model_name).to(device)

print("Base GPT-2:")
print(generate(base_model, test_instruction))
print("\n" + "-"*50 + "\n")

print("After SFT:")
print(generate(sft_model, test_instruction))
print("\n" + "-"*50 + "\n")

print("After RLHF:")
print(generate(rl_model, test_instruction))

## Key Insights

**What made InstructGPT successful:**

1. **Quality over quantity**: A small amount of high-quality human feedback was more valuable than massive datasets

2. **Alignment matters**: Teaching models what we want is as important as making them more capable

3. **Iterative improvement**: Each step (SFT → Reward Model → RL) builds on the previous

**The impact:**

- InstructGPT (1.3B parameters) was preferred over GPT-3 (175B parameters)
- This approach enabled ChatGPT and the current AI revolution
- RLHF remains the core technique for aligning modern LLMs

The constraint wasn't compute or model size - it was alignment with human intent.

## Summary

We implemented a simplified InstructGPT:

✓ **SFT**: Taught the model to follow instructions  
✓ **Reward Model**: Learned to recognize good responses  
✓ **RL**: Optimized for human preferences  

This simple process transformed an unpredictable text generator into a helpful assistant.

The real breakthrough wasn't making models bigger - it was making them do what we actually want.