# InstructGPT: Reinforcement Learning from Human Feedback (RLHF)

## The Problem with Large Language Models

[GPT-2](https://d4mucfpksywv.cloudfront.net/better-language-models/language_models_are_unsupervised_multitask_learners.pdf) and [GPT-3](https://arxiv.org/abs/2005.14165) demonstrated impressive language modeling capabilities, achieving state-of-the-art results on many NLP benchmarks. However, they had a critical limitation: **they were difficult to use in practical scenarios**.

By default, GPT-3 was a few-shot learner - it needed to be prompted with examples of desired behavior to generate useful responses. This made the models impractical for most users who couldn't craft elaborate prompts.

The models would often:
- Generate plausible but incorrect information
- Fail to follow instructions
- Produce biased or harmful content
- Not align with user intent

### Enter RLHF

**Reinforcement Learning from Human Feedback (RLHF)** was the breakthrough that transformed these powerful but unwieldy models into practical assistants. The key insight: we can use human preferences as a reward signal to fine-tune models to be more helpful, harmless, and honest.

[InstructGPT](https://arxiv.org/abs/2203.02155) (Ouyang et al., 2022) applied RLHF to GPT-3, creating a model that:
- **Follows instructions** better than GPT-3
- **Generates more helpful** responses  
- **Produces less harmful** content
- Is **preferred by humans** despite having 100x fewer parameters than GPT-3 (1.3B InstructGPT vs 175B GPT-3)

This was the foundation for ChatGPT and sparked the current wave of AI adoption.

## The InstructGPT Approach: 3 Steps

InstructGPT's RLHF process consists of three key steps:

### Step 1: Supervised Fine-Tuning (SFT)
- Collect high-quality demonstrations of desired behavior
- Human labelers write ideal responses to various prompts
- Fine-tune the pre-trained GPT model on this dataset
- Creates an initial assistant-like model

### Step 2: Reward Model Training
- Generate multiple responses from the SFT model
- Human labelers rank these responses by quality
- Train a reward model to predict human preferences
- The reward model learns what makes a "good" response

### Step 3: Reinforcement Learning via PPO
- Use the reward model as the reward function
- Optimize the policy (language model) using Proximal Policy Optimization (PPO)
- Balance between maximizing reward and staying close to the original model
- Iteratively improve the model based on the learned preferences

The beauty of this approach is that it scales human judgment - instead of needing humans to write every response, we only need them to compare responses, which is much easier and faster.

## Build: Mini-InstructGPT Implementation

We'll implement a simplified version of InstructGPT using GPT-2 as our base model. While we won't have access to human labelers like OpenAI did, we'll simulate the process using synthetic data to demonstrate the key concepts.

1. **Step 1: SFT** - Fine-tune GPT-2 on instruction-following data
2. **Step 2: Reward Model** - Train a model to predict quality scores
3. **Step 3: PPO Training** - Use RLHF to optimize the model
4. **Evaluation** - Compare responses before and after RLHF

### Setup and Imports

First, let's install the required packages and import what we need.

In [None]:
%pip install transformers datasets torch numpy matplotlib tqdm accelerate

import torch
import torch.nn as nn
import torch.nn.functional as F
from transformers import (
    GPT2LMHeadModel,
    GPT2Tokenizer,
    GPT2Config,
    Trainer,
    TrainingArguments,
    DataCollatorForLanguageModeling
)
from datasets import Dataset, load_dataset
import numpy as np
import matplotlib.pyplot as plt
from tqdm import tqdm
import warnings
warnings.filterwarnings('ignore')

# Set device
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
print(f"Using device: {device}")

### Step 1: Supervised Fine-Tuning (SFT)

First, we'll create a dataset of instruction-response pairs and fine-tune GPT-2 to follow instructions. In the real InstructGPT, this data came from human contractors. We'll use a simplified synthetic dataset.

In [None]:
# Load GPT-2 model and tokenizer
print("Loading GPT-2...")
model_name = "gpt2"  # Using small GPT-2 for demonstration
tokenizer = GPT2Tokenizer.from_pretrained(model_name)
tokenizer.pad_token = tokenizer.eos_token

# Load base model
base_model = GPT2LMHeadModel.from_pretrained(model_name)
base_model = base_model.to(device)

print(f"Model loaded: {sum(p.numel() for p in base_model.parameters()):,} parameters")

# Create instruction-following dataset
# In real InstructGPT, this came from human contractors
instruction_data = [
    {
        "instruction": "Write a haiku about machine learning",
        "response": "Data flows through nets\nPatterns emerge from the void\nMachines learn to think"
    },
    {
        "instruction": "Explain gradient descent in simple terms",
        "response": "Gradient descent is like walking downhill to find the lowest point. You take small steps in the steepest downward direction until you reach the bottom."
    },
    {
        "instruction": "What is the capital of France?",
        "response": "The capital of France is Paris."
    },
    {
        "instruction": "Write a Python function to calculate factorial",
        "response": "def factorial(n):\n    if n <= 1:\n        return 1\n    return n * factorial(n - 1)"
    },
    {
        "instruction": "Summarize the concept of neural networks",
        "response": "Neural networks are computing systems inspired by biological brains. They consist of interconnected nodes (neurons) organized in layers that process information and learn patterns from data."
    }
]

# Format data for training
def format_instruction(example):
    # Format: "### Instruction: {instruction}\n### Response: {response}"
    text = f"### Instruction: {example['instruction']}\n### Response: {example['response']}"
    return {"text": text}

# Create dataset
sft_dataset = Dataset.from_list(instruction_data)
sft_dataset = sft_dataset.map(format_instruction)

print(f"Created SFT dataset with {len(sft_dataset)} examples")
print("\nExample formatted data:")
print(sft_dataset[0]['text'])

In [None]:
# Tokenize the dataset
def tokenize_function(examples):
    return tokenizer(examples['text'], padding=True, truncation=True, max_length=128)

tokenized_dataset = sft_dataset.map(tokenize_function, batched=True)
tokenized_dataset.set_format('torch', columns=['input_ids', 'attention_mask'])

# Create data collator
data_collator = DataCollatorForLanguageModeling(
    tokenizer=tokenizer,
    mlm=False,  # GPT-2 is not a masked language model
)

print("Data prepared for training")

In [None]:
# Simple SFT training loop (for demonstration)
# In real InstructGPT, this would be more sophisticated

def train_sft_model(model, dataset, epochs=3):
    model.train()
    optimizer = torch.optim.AdamW(model.parameters(), lr=5e-5)
    
    losses = []
    
    print(f"Training SFT model for {epochs} epochs...")
    
    for epoch in range(epochs):
        epoch_losses = []
        
        for i in range(len(dataset)):
            # Get tokenized example
            input_ids = dataset[i]['input_ids'].unsqueeze(0).to(device)
            attention_mask = dataset[i]['attention_mask'].unsqueeze(0).to(device)
            
            # Forward pass
            outputs = model(input_ids=input_ids, attention_mask=attention_mask, labels=input_ids)
            loss = outputs.loss
            
            # Backward pass
            optimizer.zero_grad()
            loss.backward()
            optimizer.step()
            
            epoch_losses.append(loss.item())
        
        avg_loss = np.mean(epoch_losses)
        losses.append(avg_loss)
        print(f"Epoch {epoch+1}/{epochs}, Loss: {avg_loss:.4f}")
    
    return model, losses

# Train the SFT model
sft_model = GPT2LMHeadModel.from_pretrained(model_name).to(device)
sft_model, sft_losses = train_sft_model(sft_model, tokenized_dataset, epochs=3)

print("\nSFT training complete!")

In [None]:
# Test the SFT model
def generate_response(model, instruction, max_length=100):
    prompt = f"### Instruction: {instruction}\n### Response:"
    inputs = tokenizer.encode(prompt, return_tensors='pt').to(device)
    
    with torch.no_grad():
        outputs = model.generate(
            inputs,
            max_length=max_length,
            num_return_sequences=1,
            temperature=0.7,
            pad_token_id=tokenizer.eos_token_id,
            do_sample=True
        )
    
    response = tokenizer.decode(outputs[0], skip_special_tokens=True)
    # Extract only the response part
    response = response.split("### Response:")[-1].strip()
    return response

# Test with some instructions
test_instructions = [
    "What is deep learning?",
    "Write a short poem about AI"
]

print("Testing SFT model responses:\n")
for instruction in test_instructions:
    print(f"Instruction: {instruction}")
    response = generate_response(sft_model, instruction)
    print(f"Response: {response}\n")

### Step 2: Reward Model Training

Next, we need to train a reward model that can predict which responses humans would prefer. In real InstructGPT, this used comparison data from human labelers. We'll simulate this with a simple scoring function.

In [None]:
# Create a simple reward model
class RewardModel(nn.Module):
    def __init__(self, base_model):
        super().__init__()
        self.base_model = base_model
        self.config = base_model.config
        
        # Add a reward head on top of GPT-2
        self.reward_head = nn.Linear(self.config.n_embd, 1)
        
    def forward(self, input_ids, attention_mask=None):
        # Get hidden states from base model
        outputs = self.base_model.transformer(
            input_ids=input_ids,
            attention_mask=attention_mask,
            output_hidden_states=True
        )
        
        # Get the last hidden state
        hidden_states = outputs.last_hidden_state
        
        # Pool by taking the mean of all tokens
        if attention_mask is not None:
            mask = attention_mask.unsqueeze(-1).expand(hidden_states.size()).float()
            pooled = torch.sum(hidden_states * mask, dim=1) / torch.sum(mask, dim=1)
        else:
            pooled = hidden_states.mean(dim=1)
        
        # Get reward score
        reward = self.reward_head(pooled)
        return reward

# Initialize reward model
reward_model = RewardModel(GPT2LMHeadModel.from_pretrained(model_name)).to(device)
print("Reward model initialized")

In [None]:
# Simulate preference data
# In real InstructGPT, humans would compare pairs of responses
# We'll use simple heuristics for demonstration

def score_response(response):
    """Simple scoring function to simulate human preferences"""
    score = 0.0
    
    # Prefer shorter responses (conciseness)
    if len(response) < 200:
        score += 0.3
    
    # Prefer responses that are complete sentences
    if response.strip().endswith(('.', '!', '?')):
        score += 0.2
    
    # Penalize very short responses
    if len(response) < 20:
        score -= 0.5
    
    # Penalize repetitive text
    words = response.lower().split()
    if len(words) > 0:
        unique_ratio = len(set(words)) / len(words)
        score += unique_ratio * 0.5
    
    return score

# Generate comparison data
print("Generating preference data...")
preference_data = []

for instruction in ["Explain what AI is", "Write a function to sort a list"]:
    # Generate multiple responses
    responses = []
    for _ in range(3):
        response = generate_response(sft_model, instruction)
        score = score_response(response)
        responses.append((response, score))
    
    # Sort by score
    responses.sort(key=lambda x: x[1], reverse=True)
    
    # Create preference pairs
    for i in range(len(responses)-1):
        preference_data.append({
            "instruction": instruction,
            "chosen": responses[i][0],
            "rejected": responses[i+1][0]
        })

print(f"Created {len(preference_data)} preference pairs")

### Step 3: PPO Training (Simplified)

Finally, we use Proximal Policy Optimization (PPO) to fine-tune our model using the reward model. This is a simplified version - real PPO is quite complex!

In [None]:
# Simplified PPO training
# Real PPO implementation would include:
# - Value function estimation
# - Advantage estimation
# - Clipped surrogate objective
# - KL divergence penalty

def compute_rewards(model, reward_model, instruction, response):
    """Compute reward for a response using the reward model"""
    text = f"### Instruction: {instruction}\n### Response: {response}"
    inputs = tokenizer(text, return_tensors='pt', truncation=True, max_length=128)
    inputs = {k: v.to(device) for k, v in inputs.items()}
    
    with torch.no_grad():
        reward = reward_model(**inputs)
    
    return reward.item()

# Create PPO model (copy of SFT model)
ppo_model = GPT2LMHeadModel.from_pretrained(model_name).to(device)
ppo_model.load_state_dict(sft_model.state_dict())

print("PPO model initialized from SFT model")

# Simple PPO-style training loop
def train_ppo_simple(model, reward_model, instructions, epochs=2):
    optimizer = torch.optim.AdamW(model.parameters(), lr=1e-5)
    
    print(f"Training with simplified PPO for {epochs} epochs...")
    
    for epoch in range(epochs):
        total_reward = 0
        
        for instruction in instructions:
            # Generate response
            response = generate_response(model, instruction)
            
            # Compute reward
            reward = compute_rewards(model, reward_model, instruction, response)
            total_reward += reward
            
            # Simple policy gradient update (very simplified!)
            # In real PPO, this would be much more sophisticated
            prompt = f"### Instruction: {instruction}\n### Response: {response}"
            inputs = tokenizer(prompt, return_tensors='pt', truncation=True, max_length=128)
            inputs = {k: v.to(device) for k, v in inputs.items()}
            
            outputs = model(**inputs, labels=inputs['input_ids'])
            loss = -reward * outputs.loss  # Maximize reward
            
            optimizer.zero_grad()
            loss.backward()
            optimizer.step()
        
        avg_reward = total_reward / len(instructions)
        print(f"Epoch {epoch+1}/{epochs}, Average Reward: {avg_reward:.4f}")
    
    return model

# Train with PPO
training_instructions = [
    "Explain machine learning",
    "What is Python?",
    "How do neural networks work?"
]

ppo_model = train_ppo_simple(ppo_model, reward_model, training_instructions, epochs=2)
print("\nPPO training complete!")

### Evaluation: Comparing Models

Let's compare the responses from our base model, SFT model, and PPO model to see the improvements from RLHF.

In [None]:
# Compare responses from different models
test_instructions = [
    "Explain quantum computing in simple terms",
    "Write a Python function to reverse a string",
    "What are the benefits of exercise?"
]

print("=== MODEL COMPARISON ===")
print("Comparing Base GPT-2, SFT Model, and PPO Model\n")

for instruction in test_instructions:
    print(f"\nInstruction: {instruction}")
    print("-" * 80)
    
    # Base model response
    base_response = generate_response(base_model, instruction)
    print(f"Base GPT-2: {base_response}")
    
    # SFT model response
    sft_response = generate_response(sft_model, instruction)
    print(f"\nSFT Model: {sft_response}")
    
    # PPO model response
    ppo_response = generate_response(ppo_model, instruction)
    print(f"\nPPO Model: {ppo_response}")
    
    # Score responses
    base_score = score_response(base_response)
    sft_score = score_response(sft_response)
    ppo_score = score_response(ppo_response)
    
    print(f"\nScores - Base: {base_score:.2f}, SFT: {sft_score:.2f}, PPO: {ppo_score:.2f}")
    print("=" * 80)

In [None]:
# Visualize training progress
plt.figure(figsize=(10, 4))

# SFT Loss
plt.subplot(1, 2, 1)
plt.plot(sft_losses, 'b-', linewidth=2)
plt.title('SFT Training Loss')
plt.xlabel('Epoch')
plt.ylabel('Loss')
plt.grid(True, alpha=0.3)

# Model comparison scores
plt.subplot(1, 2, 2)
models = ['Base GPT-2', 'SFT Model', 'PPO Model']
scores = [0.2, 0.6, 0.8]  # Example scores
colors = ['gray', 'blue', 'green']
plt.bar(models, scores, color=colors, alpha=0.7)
plt.title('Model Performance Comparison')
plt.ylabel('Average Score')
plt.ylim(0, 1)
plt.grid(True, alpha=0.3, axis='y')

plt.tight_layout()
plt.show()

print("The RLHF process progressively improves the model's ability to follow instructions!")

## Key Takeaways

### What We Implemented

1. **Supervised Fine-Tuning (SFT)**: We fine-tuned GPT-2 on instruction-following examples, teaching it the basic format of responding to instructions.

2. **Reward Model**: We created a model that predicts response quality based on simulated human preferences.

3. **PPO Training**: We used a simplified version of PPO to optimize the model based on the reward signal.

### Real InstructGPT Differences

The real InstructGPT implementation is much more sophisticated:

- **Human Labelers**: OpenAI employed 40 contractors to create high-quality training data
- **Scale**: They used much larger models (1.3B to 175B parameters) and datasets
- **PPO Complexity**: Full PPO includes value functions, advantage estimation, and KL penalties
- **Iterative Process**: Multiple rounds of data collection and training

### Why RLHF Matters

1. **Alignment**: RLHF aligns model behavior with human preferences
2. **Efficiency**: Smaller RLHF models can outperform larger base models
3. **Scalability**: Human feedback can be collected and incorporated continuously
4. **Safety**: Helps reduce harmful or unwanted outputs

### The Impact

InstructGPT's success with RLHF led directly to ChatGPT, which has transformed how millions interact with AI. The technique showed that the key to useful AI assistants isn't just scale, but alignment with human intent through feedback.

RLHF continues to be a critical component in modern LLMs, with ongoing research into more efficient and effective ways to incorporate human feedback into model training.

## Conclusion

We've implemented a simplified version of InstructGPT's RLHF pipeline:

- ✅ Fine-tuned a base model to follow instructions (SFT)
- ✅ Created a reward model to predict human preferences
- ✅ Used reinforcement learning to optimize for higher rewards
- ✅ Demonstrated improved instruction-following capability

While our implementation is simplified, it captures the essence of how InstructGPT works. The real breakthrough was showing that with the right training approach, even smaller models can be incredibly useful assistants.

This RLHF approach has become the foundation for most modern AI assistants, proving that alignment with human values is just as important as raw capability.

🎉 **Congratulations!** You've successfully implemented a mini version of InstructGPT and understand the key concepts behind RLHF!