# Notebook 02: RLHF — Reinforcement Learning from Human Feedback

## Learning Objectives
- Understand the 3-stage RLHF pipeline
- Implement a simple reward model from scratch
- Implement a simplified PPO training loop
- Understand reward hacking, KL penalty, and advantage estimation

---

## The RLHF Pipeline



### Why the KL penalty?

Without KL penalty, the policy would exploit the reward model:
- Generate very long outputs (if reward correlates with length)
- Generate gibberish that tricks the reward model
- **Reward hacking** — high reward, low quality

KL penalty keeps the policy close to π_SFT.

### PPO Objective

4215\mathcal{L}_{	ext{PPO}} = \mathbb{E}\left[\min\left(r_t A_t,\ 	ext{clip}(r_t, 1{-}\epsilon, 1{+}\epsilon) A_tight)ight] - eta \cdot 	ext{KL}(\pi_	heta \| \pi_{	ext{ref}})4215

where  = \pi_	heta(a_t|s_t) / \pi_{	ext{old}}(a_t|s_t)$

In [None]:
# !pip install torch transformers tqdm matplotlib

In [None]:
import sys
sys.path.insert(0, "..")
import torch
import torch.nn as nn
import torch.nn.functional as F
import matplotlib.pyplot as plt
print(f"PyTorch: {torch.__version__}")
device = "cuda" if torch.cuda.is_available() else "cpu"
print(f"Device: {device}")

## Step 1: Build a Simple Reward Model

The reward model takes text and outputs a scalar score.
We use the **Bradley-Terry** model: (x,y) = $ scalar preference score.

Training loss: $\mathcal{L}_{	ext{RM}} = -\log \sigma(R(y_w) - R(y_l))$

In [None]:
from src.training.reward_model import RewardModel, bradley_terry_loss
from transformers import AutoTokenizer, AutoModel

MODEL_NAME = "distilgpt2"
tokenizer = AutoTokenizer.from_pretrained(MODEL_NAME)
tokenizer.pad_token = tokenizer.eos_token

# For reward model we need hidden states — use AutoModel (no LM head)
# In practice: fine-tune an already SFT-trained model as the backbone
print("Reward model uses: transformer backbone + linear scalar head")
print("Training loss: L = -log σ(R(y_chosen) - R(y_rejected))")

## Step 2: Preference Data

For the reward model, we need (prompt, chosen, rejected) triples.

In [None]:
from src.data.preference_data import create_synthetic_preferences

pairs = create_synthetic_preferences()
print(f"Preference pairs: {len(pairs)}")
print("
Example:")
p = pairs[0]
print(f"  Chosen:   {p['chosen'][:80]}...")
print(f"  Rejected: {p['rejected'][:80]}...")

## Step 3: Demonstrate Advantage Estimation

GAE (Generalized Advantage Estimation) computes  = \sum_k (\gamma\lambda)^k \delta_{t+k}$

For text generation (single step), this simplifies to  = R_t - V(s_t)$.
We approximate (s_t) pprox 	ext{mean}(R)$ for simplicity.

In [None]:
from src.training.rlhf_trainer import compute_advantages

# Simulated rewards for 8 completions
rewards = [0.2, 0.8, 0.1, 0.9, 0.3, 0.7, 0.4, 0.6]
advantages = compute_advantages(rewards)

print("Rewards:    ", [round(r, 2) for r in rewards])
print("Advantages: ", [round(a.item(), 3) for a in advantages])
print(f"Mean: {advantages.mean():.6f} (≈ 0)")
print(f"Std:  {advantages.std():.6f} (≈ 1)")

## Step 4: Visualize Reward Hacking

Illustrate what happens with and without KL penalty.

In [None]:
import numpy as np

# Simulate training with and without KL penalty
steps = list(range(1, 101))

# Without KL: reward increases but quality (real accuracy) plateaus/drops
reward_no_kl  = [0.3 + 0.6*(1 - np.exp(-s/20)) + np.random.randn()*0.05 for s in steps]
quality_no_kl = [0.4 + 0.2*(1 - np.exp(-s/20)) - 0.003*s + np.random.randn()*0.03 for s in steps]

# With KL: both increase together
reward_kl     = [0.3 + 0.4*(1 - np.exp(-s/30)) + np.random.randn()*0.05 for s in steps]
quality_kl    = [0.4 + 0.35*(1 - np.exp(-s/30)) + np.random.randn()*0.03 for s in steps]

fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(12, 4))

ax1.plot(steps, reward_no_kl, label="Reward model score", color="red")
ax1.plot(steps, quality_no_kl, label="Real quality", color="blue", linestyle="--")
ax1.set_title("Without KL Penalty — Reward Hacking!", fontsize=12)
ax1.set_xlabel("Training Step"); ax1.set_ylabel("Score")
ax1.legend(); ax1.grid(alpha=0.3)

ax2.plot(steps, reward_kl, label="Reward model score", color="red")
ax2.plot(steps, quality_kl, label="Real quality", color="blue", linestyle="--")
ax2.set_title("With KL Penalty — Stable Training", fontsize=12)
ax2.set_xlabel("Training Step"); ax2.set_ylabel("Score")
ax2.legend(); ax2.grid(alpha=0.3)

plt.tight_layout()
plt.show()

## Summary

**RLHF in 3 stages:**
1. **SFT:** Teach basic instruction following
2. **Reward Model:** Encode human preferences as a scalar signal
3. **PPO:** Optimize for reward while KL-regularizing vs SFT model

**Key takeaways:**
- KL penalty prevents reward hacking (crucial!)
- Advantage estimation centers and normalizes rewards
- PPO clipping prevents large destabilizing updates

---

## Exercises

1. **Reward model:** Implement a neural reward model using distilgpt2 backbone
2. **KL ablation:** Vary β from 0 to 1 — plot the effect on generation diversity
3. **Reward hacking demo:** Design a reward function that can be easily hacked
4. **Advantage variance:** Compare vanilla returns vs GAE advantages
5. **Extension:** Implement full PPO with critic network