# Notebook 04: GRPO — Group Relative Policy Optimization

## Learning Objectives
- Understand GRPO as an alternative to PPO without a value function
- Implement group sampling and advantage normalization
- Understand the connection to DeepSeek-R1
- Compare GRPO, PPO, and DPO

## GRPO Algorithm

For each training step:
1. Sample $G$ completions $\{y_1, \ldots, y_G\}$ from $\pi_{\text{old}}$ for prompt $x$
2. Score: $r_i = \text{Reward}(x, y_i)$ (binary for math: 1 if correct, 0 if wrong)
3. **Group normalize:** $\hat{A}_i = \frac{r_i - \mu_r}{\sigma_r + \epsilon}$
4. Update with clipped surrogate:

$$\mathcal{L}_{\text{GRPO}} = -\frac{1}{G}\sum_{i=1}^G \min\!\left(\rho_i \hat{A}_i,\ \text{clip}(\rho_i, 1{-}\epsilon, 1{+}\epsilon)\hat{A}_i\right) + \beta \cdot \text{KL}(\pi_\theta \| \pi_{\text{ref}})$$

where $\rho_i = \pi_\theta(y_i|x) / \pi_{\text{old}}(y_i|x)$

**Key insight:** Group normalization replaces the value function baseline!

In [None]:
# !pip install torch matplotlib numpy

In [None]:
import sys
sys.path.insert(0, '..')
import torch
import numpy as np
import matplotlib.pyplot as plt
print('Ready!')

## Step 1: Group Advantage Normalization

In [None]:
from src.training.grpo_trainer import compute_group_advantages, grpo_loss
print('GRPO functions imported')

# Example: 8 completions, binary rewards (correct/wrong)
rewards = torch.tensor([0.0, 1.0, 0.0, 1.0, 0.0, 0.0, 1.0, 0.0])
advantages = compute_group_advantages(rewards)

print('Group rewards:    ', rewards.tolist())
print('Group advantages: ', [round(a, 3) for a in advantages.tolist()])
print(f'Mean: {advantages.mean():.6f} (should be 0)')
print(f'Std:  {advantages.std():.6f} (should be ~1)')

## Step 2: Visualize Group Normalization Effect

In [None]:
np.random.seed(42)
group_sizes = [2, 4, 8, 16, 32]
accuracies = [0.2, 0.3, 0.5, 0.4, 0.6]  # Fraction of correct in each group

fig, axes = plt.subplots(1, len(group_sizes), figsize=(16, 3))
for ax, G, acc in zip(axes, group_sizes, accuracies):
    n_correct = int(G * acc)
    rewards = torch.tensor([1.0]*n_correct + [0.0]*(G-n_correct))
    advs = compute_group_advantages(rewards)
    colors = ['green' if r == 1 else 'red' for r in rewards.tolist()]
    ax.bar(range(G), advs.tolist(), color=colors)
    ax.set_title(f'G={G}, acc={acc:.0%}')
    ax.set_xlabel('Completion'); ax.axhline(0, color='black', linewidth=0.8)
plt.suptitle('Group Advantages for Different Group Sizes', fontsize=12)
plt.tight_layout()
plt.show()

## Step 3: Compare GRPO vs PPO vs DPO

| | PPO | DPO | GRPO |
|--|-----|-----|------|
| **Type** | On-policy RL | Offline preference | On-policy RL |
| **Value fn** | Yes (critic) | No | No (group norm) |
| **Data format** | Reward signal | Preference pairs | Reward signal |
| **Memory** | High (2 models + critic) | Medium (2 models) | Medium (2 models) |
| **Stability** | Moderate | High | High |
| **Used in** | InstructGPT | Llama, Mistral | DeepSeek-R1 |

In [None]:
# Simulate GRPO training: rewards converging upward as model improves
np.random.seed(0)
steps = list(range(1, 81))
mean_rewards = [0.3 + 0.5*(1-np.exp(-s/25)) + np.random.randn()*0.04 for s in steps]

plt.figure(figsize=(10, 4))
plt.plot(steps, mean_rewards, 'b-', linewidth=2, label='Mean group reward')
plt.axhline(0.3, color='gray', linestyle='--', label='Initial accuracy')
plt.fill_between(steps, [r-0.05 for r in mean_rewards], [r+0.05 for r in mean_rewards], alpha=0.2)
plt.xlabel('Training Step'); plt.ylabel('Mean Reward')
plt.title('GRPO Training: Mean Group Reward Over Time')
plt.legend(); plt.grid(alpha=0.3)
plt.show()

## DeepSeek-R1 Connection

DeepSeek-R1 uses GRPO with:
- **G=8** completions per problem
- **Binary reward:** 1 if final answer correct (verified programmatically), 0 otherwise
- **Long chains:** Model learns to reason for many steps because only the final answer is rewarded
- **"Aha moment":** Without SFT warmup, the model spontaneously learns chain-of-thought!

---

## Exercises

1. **Group size ablation:** Try G=2, 4, 8, 16. How does advantage variance change?
2. **Reward sparsity:** Binary vs continuous rewards. Which is more stable?
3. **Temperature effect:** How does sampling temperature affect group diversity?
4. **Clipping effect:** Try epsilon=0.1, 0.2, 0.5. What's the effect on training stability?
5. **Extension:** Implement GRPO for a simple bandit problem (no LLM needed)