# Notebook 06: Reward Modeling

## Learning Objectives
- Build a reward model from scratch using Bradley-Terry
- Understand PRM vs ORM
- Train on preference data
- Evaluate reward model quality

## Theory

A reward model $R_\phi(x, y) \in \mathbb{R}$ scores response quality.

**Bradley-Terry loss:**
$$\mathcal{L}_{\text{RM}} = -\log \sigma(R_\phi(x, y_w) - R_\phi(x, y_l))$$

Architecture: transformer backbone + linear scalar head.

**PRM vs ORM:**
- ORM: single reward at end of sequence
- PRM: reward per reasoning step (Lightman et al., 2023)

In [None]:
# !pip install torch transformers tqdm

In [None]:
import sys
sys.path.insert(0, '..')
import torch
import torch.nn.functional as F
import matplotlib.pyplot as plt
print('Ready!')

## Step 1: Understand the Bradley-Terry Loss

In [None]:
from src.training.reward_model import bradley_terry_loss

# Synthetic demo
reward_chosen   = torch.tensor([2.0, 1.5, 2.5, 1.0])
reward_rejected = torch.tensor([0.5, 0.3, 1.0, 0.8])
loss = bradley_terry_loss(reward_chosen, reward_rejected)
print(f'Reward chosen:   {reward_chosen.tolist()}')
print(f'Reward rejected: {reward_rejected.tolist()}')
print(f'BT Loss: {loss.item():.4f}')
print(f'Accuracy (chosen > rejected): {(reward_chosen > reward_rejected).float().mean():.2%}')

## Step 2: Process Reward Model (Step-Level Signals)

In [None]:
from src.credit_assignment.process_reward import StepRewardModel, assign_step_rewards

solution = ('Step 1: The store starts with 45 apples.\n'
            'Step 2: It sells 18 apples. So 45 - 18 = 27.\n'
            'Step 3: The remaining count is 27.\nThe answer is: 27')

prm = StepRewardModel(use_heuristic=True)
scored = prm.score_solution(solution, ground_truth=27.0)
print('PRM Step-Level Scores:')
for step, reward in scored:
    print(f'  [{reward:.2f}] {step[:60]}')

print('\nInterpolated rewards (outcome=1.0, decay=0.9):')
for step, reward in assign_step_rewards(solution, outcome_reward=1.0):
    print(f'  [{reward:.3f}] {step[:60]}')

## Step 3: Visualize PRM vs ORM

In [None]:
import numpy as np
fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(12, 4))

# ORM: single signal at end
steps = [1, 2, 3, 4, 5]
orm_signal = [0, 0, 0, 0, 1]
prs_signal  = [0.59, 0.66, 0.73, 0.81, 1.0]
ax1.bar(steps, orm_signal, color='steelblue', alpha=0.7, label='ORM')
ax1.set_title('ORM: Single Terminal Reward')
ax1.set_xlabel('Step'); ax1.set_ylabel('Reward'); ax1.legend()
ax2.bar(steps, prs_signal, color='green', alpha=0.7, label='PRM')
ax2.set_title('PRM: Per-Step Reward')
ax2.set_xlabel('Step'); ax2.set_ylabel('Reward'); ax2.legend()
plt.suptitle('ORM vs PRM Signal Density', fontsize=13)
plt.tight_layout(); plt.show()

---

## Exercises

1. **Bradley-Terry demo:** Show that BT loss approaches 0 as reward gap grows
2. **PRM interpolation:** Try decay=0.5, 0.8, 0.9, 1.0. How does it affect step signals?
3. **Wrong solution:** Score a wrong solution with PRM. Where does it first diverge?
4. **Extension:** Implement a neural PRM backbone using distilgpt2
5. **Calibration:** How do you know if a reward model is well-calibrated?