# Math PPO Colab Walkthrough - raju

This notebook mirrors the math RLHF pipeline and runs comfortably on Google Colab. Run the cells in order to clone the repository, install dependencies, and launch PPO fine-tuning with the math reward model. **Run the NumPy pin/restart cell once before anything else.**



In [None]:
# ---- Run once before continuing ----
# Pins NumPy 2.1.3, then forces a runtime restart (required for dtype compatibility).
%pip install --upgrade --force-reinstall --no-cache-dir "numpy==2.1.3"

import os, IPython
print("Restarting runtime to load NumPy 2.1.3 ...")
IPython.display.clear_output()
os.kill(os.getpid(), 9)



In [None]:
# Install commands are provided in Section 3 after cloning the repository.



## 1. Mount Google Drive (optional)

If your SFT and reward checkpoints live on Drive (as in `RHRL_PPO.ipynb`), mount it first. Skip this step if the checkpoints are accessible locally.



In [None]:
from google.colab import drive
drive.mount("/content/drive")



## 2. Clone the repository

If you forked the project, replace the URL below with your fork (e.g. `https://github.com/<username>/ppo_from_scratch.git`).



In [None]:
# Optional: download SFT policy checkpoint directly into Drive
%%bash
set -e
python -m pip install -q gdown
SFT_DRIVE_DIR="/content/drive/MyDrive/rl/unsloth_sft_model"
mkdir -p "${SFT_DRIVE_DIR}"
if [ -z "$(ls -A "${SFT_DRIVE_DIR}")" ]; then
  echo "Downloading SFT checkpoint into ${SFT_DRIVE_DIR}"
  gdown --folder https://drive.google.com/drive/folders/1EmaHJQ47OQ2waG-efGZppsiRHbcpHS-H -O "${SFT_DRIVE_DIR}"
else
  echo "SFT checkpoint already present at ${SFT_DRIVE_DIR}"
fi
mkdir -p /content/models
ln -sfn "${SFT_DRIVE_DIR}" /content/models/unsloth_sft_model



In [None]:
!git clone https://github.com/nagaraju-chitluru/ppo_from_scratch.git
%cd ppo_from_scratch



## 3. Install math extras

Install the math PPO dependencies bundled with the repository. Restart the runtime if Colab prompts you.



In [None]:
# Install repository in editable mode without dependencies, then manually install math extras
%pip install -q --force-reinstall numpy==2.1.3
%pip install -q --no-deps -e .
%pip install -q \
    "transformers==4.44.2" \
    "trl==0.9.6" \
    "accelerate==0.33.0" \
    "datasets==2.19.1" \
    sentencepiece \
    "sympy==1.12" \
    "peft==0.14.0" \
    gdown \
    bitsandbytes



### 3.1 Verify NumPy version and restart runtime once

Run the next cell to confirm that NumPy ≥ 2.0 is active. If it prints the restart reminder, go to `Runtime → Restart runtime`, then rerun the install cell above **once** before continuing.


In [None]:
import numpy as np
from packaging.version import Version

print("NumPy version:", np.__version__)
if Version(np.__version__) < Version("2.0.0"):
    print("⚠️ Detected NumPy < 2.0. Please restart the runtime (Runtime → Restart runtime) and rerun the install cell above.")
else:
    print("✅ NumPy 2.x is active. You can proceed.")


## 4. Configure checkpoint paths

Update the YAML with your SFT policy and reward model directories. By default it points to the shared Drive paths used in `RHRL_PPO.ipynb`.



In [None]:
import yaml
from pathlib import Path

config_path = Path("configs/math_default.yaml")
print(config_path.read_text())



## 5. Train reward model (full dataset)

Run this after updating `configs/reward_default.yaml`. With `sample_limit: null`, it will consume the entire `kira/math-dpo` split and write the LoRA adapter to `reward.training.output_dir`.



In [None]:
!python trainer/reward_train.py --config configs/reward_default.yaml


## 6. Run math PPO training

Create a Drive symlink (optional, keeps old paths working) and execute the PPO loop.



In [None]:
!mkdir -p /content/drive/MyDrive/final_project/cs5446_project
!ln -sf /content/models/unsloth_sft_model /content/drive/MyDrive/final_project/cs5446_project/unsloth_sft_model
!python trainer/math_train.py --config configs/math_default.yaml



In [None]:
from pathlib import Path
import torch
from transformers import AutoModelForCausalLM, AutoTokenizer

sft_dir = Path("/content/drive/MyDrive/rl/unsloth_sft_model")
sft_model = AutoModelForCausalLM.from_pretrained(str(sft_dir))
sft_tokenizer = AutoTokenizer.from_pretrained(str(sft_dir), use_fast=False)

sft_tokenizer.padding_side = "left"

def generate_sft(prompts, max_new_tokens=64):
    inputs = sft_tokenizer(prompts, return_tensors="pt", padding=True).to(sft_model.device)
    input_length = inputs["input_ids"].shape[1]
    outputs = sft_model.generate(
        **inputs,
        max_new_tokens=max_new_tokens,
        do_sample=False,
        eos_token_id=sft_tokenizer.eos_token_id,
    )
    for prompt, output in zip(prompts, outputs):
        completion_ids = output[input_length:]
        completion = sft_tokenizer.decode(completion_ids, skip_special_tokens=True).strip()
        print("Prompt:", prompt)
        print("Response (SFT):", completion or "<empty>")
        print("-" * 40)

prompts = ["Solve x^2 - 5x + 6 = 0. Provide the roots explicitly."]
generate_sft(prompts)



In [None]:
from pathlib import Path
import torch
from trl import AutoModelForCausalLMWithValueHead
from transformers import AutoTokenizer

policy_dir = Path("/content/drive/MyDrive/rl/ppo_policy")
policy_model = AutoModelForCausalLMWithValueHead.from_pretrained(str(policy_dir))
tokenizer = AutoTokenizer.from_pretrained(str(policy_dir))
tokenizer.padding_side = "left"

def generate(prompts, max_new_tokens=64):
    inputs = tokenizer(prompts, return_tensors="pt", padding=True).to(policy_model.pretrained_model.device)
    input_length = inputs["input_ids"].shape[1]
    outputs = policy_model.generate(
        **inputs,
        max_new_tokens=max_new_tokens,
        do_sample=False,
        eos_token_id=tokenizer.eos_token_id,
    )
    for prompt, output in zip(prompts, outputs):
        completion_ids = output[input_length:]
        completion = tokenizer.decode(completion_ids, skip_special_tokens=True).strip()
        print("Prompt:", prompt)
        print("Response (PPO):", completion or "<empty>")
        print("-" * 40)

prompts = ["Solve x^2 - 5x + 6 = 0. Provide the roots explicitly."]
generate(prompts)



## 7. Evaluate on math DPO preference data

(Optional) Compare SFT and PPO policies on a held-out slice of `kira/math-dpo`. Adjust `EVAL_SPLIT` and `SAMPLE_LIMIT` as needed before running the cell.



In [None]:
import re
import pandas as pd
from datasets import load_dataset

EVAL_SPLIT = "test"            # change to "train" if the dataset lacks a test split
SAMPLE_LIMIT = 200              # set to None to evaluate the entire split
MAX_NEW_TOKENS = 256

ppo_model = policy_model
ppo_tokenizer = tokenizer


def extract_boxed_answers(text: str) -> set[str]:
    answers = []
    start = text.find("\\boxed{")
    while start != -1:
        i = start + len("\\boxed{")
        depth = 1
        while i < len(text) and depth > 0:
            if text[i] == "{":
                depth += 1
            elif text[i] == "}":
                depth -= 1
            i += 1
        if depth == 0:
            answers.append(text[start + len("\\boxed{") : i - 1])
        start = text.find("\\boxed{", i)
    return {a.strip() for a in answers if a.strip()}


def normalize(answers: set[str]) -> set[str]:
    return {re.sub(r"\s+", "", a.lower()) for a in answers} if answers else set()


def generate_completion(model, tokenizer, prompt: str, max_new_tokens: int, is_value_model: bool = False) -> str:
    base_model = model.pretrained_model if is_value_model else model
    tokenizer.padding_side = "left"
    inputs = tokenizer(prompt, return_tensors="pt", padding=True).to(base_model.device)
    input_len = inputs["input_ids"].shape[1]
    with torch.inference_mode():
        outputs = base_model.generate(
            **inputs,
            max_new_tokens=max_new_tokens,
            do_sample=False,
            eos_token_id=tokenizer.eos_token_id,
            pad_token_id=tokenizer.pad_token_id,
        )
    completion_ids = outputs[0][input_len:]
    return tokenizer.decode(completion_ids, skip_special_tokens=True).strip()


def evaluate_model(model, tokenizer, label: str, is_value_model: bool = False):
    dataset = load_dataset("kira/math-dpo", split=EVAL_SPLIT)
    if SAMPLE_LIMIT:
        dataset = dataset.select(range(min(len(dataset), SAMPLE_LIMIT)))

    rows = []
    correct = 0
    missing = 0

    for example in dataset:
        prompt = example["prompt"].strip()
        reference = example["chosen"].strip()

        completion = generate_completion(model, tokenizer, prompt, MAX_NEW_TOKENS, is_value_model=is_value_model)
        pred = normalize(extract_boxed_answers(completion))
        expected = normalize(extract_boxed_answers(reference))

        if not pred:
            missing += 1

        rows.append(
            {
                "prompt": prompt,
                "reference": reference,
                "completion": completion,
                "pred_boxed": sorted(pred),
                "ref_boxed": sorted(expected),
                "is_correct": bool(pred and pred == expected),
            }
        )
        if pred and pred == expected:
            correct += 1

    accuracy = correct / len(rows) if rows else 0.0
    print(f"{label} accuracy: {accuracy:.3f} ({correct}/{len(rows)}) — missing boxed answers: {missing}")
    return pd.DataFrame(rows)


print("Evaluating SFT policy...")
sft_results = evaluate_model(sft_model, sft_tokenizer, label="SFT")
print("Evaluating PPO policy...")
ppo_results = evaluate_model(ppo_model, ppo_tokenizer, label="PPO", is_value_model=True)

sft_results.head()



## 8. Inspect artifacts

Training outputs (policy checkpoints, reward traces, evaluation summaries) are written to the directory specified in `training.save_dir`. Adjust batch sizes, `target_kl`, and reward weights in `math_default.yaml` to run longer experiments after verifying the pipeline end to end.

