# Math PPO Colab Walkthrough

This notebook mirrors the math RLHF pipeline and runs comfortably on Google Colab. Run the cells in order to clone the repository, install dependencies, and launch PPO fine-tuning with the math reward model.



In [None]:
# Install commands are provided in Section 3 after cloning the repository.



## 1. Mount Google Drive (optional)

If your SFT and reward checkpoints live on Drive (as in `RHRL_PPO.ipynb`), mount it first. Skip this step if the checkpoints are accessible locally.



In [None]:
from google.colab import drive
drive.mount("/content/drive")



## 2. Clone the repository

If you forked the project, replace the URL below with your fork (e.g. `https://github.com/<username>/ppo_from_scratch.git`).



In [None]:
# Optional: download SFT policy checkpoint directly into Drive
%%bash
set -e
python -m pip install -q gdown
SFT_DRIVE_DIR="/content/drive/MyDrive/rl/unsloth_sft_model"
mkdir -p "${SFT_DRIVE_DIR}"
if [ -z "$(ls -A "${SFT_DRIVE_DIR}")" ]; then
  echo "Downloading SFT checkpoint into ${SFT_DRIVE_DIR}"
  gdown --folder https://drive.google.com/drive/folders/1EmaHJQ47OQ2waG-efGZppsiRHbcpHS-H -O "${SFT_DRIVE_DIR}"
else
  echo "SFT checkpoint already present at ${SFT_DRIVE_DIR}"
fi
mkdir -p /content/models
ln -sfn "${SFT_DRIVE_DIR}" /content/models/unsloth_sft_model



In [None]:
!git clone https://github.com/nagaraju-chitluru/ppo_from_scratch.git
%cd ppo_from_scratch



## 3. Install math extras

Install the math PPO dependencies bundled with the repository. Restart the runtime if Colab prompts you.



In [None]:
# Install repository in editable mode without dependencies, then manually install math extras
%pip install -q --force-reinstall numpy==2.1.3
%pip install -q --no-deps -e .
%pip install -q \
    "transformers==4.44.2" \
    "trl==0.9.6" \
    "accelerate==0.33.0" \
    "datasets==2.19.1" \
    sentencepiece \
    "sympy==1.12" \
    "peft==0.14.0" \
    gdown \
    bitsandbytes



### 3.1 Verify NumPy version and restart runtime once

Run the next cell to confirm that NumPy ≥ 2.0 is active. If it prints the restart reminder, go to `Runtime → Restart runtime`, then rerun the install cell above **once** before continuing.


In [None]:
import numpy as np
from packaging.version import Version

print("NumPy version:", np.__version__)
if Version(np.__version__) < Version("2.0.0"):
    print("⚠️ Detected NumPy < 2.0. Please restart the runtime (Runtime → Restart runtime) and rerun the install cell above.")
else:
    print("✅ NumPy 2.x is active. You can proceed.")


## 4. Configure checkpoint paths

Update the YAML with your SFT policy and reward model directories. By default it points to the shared Drive paths used in `RHRL_PPO.ipynb`.



In [None]:
import yaml
from pathlib import Path

config_path = Path("configs/math_default.yaml")
print(config_path.read_text())



## 5. Run math PPO training

Create a Drive symlink (optional, keeps old paths working) and execute the PPO loop.



In [None]:
!mkdir -p /content/drive/MyDrive/final_project/cs5446_project
!ln -sf /content/models/unsloth_sft_model /content/drive/MyDrive/final_project/cs5446_project/unsloth_sft_model
!python trainer/math_train.py --config configs/math_default.yaml



In [None]:
from pathlib import Path
import torch
from transformers import AutoModelForCausalLM, AutoTokenizer

sft_dir = Path("/content/drive/MyDrive/rl/unsloth_sft_model")
sft_model = AutoModelForCausalLM.from_pretrained(str(sft_dir))
sft_tokenizer = AutoTokenizer.from_pretrained(str(sft_dir), use_fast=False)

sft_tokenizer.padding_side = "left"

def generate_sft(prompts, max_new_tokens=64):
    inputs = sft_tokenizer(prompts, return_tensors="pt", padding=True).to(sft_model.device)
    input_length = inputs["input_ids"].shape[1]
    outputs = sft_model.generate(
        **inputs,
        max_new_tokens=max_new_tokens,
        do_sample=False,
        eos_token_id=sft_tokenizer.eos_token_id,
    )
    for prompt, output in zip(prompts, outputs):
        completion_ids = output[input_length:]
        completion = sft_tokenizer.decode(completion_ids, skip_special_tokens=True).strip()
        print("Prompt:", prompt)
        print("Response (SFT):", completion or "<empty>")
        print("-" * 40)

prompts = ["Solve x^2 - 5x + 6 = 0. Provide the roots explicitly."]
generate_sft(prompts)



In [None]:
from pathlib import Path
import torch
from trl import AutoModelForCausalLMWithValueHead
from transformers import AutoTokenizer

policy_dir = Path("/content/drive/MyDrive/rl/ppo_policy")
policy_model = AutoModelForCausalLMWithValueHead.from_pretrained(str(policy_dir))
tokenizer = AutoTokenizer.from_pretrained(str(policy_dir))
tokenizer.padding_side = "left"

def generate(prompts, max_new_tokens=64):
    inputs = tokenizer(prompts, return_tensors="pt", padding=True).to(policy_model.pretrained_model.device)
    input_length = inputs["input_ids"].shape[1]
    outputs = policy_model.generate(
        **inputs,
        max_new_tokens=max_new_tokens,
        do_sample=False,
        eos_token_id=tokenizer.eos_token_id,
    )
    for prompt, output in zip(prompts, outputs):
        completion_ids = output[input_length:]
        completion = tokenizer.decode(completion_ids, skip_special_tokens=True).strip()
        print("Prompt:", prompt)
        print("Response (PPO):", completion or "<empty>")
        print("-" * 40)

prompts = ["Solve x^2 - 5x + 6 = 0. Provide the roots explicitly."]
generate(prompts)



## 6. Inspect artifacts

Training outputs (policy checkpoints, reward traces, evaluation summaries) are written to the directory specified in `training.save_dir`. Adjust batch sizes, `target_kl`, and reward weights in `math_default.yaml` to run longer experiments after verifying the pipeline end to end.

