# Math PPO Colab Walkthrough

This notebook mirrors the math RLHF pipeline and runs comfortably on Google Colab. Run the cells in order to clone the repository, install dependencies, and launch PPO fine-tuning with the math reward model.



## 1. Mount Google Drive (optional)

If your SFT and reward checkpoints live on Drive (as in `RHRL_PPO.ipynb`), mount it first. Skip this step if the checkpoints are accessible locally.



In [None]:
from google.colab import drive
drive.mount("/content/drive")



## 2. Clone the repository

If you forked the project, replace the URL below with your fork (e.g. `https://github.com/<username>/ppo_from_scratch.git`).



In [None]:
!git clone https://github.com/nagaraju-chitluru/ppo_from_scratch.git
%cd ppo_from_scratch



## 3. Install math extras

Install the math PPO dependencies bundled with the repository. Restart the runtime if Colab prompts you.



In [None]:
# Install repository in editable mode without dependencies, then manually install math extras
%pip install -q --no-deps -e .
%pip install -q \
    "numpy==1.26.4" \
    "transformers==4.44.2" \
    "trl==0.9.6" \
    "accelerate==0.33.0" \
    "datasets==2.19.1" \
    "sentencepiece" \
    "sympy==1.12" \
    "peft==0.14.0"



## 4. Configure checkpoint paths

Update the YAML with your SFT policy and reward model directories. By default it points to the shared Drive paths used in `RHRL_PPO.ipynb`.



In [None]:
import yaml
from pathlib import Path

config_path = Path("configs/math_default.yaml")
print(config_path.read_text())



## 5. Run math PPO training

This executes the TRL-based PPO loop that samples math prompts, scores responses with the reward model, and updates the policy.



In [None]:
!python trainer/math_train.py --config configs/math_default.yaml



## 6. Inspect artifacts

Training outputs (policy checkpoints, reward traces, evaluation summaries) are written to the directory specified in `training.save_dir`. Adjust batch sizes, `target_kl`, and reward weights in `math_default.yaml` to run longer experiments after verifying the pipeline end to end.

