# Qwen3-1.7B GSM8K PPO Training (VERL)

Replicates reported improvement on GSM8K:
| Model | Accuracy |
|---|---|
| Qwen3-1.7B base | ~69.2% |
| Qwen3-1.7B + PPO | ~82.7% |

**GPU recommendation:** A100 40GB (Colab Pro) or T4 16GB (free tier, with reduced batch sizes)

**Runtime:** ~6–12 hours on A100 for 500 steps. Use Colab Pro with background execution.

## 0. Check GPU

In [None]:
!nvidia-smi
import torch
print(f"CUDA available: {torch.cuda.is_available()}")
if torch.cuda.is_available():
    gpu = torch.cuda.get_device_properties(0)
    print(f"GPU: {gpu.name}  VRAM: {gpu.total_memory / 1e9:.1f} GB")

## 1. Clone Project Repo

In [None]:
import os

REPO_URL = "https://github.com/YOUR_USERNAME/YOUR_REPO_NAME.git"  # <-- UPDATE THIS
REPO_DIR = "/content/LLM_RL"

if not os.path.exists(REPO_DIR):
    !git clone {REPO_URL} {REPO_DIR}
else:
    !git -C {REPO_DIR} pull

os.chdir(REPO_DIR)
!ls

## 2. Install Dependencies

In [None]:
# Install VERL
!git clone https://github.com/verl-project/verl /content/verl
%cd /content/verl
!pip install -e . -q
%cd {REPO_DIR}

In [None]:
# Install remaining dependencies
!pip install -q \
    vllm \
    datasets \
    sympy \
    regex \
    pandas \
    pyarrow \
    transformers>=4.44.0 \
    accelerate \
    flash-attn --no-build-isolation

print("Dependencies installed.")

## 3. HuggingFace Login (required to download Qwen3)

In [None]:
from huggingface_hub import login
# Get your token from https://huggingface.co/settings/tokens
# Use Colab Secrets (key icon on left sidebar) → add HF_TOKEN
try:
    from google.colab import userdata
    hf_token = userdata.get('HF_TOKEN')
    login(token=hf_token)
    print("Logged in via Colab secret.")
except Exception:
    login()  # interactive prompt

## 4. Prepare GSM8K Dataset

In [None]:
os.chdir(REPO_DIR)
!python data/prepare_gsm8k.py --output_dir data/gsm8k
!ls data/gsm8k/

## 5. Test Reward Function

In [None]:
!python rewards/gsm8k_reward.py

## 6. (Optional) Evaluate Base Model Before Training

This gives you the ~69% baseline to compare against after PPO.

In [None]:
# Evaluate base model (takes ~30-45 min on T4, ~15 min on A100)
# You can skip this and come back after training.
RUN_BASE_EVAL = False  # Set to True to run

if RUN_BASE_EVAL:
    !python evaluation/eval_gsm8k.py \
        --model_path Qwen/Qwen3-1.7B \
        --split test \
        --max_new_tokens 512

## 7. Detect GPU and Configure Training

- **A100 (40GB)**: uses full config, batch_size=128
- **T4 (16GB)**: uses reduced config, batch_size=32, response_length=512

In [None]:
import torch
import yaml

gpu_mem_gb = torch.cuda.get_device_properties(0).total_memory / 1e9 if torch.cuda.is_available() else 0
print(f"GPU memory: {gpu_mem_gb:.1f} GB")

IS_A100 = gpu_mem_gb >= 38
print(f"Using {'A100' if IS_A100 else 'T4/low-VRAM'} config")

In [None]:
# Write a Colab-specific override config
import os

os.makedirs("configs", exist_ok=True)

if IS_A100:
    colab_overrides = """
data:
  train_batch_size: 128
  max_response_length: 1024

actor_rollout_ref:
  actor:
    ppo_mini_batch_size: 64
    ppo_micro_batch_size_per_gpu: 8
  rollout:
    response_length: 1024
    gpu_memory_utilization: 0.45
  ref:
    fsdp_config:
      param_offload: True

critic:
  ppo_micro_batch_size_per_gpu: 8

trainer:
  n_gpus_per_node: 1
  total_epochs: 15
  save_freq: 50
  test_freq: 25
"""
else:
    # T4 16GB — reduce everything
    colab_overrides = """
data:
  train_batch_size: 32
  max_prompt_length: 384
  max_response_length: 512

actor_rollout_ref:
  model:
    enable_gradient_checkpointing: True
    fsdp_config:
      param_offload: True
  actor:
    ppo_mini_batch_size: 16
    ppo_micro_batch_size_per_gpu: 2
  rollout:
    response_length: 512
    gpu_memory_utilization: 0.35
  ref:
    fsdp_config:
      param_offload: True

critic:
  ppo_micro_batch_size_per_gpu: 2
  model:
    enable_gradient_checkpointing: True
    fsdp_config:
      param_offload: True

trainer:
  n_gpus_per_node: 1
  total_epochs: 10
  save_freq: 50
  test_freq: 50
"""

with open("configs/colab_overrides.yaml", "w") as f:
    f.write(colab_overrides)
print("Config written.")
print(colab_overrides)

## 8. PPO Training

> **Important:** Enable Colab background execution before running:
> Runtime → Run all (or just this section), then enable **Background execution** under Runtime → Change runtime type.
> This prevents the session from disconnecting during the long training run.

In [None]:
import subprocess, sys, os

os.chdir(REPO_DIR)
os.environ["PYTHONPATH"] = REPO_DIR  # so VERL can find rewards/

cmd = [
    sys.executable, "-m", "verl.trainer.main_ppo",
    "--config-path", f"{REPO_DIR}/configs",
    "--config-name", "qwen3_gsm8k_ppo",
    # Override reward function path so VERL finds it
    f"reward_model.reward_fn_path={REPO_DIR}/rewards/gsm8k_reward.py",
    f"reward_model.reward_fn_name=compute_score",
    # Load colab-specific overrides
    "+trainer.n_gpus_per_node=1",
    f"trainer.default_local_dir={REPO_DIR}/checkpoints",
]

print("Training command:")
print(" ".join(cmd))
print("\nStarting training... (this will take several hours)")

In [None]:
# Run training — output streams to cell
# On Colab Pro, enable background execution to survive disconnects.
!python -m verl.trainer.main_ppo \
    --config-path {REPO_DIR}/configs \
    --config-name qwen3_gsm8k_ppo \
    trainer.n_gpus_per_node=1 \
    data.train_files={REPO_DIR}/data/gsm8k/train.parquet \
    data.val_files={REPO_DIR}/data/gsm8k/test.parquet \
    reward_model.reward_fn_path={REPO_DIR}/rewards/gsm8k_reward.py \
    reward_model.reward_fn_name=compute_score \
    trainer.default_local_dir={REPO_DIR}/checkpoints \
    trainer.experiment_name=qwen3_gsm8k_ppo_colab 2>&1 | tee {REPO_DIR}/training.log

## 9. Evaluate Trained Checkpoint

In [None]:
import glob

# Find latest checkpoint
checkpoints = sorted(glob.glob(f"{REPO_DIR}/checkpoints/qwen3_gsm8k_ppo_colab/global_step_*"))
if checkpoints:
    latest_ckpt = checkpoints[-1]
    print(f"Latest checkpoint: {latest_ckpt}")
else:
    print("No checkpoints found yet.")
    latest_ckpt = "Qwen/Qwen3-1.7B"  # fallback to base

In [None]:
os.chdir(REPO_DIR)
!python evaluation/eval_gsm8k.py \
    --model_path {latest_ckpt} \
    --split test \
    --max_new_tokens 512

## 10. Save Results to Google Drive (optional)

In [None]:
SAVE_TO_DRIVE = False  # Set True to mount Drive and copy results

if SAVE_TO_DRIVE:
    from google.colab import drive
    drive.mount('/content/drive')
    DRIVE_DIR = "/content/drive/MyDrive/LLM_RL_results"
    !mkdir -p {DRIVE_DIR}
    !cp -r {REPO_DIR}/checkpoints {DRIVE_DIR}/
    !cp -r {REPO_DIR}/evaluation/results {DRIVE_DIR}/
    !cp {REPO_DIR}/training.log {DRIVE_DIR}/
    print(f"Saved to {DRIVE_DIR}")

## 11. Print Accuracy Report

In [None]:
import json, glob

result_files = glob.glob(f"{REPO_DIR}/evaluation/results/*.json")
for rf in sorted(result_files):
    with open(rf) as f:
        data = json.load(f)
    r = data["report"]
    print(f"{r['model_path']}  →  {r['accuracy']*100:.2f}%  ({r['correct']}/{r['total']})")