A minimal, self-contained implementation of Reinforcement Learning from Verifiable Rewards (RLVR) on Qwen3-4B-Instruct for grade-school math reasoning, powered by the Tinker training API.
| Model | Accuracy (GSM8K, 250 examples) |
|---|---|
| Qwen3-4B-Instruct (base) | 89.2% (223/250) |
| + RLVR fine-tune (this repo) | 90.0% (225/250) |
| Delta | +0.8% |
Best validation checkpoint reached 91.7% at RL iteration 8.
- Warm-start SFT (40 steps) — teaches the model the
Final Answer: <number>output format using 768 GSM8K training examples - GRPO-style RL (10 iterations) — samples 8 completions per question, computes exact-match rewards, normalizes advantages by standard deviation, and updates via PPO clipping
- Final eval — compares the best checkpoint against the frozen base model on 250 held-out test examples
pip install tinker datasets transformers tqdm pandas torchYou will need a Tinker API key. The script prompts for it securely via getpass — it is never written to disk.
Notebook (recommended):
Open Tinker_RFT.ipynb in Google Colab and run all cells. The notebook auto-prompts for your TINKER_API_KEY.
Script:
python Tinker_RFT.py| Parameter | Value |
|---|---|
| Base model | Qwen/Qwen3-4B-Instruct-2507 |
| LoRA rank | 32 |
| SFT steps | 40 |
| RL iterations | 10 |
| Questions/iter | 12 |
| Group size (samples/question) | 8 |
| RL learning rate | 1e-5 |
| Max new tokens (RL) | 512 |
| PPO clip range | [0.9, 1.1] |
+1.0for exact numeric match+0.10for cleanFinal Answer: <number>on the last line-0.20for missing the required format- Penalties for prompt-leaking artifacts (
Question:, repeatedFinal Answer:, etc.)
Advantages are standardized per group: adv = (r - mean) / std.
| File | Description |
|---|---|
Tinker_RFT.ipynb |
Colab notebook (recommended entry point) |
Tinker_RFT.py |
Equivalent standalone Python script |
results.png |
Training log and final eval output |
