# Reward-Tuned, Reality-Checked: Lessons from GRPO Fine-Tuning
## 🔧 Step 1 – Environment Setup

This notebook is optimized for Google Colab and aims to support **GRPO fine-tuning** with **LLaMA 3.1 8B** using **Unsloth** and **vLLM**.

Key setup notes:

- ✅ Enables 4-bit quantization for memory efficiency (via `bitsandbytes`)
- ✅ Installs `Unsloth`, `trl`, and `peft` to support GRPO-style fine-tuning
- ✅ Uses `vLLM` for efficient inference and execution (patched for compatibility)
- ✅ Pulls in `datasets`, `sentencepiece`, and `huggingface_hub` to support tokenizer and dataset access

This setup ensures:
- 🔋 Lower memory usage (thanks to quantization)
- ⚙️ Compatibility with GRPO training patterns
- 📚 Access to the Open R1 Math dataset via the Hugging Face Hub

> Note: Dependencies are installed in a Colab-safe way and include patches to resolve known incompatibilities between `vLLM`, `transformers`, and `xformers`.

### READ THIS:
If you receive the following message, go ahead and restart your session, then proceed with the next code cell. **You do not need to rerun the environment setup.**

In [1]:
# 🛠️ Detect Colab and install optimized dependencies
import os
if "COLAB_" in "".join(os.environ.keys()):
    # Clean install for Colab with 4-bit and GRPO support
    !pip install --no-deps unsloth vllm==0.7.3
    !pip install --no-deps bitsandbytes accelerate xformers==0.0.29.post3 peft "trl==0.15.2" triton cut_cross_entropy unsloth_zoo
    !pip install sentencepiece protobuf datasets huggingface_hub hf_transfer

    # Patch vLLM dependency conflicts
    import sys, re, requests
    modules = list(sys.modules.keys())
    for x in modules: sys.modules.pop(x) if "PIL" in x or "google" in x else None
    f = requests.get("https://raw.githubusercontent.com/vllm-project/vllm/refs/heads/main/requirements/common.txt").content
    with open("vllm_requirements.txt", "wb") as file:
        file.write(re.sub(rb"(transformers|numpy|xformers)[^\n]{1,}\n", b"", f))
    !pip install -r vllm_requirements.txt


Collecting unsloth
  Downloading unsloth-2025.4.7-py3-none-any.whl.metadata (46 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m46.8/46.8 kB[0m [31m2.0 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting vllm==0.7.3
  Downloading vllm-0.7.3-cp38-abi3-manylinux1_x86_64.whl.metadata (25 kB)
Downloading vllm-0.7.3-cp38-abi3-manylinux1_x86_64.whl (264.6 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m264.6/264.6 MB[0m [31m9.6 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading unsloth-2025.4.7-py3-none-any.whl (218 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m218.5/218.5 kB[0m [31m19.6 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: vllm, unsloth
Successfully installed unsloth-2025.4.7 vllm-0.7.3
Collecting bitsandbytes
  Downloading bitsandbytes-0.45.5-py3-none-manylinux_2_24_x86_64.whl.metadata (5.0 kB)
Collecting xformers==0.0.29.post3
  Downloading xformers-0.0.29.post3-cp311-cp311-manylinux_2_28_x86_64.whl.meta

### 🔍 Check GPU Availability

Before continuing, make sure you have access to a suitable GPU (e.g., A100 or L4 GPUs).


In [1]:
# ✅ Check if GPU is available
import torch

if torch.cuda.is_available():
    print("✅ GPU is available!")
    print(f"Using: {torch.cuda.get_device_name(0)}")
else:
    print("🚫 GPU not found! Go to Runtime → Change runtime type → GPU")


✅ GPU is available!
Using: NVIDIA A100-SXM4-40GB


## 🧠 Step 2 – Load LLaMA 3.1 8B with Unsloth + LoRA

We'll now load the **Meta LLaMA 3.1 8B Instruct** model using **Unsloth's `FastLanguageModel`**, which simplifies integration with **LoRA** fine-tuning and supports 4-bit quantization out of the box.

This step includes:

- 🔐 **Authentication**: Provide your Hugging Face token for gated model access.
- 🧩 **Model Initialization**:
  - Loads the base model in 4-bit precision to save memory
  - Prepares it for **GRPO-style fine-tuning** with LoRA adapters
  - Applies target modules relevant for causal language modeling

Key configuration options:
- `max_seq_length = 2048` — maximum input size for GRPO samples
- `lora_rank = 64` — controls the size of trainable LoRA adapter matrices
- `use_gradient_checkpointing = "unsloth"` — enables memory-efficient training


In [2]:
# Set the HF-TOKEN environment variable (used to authorize access to the model from Hugging Face).

from getpass import getpass
import os

os.environ["HF_TOKEN"] = getpass("🔑 Enter your Hugging Face token: ")



🔑 Enter your Hugging Face token: ··········


In [3]:
from unsloth import FastLanguageModel
import torch

model_name = "meta-llama/Meta-Llama-3-8B-Instruct"
max_seq_length = 2048
lora_rank = 64

model, tokenizer = FastLanguageModel.from_pretrained(
    model_name = model_name,
    max_seq_length = max_seq_length,
    load_in_4bit = True,  # You can use 4-bit now
    token = os.environ["HF_TOKEN"]
)

model = FastLanguageModel.get_peft_model(
    model,
    r = lora_rank,
    target_modules = [
        "q_proj", "k_proj", "v_proj", "o_proj",
        "gate_proj", "up_proj", "down_proj",
    ],
    lora_alpha = lora_rank,
    use_gradient_checkpointing = "unsloth",
    random_state = 42,
)


🦥 Unsloth: Will patch your computer to enable 2x faster free finetuning.
🦥 Unsloth Zoo will now patch everything to make training faster!
INFO 05-07 03:48:30 __init__.py:207] Automatically detected platform cuda.
==((====))==  Unsloth 2025.4.7: Fast Llama patching. Transformers: 4.51.3. vLLM: 0.7.3.
   \\   /|    NVIDIA A100-SXM4-40GB. Num GPUs = 1. Max memory: 39.557 GB. Platform: Linux.
O^O/ \_/ \    Torch: 2.6.0+cu124. CUDA: 8.0. CUDA Toolkit: 12.4. Triton: 3.2.0
\        /    Bfloat16 = TRUE. FA [Xformers = 0.0.29.post3. FA2 = False]
 "-____-"     Free license: http://github.com/unslothai/unsloth
Unsloth: Fast downloading is enabled - ignore downloading bars which are red colored!


model.safetensors:   0%|          | 0.00/5.70G [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/220 [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/51.1k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/9.09M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/345 [00:00<?, ?B/s]

Unsloth 2025.4.7 patched 32 layers with 32 QKV layers, 32 O layers and 32 MLP layers.


## 🧮 Step 3 – Load and Preprocess the Open R1 Math Dataset

We'll fine-tune the model on the **OpenR1-Math-Raw** dataset. Each entry contains a challenging math problem and its detailed solution — ideal for training models to reason step-by-step.

To prepare the data:

1. **Load** the raw dataset directly from Hugging Face 🤗 Datasets Hub.
2. **Reformat** each example into an instruction-following format:
   ```
   ### Problem:
   <problem>

   ### Solution:
   <solution>
   ```
3. **Tokenize** the formatted text using Unsloth's tokenizer, keeping sequences within a 2048-token limit.

This format aligns with the expectations of instruction-tuned models and is ideal for GRPO-style reward modeling.


In [4]:
from datasets import load_dataset

# Load the OpenR1-Math-Raw dataset
dataset = load_dataset("open-r1/OpenR1-Math-Raw", split="train")

# Preview a sample
print(dataset[0])


README.md:   0%|          | 0.00/3.29k [00:00<?, ?B/s]

Resolving data files:   0%|          | 0/39 [00:00<?, ?it/s]

Resolving data files:   0%|          | 0/39 [00:00<?, ?it/s]

Downloading data:   0%|          | 0/39 [00:00<?, ?files/s]

data/train-00000-of-00039.parquet:   0%|          | 0.00/349M [00:00<?, ?B/s]

data/train-00001-of-00039.parquet:   0%|          | 0.00/315M [00:00<?, ?B/s]

data/train-00002-of-00039.parquet:   0%|          | 0.00/226M [00:00<?, ?B/s]

data/train-00003-of-00039.parquet:   0%|          | 0.00/251M [00:00<?, ?B/s]

data/train-00004-of-00039.parquet:   0%|          | 0.00/317M [00:00<?, ?B/s]

data/train-00005-of-00039.parquet:   0%|          | 0.00/281M [00:00<?, ?B/s]

data/train-00006-of-00039.parquet:   0%|          | 0.00/290M [00:00<?, ?B/s]

data/train-00007-of-00039.parquet:   0%|          | 0.00/296M [00:00<?, ?B/s]

data/train-00008-of-00039.parquet:   0%|          | 0.00/296M [00:00<?, ?B/s]

data/train-00009-of-00039.parquet:   0%|          | 0.00/295M [00:00<?, ?B/s]

data/train-00010-of-00039.parquet:   0%|          | 0.00/306M [00:00<?, ?B/s]

data/train-00011-of-00039.parquet:   0%|          | 0.00/295M [00:00<?, ?B/s]

data/train-00012-of-00039.parquet:   0%|          | 0.00/291M [00:00<?, ?B/s]

data/train-00013-of-00039.parquet:   0%|          | 0.00/232M [00:00<?, ?B/s]

data/train-00014-of-00039.parquet:   0%|          | 0.00/315M [00:00<?, ?B/s]

data/train-00015-of-00039.parquet:   0%|          | 0.00/304M [00:00<?, ?B/s]

data/train-00016-of-00039.parquet:   0%|          | 0.00/123M [00:00<?, ?B/s]

data/train-00017-of-00039.parquet:   0%|          | 0.00/120M [00:00<?, ?B/s]

data/train-00018-of-00039.parquet:   0%|          | 0.00/121M [00:00<?, ?B/s]

data/train-00019-of-00039.parquet:   0%|          | 0.00/122M [00:00<?, ?B/s]

data/train-00020-of-00039.parquet:   0%|          | 0.00/123M [00:00<?, ?B/s]

data/train-00021-of-00039.parquet:   0%|          | 0.00/121M [00:00<?, ?B/s]

data/train-00022-of-00039.parquet:   0%|          | 0.00/121M [00:00<?, ?B/s]

data/train-00023-of-00039.parquet:   0%|          | 0.00/122M [00:00<?, ?B/s]

data/train-00024-of-00039.parquet:   0%|          | 0.00/121M [00:00<?, ?B/s]

data/train-00025-of-00039.parquet:   0%|          | 0.00/121M [00:00<?, ?B/s]

data/train-00026-of-00039.parquet:   0%|          | 0.00/121M [00:00<?, ?B/s]

data/train-00027-of-00039.parquet:   0%|          | 0.00/121M [00:00<?, ?B/s]

data/train-00028-of-00039.parquet:   0%|          | 0.00/122M [00:00<?, ?B/s]

data/train-00029-of-00039.parquet:   0%|          | 0.00/121M [00:00<?, ?B/s]

data/train-00030-of-00039.parquet:   0%|          | 0.00/122M [00:00<?, ?B/s]

data/train-00031-of-00039.parquet:   0%|          | 0.00/122M [00:00<?, ?B/s]

data/train-00032-of-00039.parquet:   0%|          | 0.00/122M [00:00<?, ?B/s]

data/train-00033-of-00039.parquet:   0%|          | 0.00/122M [00:00<?, ?B/s]

data/train-00034-of-00039.parquet:   0%|          | 0.00/124M [00:00<?, ?B/s]

data/train-00035-of-00039.parquet:   0%|          | 0.00/143M [00:00<?, ?B/s]

data/train-00036-of-00039.parquet:   0%|          | 0.00/313M [00:00<?, ?B/s]

data/train-00037-of-00039.parquet:   0%|          | 0.00/341M [00:00<?, ?B/s]

data/train-00038-of-00039.parquet:   0%|          | 0.00/254M [00:00<?, ?B/s]

Generating train split:   0%|          | 0/516499 [00:00<?, ? examples/s]

Loading dataset shards:   0%|          | 0/37 [00:00<?, ?it/s]

{'problem': '\nProblem 1. Find all prime numbers $p$ for which there exist positive integers $x, y$ and $z$ such that the number\n\n$$\nx^{p}+y^{p}+z^{p}-x-y-z\n$$\n\nis a product of exactly three distinct prime numbers.\n', 'solution': "\nSolution. Let $A=x^{p}+y^{p}+z^{p}-x-y-z$. For $p=2$, we take $x=y=4$ and $z=3$. Then $A=30=2 \\cdot 3 \\cdot 5$. For $p=3$ we can take $x=3$ and $y=2$ and $z=1$. Then again $A=30=2 \\cdot 3 \\cdot 5$. For $p=5$ we can take $x=2$ and $y=1$ and $z=1$. Again $A=30=2 \\cdot 3 \\cdot 5$.\n\nAssume now that $p \\geqslant 7$. Working modulo 2 and modulo 3 we see that $A$ is divisible by both 2 and 3. Moreover, by Fermat's Little Theorem, we have\n\n$$\nx^{p}+y^{p}+z^{p}-x-y-z \\equiv x+y+z-x-y-z=0 \\bmod p \\text {. }\n$$\n\nTherefore, by the given condition, we have to solve the equation\n\n$$\nx^{p}+y^{p}+z^{p}-x-y-z=6 p\n$$\n\nIf one of the numbers $x, y$ and $z$ is bigger than or equal to 2 , let's say $x \\geqslant 2$, then\n\n$$\n6 p \\geqslant x^{p}

In [5]:
# Format each item as a full instruction-completion string
def format_prompt(example):
    return {
        "prompt": f"### Instruction:\n{example['problem']}\n\n### Response:\n{example['solution']}"
    }


formatted_dataset = dataset.map(format_prompt)

# Tokenize using Unsloth's tokenizer
def tokenize_prompt(example):
    return tokenizer(
        example["prompt"],
        truncation=True,
        max_length=2048,
    )

tokenized_dataset = formatted_dataset.map(tokenize_prompt, remove_columns=dataset.column_names)

print("✅ Dataset formatted and tokenized.")


Map:   0%|          | 0/516499 [00:00<?, ? examples/s]

Map:   0%|          | 0/516499 [00:00<?, ? examples/s]

✅ Dataset formatted and tokenized.


## 🧠 Step 4 – Define Reward Functions

With our dataset tokenized, it's time to implement the **Guided Reward Policy Optimization (GRPO)** loop.

We'll define simple **reward functions** that evaluate how well a generated solution resembles the ground-truth answer. For this example, we'll reward completions that:

- Contain the correct answer (+1)
- Numeric-only answers (+0.5)
- Show structured reasoning or multiple-step logic (+0.5)

These functions will be called during the training process. by the GRPOTrainer.


In [17]:
# 🎯 Define Reward Functions for GRPO Fine-Tuning

def reward_correct_answer(prompts, completions, references=None, **kwargs):
    """+1 if the generated output matches reference, else 0"""
    if references is None:
        return [0.0] * len(completions)
    return [1.0 if pred.strip() in ref else 0.0
            for pred, ref in zip(completions, references)]

def reward_numeric_only(prompts, completions, references=None, **kwargs):
    """+0.5 if the output contains only digits or math symbols"""
    import re
    numeric_re = re.compile(r'^[\d\s\+\-\*/\(\)=\.]+$')
    return [0.5 if numeric_re.match(pred.strip()) else 0.0
            for pred in completions]

def reward_has_reasoning_and_answer(prompts, completions, references=None, **kwargs):
    """+0.5 if the response contains both reasoning and a final answer"""
    return [0.5 if ("because" in pred.lower() and "answer" in pred.lower()) else 0.0
            for pred in completions]


## Step 5 – Instantiate GRPOTrainer

We are using trl's `GRPOTrainer`, which natively supports multiple reward functions for guided fine-tuning.

Each reward function receives:
- A batch of completions (LLM outputs) as a list of structured strings
- Optionally, a reference string to compare against

The trainer will combine rewards and apply them during training to steer model behavior.

>NOTE: The training is being limited to 100 steps, which took 1:14:15 to process. Had we processed the entire 129,124 steps it would have taken an estimated 112 days to complete.


In [20]:
from trl import GRPOTrainer, GRPOConfig

# ⚙️ GRPO Training Configuration
grpo_args = GRPOConfig(
    output_dir="llama3-math-grpo",
    max_steps=100,
    per_device_train_batch_size=4,
    gradient_accumulation_steps=2,
    save_strategy="no",
    logging_steps=5,
    learning_rate=2e-4,
    bf16=True,
    optim="adamw_8bit",
    seed=42,
)

# 🏋️ GRPOTrainer (uses reward functions directly)
trainer = GRPOTrainer(
    model=model,
    processing_class=tokenizer,
    reward_funcs=[
        reward_correct_answer,
    ],
    args=grpo_args,
    train_dataset=tokenized_dataset,
)


Unsloth: We now expect `per_device_train_batch_size` to be a multiple of `num_generations`.
We will change the batch size of 4 to the `num_generations` of 8


## Step 6 – Fine-Tuning with GRPO

We now launch the fine-tuning process using the `GRPOTrainer` from 🤗 `trl`, powered by Unsloth’s efficient training stack.

Key configurations:
- Gradient checkpointing and 8-bit optimizers for memory efficiency
- Packing disabled due to known issues with Hugging Face’s implementation

This step produces a reasoning-optimized LLaMA 3.1 8B model trained on mathematical proof problems.

### Important Factors Impacting Tuning
##### 🔸 Dataset: 516,499 samples
##### 🔸 Steps: 129,124
##### 🔸 Batch Size: 8 examples * 4 gradient accumulations * 1 epoch = 32
##### 🔸 Trainable parameters 167M (2.1% of the 8B possible)
##### 🔸 GPU: A100

In [21]:
trainer.train()


==((====))==  Unsloth - 2x faster free finetuning | Num GPUs used = 1
   \\   /|    Num examples = 516,499 | Num Epochs = 1 | Total steps = 100
O^O/ \_/ \    Batch size per device = 8 | Gradient accumulation steps = 2
\        /    Data Parallel GPUs = 1 | Total batch size (8 x 2 x 1) = 16
 "-____-"     Trainable parameters = 167,772,160/8,000,000,000 (2.10% trained)


Step,Training Loss,reward,reward_std,completion_length,kl,rewards / reward_correct_answer
5,0.0,0.0,0.0,236.2875,0.00049,0.0
10,0.0001,0.0,0.0,219.7625,0.001985,0.0
15,0.0004,0.0,0.0,209.9625,0.009289,0.0
20,0.0002,0.0,0.0,254.1,0.006082,0.0
25,0.0005,0.0,0.0,253.3125,0.012818,0.0
30,0.0005,0.0,0.0,256.0,0.01319,0.0
35,0.0004,0.0,0.0,256.0,0.009606,0.0
40,0.0005,0.0,0.0,256.0,0.012513,0.0
45,0.0187,0.0,0.0,184.4,0.467371,0.0
50,0.001,0.0,0.0,256.0,0.024761,0.0


TrainOutput(global_step=100, training_loss=0.0014238211774500087, metrics={'train_runtime': 4500.2421, 'train_samples_per_second': 0.356, 'train_steps_per_second': 0.022, 'total_flos': 0.0, 'train_loss': 0.0014238211774500087})

## 🧪 Original GRPO Training Run (Console Log Snapshot)

We ran the GRPO fine-tuning using the full OpenR1-Math dataset and all three custom reward functions. Below is the captured console log from the first few steps:

```
==((====))==  Unsloth - 2x faster free finetuning | Num GPUs used = 1
   \\   /|    Num examples = 516,499 | Num Epochs = 1 | Total steps = 129,124
O^O/ \\_/ \\    Batch size per device = 8 | Gradient accumulation steps = 4
\        /    Data Parallel GPUs = 1 | Total batch size (8 x 4 x 1) = 32
 "-____-"     Trainable parameters = 167,772,160/8,000,000,000 (2.10% trained)
Unsloth: Will smartly offload gradients to save VRAM!
 [ 2/129124 : < :, Epoch 0.00/1]
```

### 📊 Reward Breakdown (Early Steps)

| Step | Training Loss | Reward   | Reward Std | Completion Length | KL        | Correct Answer | Numeric Only | Reasoning + Answer |
|------|----------------|----------|-------------|--------------------|-----------|----------------|----------------|---------------------|
| 5    | 0.000000       | 0.021875 | 0.048294     | 229.006250          | 0.000187  | 0.000000       | 0.000000       | **0.021875**        |
| 10   | 0.000000       | 0.018750 | 0.046928     | 230.325000          | 0.000268  | 0.000000       | 0.000000       | **0.018750**        |

**💡 Insight:** Rewards were only earned from the `reward_has_reasoning_and_answer` function at this early stage. Both `reward_correct_answer` and `reward_numeric_only` returned zero — likely due to incomplete formatting or deviations from expected answer syntax.

### Estimated Processing Times
##### **Full Training** _(A100 GPU; 3 rewards; 129.124 steps; LoRA rank 64):_ 112 Days (3.7 months)
##### **Sample Training** _(A100 GPU; 1 reward; 129,124 steps; LoRA rank 64):_ 59 Day (~2 months)