# Unsloth: Reinforcement Learning with GRPO (Reasoning)

This notebook demonstrates how to use **GRPO (Group Relative Policy Optimization)** with Unsloth.

GRPO is particularly effective for reasoning tasks where the model generates multiple outputs, and we optimize based on the relative quality of these outputs.

In [3]:
%%capture
!pip install unsloth
!pip install --no-deps "unsloth[colab-new] @ git+https://github.com/unslothai/unsloth.git"

In [4]:
from unsloth import FastLanguageModel, PatchFastRL
PatchFastRL("GRPO", FastLanguageModel) # Patch GRPO into Unsloth

from unsloth import is_bfloat16_supported
import torch

max_seq_length = 2048
dtype = None
load_in_4bit = True

model, tokenizer = FastLanguageModel.from_pretrained(
    model_name = "unsloth/Llama-3.2-3B-Instruct-bnb-4bit", # Using a slightly larger model for reasoning
    max_seq_length = max_seq_length,
    dtype = dtype,
    load_in_4bit = load_in_4bit,
)

ðŸ¦¥ Unsloth: Will patch your computer to enable 2x faster free finetuning.
ðŸ¦¥ Unsloth Zoo will now patch everything to make training faster!
==((====))==  Unsloth 2025.11.4: Fast Llama patching. Transformers: 4.57.2.
   \\   /|    NVIDIA A100-SXM4-40GB. Num GPUs = 1. Max memory: 39.557 GB. Platform: Linux.
O^O/ \_/ \    Torch: 2.9.0+cu126. CUDA: 8.0. CUDA Toolkit: 12.6. Triton: 3.5.0
\        /    Bfloat16 = TRUE. FA [Xformers = 0.0.33.post1. FA2 = False]
 "-____-"     Free license: http://github.com/unslothai/unsloth
Unsloth: Fast downloading is enabled - ignore downloading bars which are red colored!


model.safetensors:   0%|          | 0.00/2.24G [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/234 [00:00<?, ?B/s]

tokenizer_config.json: 0.00B [00:00, ?B/s]

special_tokens_map.json:   0%|          | 0.00/454 [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/17.2M [00:00<?, ?B/s]

chat_template.jinja: 0.00B [00:00, ?B/s]

### LoRA Configuration

In [5]:
model = FastLanguageModel.get_peft_model(
    model,
    r = 16,
    target_modules = ["q_proj", "k_proj", "v_proj", "o_proj", "gate_proj", "up_proj", "down_proj",],
    lora_alpha = 16,
    lora_dropout = 0,
    bias = "none",
    use_gradient_checkpointing = "unsloth",
    random_state = 3407,
)

Unsloth 2025.11.4 patched 28 layers with 28 QKV layers, 28 O layers and 28 MLP layers.


### Data Preparation
We will use the **GSM8K** dataset, which consists of grade school math problems. GRPO works by generating multiple solutions and rewarding the correct ones.

In [11]:
from datasets import load_dataset
dataset = load_dataset("gsm8k", "main", split = "train")
dataset = dataset.rename_column("question", "prompt")  # <--- ADD THIS LINE

# Simple reward function: Check if the answer matches the ground truth
# In a real scenario, this would be more complex
def reward_func(prompts, completions, **kwargs):
    rewards = []
    for completion, ground_truth in zip(completions, kwargs["answer"]):
        # Very basic check - in reality we need to parse the number
        if ground_truth.strip() in completion:
            rewards.append(1.0)
        else:
            rewards.append(0.0)
    return rewards

In [15]:
from trl import GRPOTrainer, GRPOConfig

trainer = GRPOTrainer(
    model = model,
    reward_funcs = reward_func,
    train_dataset = dataset,
    args = GRPOConfig(
        per_device_train_batch_size = 2,
        gradient_accumulation_steps = 4,
        warmup_steps = 5,
        max_steps = 2,
        learning_rate = 5e-6,
        fp16 = not is_bfloat16_supported(),
        bf16 = is_bfloat16_supported(),
        logging_steps = 1,
        optim = "adamw_8bit",
        weight_decay = 0.01,
        lr_scheduler_type = "linear",
        seed = 3407,
        output_dir = "outputs",
        report_to = "none", # Prevent wandb logging error
    ),
)

In [16]:
trainer.train()

The model is already on multiple devices. Skipping the move to device specified in `args`.
==((====))==  Unsloth - 2x faster free finetuning | Num GPUs used = 1
   \\   /|    Num examples = 7,473 | Num Epochs = 1 | Total steps = 2
O^O/ \_/ \    Batch size per device = 2 | Gradient accumulation steps = 4
\        /    Data Parallel GPUs = 1 | Total batch size (2 x 4 x 1) = 8
 "-____-"     Trainable parameters = 24,313,856 of 3,237,063,680 (0.75% trained)


Step,Training Loss,reward,reward_std,completions / mean_length,completions / min_length,completions / max_length,completions / clipped_ratio,completions / mean_terminated_length,completions / min_terminated_length,completions / max_terminated_length,sampling / sampling_logp_difference / mean,sampling / sampling_logp_difference / max,sampling / importance_sampling_ratio / min,sampling / importance_sampling_ratio / mean,sampling / importance_sampling_ratio / max,kl,rewards / reward_func / mean,rewards / reward_func / std
1,0.0,0.0,0.0,199.375,134.0,256.0,0.25,180.5,134.0,244.0,0,0,0,0,0,0.000162,0.0,0.0
2,0.0,0.0,0.0,256.0,256.0,256.0,1.0,0.0,0.0,0.0,No Log,No Log,No Log,No Log,No Log,0.000128,0.0,0.0


TrainOutput(global_step=2, training_loss=1.4510725776517575e-07, metrics={'train_runtime': 41.6453, 'train_samples_per_second': 0.384, 'train_steps_per_second': 0.048, 'total_flos': 0.0, 'train_loss': 1.4510725776517575e-07})