# Unsloth: Reinforcement Learning with GRPO (Reasoning)

This notebook demonstrates how to use **GRPO (Group Relative Policy Optimization)** with Unsloth.

GRPO is particularly effective for reasoning tasks where the model generates multiple outputs, and we optimize based on the relative quality of these outputs.

In [None]:
%%capture
!pip install unsloth
!pip install --no-deps "unsloth[colab-new] @ git+https://github.com/unslothai/unsloth.git"

In [None]:
from unsloth import FastLanguageModel, PatchFastRL
PatchFastRL("GRPO", FastLanguageModel) # Patch GRPO into Unsloth

from unsloth import is_bfloat16_supported
import torch

max_seq_length = 2048
dtype = None
load_in_4bit = True

model, tokenizer = FastLanguageModel.from_pretrained(
    model_name = "unsloth/Llama-3.2-3B-Instruct-bnb-4bit", # Using a slightly larger model for reasoning
    max_seq_length = max_seq_length,
    dtype = dtype,
    load_in_4bit = load_in_4bit,
)

### LoRA Configuration

In [None]:
model = FastLanguageModel.get_peft_model(
    model,
    r = 16,
    target_modules = ["q_proj", "k_proj", "v_proj", "o_proj", "gate_proj", "up_proj", "down_proj",],
    lora_alpha = 16,
    lora_dropout = 0,
    bias = "none",
    use_gradient_checkpointing = "unsloth",
    random_state = 3407,
)

### Data Preparation
We will use the **GSM8K** dataset, which consists of grade school math problems. GRPO works by generating multiple solutions and rewarding the correct ones.

In [None]:
from datasets import load_dataset
dataset = load_dataset("gsm8k", "main", split = "train")

# Simple reward function: Check if the answer matches the ground truth
# In a real scenario, this would be more complex
def reward_func(prompts, completions, **kwargs):
    rewards = []
    for completion, ground_truth in zip(completions, kwargs["answer"]):
        # Very basic check - in reality we need to parse the number
        if ground_truth.strip() in completion:
            rewards.append(1.0)
        else:
            rewards.append(0.0)
    return rewards

In [None]:
from trl import GRPOTrainer
from transformers import TrainingArguments

trainer = GRPOTrainer(
    model = model,
    reward_funcs = reward_func,
    train_dataset = dataset,
    args = TrainingArguments(
        per_device_train_batch_size = 2,
        gradient_accumulation_steps = 4,
        warmup_steps = 5,
        max_steps = 60,
        learning_rate = 5e-6,
        fp16 = not is_bfloat16_supported(),
        bf16 = is_bfloat16_supported(),
        logging_steps = 1,
        optim = "adamw_8bit",
        weight_decay = 0.01,
        lr_scheduler_type = "linear",
        seed = 3407,
        output_dir = "outputs",
    ),
)

In [None]:
trainer.train()