# GRPO EssentialAI/rnj-1-instruct with QLoRA using TRL

[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/huggingface/trl/blob/main/examples/notebooks/grpo_rnj_1_instruct.ipynb)

![trl banner](https://huggingface.co/datasets/trl-lib/documentation-images/resolve/main/trl_banner_dark.png)


With [**Transformers Reinforcement Learning (TRL)**](https://github.com/huggingface/trl), you can fine-tune cutting edge large language models. It comes with support for quantized parameter efficient fine-tuning technique **QLoRA**, so we can use Colab to fine-tune models like [EssentialAI/rnj-1-instruct](https://huggingface.co/collections/EssentialAI/rnj-1).


- [TRL GitHub Repository](https://github.com/huggingface/trl) â€” star us to support the project!  
- [Official TRL Examples](https://huggingface.co/docs/trl/example_overview)  
- [Community Tutorials](https://huggingface.co/docs/trl/community_tutorials)

In this notebook, we'll add reasoning capabilities to the model, teaching it to generate reasoning traces (`<think></think>`) before giving us the final answer (`<answer></answer>`).

## Install dependencies

We'll install **TRL** with the **PEFT** extra, which ensures all main dependencies such as **Transformers** and **PEFT** (a package for parameter-efficient fine-tuning, e.g., LoRA/QLoRA) are included. Additionally, we'll install **trackio** to log and monitor our experiments, and **bitsandbytes** to enable quantization of LLMs, reducing memory consumption for both inference and training.

In [None]:
!pip install -Uq "trl[peft]" bitsandbytes trackio math_verify

### Log in to Hugging Face

Log in to your **Hugging Face** account to save your fine-tuned model, track your experiment results directly on the Hub or access gated models. You can find your **access token** on your [account settings page](https://huggingface.co/settings/tokens).

In [None]:
from huggingface_hub import notebook_login

notebook_login()

## Load dataset


We'll load the [**AI-MO/NuminaMath-TIR**](https://huggingface.co/datasets/AI-MO/NuminaMath-TIR) dataset from the Hugging Face Hub using the `datasets` library.

This dataset contains maths problems, along with the solution in thinking format specially tailored for LLMs. By training our model with this dataset, it'll improve its maths and thinking reasoning.

> We only use a subset for educational purposes. In a real scenario, we'd use the complete dataset.

In [None]:
from datasets import load_dataset

dataset_id = 'AI-MO/NuminaMath-TIR'
train_dataset = load_dataset(dataset_id, split='train[:5%]')

In addition to the current columns, we also include a custom system prompt to tell the model how we'd like the generation.

This system prompt is an adapted version of the original one extracted from **DeepSeek R1**. For additional background, see [this previous recipe](https://huggingface.co/learn/cookbook/fine_tuning_llm_grpo_trl). We extend the prompt with **examples** and a **more explicit, verbose formulation** to make the desired behavior easier for the model to learn. Depending on your goals, you may further enrich the prompt to simplify learning, or intentionally shorten and harden it to encourage more robust and generalizable behavior.

We convert the dataset samples into conversation samples, including the system prompt and problem description per sample, since this is how the GRPO trainer expects them.

We also set `padding_side="left"` to ensure that generated completions during training are concatenated directly after the prompt, which is essential for GRPO to correctly compare token-level probabilities between preferred and rejected responses.

In [None]:
SYSTEM_PROMPT = """A conversation between User and Assistant. The user asks a question, and the Assistant solves it.
The assistant first thinks about the reasoning process in the mind and then provides the user with the answer.
The reasoning process and answer are enclosed within <think> </think> and <answer> </answer> tags.
Use exactly one <think>...</think> block followed by exactly one <answer>...</answer> block.

Examples:

User: What is 2 + 2?
Assistant:
<think>
I will add 2 and 2 together.
</think>
<answer>4</answer>

User: What is 3 Ã— 5?
Assistant:
<think>
I will multiply 3 by 5.
</think>
<answer>15</answer>

User: Find the GCD of 12 and 18.
Assistant:
<think>
I will list the divisors of 12 and 18 and find the greatest one they have in common.
</think>
<answer>6</answer>
"""

def make_conversation(example):
    return {
        "prompt": [
            {"role": "system", "content": SYSTEM_PROMPT},
            {"role": "user", "content": example["problem"]},
        ],
    }

train_dataset = train_dataset.map(make_conversation)

Let's review one example to understand the internal structure:

In [None]:
print(train_dataset[0])

And remove the columns that are not needed for training:

In [None]:
train_dataset = train_dataset.remove_columns(['messages', 'problem'])
print(train_dataset)

## Load model and configure LoRA/QLoRA

This notebook can be used with two fine-tuning methods. By default, it is set up for **QLoRA**, which includes quantization using `BitsAndBytesConfig`. If you prefer to use standard **LoRA** without quantization, simply comment out the `BitsAndBytesConfig` configuration.

In [None]:
from transformers import AutoModelForCausalLM, BitsAndBytesConfig
import torch

model_name = "EssentialAI/rnj-1-instruct"

model = AutoModelForCausalLM.from_pretrained(
    model_name,
    dtype="float32",
    device_map="auto",
    quantization_config=BitsAndBytesConfig(
        load_in_4bit=True,
        bnb_4bit_use_double_quant=True,
        bnb_4bit_quant_type="nf4",
        bnb_4bit_compute_dtype=torch.float16
    ),
)

The following cell defines LoRA (or QLoRA if needed). When training with LoRA/QLoRA, we use a **base model** (the one selected above) and, instead of modifying its original weights, we fine-tune a **LoRA adapter**, a lightweight layer that enables efficient and memory-friendly training. The **`target_modules`** specify which parts of the model (e.g., attention or projection layers) will be adapted by LoRA during fine-tuning.

In [None]:
from peft import LoraConfig

# You may need to update `target_modules` depending on the architecture of your chosen model.
# For example, different LLMs might have different attention/projection layer names.
peft_config = LoraConfig(
    r=32,
    lora_alpha=32,
    target_modules = ["q_proj", "k_proj", "v_proj", "o_proj", "gate_proj", "up_proj", "down_proj",],
)


## Train model

We'll configure **GRPO** using `GRPOConfig`, keeping the parameters minimal so the training fits on a Colab instance. You can adjust these settings depending on the resources available. For full details on all available parameters, check the [TRL GRPOConfig documentation](https://huggingface.co/docs/trl/sft_trainer#trl.GRPOConfig).

First, we need to define the rewards functions that the training algorithm will use to improve the model. In this case, we'll include just one reward function.
We'll use a format reward that will reward the model when the output includes `<think>` and `<answer>` tags. This is a simplification of the pipeline for educational purposes, but in a real scenario, you'd at least all need a reward function to check the correctness of the model answer. The function has been extracted from [here](https://github.com/huggingface/open-r1/blob/main/src/open_r1/rewards.py).

> ðŸ’¡ **Note**:  
> You can further refine this reward by making it more granular. For example, assigning partial rewards when `<think>` and `<answer>` appear independently, or when they are present but incorrectly ordered. This can make the learning signal denser and speed up early training. However, overly simplifying the reward may reduce robustness, even if it helps the model converge faster. In practice, there is a trade-off between ease of learning and the generalization quality of the final model.

In [None]:
import re

def format_reward(completions, **kwargs):
    """Reward function that checks if the reasoning process is enclosed within <think> and </think> tags, while the final answer is enclosed within <answer> and </answer> tags."""
    pattern = r"<think>.*?</think>.*?<answer>.*?</answer>"

    matches = []
    for item in completions:
        if isinstance(item, list):
            text = item[0]['content']
        else:
            text = item
        match = re.match(pattern, text, re.DOTALL | re.MULTILINE)
        matches.append(match)

    return [1.0 if match else 0.0 for match in matches]

After defining the reward function(s), we can define the `GRPOConfig`. You can adapt the values in the config depending on your training setting and even fit the training in more constrained setups like free Colab (T4).

In [None]:
from trl import GRPOConfig

output_dir = "EssentialAI-rnj-1-instruct-trl-grpo"

# Configure training arguments using GRPOConfig
training_args = GRPOConfig(
    learning_rate=2e-5,                                   # Learning rate used during traing
    num_train_epochs=1,                                   # Number of full dataset passes. For testing, use `max_steps` instead
    #max_steps=100,

    # Parameters that control the data preprocessing
    per_device_train_batch_size=8,
    max_completion_length=256, # default: 256             # Max completion length produced during training
    num_generations=8, # default: 8                       # Number of generations produced during training for comparison

    # Parameters related to reporting and saving
    output_dir=output_dir,                                # Where to save model checkpoints and logs
    logging_steps=10,                                     # Log training metrics every N steps
    report_to="trackio",                                  # Experiment tracking tool
    trackio_space_id = output_dir,                        # HF Space where you trackio will be

    # Hub integration
    push_to_hub=True,                                     # Push the resulted model to the Hub
    log_completions=True,                                 # Log completions during training
)

Configure the GRPO Trainer. We pass the previously configured `training_args`. We don't use eval dataset to maintain memory usage low but you can configure it.

In [None]:
from trl import GRPOTrainer

trainer = GRPOTrainer(
    model=model,
    reward_funcs=[format_reward],
    args=training_args,
    train_dataset=train_dataset,
    peft_config=peft_config,
)

Show memory stats before training

In [None]:
gpu_stats = torch.cuda.get_device_properties(0)
start_gpu_memory = round(torch.cuda.max_memory_reserved() / 1024 / 1024 / 1024, 3)
max_memory = round(gpu_stats.total_memory / 1024 / 1024 / 1024, 3)

print(f"GPU = {gpu_stats.name}. Max memory = {max_memory} GB.")
print(f"{start_gpu_memory} GB of memory reserved.")

And train!

In [None]:
trainer_stats = trainer.train()

Show memory stats after training

In [None]:
used_memory = round(torch.cuda.max_memory_reserved() / 1024 / 1024 / 1024, 3)
used_memory_for_lora = round(used_memory - start_gpu_memory, 3)
used_percentage = round(used_memory / max_memory * 100, 3)
lora_percentage = round(used_memory_for_lora / max_memory * 100, 3)

print(f"{trainer_stats.metrics['train_runtime']} seconds used for training.")
print(f"{round(trainer_stats.metrics['train_runtime']/60, 2)} minutes used for training.")
print(f"Peak reserved memory = {used_memory} GB.")
print(f"Peak reserved memory for training = {used_memory_for_lora} GB.")
print(f"Peak reserved memory % of max memory = {used_percentage} %.")
print(f"Peak reserved memory for training % of max memory = {lora_percentage} %.")

## Saving fine tuned model

In this step, we save the fine-tuned model both **locally** and to the **Hugging Face Hub** using the credentials from your account.

In [None]:
trainer.save_model(output_dir)
trainer.push_to_hub(dataset_name=dataset_id)

## Load the fine-tuned model and run inference

Now, let's test our fine-tuned model by loading the **LoRA/QLoRA adapter** and performing **inference**. We'll start by loading the **base model**, then attach the adapter to it, creating the final fine-tuned model ready for evaluation.

In [None]:
output_dir = 'sergiopaniego/EssentialAI-rnj-1-instruct-trl-grpo'
model_name = "EssentialAI/rnj-1-instruct"

In [None]:
from transformers import AutoModelForCausalLM, AutoTokenizer
from peft import PeftModel

base_model = model_name
adapter_model = f"{output_dir}" # Replace with your HF username or organization

model = AutoModelForCausalLM.from_pretrained(base_model, dtype="float32", device_map="auto")
model = PeftModel.from_pretrained(model, adapter_model)

tokenizer = AutoTokenizer.from_pretrained(base_model)

In [None]:
train_dataset[0]

In [None]:
from datasets import load_dataset

dataset_id = 'AI-MO/NuminaMath-TIR'
train_dataset = load_dataset(dataset_id, split='train[:5%]')

problem = train_dataset[0]['problem']

messages = [
    {
        "role": "system", "content": [
            {"type": "text", "text": SYSTEM_PROMPT}
        ]
    },
    {
        "role": "user",
        "content": [
            {"type": "text", "text": problem},
        ],
    },
]

In [None]:
messages

In [None]:
input_ids = tokenizer.apply_chat_template(
    messages,
    add_generation_prompt=True,
    return_tensors="pt",
    return_dict=False,
).to(model.device)

# --- Generate Prediction --- #
print("Generating prediction...")
output_ids = model.generate(
    input_ids,
    max_new_tokens=50,
    pad_token_id=tokenizer.eos_token_id,
    do_sample=True,
    temperature=0.2,
    top_p=0.95
)

response = tokenizer.decode(output_ids[0][input_ids.shape[-1]:], skip_special_tokens=True)
print(response)