In [None]:
!pip install -qU trl peft math_verify transformers datasets accelerate

In [None]:
from huggingface_hub import notebook_login

notebook_login()

# Post-Training an LLM for Reasoning with GRPO in TRL

The **DeepSeekMath** paper desmonstrates how to post-train a Large Language Model (LLM) using **Group Relative Policy Optimization (GRPO)**.

GRPO is particuarly effective for **scaling test-time compute for extended reasoning**, making it an ideal approach for solving complex tasks, such as mathematical problem-solving.

GRPO is a **reinforcement learning (RL) post-training technique** that was integrated into the training pipeline for **DeepSeek-R1**. Unlike earlier techniques that relied on search-heuristic methods, GRPO exclusively employs RL for post-training, enhancing the model's capacity to handle complex and nuanced tasks.

## Load Dataset

We will use the **AI-MO/NuminaMath-TIR** dataset, which is a **reasoning-focused dataset** that contains mathematical problems, their solutions, and detailed reasoning steps that explain how to transition from the problem statement to the final solution.

In [None]:
from datasets import load_dataset

dataset_id = 'AI-MO/NuminaMath-TIR'
train_dataset, test_dataset = load_dataset(
    dataset_id,
    split=['train[:5%]', 'test[:5%]']
)

In [4]:
train_dataset

Dataset({
    features: ['problem', 'solution', 'messages'],
    num_rows: 3622
})

In [7]:
train_dataset[0]

{'problem': 'What is the coefficient of $x^2y^6$ in the expansion of $\\left(\\frac{3}{5}x-\\frac{y}{2}\\right)^8$?  Express your answer as a common fraction.',
 'solution': "To determine the coefficient of \\(x^2y^6\\) in the expansion of \\(\\left(\\frac{3}{5}x - \\frac{y}{2}\\right)^8\\), we can use the binomial theorem.\n\nThe binomial theorem states:\n\\[\n(a + b)^n = \\sum_{k=0}^{n} \\binom{n}{k} a^{n-k} b^k\n\\]\n\nIn this case, \\(a = \\frac{3}{5}x\\), \\(b = -\\frac{y}{2}\\), and \\(n = 8\\).\n\nWe are interested in the term that contains \\(x^2y^6\\). In the general term of the binomial expansion:\n\\[\n\\binom{8}{k} \\left(\\frac{3}{5}x\\right)^{8-k} \\left(-\\frac{y}{2}\\right)^k\n\\]\n\nTo get \\(x^2\\), we need \\(8 - k = 2\\), thus \\(k = 6\\).\n\nSubstituting \\(k = 6\\) into the expression:\n\\[\n\\binom{8}{6} \\left(\\frac{3}{5}x\\right)^{8-6} \\left(-\\frac{y}{2}\\right)^6 = \\binom{8}{6} \\left(\\frac{3}{5}x\\right)^2 \\left(-\\frac{y}{2}\\right)^6\n\\]\n\nNow, we w

In the DeepSeek-R1 training procedure, a specific system prompt was used to generate a conversational pipeline that includes reasoning steps. We will apply a similar approach, where the model is guided to first think through the problem and then present its answer.

In [8]:
SYSTEM_PROMPT = (
    "A conversation between User and Assistant. The user asks a question, and the Assistant solves it. "
    "The assistant first thinks about the reasoning process in the mind and then provides the user with the answer. "
    "The reasoning process and answer are encloded within <think> </think> and <answer> </answer> tags, respectively, i.e., "
    "<think> reasoning process here </think><answer> answer here </answer>"
)

We will also modify our dataset to follow this conversational format, prompting the LLM to generate both the reasoning steps and the final answer.

In [10]:
def make_conversation(example):
    return {
        'prompt': [
            {'role': 'system', 'content': SYSTEM_PROMPT},
            {'role': 'user', 'content': example['problem']}
        ]
    }


train_dataset = train_dataset.map(make_conversation)
test_dataset = test_dataset.map(make_conversation)

Map:   0%|          | 0/3622 [00:00<?, ? examples/s]

Map:   0%|          | 0/5 [00:00<?, ? examples/s]

In [12]:
train_dataset

Dataset({
    features: ['problem', 'solution', 'messages', 'prompt'],
    num_rows: 3622
})

In [11]:
train_dataset[0]

{'problem': 'What is the coefficient of $x^2y^6$ in the expansion of $\\left(\\frac{3}{5}x-\\frac{y}{2}\\right)^8$?  Express your answer as a common fraction.',
 'solution': "To determine the coefficient of \\(x^2y^6\\) in the expansion of \\(\\left(\\frac{3}{5}x - \\frac{y}{2}\\right)^8\\), we can use the binomial theorem.\n\nThe binomial theorem states:\n\\[\n(a + b)^n = \\sum_{k=0}^{n} \\binom{n}{k} a^{n-k} b^k\n\\]\n\nIn this case, \\(a = \\frac{3}{5}x\\), \\(b = -\\frac{y}{2}\\), and \\(n = 8\\).\n\nWe are interested in the term that contains \\(x^2y^6\\). In the general term of the binomial expansion:\n\\[\n\\binom{8}{k} \\left(\\frac{3}{5}x\\right)^{8-k} \\left(-\\frac{y}{2}\\right)^k\n\\]\n\nTo get \\(x^2\\), we need \\(8 - k = 2\\), thus \\(k = 6\\).\n\nSubstituting \\(k = 6\\) into the expression:\n\\[\n\\binom{8}{6} \\left(\\frac{3}{5}x\\right)^{8-6} \\left(-\\frac{y}{2}\\right)^6 = \\binom{8}{6} \\left(\\frac{3}{5}x\\right)^2 \\left(-\\frac{y}{2}\\right)^6\n\\]\n\nNow, we w

We only need the custom `prompt` column and `solution` to verify the generated answer.

In [13]:
train_dataset = train_dataset.remove_columns(['messages', 'problem'])
train_dataset

Dataset({
    features: ['solution', 'prompt'],
    num_rows: 3622
})

## Post-training the base model using GRPO

### Loading the baseline model

We will load `Qwen/Qwen2-0.5B-Instruct` as the baseline model. For better results, we should consider a larger LLM.

In [None]:
import torch
from transformers import AutoModelForCausalLM

model_id = 'Qwen/Qwen2-0.5B-Instruct'
model = AutoModelForCausalLM.from_pretrained(
    model_id,
    torch_dtype='auto',
    device_map='auto'
)

### Configuring LoRA

The LoRA training allows us to efficiently fine-tune the model with a reduced number of parameters, enabling faster and more resource-efficient training.

In [15]:
from peft import LoraConfig, get_peft_model

lora_config = LoraConfig(
    task_type='CAUSAL_LM',
    r=8,
    lora_alpha=32,
    lora_dropout=0.1,
    target_modules=['q_proj', 'v_proj']
)

model = get_peft_model(model, lora_config)
model.print_trainable_parameters()

trainable params: 540,672 || all params: 494,573,440 || trainable%: 0.1093


### Loading reward functions

For the reward component of the system, we can use either pretrained reward models or reward functions defined directly in code.

For training, the DeepSeek-R1 used an accuracy-based reward model evaluates whether the response is correct, alongside a format-based reward that ensures the model places its reasoning process between `<think> </think>` tags.

We can simply define and implement these reward functions as generic Python functions. In this case, we will utilize these reward functions:
1. **Format Enforcement** ensures that the generation follows a specific format using `<think> </think> <answer> </answer>` tags for reasoning.

In [17]:
import re

def format_reward(completions, **kwargs):
    """Reward function that checks if the completion has a spcific formar"""
    pattern = r"^<think>.*?</think>\s*<answer>.*?</answer>$"
    completion_contents = [completion[0]['content'] for completion in completions]
    matches = [re.match(pattern, content) for content in completion_contents]
    rewards_list = [1.0 if match else 0.0 for match in matches]
    return rewards_list

2. **Solution Accuracy** verifies whether the solution to the problem is correct.

In [18]:
from math_verify import LatexExtractionConfig, parse, verify

def accuracy_reward(completions, **kwargs):
    """Reward function that checks if the completion is the same as the ground truth"""
    solutions = kwargs['solution']
    completion_contents = [completion[0]['content'] for completion in completions]
    rewards = []

    for content, solution in zip(completion_contents, solutions):
        gold_parsed = parse(
            solution,
            extraction_mode='first_match',
            extraction_config=[LatexExtractionConfig()]
        )
        answer_parsed = parse(
            content,
            extraction_mode='first_match',
            extraction_config=[LatexExtractionConfig()]
        )

        if len(gold_parsed) != 0:
            try:
                rewards.append(float(verify(answer_parsed, gold_parsed)))
            except Exception:
                rewards.append(0.0)
        else:
            rewards.append(1.0)

    return rewards

### Configuring GRPO training parameters

In [19]:
from trl import GRPOConfig

# Configure training arguments using GRPOConfig
training_args = GRPOConfig(
    output_dir='Qwen2-0.5B-GRPO-test',
    learning_rate=1e-5,
    remove_unused_columns=False, # to access the solution column in accuracy_reward
    gradient_accumulation_steps=16,
    num_train_epochs=1,
    bf16=True,
    # Parameters that control the data preprocessing
    max_completion_length=64, # default 256
    num_generations=4, # default 0
    max_prompt_length=128, # default 512
    # Parameters related to reporting and saving
    report_to=['tensorboard'],
    logging_steps=10,
    push_to_hub=False,
    save_strategy='steps',
    save_steps=10,
)

### Training the model

Now we pass the two reward functions we previously defined to the trainer.

In [20]:
from trl import GRPOTrainer

trainer = GRPOTrainer(
    model=model,
    reward_funcs=[format_reward, accuracy_reward],
    args=training_args,
    train_dataset=train_dataset
)

tokenizer_config.json:   0%|          | 0.00/1.29k [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/2.78M [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/1.67M [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/7.03M [00:00<?, ?B/s]

No label_names provided for model class `PeftModelForCausalLM`. Since `PeftModel` hides base models input arguments, if label_names is not given, label_names can't be set automatically within `Trainer`. Note that empty label_names list will be used instead.


In [None]:
trainer.train()

  ctx_manager = torch.cpu.amp.autocast(cache_enabled=cache_enabled, dtype=self.amp_dtype)


In [None]:
# save the results
trainer.save_model(training_args.output_dir)
trainer.push_to_hub(dataset_name=dataset_id)

## Check the model performance

In [None]:
from transformers import AutoTokenizer

model_id = 'sergiopaniego/Qwen2-0.5B-GRPO'
trained_model = AutoModelForCausalLM.from_pretrained(
    model_id,
    torch_dtype='auto',
    device_map='auto'
)
trained_tokenizer = AutoTokenizer.from_pretrained(model_id)

In [None]:
test_dataset['prompt'][0]

We will create a function to interact with the model. In addition to generating the answer, we will measure the inference duration and count the number of generated tokens. This will give us insights into how much the model has reasoned during generation.

In [None]:
import time

def generate_with_reasoning(prompt):
    # build the prompt from the dataset
    prompt = " ".join(entry['content'] for entry in prompt)

    # tokenize and move to the same device as the model
    inputs = trained_tokenizer(prompt, return_tensors='pt').to(trained_model.device)

    # generate text without gradients
    start_time = time.time()
    with torch.no_grad():
        output_ids = trained_model.generate(**inputs, max_length=500)
    end_time = time.time()

    # decode and extract model response
    generated_text = trained_tokenizer.decode(output_ids[0], skip_special_tokens=True)

    # get inference time
    inference_duration = end_time - start_time

    # get number of generated tokens
    num_input_tokens = inputs['input_ids'].shape[1]
    num_generated_tokens = output_ids.shape[1] - num_input_tokens

    return generated_text, inference_duration, num_generated_tokens

In [None]:
prompt = test_dataset['prompt'][0]
generated_text, inference_duration, num_generated_tokens = generate_with_reasoning(prompt)
print(generated_text)
print(f"Inference time: {inference_duration:.2f} seconds")
print(f"Generated tokens: {num_generated_tokens}")

Review the generated response to visualize this behavior:

In [None]:
prompt_text = " ".join(entry['content'] for entry in prompt)
response_text = generated_text[len(prompt_text) :].strip()
print(response_text)