<center>
<h1>

<h1>Improving LLMs with RLHF (DPO & GRPO)</h2>
<br>
</center>




# ü§ñ Post-Training: Improving LLMs with RLHF (DPO & GRPO)

In this notebook, we‚Äôll show how to improve a language model using **two post-training techniques**:


In [2]:
!git clone https://github.com/BounharAbdelaziz/RLHF.git

fatal: destination path 'RLHF' already exists and is not an empty directory.


In [3]:
!pip install -q -r RLHF/requirements.txt

# Part I: DPO

# üß† Fine-tuning Qwen2.5-0.5B-Instruct on French Data

In this part, we‚Äôll walk through how to **fine-tune the Qwen2.5-0.5B-Instruct** model on **French-language data**, using **off-policy DPO (Direct Preference Optimization)**.

---

## üß© Key Concepts

- **Model**: [Qwen2.5-0.5B-Instruct](https://huggingface.co/Qwen/Qwen2.5-0.5B-Instruct)  
- **Objective**: Adapt the model for French understanding and instruction-following  
- **Method**: Off-policy **DPO** for alignment-based fine-tuning  

---

## ‚öôÔ∏è System Requirements

Before training, We choose Runpod as a GPU setup.


## üßÆ Memory Optimization

We‚Äôll use **LoRA** (Low-Rank Adaptation) combined with **quantization (4-bit)** to reduce GPU memory usage while maintaining performance.



In [5]:
# Imports
from datasets import load_dataset
from transformers import (
    AutoTokenizer,
    AutoModelForCausalLM,
    BitsAndBytesConfig,
)
from trl import (
    DPOTrainer,
    DPOConfig,
)
from peft import (
    LoraConfig,
    get_peft_model,
    prepare_model_for_kbit_training,
)
import torch
import os
import wandb

## üìä Tracking with Weights & Biases (W&B)

We‚Äôll use **Weights & Biases (W&B)** to log training metrics, model versions, and system stats so you can compare runs, debug faster, and share results. Before running the next cell, **create a free account** at [https://wandb.ai](https://wandb.ai) and make a new **Project** (e.g., `RLHF`). In the code cell that follows, we‚Äôll initialize W&B; on first use you‚Äôll need to **register** then paste your **API key** from your W&B profile. During training, W&B will automatically track losses, learning rate, gradient norms, and GPU utilization, and we‚Äôll log custom metrics (e.g., validation perplexity, evaluation scores) plus configuration details (dataset, hyperparameters, LoRA/quantization settings). Each run will appear on your project dashboard with charts, tables, run metadata, and artifacts, making it easy to **compare experiments**, **resume runs**, and **share dashboards** with your teammates.


In [None]:
# We will use wandb.ai for logging the experiments - Set your API key here
WANDB_API_KEY = "" 

# Automatically login using the API key
os.environ["WANDB_API_KEY"] = WANDB_API_KEY
os.environ["WANDB_PROJECT"] = "RLHF"
wandb.login()

# Training dataset
DATASET_PATH = "AIffl/french_orca_dpo_pairs" # french version of "Intel/orca_dpo_pairs"

# We limit to 2k samples for speed
LIMIT = 2_000

# SFT Model we will finetune
MODEL_NAME = "Qwen/Qwen2.5-0.5B-Instruct"
# Seed for reproducibility
SEED = 1998

MAX_PROMPT_LEN = 1024
MAX_LENGTH = MAX_PROMPT_LEN + 512

RUN_NAME = "DPO-french-orca-" + MODEL_NAME.split('/')[-1]

[34m[1mwandb[0m: Currently logged in as: [33myebarimoghit[0m ([33myebarimoghit-centralesup-lec[0m) to [32mhttps://api.wandb.ai[0m. Use [1m`wandb login --relogin`[0m to force relogin


## Load the SFT Model and Tokenizer

We‚Äôll stick with 4-bit quantization via bitsandbytes for this lab. You‚Äôve already used it last week, so nothing new‚Äîsame setup (load the model with 4-bit weights), same goal (reduce VRAM) with minimal impact on quality for our use case. This keeps runs feasible on a single GPU.

We will need the tonkenizer in the data preparation step to apply the chat template.

In [7]:
# Quantization configuration
quantization_config = BitsAndBytesConfig(
    load_in_4bit = True,
    bnb_4bit_quant_type = "nf4",
    bnb_4bit_compute_dtype = torch.bfloat16,
    bnb_4bit_use_double_quant = True,

)

# Load the model to finetune
model = AutoModelForCausalLM.from_pretrained(
    MODEL_NAME,
    torch_dtype=torch.float16,
    quantization_config=quantization_config,
    device_map="auto",
    trust_remote_code=True,
)
# to avoid warning
model.config.use_cache = False
# Prepare model for k-bit training
model = prepare_model_for_kbit_training(model)

# Load tokenizer
tokenizer = AutoTokenizer.from_pretrained(MODEL_NAME, trust_remote_code=True)

# Set padding token
if tokenizer.pad_token is None:
    tokenizer.pad_token = tokenizer.eos_token
    tokenizer.pad_token_id = tokenizer.eos_token_id

`torch_dtype` is deprecated! Use `dtype` instead!


## Data Preparation

### üí¨ Chat templates & converting data to `messages`

Modern instruction-tuned models (including **Qwen2.5-0.5B-Instruct**) expect inputs in a **chat format** and rely on a **tokenizer chat template** to turn structured messages into the exact token sequence the model was trained on. In practice, you should **not** hand-craft special tokens; instead, pass a list of `{role, content}` messages to the tokenizer and let `apply_chat_template(...)` do the right thing.

#### Why a chat template?
- Ensures your prompts match the **pretraining/finetuning format** (system/user/assistant turns, BOS/EOS, separators).
- Minimizes prompt drift across libraries and models.
- Makes it easy to add **system instructions** (e.g., ‚ÄúYou are a helpful assistant that answers in French.‚Äù).

#### Message structure
Each example becomes an ordered list of chat turns:
```python
messages = [
  {"role": "system", "content": "Tu es un assistant utile. R√©ponds en fran√ßais."},
  {"role": "user", "content": "Explique la diff√©rence entre LoRA et le fine-tuning complet."},
  {"role": "assistant", "content": "LoRA adapte un petit sous-espace de poids, alors que..."}
]


<b><h4><font color='blue'>
<hr style="border:10px solid blue"> </hr>
Task 3: </b><br>
Create the user message which is the question field of the dataset.
<hr style="border:10px solid blue"> </hr>
</font></h4>

In [None]:
def preprocess_for_dpo(example):
    # Format system message if present
    messages = []
    if example.get('system') and len(example['system'].strip()) > 0:
        messages.append({"role": "system", "content": example['system']})

    user_message = {"role":"user","content":example["question"]} 
    messages.append(user_message)

    # Create prompt with generation prompt for DPO
    prompt = tokenizer.apply_chat_template(
        messages,
        tokenize=False,
        add_generation_prompt=True
    )

    # The chosen and rejected should be the assistant responses only
    chosen = example['chosen']
    rejected = example['rejected']

    return {
        "prompt": prompt,
        "chosen": chosen,
        "rejected": rejected,
    }

# Download the training dataset
dataset = load_dataset(DATASET_PATH, split=f"train")
# shuffle and select a number of samples
dataset = dataset.shuffle(True).select(range(LIMIT))

# Save columns
original_columns = dataset.column_names

# Apply the preprocessing function
dpo_dataset = dataset.map(
    preprocess_for_dpo,
    remove_columns=original_columns,
)

# Filter out examples that are too long
def filter_length(example):
    prompt_length = len(tokenizer.encode(example['prompt']))
    chosen_length = len(tokenizer.encode(example['chosen']))
    rejected_length = len(tokenizer.encode(example['rejected']))

    return (prompt_length + max(chosen_length, rejected_length)) < MAX_LENGTH

dpo_dataset = dpo_dataset.filter(filter_length)

print(f"Dataset size after filtering: {len(dpo_dataset)}")

Dataset size after filtering: 1954


## Model Training

We'll use the `trl` library. Concretely, we‚Äôll instantiate a **policy model** (trainable) and a **reference model** (frozen) and optimize with the DPO objective so the policy prefers **chosen** over **rejected** responses for the same prompt.

### What we‚Äôll use
- **TRL**: `DPOConfig`, `DPOTrainer`
- **PEFT**: LoRA adapters on top of the base **Qwen2.5-0.5B-Instruct**
- **Quantization**: 4-bit (QLoRA-style) to fit on small GPUs
- **Logging**: W&B for metrics, configs, and artifacts

### Expected dataset columns
- `prompt` (or `messages`): the shared context (system+user turns)
- `chosen`: assistant reply preferred by annotators
- `rejected`: less-preferred reply
> If you‚Äôre keeping everything in chat format, we‚Äôll pass lists of `{role, content}` and rely on `tokenizer.apply_chat_template(...)` inside the collator.

In [9]:
# LoRA configuration - targeting the correct modules for Qwen2.5
peft_config = LoraConfig(
    r=32,
    lora_alpha=64,
    lora_dropout = 0.1,
    bias = "none",
    task_type = "CAUSAL_LM",
    target_modules = ["gate_proj","up_proj","down_proj","q_proj","k_proj","v_proj","o_proj"]
)

# Apply LoRA to the model
model = get_peft_model(model, peft_config)
model.print_trainable_parameters()

# Training configuration
training_args = DPOConfig(
    beta=0.1,  # DPO temperature parameter
    learning_rate=5e-6,
    max_prompt_length=MAX_PROMPT_LEN,
    max_length=MAX_LENGTH,
    per_device_train_batch_size=1,  # Reduced for memory
    gradient_accumulation_steps=4,  # Increased to maintain effective batch size of 4 (1*4)
    num_train_epochs=1,
    max_grad_norm=1.0,
    logging_steps=1,
    save_steps=100,
    lr_scheduler_type="cosine",
    optim="paged_adamw_8bit",  # More memory efficient
    warmup_ratio=0.03, # 3% of the steps will be just a warmup
    save_strategy="steps",
    output_dir="./dpo_model",
    report_to="wandb",
    run_name=RUN_NAME,
    remove_unused_columns=False,
    dataloader_pin_memory=False,
    fp16=True,  # Enable mixed precision
)

# Initialize the trainer - Note: no ref_model needed when using peft_config
trainer = DPOTrainer(
    model=model,
    args=training_args,
    peft_config=peft_config,  # This automatically handles reference model
    processing_class=tokenizer,
    train_dataset=dpo_dataset,
)

# Print a sample to verify preprocessing
print("Sample from dataset:")
print(f"Prompt: {dpo_dataset[0]['prompt']}")
print(f"Chosen: {dpo_dataset[0]['chosen']}")
print(f"Rejected: {dpo_dataset[0]['rejected']}")

# Train
trainer.train()

trainable params: 17,596,416 || all params: 511,629,184 || trainable%: 3.4393


The tokenizer has new PAD/BOS/EOS tokens that differ from the model config and generation config. The model config and generation config were aligned accordingly, being updated with the tokenizer's values. Updated tokens: {'bos_token_id': None, 'pad_token_id': 151643}.


Sample from dataset:
Prompt: <|im_start|>system
Vous √™tes un assistant utile, qui fournit toujours des explications. Pensez comme si vous r√©pondiez √† un enfant de cinq ans.<|im_end|>
<|im_start|>user
Pr√©misse:
"pour les drogues, ils appliquent la peine de mort pour √ßa, euh, ils sont juste tr√®s tr√®s durs et je suppose que c'est peut-√™tre comme √ßa √† Tokyo ou au Japon aussi, ils sont juste tr√®s durs avec les criminels" Bas√©e sur cette pr√©misse, c'est l'hypoth√®se "Pour les drogues, ils appliquent la peine de mort, mais ils sont tr√®s tr√®s durs envers les criminels et c'est peut-√™tre la m√™me chose √† Tokyo ou au Japon." vrai?
Choisissez votre r√©ponse parmi¬†:
[1]. Oui;
[2]. il n'est pas possible de le savoir¬†;
[3]. Non;<|im_end|>
<|im_start|>assistant

Chosen: [1]. Oui
Rejected: Ohh ! Laissez-moi y r√©fl√©chir comme un assistant super sympa et intelligent¬†! ü§î

Donc, vous dites que certains endroits, comme Tokyo ou le Japon, sont tr√®s durs envers les gens qui font de 

[34m[1mwandb[0m: Detected [huggingface_hub.inference] in use.
[34m[1mwandb[0m: Use W&B Weave for improved LLM call tracing. Install Weave with `pip install weave` then add `import weave` to the top of your script.
[34m[1mwandb[0m: For more information, check out the docs at: https://weave-docs.wandb.ai/


Step,Training Loss
1,0.6931
2,0.6931
3,0.6931
4,0.6931
5,0.6917
6,0.695
7,0.7115
8,0.6792
9,0.6881
10,0.6719




TrainOutput(global_step=489, training_loss=0.17462673559463646, metrics={'train_runtime': 944.2641, 'train_samples_per_second': 2.069, 'train_steps_per_second': 0.518, 'total_flos': 0.0, 'train_loss': 0.17462673559463646, 'epoch': 1.0})

In [10]:
# merge LoRA adapters with the base model
save_path = "dpo_model/final_merged_dpo_model"

model = model.merge_and_unload()
model.save_pretrained(save_path)
tokenizer.save_pretrained(save_path)



('dpo_model/final_merged_dpo_model/tokenizer_config.json',
 'dpo_model/final_merged_dpo_model/special_tokens_map.json',
 'dpo_model/final_merged_dpo_model/chat_template.jinja',
 'dpo_model/final_merged_dpo_model/vocab.json',
 'dpo_model/final_merged_dpo_model/merges.txt',
 'dpo_model/final_merged_dpo_model/added_tokens.json',
 'dpo_model/final_merged_dpo_model/tokenizer.json')

# Part II: GRPO


The following diagram illustrates the **GRPO (Generative Reinforcement Preference Optimization)** process ‚Äî an alternative to DPO that directly optimizes generation quality from preference data using reinforcement-style updates:

![GRPO Training Overview](https://1drv.ms/i/c/ae69638675180117/IQQ-KizPdUxCRZU9qDGcpX1AAeettH1uhsJqqM1WjXiYR6s?width=705&height=66)

After exploring DPO, we now move on to **GRPO** ‚Äî a reinforcement learning‚Äìstyle approach that builds directly on preference data.  
While **DPO** adjusts the model using an *analytic loss* derived from preference pairs, **GRPO** takes a more dynamic route: it uses **reward modeling and policy gradients** to optimize the model through sampled generations.

In essence:
- GRPO **learns from human (or model) preferences** using *on-policy* updates.  
- It combines elements of **PPO** (Proximal Policy Optimization) with **preference-based rewards** rather than explicit numerical scores.  
- This allows the model to better capture *generation quality* aspects that aren‚Äôt directly expressible through static loss terms.


In [13]:
import torch
import re
from datasets import load_dataset, Dataset
from transformers import (
    AutoTokenizer,
    AutoModelForCausalLM,
    BitsAndBytesConfig,
)
from peft import (
    LoraConfig,
    get_peft_model,
    TaskType,
    prepare_model_for_kbit_training,
)
from trl import (
    GRPOConfig,
    GRPOTrainer,
)
import os
import wandb

# Training dataset
DATASET_PATH = "openai/gsm8k"

# We limit to 200 samples for speed
LIMIT = 200

# SFT Model we will finetune
MODEL_NAME = "Qwen/Qwen2.5-0.5B-Instruct"
# MODEL_NAME = "Qwen/Qwen2.5-3B-Instruct"
# Seed for reproducibility
SEED = 1998

USE_LORA = True

if MODEL_NAME == "Qwen/Qwen2.5-0.5B-Instruct":
  USE_QUANT = False
else:
  USE_QUANT = True

lora_alpha = 128
lora_r = 64
lora_dropout = 0.1

if not USE_LORA:
  MAX_PROMPT_LEN = 512
  MAX_LENGTH = MAX_PROMPT_LEN + 512
else:
    MAX_PROMPT_LEN = 150
    MAX_LENGTH = MAX_PROMPT_LEN + 150

RUN_NAME = "GRPO-GSM8K-limit-" + str(LIMIT) + "-" + MODEL_NAME.split('/')[-1]

## Load the SFT Model and Tokenizer

In [14]:
# Load model and tokenizer
tokenizer = AutoTokenizer.from_pretrained(MODEL_NAME)

# Quantization configuration
quantization_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_compute_dtype=torch.float16,
    bnb_4bit_use_double_quant=True,
)

# Load the model to finetune
model = AutoModelForCausalLM.from_pretrained(
    MODEL_NAME,
    torch_dtype=torch.float16,
    quantization_config=quantization_config if USE_QUANT else None,
    device_map="auto",
    trust_remote_code=True,
)

# Prepare model for k-bit training
if USE_QUANT:
  model = prepare_model_for_kbit_training(model)
model.config.use_cache = False

# Add padding token if not exists
if tokenizer.pad_token is None:
    tokenizer.pad_token = tokenizer.eos_token

if USE_LORA:

  # Configure LoRA
  lora_config = LoraConfig(
      r=lora_r,  # Rank of adaptation
      lora_alpha=lora_alpha,  # LoRA scaling parameter
      target_modules=[
          "q_proj",
          "k_proj",
          "v_proj",
          "o_proj",
          "gate_proj",
          "up_proj",
          "down_proj",
      ],  # Target modules for Qwen2.5 architecture
      lora_dropout=lora_dropout,  # LoRA dropout
      bias="none",  # Bias type
      task_type=TaskType.CAUSAL_LM,  # Task type
  )

  # Apply LoRA to the model
  model = get_peft_model(model, lora_config)

  # Print trainable parameters
  model.print_trainable_parameters()

trainable params: 35,192,832 || all params: 529,225,600 || trainable%: 6.6499


## Data Preparation

For GRPO training, we‚Äôll use the **GSM8K** dataset ‚Äî a benchmark of grade-school math word problems.  
Each problem includes a **question** and a **final answer**. Our goal is to teach the model to reason in French (or English if you prefer) and **output only the final numeric answer** enclosed between `<answer>` and `</answer>` tags.

This format makes automatic evaluation trivial ‚Äî we can extract the number between tags and compare it directly to the reference.

With GRPO the model will learn two main things:
- How follow the instruction to output the correct format.
- Gain more math capabilities

---

### üß† Why structure as chat messages?

Just like with DPO, the **Qwen2.5-0.5B-Instruct** model expects inputs in a *chat-style message format*.  
We‚Äôll use:
- A **system message** to define the task and output style.
- A **user message** with the math problem.
- An **assistant message** containing the reasoning and final numeric answer wrapped in tags.

In [15]:
# Load GSM8K dataset
if LIMIT:
    dataset = load_dataset(DATASET_PATH, "main", split=f"train[:{LIMIT}]")  # Small subset for demo
else:
    dataset = load_dataset(DATASET_PATH, "main")

R1_STYLE_SYSTEM_PROMPT = """A conversation between User and Assistant. The user asks a question, and the Assistant solves it.
The assistant first thinks shortly about the reasoning process in the mind and then provides the user with the answer in a new line between <answer> and </answer>."""

TASK_SPECIFIC_INSTRUCTIONS = "The answer must be a single integer."

def preprocess_dataset(dataset, chunk_size=1000) -> Dataset:

    def extract_hash_answer(text: str) -> str | None:
        try:
            return text.split("####")[1].strip()
        except IndexError:
            return None

    def process_batch(batch):
        prompts = [ ]
        for a,b in zip(batch["question"],batch["answer"]):
            messages = []
            messages.append({"role": "system", "content": R1_STYLE_SYSTEM_PROMPT + "\n" + TASK_SPECIFIC_INSTRUCTIONS})
            messages.append({
            "role": "user",
            "content": "What is 2+2?"
            })
            messages.append({
            "role": "assistant",
            "content": "To calculate 2+2, we simply add the numbers together: 2 + 2 = 4.\n<answer>4</answer>"
            })
            user_message = {"role":"user","content":a} # fill here
            messages.append(user_message)
            # Create prompt with generation prompt for DPO
            prompt = tokenizer.apply_chat_template(
                messages,
                tokenize=False,
                add_generation_prompt=True
            )
            prompts.append(prompt)

        return {
            'prompt': prompts,
            'answer': [extract_hash_answer(a) for a in batch['answer']]
        }

    return dataset.map(process_batch, batched=True, batch_size=chunk_size)
train_dataset = preprocess_dataset(dataset, chunk_size=500)

In [16]:
train_dataset[0]

{'question': 'Natalia sold clips to 48 of her friends in April, and then she sold half as many clips in May. How many clips did Natalia sell altogether in April and May?',
 'answer': '72',
 'prompt': '<|im_start|>system\nA conversation between User and Assistant. The user asks a question, and the Assistant solves it.\nThe assistant first thinks shortly about the reasoning process in the mind and then provides the user with the answer in a new line between <answer> and </answer>.\nThe answer must be a single integer.<|im_end|>\n<|im_start|>user\nWhat is 2+2?<|im_end|>\n<|im_start|>assistant\nTo calculate 2+2, we simply add the numbers together: 2 + 2 = 4.\n<answer>4</answer><|im_end|>\n<|im_start|>user\nNatalia sold clips to 48 of her friends in April, and then she sold half as many clips in May. How many clips did Natalia sell altogether in April and May?<|im_end|>\n<|im_start|>assistant\n'}

## Reward Function Design

We‚Äôll use **two simple rewards** during GRPO rollouts:

1. **Format reward** ‚Äî checks that the **last non-empty line** is exactly in the form  
   `<answer>NUMBER</answer>`  
   - Score: **1** if correct format, **0** otherwise.

2. **Correctness reward** ‚Äî checks whether the extracted number matches the gold answer.  
   - Score: **2** if correct, **0** otherwise.

Total reward per sample ‚àà {0, 1, 2, 3}.



In [17]:
def extract_xml_answer(text: str) -> str:
    lines = [line for line in text.strip().split('\n') if line.strip()]
    if lines:
        last_line = lines[-1]
        match = re.search(r'<answer>(\d+)</answer>', last_line)
        if match:
            number = match.group(1)
            return number
    return None

def format_reward_func(completions, **kwargs) -> list[float]:
    """Reward function that checks if the completion has the correct format."""
    pattern = r"^(?:[^\r\n]*\r?\n)+<answer>\d+</answer>\r?\n?$"
    responses = [completion for completion in completions]
    matches = [bool(re.match(pattern, r)) for r in responses]
    return [1.0 if match else 0.0 for match in matches]

def correctness_reward_func(prompts, completions, answer, **kwargs) -> list[float]:
    """Reward function that checks if the answer is correct."""
    responses = [completion for completion in completions]
    extracted_responses = [extract_xml_answer(r) for r in responses]
    return [2.0 if r == a else 0.0 for r, a in zip(extracted_responses, answer)]

## Model Training

In [18]:
# GRPO Configuration
grpo_config = GRPOConfig(
    output_dir="./grpo_model",
    learning_rate=1e-5,
    per_device_train_batch_size=8,
    gradient_accumulation_steps=1,
    num_train_epochs=3,
    max_prompt_length=MAX_PROMPT_LEN,
    max_completion_length=MAX_LENGTH,
    num_generations=2, # The effective train batch size must be evenly divisible by the number of generations per prompt
    beta=0,
    epsilon=0.28,
    temperature=1,
    logging_steps=1,
    save_steps=25,
    save_total_limit=3,
    # load_best_model_at_end=True,
    # metric_for_best_model="reward",
    # greater_is_better=True,
    report_to="wandb",
    run_name=RUN_NAME,
    lr_scheduler_type="cosine",
    optim="paged_adamw_8bit" ,  # More memory efficient
    warmup_ratio=0.03, # 3% of the steps will be just a warmup
    remove_unused_columns=False,
    dataloader_pin_memory=False,
    fp16=True,  # Enable mixed precision
)

# Initialize trainer
trainer = GRPOTrainer(
    model=model,
    reward_funcs=[format_reward_func, correctness_reward_func],
    args=grpo_config,
    train_dataset=train_dataset,
    processing_class=tokenizer,
)

# Training
print("Starting GRPO training...")
trainer.train()

The model is already on multiple devices. Skipping the move to device specified in `args`.
The tokenizer has new PAD/BOS/EOS tokens that differ from the model config and generation config. The model config and generation config were aligned accordingly, being updated with the tokenizer's values. Updated tokens: {'bos_token_id': None, 'pad_token_id': 151643}.


Starting GRPO training...


Step,Training Loss
1,0.0
2,0.0073
3,0.2317
4,0.0
5,0.0
6,0.0
7,0.0
8,-0.0381
9,0.0
10,0.0


TrainOutput(global_step=150, training_loss=0.004809098082829884, metrics={'train_runtime': 529.0525, 'train_samples_per_second': 1.134, 'train_steps_per_second': 0.284, 'total_flos': 0.0, 'train_loss': 0.004809098082829884})

In [19]:
# merge LoRA adapters with the base model
save_path = "grpo_model/final_merged_grpo_model"

model = model.merge_and_unload()
model.save_pretrained(save_path)
tokenizer.save_pretrained(save_path)

('grpo_model/final_merged_grpo_model/tokenizer_config.json',
 'grpo_model/final_merged_grpo_model/special_tokens_map.json',
 'grpo_model/final_merged_grpo_model/chat_template.jinja',
 'grpo_model/final_merged_grpo_model/vocab.json',
 'grpo_model/final_merged_grpo_model/merges.txt',
 'grpo_model/final_merged_grpo_model/added_tokens.json',
 'grpo_model/final_merged_grpo_model/tokenizer.json')

## üìà Model Evaluation

Once our GRPO-trained model is ready, we need to **evaluate its performance**  ‚Äî to verify that it has learned to produce correctly formatted and accurate answers.

For computational issues, we will evaluate on the first 200 samples only.


In [21]:
import torch
from datasets import load_dataset
from transformers import AutoTokenizer
from tqdm.notebook import tqdm
import numpy as np
from typing import List, Dict
import json
from datetime import datetime
import logging


def extract_hash_answer(text: str) -> str | None:
    try:
        return text.split("####")[1].strip()
    except IndexError:
        return None

def evaluate_model(
    model_path: str,
    batch_size: int = 1,
    num_samples: int = None,
    save_results: bool = True,
) -> Dict:
    print("Initializing evaluation...")

    with tqdm(total=2, desc="Loading model components") as pbar:
        llm = AutoModelForCausalLM.from_pretrained(
            model_path,
            torch_dtype=torch.float16,
            device_map="cuda:0",
            trust_remote_code=True,
        )
        pbar.update(1)

        tokenizer = AutoTokenizer.from_pretrained(
            model_path,
            model_max_length=768,
        )
        pbar.update(1)

    # Load test dataset
    print("Loading dataset...")
    dataset = load_dataset('openai/gsm8k', 'main', split='test')
    if num_samples:
        dataset = dataset.select(range(num_samples))
    total_samples = len(dataset)
    print(f"Loaded {total_samples} samples")

    results = []
    correct = 0
    total = 0

    # Create progress bar
    progress_bar = tqdm(
        total=total_samples,
        desc="Processing samples",
        unit="examples",
        dynamic_ncols=True,
    )

    progress_bar.set_postfix({
        'acc': '0.00%',
        'correct': '0',
    })

    # Process in batches
    for i in range(0, total_samples, batch_size):
        batch_data = dataset[i:i + batch_size]
        current_batch_size = len(batch_data['question'])

        # Prepare prompts using same format as training
        prompts = [
            [
                {'role': 'system', 'content': R1_STYLE_SYSTEM_PROMPT + "\n" + TASK_SPECIFIC_INSTRUCTIONS},
                {'role': 'user', 'content': "What is 2+2?"},
                {'role': 'assistant', 'content': "To calculate 2+2, we simply add the numbers together: 2 + 2 = 4.\n<answer>4</answer>"},
                {'role': 'user', 'content': q.strip()}
            ] for q in batch_data['question']
        ]

        # Convert to chat format
        formatted_prompts = [
            tokenizer.apply_chat_template(
                p,
                tokenize=True,
                return_tensors='pt',
                add_generation_prompt=True
            )
            for p in prompts
        ]

        # Generate responses
        outputs = []
        for prompt in formatted_prompts:
            output = llm.generate(
                prompt.to('cuda:0'),
                max_new_tokens=512,
                temperature=1.0,
            )
            outputs.append(output)


        # Process responses
        for j, output in enumerate(outputs):

            response = tokenizer.decode(output[0], skip_special_tokens=True)

            # Extract answers
            generated_answer = extract_xml_answer(response)
            true_answer = extract_hash_answer(batch_data['answer'][j])

            # Store result
            result = {
                'question': batch_data['question'][j],
                'true_answer': true_answer,
                'generated_answer': generated_answer,
                'full_response': response,
                'correct': generated_answer == true_answer
            }
            results.append(result)

            # Update metrics
            if generated_answer == true_answer:
                correct += 1
            total += 1

        # Update progress
        progress_bar.update(current_batch_size)
        progress_bar.set_postfix({
            'acc': f'{(correct/total)*100:.2f}%',
            'correct': f'{correct}/{total}',
        })

    progress_bar.close()

    # Calculate metrics
    accuracy = correct / total if total > 0 else 0
    metrics = {
        'accuracy': accuracy,
        'correct': correct,
        'total': total,
        'model_path': model_path,
        'timestamp': datetime.now().isoformat()
    }

    # Save results
    if save_results:
        save_path = f"gsm8k_eval_results_{datetime.now().strftime('%Y%m%d_%H%M%S')}.json"
        with open(save_path, 'w') as f:
            json.dump({
                'metrics': metrics,
                'results': results
            }, f, indent=2)
        print(f"\nResults saved to {save_path}")

    return metrics

print("Starting GSM8K evaluation...")


Starting GSM8K evaluation...


### Evaluation after GRPO



In [22]:
evaluate_model("grpo_model/final_merged_grpo_model",num_samples = 200)

Initializing evaluation...


Loading model components:   0%|          | 0/2 [00:00<?, ?it/s]

Loading dataset...
Loaded 200 samples


Processing samples:   0%|          | 0/200 [00:00<?, ?examples/s]

The attention mask is not set and cannot be inferred from input because pad token is same as eos token. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.



Results saved to gsm8k_eval_results_20251018_180542.json


{'accuracy': 0.265,
 'correct': 53,
 'total': 200,
 'model_path': 'grpo_model/final_merged_grpo_model',
 'timestamp': '2025-10-18T18:05:42.438047'}

### Evaluation before GRPO

In [24]:
evaluate_model(MODEL_NAME,num_samples = 200)

Initializing evaluation...


Loading model components:   0%|          | 0/2 [00:00<?, ?it/s]

Loading dataset...
Loaded 200 samples


Processing samples:   0%|          | 0/200 [00:00<?, ?examples/s]


Results saved to gsm8k_eval_results_20251018_181844.json


{'accuracy': 0.055,
 'correct': 11,
 'total': 200,
 'model_path': 'Qwen/Qwen2.5-0.5B-Instruct',
 'timestamp': '2025-10-18T18:18:44.115134'}