# Model Evaluation: Testing GRPO-Trained Mathematical Reasoning Models

## Overview

This notebook is designed to evaluate the performance of language models that have been trained using **Group Relative Policy Optimization (GRPO)** from the tutorial in `1. grpo_training_nemo_rl.ipynb`. After completing the GRPO training process, you can use this notebook to assess how well your model has learned to solve mathematical reasoning problems.

## Prerequisites

Before running this evaluation:

1. **Completed Training**: You should have successfully run the GRPO training from the first notebook. **MAKE SURE THE TRAINING NOTEBOOK IS STOPPED BEFORE RUNNING THIS EVALUATION NOTEBOOK**. If we run the two notebooks at the same time, **we would run out of GPU resources**.
2. **Model Checkpoints**: Training should have generated model checkpoints in `results/grpo/step_X/`
3. **System Requirements**: Sufficient GPU memory for model inference during evaluation

---


## Convert Checkpoint to Huggingface Format
We first need to convert the checkpoint to Huggingface format to begin our next steps.
Replace STEP_X with the step number of the checkpoint you want to evaluate.


In [1]:
# Convert checkpoint to HuggingFace format (Replace STEP_X with the step number of the checkpoint you want to evaluate)


!cd /root/verb-workspace/NeMo-RL && uv run python examples/convert_dcp_to_hf.py \
    --config results/grpo/step_X/config.yaml \
    --dcp-ckpt-path results/grpo/step_X/policy/weights/ \
    --hf-ckpt-path results/grpo/step_X/hf

Saved HF checkpoint to: results/grpo_kl_0_max_seq_len_2048/step_60/hf


## Modify NeMo-RL Evaluation Code to Store Results

`eval.py` in NeMo-RL does not store the evaluation responses, it only prints out the final metric. So we have to swap that `eval.py` out with a modified one that stores the results for us to see later.

In [19]:
# Replace NeMo-RL's eval.py with modified version that saves the outputs
!cp eval.py /root/verb-workspace/NeMo-RL/nemo_rl/evals/eval.py

## Run Eval for Model Trained with RL
Now let's test our trained model on MATH500:

In [22]:
# Run evaluation for model trained with RL (Replace STEP_X with the step number of the checkpoint you want to evaluate)

!cd /root/verb-workspace/NeMo-RL && uv run python examples/run_eval.py \
    generation.model_name=$PWD/results/grpo/step_X/hf \
    data.dataset_name=HuggingFaceH4/MATH-500 \
    data.dataset_key=test \
    eval.save_path=result_RL.parquet

Loaded configuration from: /root/verb-workspace/NeMo-RL/examples/configs/eval.yaml
Overrides: {'generation': {'model_name': '/root/verb-workspace/NeMo-RL/results/grpo_kl_0_max_seq_len_2048/step_60/hf'}, 'data': {'dataset_name': 'HuggingFaceH4/MATH-500', 'dataset_key': 'test'}, 'eval': {'save_path': 'result_RL.parquet'}}
Applied CLI overrides
Final config:
{'cluster': {'gpus_per_node': 1, 'num_nodes': 1},
 'data': {'dataset_key': 'test',
          'dataset_name': 'HuggingFaceH4/MATH-500',
          'max_input_seq_length': 2048,
          'problem_key': 'problem',
          'prompt_file': None,
          'solution_key': 'answer',
          'system_prompt_file': None},
 'env': {'math': {'num_workers': 8}},
 'eval': {'metric': 'pass@1',
          'num_tests_per_prompt': 1,
          'save_path': 'result_RL.parquet',
          'seed': 42},
 'generation': {'backend': 'vllm',
                'max_new_tokens': 2048,
                'model_name': '/root/verb-workspace/NeMo-RL/results/grpo_kl_0_

## Run Eval for Base Model
We then test the base model on MATH500 for a comparison:

In [23]:
# Run evaluation for pre-RL base model
!cd /root/verb-workspace/NeMo-RL && uv run python examples/run_eval.py \
    generation.model_name=Qwen/Qwen2.5-1.5B \
    data.dataset_name=HuggingFaceH4/MATH-500 \
    data.dataset_key=test \
    eval.save_path=result_Base.parquet

Loaded configuration from: /root/verb-workspace/NeMo-RL/examples/configs/eval.yaml
Overrides: {'generation': {'model_name': 'Qwen/Qwen2.5-1.5B'}, 'data': {'dataset_name': 'HuggingFaceH4/MATH-500', 'dataset_key': 'test'}, 'eval': {'save_path': 'result_Base.parquet'}}
Applied CLI overrides
Final config:
{'cluster': {'gpus_per_node': 1, 'num_nodes': 1},
 'data': {'dataset_key': 'test',
          'dataset_name': 'HuggingFaceH4/MATH-500',
          'max_input_seq_length': 2048,
          'problem_key': 'problem',
          'prompt_file': None,
          'solution_key': 'answer',
          'system_prompt_file': None},
 'env': {'math': {'num_workers': 8}},
 'eval': {'metric': 'pass@1',
          'num_tests_per_prompt': 1,
          'save_path': 'result_Base.parquet',
          'seed': 42},
 'generation': {'backend': 'vllm',
                'max_new_tokens': 2048,
                'model_name': 'Qwen/Qwen2.5-1.5B',
                'num_prompts_per_step': -1,
                'stop_strings': None

## Results
Your results for the model trained with RL should have a significantly higher score than the score for the base model, indicating that reinforcement learning fine-tuning has substantially enhanced the model's mathematical reasoning capabilities.

### Commonly Used Metrics Explained

When evaluating reinforcement learning-trained language models, several key metrics are commonly used:

- **pass@1**: The percentage of problems solved correctly on the first attempt. This measures how often the model generates a correct solution immediately without multiple tries.

- **pass@k**: The percentage of problems for which at least one correct solution is found among k generated attempts. For example, pass@10 means generating 10 solutions and checking if any of them is correct. This metric accounts for the model's ability to eventually find the right answer when given multiple chances. This metric is usually calculated by taking n (a much larger number than k) samples and plugging the results into the unbiased estimator below:

![pass@k unbiased estimator](passatk.png)

This results in a more accurate result as evaluating pass@k by only doing k samples would result in a lot of variance in the result. More info about this can be found [here](https://github.com/huggingface/evaluate/blob/main/metrics/code_eval/code_eval.py#L198).

- **maj@k (majority@k)**: The percentage of problems solved correctly when taking the majority vote among k generated solutions. This approach assumes that if the model generates multiple solutions, the most frequently occurring answer is likely to be correct.

- **avg@n**: The average score across n evaluation runs or the average number of correct solutions out of n attempts. This provides a more stable estimate of model performance by reducing variance from single runs.
These metrics help assess different aspects of model performance: pass@1 measures immediate accuracy, pass@k measures the model's potential when given multiple attempts, maj@k leverages consensus among multiple generations, and avg@n provides statistical reliability.

We are evaluating the model with the `pass@1` metric in this example.


Now let's take a closer look at the model outputs:

In [None]:
!pip install pandas pyarrow

[0m

In [None]:
import pandas as pd
import os

# Print current working directory
print(f"Current working directory: {os.getcwd()}")

# Load both parquet files
rl_path = "~/verb-workspace/NeMo-RL/result_RL.parquet"
base_path = "~/verb-workspace/NeMo-RL/result_Base.parquet"

print(f"\nLoading files:")
print(f"- RL results: {rl_path}")
print(f"- Base results: {base_path}")

try:
    df_rl = pd.read_parquet(rl_path)
    df_base = pd.read_parquet(base_path)
    print(f"✓ Successfully loaded both files")
except Exception as e:
    print(f"Error loading files: {e}")
    # Try alternative method
    import pyarrow.parquet as pq
    df_rl = pq.read_table(rl_path).to_pandas()
    df_base = pq.read_table(base_path).to_pandas()

# Basic statistics
print(f"\nDataset sizes:")
print(f"- RL model: {len(df_rl)} samples")
print(f"- Base model: {len(df_base)} samples")

rl_accuracy = (df_rl['reward'] == 1).mean()
base_accuracy = (df_base['reward'] == 1).mean()
print(f"\nAccuracy:")
print(f"- RL model: {rl_accuracy:.1%}")
print(f"- Base model: {base_accuracy:.1%}")
print(f"- Difference: {(rl_accuracy - base_accuracy):.1%}")

# Find questions that appear in both datasets
# We'll match by prompt content
rl_prompts = set(df_rl['prompt'].values)
base_prompts = set(df_base['prompt'].values)
common_prompts = rl_prompts.intersection(base_prompts)

print(f"\nCommon questions: {len(common_prompts)} out of {len(rl_prompts)} RL prompts and {len(base_prompts)} Base prompts")

if len(common_prompts) == 0:
    print("❌ No common questions found between datasets")
    # Show a few prompts from each to help debug
    print("\nFirst 3 RL prompts:")
    for i, prompt in enumerate(list(df_rl['prompt'].head(3))):
        print(f"{i+1}: {prompt[:100]}...")
    print("\nFirst 3 Base prompts:")
    for i, prompt in enumerate(list(df_base['prompt'].head(3))):
        print(f"{i+1}: {prompt[:100]}...")
else:
    print("✓ Found common questions!")
    
    # Find questions where RL got correct and Base got wrong
    rl_correct_wrong_samples = []
    for prompt in common_prompts:
        rl_sample = df_rl[df_rl['prompt'] == prompt].iloc[0]
        base_sample = df_base[df_base['prompt'] == prompt].iloc[0]
        
        # Check if RL got correct (reward=1) and Base got wrong (reward=0)
        if rl_sample['reward'] == 1 and base_sample['reward'] == 0:
            rl_correct_wrong_samples.append((prompt, rl_sample, base_sample))
    
    print(f"Found {len(rl_correct_wrong_samples)} questions where RL got correct and Base got wrong")
    
    if len(rl_correct_wrong_samples) == 0:
        print("❌ No samples found where RL performed better than Base")
        # Fall back to any common question
        sample_prompt = list(common_prompts)[0]
        rl_sample = df_rl[df_rl['prompt'] == sample_prompt].iloc[0]
        base_sample = df_base[df_base['prompt'] == sample_prompt].iloc[0]
        print("Showing first available sample instead:")
    else:
        # Pick the first sample where RL performed better
        sample_prompt, rl_sample, base_sample = rl_correct_wrong_samples[0]
        print("✓ Showing sample where RL performed better than Base!")
    
    print("\n" + "="*80)
    print("COMPARISON EXAMPLE")
    print("="*80)
    
    # Show the question
    question = sample_prompt
    if len(question) > 300:
        question = question[:300] + "..."
    print(f"QUESTION:\n{question}")
    
    print(f"\n{'-'*40}")
    print("RL MODEL RESPONSE:")
    print(f"{'-'*40}")
    rl_response = rl_sample['response']
    if len(rl_response) > 400:
        rl_response = rl_response[:400] + "..."
    print(f"{rl_response}")
    print(f"✅ CORRECT" if rl_sample['reward'] == 1 else "❌ WRONG")
    
    print(f"\n{'-'*40}")
    print("BASE MODEL RESPONSE:")
    print(f"{'-'*40}")
    base_response = base_sample['response']
    if len(base_response) > 400:
        base_response = base_response[:400] + "..."
    print(f"{base_response}")
    print(f"✅ CORRECT" if base_sample['reward'] == 1 else "❌ WRONG")
    
    print(f"\n{'-'*40}")
    print("COMPARISON SUMMARY:")
    print(f"{'-'*40}")
    print(f"RL Model: {'✅ Correct' if rl_sample['reward'] == 1 else '❌ Wrong'}")
    print(f"Base Model: {'✅ Correct' if base_sample['reward'] == 1 else '❌ Wrong'}")
    
    if rl_sample['reward'] == base_sample['reward']:
        print("🔄 Both models performed the same")
    elif rl_sample['reward'] == 1:
        print("🎯 RL model performed better!")
    else:
        print("📉 Base model performed better")


Current working directory: /root/verb-workspace/mair-hub/rl-tutorial/kdd_labs/rl_lab

Loading files:
- RL results: ~/verb-workspace/NeMo-RL/result_RL.parquet
- Base results: ~/verb-workspace/NeMo-RL/result_Base.parquet
✓ Successfully loaded both files

Dataset sizes:
- RL model: 500 samples
- Base model: 500 samples

Accuracy:
- RL model: 53.6%
- Base model: 5.0%
- Difference: 48.6%

Common questions: 500 out of 500 RL prompts and 500 Base prompts
✓ Found common questions!

COMPARISON EXAMPLE
QUESTION:
<|im_start|>system
You are a helpful assistant.<|im_end|>
<|im_start|>user
A polynomial with integer coefficients is of the form
\[2x^4 + a_3 x^3 + a_2 x^2 + a_1 x + 1 = 0.\]Find the number of different possible rational roots of this polynomial.<|im_end|>
<|im_start|>assistant


----------------------------------------
RL MODEL RESPONSE:
----------------------------------------
To determine the number of different possible rational roots of the polynomial \(2x^4 + a_3 x^3 + a_2 x^2 + a_

## Analysis of Results

The evaluation demonstrates that the RL model successfully learned to generate coherent Chain-of-Thought reasoning steps and follow instructions properly, while the base model struggled to even produce well-formatted responses or follow the given instruction format. This shows how reinforcement learning can teach models not just what to output, but how to think through problems systematically, resulting in dramatically improved mathematical reasoning capabilities.


