# Model Evaluation: Testing GRPO-Trained Mathematical Reasoning Models

## Overview

This notebook is designed to evaluate the performance of language models that have been trained using **Group Relative Policy Optimization (GRPO)** from the tutorial in `1. grpo_training_nemo_rl.ipynb`. After completing the GRPO training process, you can use this notebook to assess how well your model has learned to solve mathematical reasoning problems.

## Prerequisites

Before running this evaluation:

1. **Completed Training**: You should have successfully run the GRPO training from the first notebook
2. **Model Checkpoints**: Training should have generated model checkpoints in `results/grpo/step_X/`
3. **System Requirements**: Sufficient GPU memory for model inference during evaluation

---


## Convert Checkpoint to Huggingface Format
We first need to convert the checkpoint to Huggingface format to begin our next steps.
Replace STEP_X with the step number of the checkpoint you want to evaluate.


In [5]:
# Convert checkpoint to HuggingFace format (Replace STEP_X with the step number of the checkpoint you want to evaluate)

!cd /root/verb-workspace/NeMo-RL && uv run python examples/convert_dcp_to_hf.py \
    --config results/grpo/step_130/config.yaml \
    --dcp-ckpt-path results/grpo/step_130/policy/weights/ \
    --hf-ckpt-path results/grpo/hf

Saved HF checkpoint to: results/grpo/hf


## Run Eval for Model Trained with RL
Now let's test our trained model on MATH500:

In [None]:
# Run evaluation for model trained with RL
!cd /root/verb-workspace/NeMo-RL && uv run python examples/run_eval.py \
    generation.model_name=$PWD/results/grpo/hf \
    data.dataset_name=HuggingFaceH4/MATH-500 \
    data.dataset_key=test

Loaded configuration from: /root/verb-workspace/NeMo-RL/examples/configs/eval.yaml
Overrides: {'generation': {'model_name': '/root/verb-workspace/NeMo-RL/results/grpo/hf'}, 'data': {'dataset_name': 'HuggingFaceH4/MATH-500', 'dataset_key': 'test'}}
Applied CLI overrides
Final config:
{'cluster': {'gpus_per_node': 1, 'num_nodes': 1},
 'data': {'dataset_key': 'test',
          'dataset_name': 'HuggingFaceH4/MATH-500',
          'max_input_seq_length': 2048,
          'problem_key': 'problem',
          'prompt_file': None,
          'solution_key': 'answer',
          'system_prompt_file': None},
 'env': {'math': {'num_workers': 8}},
 'eval': {'metric': 'pass@1', 'num_tests_per_prompt': 1, 'seed': 42},
 'generation': {'backend': 'vllm',
                'max_new_tokens': 2048,
                'model_name': '/root/verb-workspace/NeMo-RL/results/grpo/hf',
                'num_prompts_per_step': -1,
                'stop_strings': None,
                'stop_token_ids': None,
                

## Run Eval for Base Model
We then test the base model on MATH500 for a comparison:

In [12]:
# Run evaluation for pre-RL base model
!cd /root/verb-workspace/NeMo-RL && uv run python examples/run_eval.py \
    generation.model_name=Qwen/Qwen2.5-1.5B \
    data.dataset_name=HuggingFaceH4/MATH-500 \
    data.dataset_key=test

Loaded configuration from: /root/verb-workspace/NeMo-RL/examples/configs/eval.yaml
Overrides: {'generation': {'model_name': 'Qwen/Qwen2.5-1.5B'}, 'data': {'dataset_name': 'HuggingFaceH4/MATH-500', 'dataset_key': 'test'}}
Applied CLI overrides
Final config:
{'cluster': {'gpus_per_node': 1, 'num_nodes': 1},
 'data': {'dataset_key': 'test',
          'dataset_name': 'HuggingFaceH4/MATH-500',
          'max_input_seq_length': 2048,
          'problem_key': 'problem',
          'prompt_file': None,
          'solution_key': 'answer',
          'system_prompt_file': None},
 'env': {'math': {'num_workers': 8}},
 'eval': {'metric': 'pass@1', 'num_tests_per_prompt': 1, 'seed': 42},
 'generation': {'backend': 'vllm',
                'max_new_tokens': 2048,
                'model_name': 'Qwen/Qwen2.5-1.5B',
                'num_prompts_per_step': -1,
                'stop_strings': None,
                'stop_token_ids': None,
                'temperature': 0.0,
                'top_k': -1,
     

## Results
Your results for the model trained with RL should have a significantly higher score than the score for the base model, indicating that reinforcement learning fine-tuning has substantially enhanced the model's mathematical reasoning capabilities.