# Hands-On Lab: Group Relative Policy Optimization (GRPO) with NeMo RL

## Tutorial Overview

Welcome to this hands-on lab where you'll learn to train a language model using **Group Relative Policy Optimization (GRPO)**, a cutting-edge reinforcement learning technique for improving language model performance on mathematical reasoning tasks.

### What You'll Learn

- **Reinforcement Learning Fundamentals**: Understanding how RL applies to language models
- **GRPO Algorithm**: How Group Relative Policy Optimization works
- **NeMo RL Framework**: NVIDIA's scalable toolkit for RL training
- **Practical Implementation**: Running GRPO training on a real model

### Prerequisites

- Basic understanding of machine learning concepts
- Familiarity with Python and Jupyter notebooks
- No prior reinforcement learning experience required!

### Lab Environment

- **Model**: Qwen/Qwen2.5-1.5B (1.5 billion parameter language model)
- **Task**: Mathematical reasoning improvement
- **Hardware**: Single GPU setup
- **Framework**: NeMo RL toolkit

---


## 📚 Lab Steps Overview

### Setup
- Environment Setup
- NeMo RL Framework Setup
### Training
- Run GRPO Training
- Understanding the training Output
- Tensorboard Monitoring
- Deep Dive: What's happening Under the Hood?
### Evaluation
- Running Evaluation and analyzing the results

### Next Steps & Resources

---


## 🛠️ Step 1: Environment Setup

We'll start by setting up our development environment. This includes installing necessary tools and preparing our system for the GRPO training.

### Installing UV Package Manager

**UV** is a fast Python package manager that will help us manage dependencies efficiently. It's particularly useful for machine learning projects with complex dependency trees.


In [2]:
# Install UV package manager for efficient dependency management
!pip install uv

[0m

Let's verify that UV is installed correctly and explore its capabilities:


In [3]:
# Verify UV installation and see available commands
!uv

An extremely fast Python package manager.

[1m[32mUsage:[0m [1m[36muv[0m [36m[OPTIONS][0m [36m<COMMAND>[0m

[1m[32mCommands:[0m
  [1m[36mrun[0m      Run a command or script
  [1m[36minit[0m     Create a new project
  [1m[36madd[0m      Add dependencies to the project
  [1m[36mremove[0m   Remove dependencies from the project
  [1m[36mversion[0m  Read or update the project's version
  [1m[36msync[0m     Update the project's environment
  [1m[36mlock[0m     Update the project's lockfile
  [1m[36mexport[0m   Export the project's lockfile to an alternate format
  [1m[36mtree[0m     Display the project's dependency tree
  [1m[36mtool[0m     Run and install commands provided by Python packages
  [1m[36mpython[0m   Manage Python versions and installations
  [1m[36mpip[0m      Manage Python packages with a pip-compatible interface
  [1m[36mvenv[0m     Create a virtual environment
  [1m[36mbuild[0m    Build Python packages into source distri

### Installing System Dependencies

Next, we'll install essential system tools that NeMo RL requires for optimal performance:


In [4]:
# Update system package list to ensure we have the latest versions
!apt-get update

Hit:1 https://deb.nodesource.com/node_18.x nodistro InRelease
Hit:2 http://archive.ubuntu.com/ubuntu jammy InRelease                         
Hit:3 https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2204/x86_64  InRelease
Hit:4 https://download.docker.com/linux/ubuntu jammy InRelease                 
Hit:5 http://archive.ubuntu.com/ubuntu jammy-updates InRelease                 
Hit:6 http://archive.ubuntu.com/ubuntu jammy-backports InRelease
Hit:7 http://security.ubuntu.com/ubuntu jammy-security InRelease
Reading package lists... Done


In [5]:
# Install essential development tools:
# - jq: JSON processor for configuration handling
# - curl/wget: For downloading resources
# - git: Version control for cloning repositories
# - rsync: Efficient file transfer
# - less/vim: Text viewing and editing
!apt-get install -y --no-install-recommends jq curl git rsync wget less vim

Reading package lists... Done
Building dependency tree... Done
Reading state information... Done
curl is already the newest version (7.81.0-1ubuntu1.20).
git is already the newest version (1:2.34.1-1ubuntu1.15).
less is already the newest version (590-1ubuntu0.22.04.3).
vim is already the newest version (2:8.2.3995-1ubuntu2.24).
wget is already the newest version (1.21.2-2ubuntu1.1).
The following additional packages will be installed:
  libjq1 libonig5 libpopt0
Suggested packages:
  python3-braceexpand
The following NEW packages will be installed:
  jq libjq1 libonig5 libpopt0 rsync
0 upgraded, 5 newly installed, 0 to remove and 62 not upgraded.
Need to get 822 kB of archives.
After this operation, 2024 kB of additional disk space will be used.
Get:1 http://archive.ubuntu.com/ubuntu jammy/main amd64 libpopt0 amd64 1.18-3build1 [28.2 kB]
Get:2 http://archive.ubuntu.com/ubuntu jammy-updates/main amd64 rsync amd64 3.2.7-0ubuntu0.22.04.4 [437 kB]
Get:3 http://archive.ubuntu.com/ubuntu jam

---

## 🚀 Step 2: Setting Up NeMo RL

Now we'll download and set up the NeMo RL framework. **NeMo RL** is NVIDIA's comprehensive toolkit for reinforcement learning with language models.

### Key Features of NeMo RL:
- **Scalable**: Supports single GPU to multi-node training
- **Efficient**: Optimized for NVIDIA hardware
- **Comprehensive**: Includes multiple RL algorithms (GRPO, DPO, SFT)
- **Production-Ready**: Used in NVIDIA's research and products

### Cloning the Repository


In [7]:
# Clone the official NeMo RL repository from NVIDIA
# This contains all the tools and examples we need for GRPO training
!cd /root/verb-workspace/ && git clone https://github.com/NVIDIA/NeMo-RL.git && cd /root/verb-workspace/NeMo-RL && git checkout v0.2.1

Cloning into 'NeMo-RL'...
remote: Enumerating objects: 20508, done.[K
remote: Counting objects: 100% (2228/2228), done.[K
remote: Compressing objects: 100% (914/914), done.[K
remote: Total 20508 (delta 1914), reused 1317 (delta 1313), pack-reused 18280 (from 3)[K
Receiving objects: 100% (20508/20508), 14.50 MiB | 34.94 MiB/s, done.
Resolving deltas: 100% (15348/15348), done.
Note: switching to 'v0.2.1'.

You are in 'detached HEAD' state. You can look around, make experimental
changes and commit them, and you can discard any commits you make in this
state without impacting any branches by switching back to a branch.

If you want to create a new branch to retain commits you create, you may
do so (now or later) by using -c with the switch command. Example:

  git switch -c <new-branch-name>

Or undo this operation with:

  git switch -

Turn off this advice by setting config variable advice.detachedHead to false

HEAD is now at 81d421f4 Cherry pick `docs: deepscaler guide on sidebar (

### Setting Up the Python Environment

We'll create an isolated Python environment to avoid conflicts with system packages:


In [8]:
# Create a virtual environment in the NeMo-RL directory
# This ensures clean dependency management
!cd /root/verb-workspace/NeMo-RL && uv venv

Using CPython [36m3.12.11[39m
Creating virtual environment at: [36m.venv[39m
Activate with: [32msource .venv/bin/activate[39m


---

## 🎯 Step 3: Running GRPO Training

Now comes the exciting part! We'll run our GRPO training experiment. Let's break down what's happening:

### Training Configuration

Our training uses the following key parameters:

- **Model**: `Qwen/Qwen2.5-1.5B` - A 1.5B parameter Chinese-English bilingual model
- **Dataset**: `OpenMathInstruct-2` - A high-quality mathematical reasoning dataset
- **Task**: Mathematical problem solving with step-by-step reasoning
- **Monitoring**: TensorBoard for real-time training visualization

### What Happens During Training?

There are 4 main steps in a single training step:

![training](training.png)

1. **Data Preparation**: A batch of (i.e. 32) prompts (math problems) are prepared for rollout.
2. **Rollout**: We generate multiple (i.e 16) responses for each prompt in the batch using a fast inference engine like VLLM, getting the rollout results.
3. **Reward and Advantage Calculation**: Model responses are then evaluated against the answers in the dataset to get their rewards (typically 1 for a correct response and 0 for a wrong one). The advantage for each is then calculated using the reward by subtracting the reward for a single response by the average reward of the group.
4. **Policy Update**: Losses are calculated using the advantage, and the policy is updated by optimizing to favor high-reward responses while avoiding big changes, penalized with a KL loss measuring the model's divergence with a reference model.

These steps are then repeated for the specifed number of training iterations.


### Now let's start training!

In [None]:
# 🚀 MAIN TRAINING COMMAND
# This is where the magic happens! We're running GRPO training with the following key settings:
#
# Key Components Explained:
# 1. NeMo-RL directory: Contains all training scripts and configurations
# 2. run_grpo_math.py: The main GRPO training script for mathematical reasoning
# 3. logger.tensorboard_enabled=True: Enables real-time training visualization
#
# What you'll see during training:
# - Configuration loading and validation
# - Data pipeline setup (OpenMathInstruct-2 dataset)
# - Model initialization (Qwen/Qwen2.5-1.5B)
# - Training progress with metrics
# - TensorBoard logging for visualization

!cd /root/verb-workspace/NeMo-RL && uv run python examples/run_grpo_math.py grpo.num_prompts_per_step=16 \
grpo.num_generations_per_prompt=8 policy.train_global_batch_size=128 logger.tensorboard_enabled=True checkpointing.keep_top_k=5

         If the cache and target directories are on different filesystems, hardlinking may not be supported.
[2K[2mInstalled [1m119 packages[0m [2min 993ms[0m[0m                             [0m
Loaded configuration from: /root/verb-workspace/NeMo-RL/examples/configs/grpo_math_1B.yaml
Overrides: ['logger.tensorboard_enabled=True']
Applied CLI overrides
Final config:
{'checkpointing': {'checkpoint_dir': 'results/grpo',
                   'enabled': True,
                   'higher_is_better': True,
                   'keep_top_k': 3,
                   'metric_name': 'val_reward',
                   'save_period': 10},
 'cluster': {'gpus_per_node': 1, 'num_nodes': 1},
 'data': {'dataset_name': 'OpenMathInstruct-2',
          'max_input_seq_length': 512,
          'prompt_file': 'examples/prompts/cot.txt',
          'system_prompt_file': None},
 'env': {'math': {'num_workers': 8}},
 'grpo': {'max_num_steps': 1000000,
          'max_rollout_turns': 1,
          'max_val_samples': 2

## 📊 Understanding the Training Output

Let's break down what you just saw during the training process:

### 1. 🔧 Configuration Phase
- **Config Loading**: The system loaded the GRPO configuration from `grpo_math_1B.yaml`
- **Parameter Overview**: You saw the complete training configuration including model settings, data parameters, and optimization details. Below is an overview of the key parameters in the training configuation.

#### 🔄 Key GRPO Algorithm Parameters
Core settings that control the GRPO reinforcement learning process:

| **Parameter** | **Value Used** | **Description** |
|---------------|----------------|-----------------|
| `num_prompts_per_step` | 16 | Number of unique prompts processed in each training step. Higher values provide more diverse training signals but require more memory. |
| `num_generations_per_prompt` | 8 | How many response candidates to generate for each prompt. Used for GRPO's group comparison mechanism - minimum 2 required. |
| `max_rollout_turns` | 1 | Maximum conversation turns. Math problems typically require only 1 turn (single answer), but multi-turn reasoning can use higher values. |
| `val_period` | 10 | Frequency of validation runs (every N steps). Balance between monitoring progress and training efficiency. |
| `max_val_samples` | 256 | Maximum validation samples to evaluate. Controls validation time while ensuring representative performance measurement. |

#### 📊 Data & Environment Parameters
Configuration for datasets and computational resources:

| **Parameter** | **Value Used** | **Description** |
|---------------|----------------|-----------------|
| `dataset_name` | "OpenMathInstruct-2" | Training dataset. High-quality mathematical reasoning problems with step-by-step solutions. |
| `num_workers` | 8 | Parallel workers for data processing and reward computation. Speeds up training pipeline. |
| `gpus_per_node` | 1 | Number of GPUs allocated per compute node. Our setup uses single GPU training. |

#### 🤖 Model Policy Parameters
Configuration for the language model being trained:

| **Parameter** | **Value Used** | **Description** |
|---------------|----------------|-----------------|
| `model_name` | "Qwen/Qwen2.5-1.5B" | Base model to train. Qwen models are strong in mathematical reasoning and multilingual capabilities. |
| `train_global_batch_size` | 128 | Total number of sequences processed across all devices. |
| `train_micro_batch_size` | 4 | Sequences processed per GPU micro-batch. Smaller values reduce memory usage but may slow training. |
| `max_total_sequence_length` | 512 | Maximum token length for input + output. Balance between accommodating long reasoning chains and memory constraints. |
| `precision` | "bfloat16" | Numerical precision for computations. bfloat16 saves memory while maintaining training stability. |

#### 🎯 Generation Parameters
Settings that control how the model generates responses during training:

| **Parameter** | **Value Used** | **Description** |
|---------------|----------------|-----------------|
| `backend` | "vllm" | Generation engine. vLLM provides optimized inference with advanced batching and memory management. |
| `temperature` | 1.0 | Sampling randomness during generation. Higher values increase diversity; lower values make generation more deterministic. |
| `max_new_tokens` | 512 | Maximum tokens to generate per response. Should accommodate full solution explanations. |
| `top_p` | 1.0 | Nucleus sampling parameter. Controls diversity by considering top tokens whose cumulative probability is p. |


#### ⚖️ Loss Function Parameters
Settings that control how the model learns from rewards and prevents training instability:

| **Parameter** | **Value Used** | **Description** |
|---------------|----------------|-----------------|
| `reference_policy_kl_penalty` | 0| KL divergence penalty to prevent model from deviating too far from original policy. Used for maintaining model stability. |
| `ratio_clip_min/max` | 0.2 | PPO-style clipping bounds for policy updates. Prevents destructive large updates that could harm training. |
| `token_level_loss` | true | Apply loss at individual token level rather than sequence level. Provides finer-grained learning signals. |



### 2. 🏗️ Setup Phase
- **Data Loading**: Loaded 950,000 training samples and 50,000 validation samples from [OpenMathInstruct-2](https://huggingface.co/datasets/nvidia/OpenMathInstruct-2)
- **Cluster Initialization**: Set up Ray distributed computing framework
- **Model Loading**: Initialized the Qwen/Qwen2.5-1.5B model with vLLM backend

Let's take a look at one sample from the dataset to better understand it:

![dataset](dataset.png)

- ```problem```: The mathematical question or problem statement that needs to be solved. This is the input prompt given to the AI model. Examples include word problems, algebra questions, geometry problems, etc.
- ```generated_solution```: The step-by-step solution and reasoning process generated by an AI model. This shows the detailed work and explanation of how to arrive at the answer, including mathematical steps and logical reasoning.
- ```expected_answer```: The correct numerical or text answer to the problem. This serves as the ground truth for evaluating whether the AI's solution is correct.
- ```problem_source```: The origin or source dataset where this problem was collected from. This helps track the diversity and provenance of the training data (e.g., "augmented_gsm8k", "augmented_math", etc.).

For RL training, we are only using the ```problem``` column and the ```expected_answer``` column to check the model's answers.

### 3. 🎯 Training Phase

```
▶ Preparing batch... 
▶ Generating responses for batch of size 512...
...
Processed prompts:  15%|█▍        | 75/512 [00:03<00:08, 52.26it/s, est. speed input: 2090.58 toks/s, output: 3021.35 toks/s]
...
```

**Rollout Stage: Batch Processing with Progress Monitoring**: Generated responses for batches of 32 prompts (with 16 generations each = 512 total sequences) with real-time speed metrics (tokens/second, samples/second)


```
▶ Processing rewards...
▶ Computing advantages...
▶ Preparing for logprob inference...
▶ Computing logprobs...
▶ Preparing for training...
▶ Training policy...
Logged data to logs/exp_001/train_data_step58.jsonl
```
**Reward and Training Stage**: Calculated rewards based on mathematical correctness, computed advantages using the reward signals, then performed policy updates. The model learns to increase the likelihood of generating correct mathematical solutions while maintaining stability through clipped policy updates.


## 📈 Monitoring Training with TensorBoard

Since we enabled TensorBoard logging, you can monitor your training progress in real-time!

### Accessing TensorBoard

TensorBoard logs are saved in the `logs/exp_001/tensorboard` directory. Go to the `2.install_start_tensorboard` notebook to start the monitoring.

### Key Metrics to Watch

Below are the results from one of our experiments:

#### Training Metrics:

![Tensorboard Training Logs](tensorboard_training.png)

1. **Reward Trends**: Monitor the `train/reward` metric to see how well your model is learning to solve mathematical problems. A healthy training run should show a steady upward trend. This indicates the model is generating increasingly correct and well-reasoned mathematical solutions. Sudden drops or plateaus may indicate learning instability or reward hacking.

2. **Truncation Rate**: Monitor `train/truncation_rate` to track how frequently model responses are cut off due to exceeding the maximum sequence length. In the example shown, the truncation rate is too high, which typically indicates that the maximum sequence length is set too low for the complexity of mathematical reasoning required. High truncation rates can lead to incomplete solutions and reduced training effectiveness, as the model cannot fully express its reasoning process.

3. **Generation Tokens Per Sample**: Track `mean_gen_tokens_per_sample` to see the mean number of tokens generated for the sample rollouts. This indicates the model's response length during training, which is useful for identifying issues related to reasoning steps. Typically a successful RL run should have this metric growing, indicating that the model is reasoning more in its CoT. In our example, due to a short max total sequence length, this metric is instead going down.

4. **KL Divergence**: Track `train/kl_penalty` to see how much the current policy differs from the reference policy. Small values indicate the model is learning while staying close to its original behavior, which is crucial for stable GRPO training.

5. **Exploration Metrics**: Watch `train/approx_entropy` to monitor how diverse the model's outputs are. Too low indicates the model is becoming too deterministic, too high suggests excessive randomness.



#### Validation Metrics

![Tensorboard Validation Logs](tensorboard_validation.png)

5. **Validation Accuracy**: Track `validation/accuracy` to measure your model's performance on unseen mathematical problems. You should observe a gradual improvement as training progresses. This metric directly reflects the model's ability to solve new math problems correctly. If validation accuracy stops improving while training reward continues to rise, this may indicate overfitting.

6. **Validation Average Response Length**: Monitor `validation/avg_length` to ensure your model maintains appropriate response lengths. Typically the average response length gradually increases as training progresses, indicating that the model is learning to reason with more steps. Extremely long responses might indicate verbosity issues, while very short responses could suggest the model isn't showing enough reasoning steps.

#### Performance Metrics (Timing)

![Tensorboard Timing logs](tensorboard_timing.png)

7. **Timing of the Steps**: Watch timing metrics like `timing/train/generation`, `timing/train/reward_calculation`, and `timing/train/policy_training` to identify performance bottlenecks. Monitoring these helps optimize training efficiency and resource utilization.

---


## 🎯 Analyzing Your Results and Run Evaluation

After running the training, let's examine what was accomplished:

### Training Artifacts Created

- **Checkpoints**: Model weights saved in `results/grpo/`
- **Logs**: Training metrics in `logs/exp_001/`
- **TensorBoard**: Visualization data for analysis

### Expected Improvements

With continued training, the model will show:
- **Better Mathematical Reasoning**: More accurate problem-solving
- **Improved Step-by-Step Logic**: Clearer solution explanations
- **Higher Success Rates**: Correct answers on more problems

### Run Evaluation using Saved Model Checkpoint
Go to the `3.run_evaluation` notebook to evaluate the model performance.

---


## 🚀 Next Steps and Extensions

### Explore More Examples and Guides (After the Lab)

After this lab session, you are encouraged to explore additional resources and examples to deepen your understanding of reinforcement learning:


#### NVIDIA NeMo RL Official Guides
Explore the comprehensive **NVIDIA NeMo RL documentation and guides** at [https://github.com/NVIDIA-NeMo/RL/tree/main/docs/guides](https://github.com/NVIDIA-NeMo/RL/tree/main/docs/guides) which includes:

**Key Guides and Examples to Explore:**
- **[GRPO In-Depth Walkthrough](https://github.com/NVIDIA-NeMo/RL/blob/main/docs/guides/grpo.md)** - Deep dive into Group Relative Policy Optimization
- **[GRPO on DeepScaler Example](https://github.com/NVIDIA-NeMo/RL/blob/main/docs/guides/grpo-deepscaler.md)** - Example of context lengthen training using `DeepSeek-R1-Distill-Qwen-1.5B` on the `DeepScaleR` dataset
- **[Supervised Fine-Tuning (SFT) Guide](https://github.com/NVIDIA-NeMo/RL/blob/main/docs/guides/sft.md)** - Learn how to perform supervised fine-tuning with NeMo RL
- **[Direct Preference Optimization (DPO) Guide](https://github.com/NVIDIA-NeMo/RL/blob/main/docs/guides/dpo.md)** - Understand preference-based training methods


**Additional Resources:**
- **[Evaluation Guide](https://github.com/NVIDIA-NeMo/RL/blob/main/docs/guides/eval.md)** - Learn how to evaluate your trained models
- **[Model Quirks Documentation](https://github.com/NVIDIA-NeMo/RL/blob/main/docs/model-quirks.md)** - Important considerations for different model types
- **[Add New Models Guide](https://github.com/NVIDIA-NeMo/RL/blob/main/docs/adding-new-models.md)** - Instructions for integrating custom models
- **[Profile GPU with Nsys](https://github.com/NVIDIA-NeMo/RL/blob/main/docs/nsys-profiling.md)** - Instructions for profiling GPU utilization during training using NVIDIA Nsight.

These resources provide comprehensive guidance for advanced RL techniques, model customization, and scaling to production environments beyond what we covered in this lab.


## 📝 Lab Summary

Congratulations! You've successfully completed the GRPO hands-on lab. Here's what you accomplished:

### ✅ Key Achievements

- **Environment Setup**: Installed and configured NeMo RL toolkit
- **GRPO Training**: Ran reinforcement learning training on a language model
- **Mathematical Reasoning**: Improved model performance on math problems
- **Monitoring**: Used TensorBoard for training visualization
- **Understanding**: Learned core RL concepts and GRPO algorithm

### 🧠 Concepts Learned

- **Reinforcement Learning**: How RL applies to language models
- **GRPO Algorithm**: Group-based policy optimization technique
- **Mathematical Reasoning**: Training models for step-by-step problem solving
- **Distributed Training**: Using Ray for scalable machine learning
- **Model Optimization**: Efficient training on GPU hardware

### 🔧 Technical Skills

- **NeMo RL Framework**: Practical experience with NVIDIA's RL toolkit
- **Configuration Management**: Understanding training parameter tuning
- **Performance Monitoring**: Using metrics to track training progress
- **Troubleshooting**: Handling common training issues

### 🌟 Impact

You've gained hands-on experience with cutting-edge AI techniques that are:
- **Advancing AI Capabilities**: Improving reasoning in language models
- **Industry Relevant**: Used in production AI systems
- **Research-Grade**: Based on latest academic developments
- **Scalable**: Applicable from research to production

---

Thank you for participating in this hands-on GRPO lab! 🎉
