# VERL-based Reasoning Reward Model Training Guide

## 📖 Overview

This document provides a comprehensive guide for training Reasoning Reward Models using the VERL framework. Through this tutorial, you will learn how to configure the environment, prepare data, design reward functions, and execute the training pipeline.

## 🏗️ System Architecture

### Core Components

The VERL reasoning reward model training system consists of three core components:

#### 1. **Training Dataset** - Inherits from `BaseTrainDataset`
   - Supports multiple data formats and evaluation criteria
   - Provides flexible conversation template system
   - Integrates custom reward functions

#### 2. **Prompt Template** - Based on `BasePromptTemplate`
   - Defines structured output format
   - Supports extensible scoring criteria
   - Adapts to various task types

#### 3. **Reward Function** - Customizable reward computation module
   - Supports multiple reward calculation methods (pointwise, pairwise, etc.)
   - Provides flexible evaluation metric configuration
   - Real-time accuracy and reward score statistics

## 🔧 Environment Configuration

### Runtime Requirements

Create a `runtime_env.yaml` configuration file:

```yaml
# runtime_env.yaml
excludes: [\"/.git/\"]
env_vars:
  TORCH_NCCL_AVOID_RECORD_STREAMS: \"1\"
  PYTORCH_CUDA_ALLOC_CONF: \"expandable_segments: False\"
  WANDB_API_KEY: \"your_wandb_api_key\"
  WANDB_BASE_URL: \"your_wandb_base_url\"
  HYDRA_FULL_ERROR: \"1\"
```

### Dependency Installation

Ensure the following essential dependencies are installed:
- `verl==0.4.0` (core framework)


## 🚀 Quick Start

### Step 1: Prepare Training Data

Training data should conform to the `DataSample` format specification. For detailed data loading and preprocessing steps, please refer to the data loading section.

### Step 2: Start Ray Distributed Cluster

#### Master Node Startup:
```bash
ray start --head --node-ip-address $MASTER_ADDR --num-gpus 8
```

#### Worker Nodes Startup:
```bash
ray start --address=$MASTER_ADDR:6379 --num-gpus 8
```

### Step 3: Execute Training Pipeline

```bash
# Navigate to training directory
cd rm_gallery/gallery/train/<your_method>

# Start training script
bash run_train.sh
```

### Data Format Description

- **Input Data Format**: All input data must conform to the `DataSample` format

## 🧩 Core Component Details

### Custom Training Dataset

Here's a complete implementation example of a custom training dataset:

```python
class CustomTrainDataset(BaseTrainDataset):
    def __init__(self, *args, **kwargs):
        # Initialize reward module
        self.reward_module = YourRewardModule(
            name="custom_reward",
            template=YourTemplate,
            examples=self._get_examples(),
            llm=None,
        )
        super().__init__(*args, **kwargs)
    
    def _build_messages(self, example):
        # Build formatted messages
        result = self.reward_module.format(sample=example)
        return [{"role": "user", "content": result}]
```

> **Important Note: Reasoning Model Configuration**
> 
> When training reasoning reward models, pay attention to the following configuration:
> - For reasoning models (e.g., Qwen3):
>   - `apply_chat_template` with `enable_thinking=True`
>   - `format` with `enable_thinking=False`
> - For non-reasoning models:
>   - `apply_chat_template` with `enable_thinking=False`
>   - `format` with `enable_thinking=True`
>
> ```python
> # Reasoning model configuration example
> self.tokenizer.apply_chat_template(
>     messages, add_generation_prompt=True, tokenize=False, enable_thinking=True
> )
> 
> result = self.helpfulness_reward.format(sample=example, enable_thinking=False)
> ```

### Reward Function Design

The reward function is a key component for evaluating model performance:

```python
def calculate_reward(predicted_score, true_score):
    """
    Custom reward function
    
    Args:
        predicted_score (float): Model predicted score
        true_score (float): Ground truth score
        
    Returns:
        float: Calculated reward value
    """
    if true_score is None:
        return 0.0
    
    # Reward calculation based on absolute error
    abs_error = abs(predicted_score - true_score)
    max_error = 4  # Adjust based on scoring range (e.g., 0-4 scale)
    
    # Linear decay reward function
    reward = 1.0 - (abs_error / max_error)
    return max(0.0, reward)  # Ensure non-negative reward
```

### Prompt Template System

The template system defines the structured format for model input and output:

```python
class YourTemplate(BasePromptTemplate):
    score: int = Field(description="Scoring result (0-4 scale)")
    
    @classmethod
    def format(cls, desc, examples, query, answer, **kwargs):
        """
        Format prompt template
        
        Args:
            desc (str): Task description
            examples (str): Example data
            query (str): User query
            answer (str): Model response
        
        Returns:
            str: Formatted prompt text
        """
        return f"""# Task Description
{desc}

# Reference Examples
{examples}

# User Query
{query}

# Model Response
{answer}

# Output Requirements
{cls.schema(**kwargs)}
        """
```


## 💡 Implementation Example: Pointwise Training

### Project Structure

```
rm_gallery/gallery/train/pointwise/
├── run_pointwise.sh    # Training launch script
├── dataset.py          # Dataset implementation
├── reward_fn.py        # Reward function implementation
├── template.py         # Prompt template
└── runtime_env.yaml    # Training configuration file
```

### Experiment Configuration

#### Dataset Selection
- **Data Source**: `nvidia/helpsteer2` dataset
- **Evaluation Dimension**: `helpfulness` annotation information
- **Data Scale**: ~8K samples for training set, ~8K samples for validation set
- **Scoring Range**: 0-4 scale (0=worst, 4=best)

#### Model Configuration
- **Base Model**: `Qwen3-8B`

#### Training Parameters
For detailed training parameter configuration, please check: `./gallery/train/pointwise/run_pointwise.sh`

### Pointwise Training Dataset

```python
class PointwiseTrainDataset(BaseTrainDataset):
    """
    Pointwise training dataset implementation
    
    This dataset is specifically designed for pointwise scoring tasks,
    evaluating the quality of each sample independently
    """
    def __init__(self, *args, **kwargs):
        self.reward_module = PointwiseReward(
            name="pointwise_reward",
            template=PointwiseTemplate,
            examples=self._get_examples(),
            llm=None,
        )
        super().__init__(*args, **kwargs)
```

### Pointwise Reward Function

```python
import math

def pointwise_reward(predicted_score, true_score):
    """
    Pointwise scoring reward function
    
    Uses exponential decay mechanism to penalize prediction errors
    
    Args:
        predicted_score (float): Model predicted score
        true_score (float): Ground truth score
        
    Returns:
        float: Calculated reward value [0, 1]
    """
    if true_score is None:
        return 0.0
    
    # Calculate absolute error
    abs_error = abs(predicted_score - true_score)
    max_error = 4  # Scoring range 0-4
    
    # Exponential decay reward calculation
    k = 2.0  # Decay coefficient, controls penalty strength
    error_ratio = abs_error / max_error
    reward = math.exp(-k * error_ratio)
    
    return float(reward)
```

### Training Results Analysis

#### Training Curves

We evaluate the model's learning effectiveness through training and validation curves:

**Training Reward Curve**:
![Training Reward Curve](../images/data/pointwise_train.jpg)

**Validation Reward Curve**:
![Validation Reward Curve](../images/data/pointwise_val.jpg)

## 🔗 Related Resources

### Framework Documentation
- **VERL Framework**: https://github.com/volcengine/verl - Core training framework
- **Ray Distributed**: https://docs.ray.io/ - Distributed computing platform
- **VLLM Inference**: https://docs.vllm.ai/ - High-performance inference engine

### Dataset Resources
- **HelpSteer2**: https://huggingface.co/datasets/nvidia/helpsteer2 - Human preference dataset
- **UltraFeedback**: https://huggingface.co/datasets/openbmb/UltraFeedback - Multi-dimensional feedback data
