# Data Loading Module

## Overview

The data loading module provides a unified, flexible data loading interface that supports loading data from multiple data sources and converting them to standardized formats. This module is located in the `rm_gallery/core/data/load/` directory.

## Core Architecture

### Design Patterns
- **Strategy Pattern**: Supports different data loading strategies
  - `FileDataLoadStrategy`: Local file loading
  - `HuggingFaceDataLoadStrategy`: HuggingFace dataset loading

- **Registry Pattern**: Dynamic registration and management of data converters
  - `DataConverterRegistry`: Converter registry center
  - Supports runtime registration of new data format converters

- **Template Method Pattern**: Unified data conversion interface
  - `DataConverter`: Abstract converter base class
  - Various concrete converters implement specific format conversion logic

## Supported Data Sources

### Local Files
- **Supported Formats**: JSON (`.json`), JSONL (`.jsonl`), Parquet (`.parquet`)
- **Core Features**: 
  - Automatic file type detection
  - Batch file loading
  - Recursive directory scanning

### Hugging Face Datasets
- **Data Source**: Hugging Face Hub public datasets
- **Core Features**: 
  - Streaming data loading
  - Flexible configuration options
  - Support for dataset sharding

## Built-in Data Converters

### ChatMessageConverter (`chat_message`)
Specifically handles chat conversation format data:
```python
{
    "messages": [
        {"role": "user", "content": "Hello"},
        {"role": "assistant", "content": "Hello! How can I help you?"}
    ]
}
```

### GenericConverter (`*`)
Generic converter that automatically recognizes common fields:
```python
{
    "prompt": "User input",      # Supported fields: question, input, text, instruction
    "response": "Model reply"    # Supported fields: answer, output, completion
}
```

### Supported Benchmark Datasets

Currently built-in support for converters for the following benchmark datasets (located in `rm_gallery/gallery/data/load/`):

- **rewardbench**
- **rewardbench2** 
- **helpsteer2**
- **prmbench**
- **rmbbenchmark_bestofn**
- **rmbbenchmark_pairwise**

Each dataset has a corresponding dedicated converter that can correctly handle its specific data format and field structure.

## Quick Start

### Local File Loading


In [None]:
# Implementation by creating base class
from rm_gallery.core.data.load.base import DataLoad
import rm_gallery.core.data     # Core strategy registration
import rm_gallery.gallery.data  # Extended strategy registration


# Configure local file loading parameters
config = {
    "path": "./data/reward-bench-2/data/test-00000-of-00001.parquet",
    "limit": 1000,  # Limit the number of data items to load
}

# Create data loader
loader = DataLoad(
    name="rewardbench2",           # Dataset name
    load_strategy_type="local",    # Use local file loading strategy
    data_source="rewardbench2",    # Specify data source format converter
    config=config                  # Pass configuration parameters
)

# Execute data loading
dataset = loader.run()

# Output dataset size
print(f"Successfully loaded {len(dataset)} data items")


Successfully loaded 1000 data items


In [None]:
# Implementation by creating factory function
from rm_gallery.core.data.load.base import create_load_module
from rm_gallery.core.data.build import create_build_module
import rm_gallery.core.data     # Core strategy registration
import rm_gallery.gallery.data  # Extended strategy registration

config = {
    "path": "./data/reward-bench-2/data/test-00000-of-00001.parquet",
    "limit": 1000,  # Limit the number of data items to load
}

# Create loading module
load_module = create_load_module(
    name="rewardbench2",
    load_strategy_type="local",
    data_source="rewardbench2",
    config=config
)
# Create complete pipeline
pipeline = create_build_module(
    name="load_pipeline",
    load_module=load_module
)

# Run pipeline
result = pipeline.run()
print(f"Successfully loaded {len(result)} data items")


### Hugging Face Dataset Loading


In [None]:
# Implementation by creating base class
from rm_gallery.core.data.load.base import DataLoad

# Configure Hugging Face dataset loading parameters
config = {
    "huggingface_split": "test",        # Dataset split (train/test/validation)
    "limit": 1000,          # Limit the number of data items to load
    "streaming": False      # Whether to use streaming loading
}

# Create data loader
loader = DataLoad(
    name="allenai/reward-bench-2",     # HuggingFace dataset path
    load_strategy_type="huggingface",  # Use HuggingFace loading strategy
    data_source="rewardbench2",        # Specify data source format converter
    config=config                      # Pass configuration parameters
)

# Execute data loading
dataset = loader.run()

# Output dataset size
print(f"Successfully loaded {len(dataset)} data items from HuggingFace")


In [None]:
# Implementation by creating factory function
from rm_gallery.core.data.load.base import create_load_module
from rm_gallery.core.data.build import create_build_module

config = {
    "huggingface_split": "test",        # Dataset split (train/test/validation)
    "limit": 1000,          # Limit the number of data items to load
    "streaming": False      # Whether to use streaming loading
}

# Create loading module
load_module = create_load_module(
    name="allenai/reward-bench-2",
    load_strategy_type="huggingface",
    data_source="rewardbench",
    config=config
)
# Create complete pipeline
pipeline = create_build_module(
    name="load_pipeline",
    load_module=load_module
)

# Run pipeline
result = pipeline.run()
print(f"Successfully loaded {len(result)} data items")


### Data Export

Built-in data export capabilities supporting multiple format data export: jsonl, parquet, json, and splitting into training and test sets.


In [None]:
from rm_gallery.core.data.load.base import create_load_module
from rm_gallery.core.data.build import create_build_module
from rm_gallery.core.data.export import create_export_module
import rm_gallery.core.data     # Core strategy registration
import rm_gallery.gallery.data  # Extended strategy registration

config = {
    "path": "./data/reward-bench-2/data/test-00000-of-00001.parquet",
    "limit": 1000,  # Limit the number of data items to load
}

# Create loading module
load_module = create_load_module(
    name="rewardbench2",
    load_strategy_type="local",
    data_source="rewardbench2",
    config=config
)

export_module = create_export_module(
    name="rewardbench2",
    config={
        "output_dir": "./exports",
        "formats": ["jsonl"],
        "split_ratio": {"train": 0.8, "test": 0.2}
    }
)
# Create complete pipeline
pipeline = create_build_module(
    name="load_pipeline",
    load_module=load_module,
    export_module=export_module
)

# Run pipeline
result = pipeline.run()
print(f"Successfully loaded {len(result)} data items")


## Data Output Format

### BaseDataSet Structure
All loaded data is encapsulated as a `BaseDataSet` object:
```python
BaseDataSet(
    name="dataset_name",           # Dataset name
    metadata={                     # Metadata information
        "source": "data_source",
        "strategy_type": "local|huggingface",
        "config": {...}
    },
    datas=[DataSample(...), ...]   # List of standardized data samples
)
```

### DataSample Structure
Each data sample is uniformly converted to `DataSample` format:
```python
DataSample(
    unique_id="md5_hash_id",        # Unique identifier for the data
    input=[                         # Input message list
        ChatMessage(role="user", content="...")
    ],  
    output=[                        # Output data list
        DataOutput(answer=Step(...))
    ],      
    source="data_source_name",      # Data source name
    task_category="chat|qa|instruction_following|general",  # Task category
    metadata={                      # Detailed metadata
        "raw_data": {...},          # Raw data
        "load_strategy": "ConverterName",  # Converter used
        "source_file_path": "...",  # Source file path (local files)
        "dataset_name": "...",      # Dataset name (HF datasets)
        "load_type": "local|huggingface"   # Loading method
    }
)
```

## Custom Data Converters

If you need to support new data formats, you can create custom converters by following these steps:

### Step 1: Implement Converter Class
Create a converter file in the `rm_gallery/gallery/data/load/` directory:

```python
from rm_gallery.core.data.load.base import DataConverter, DataConverterRegistry

@DataConverterRegistry.register("custom_format")
class CustomConverter(DataConverter):
    """Custom data format converter"""
    
    def convert_to_data_sample(self, data_dict, source_info):
        """
        Convert raw data to DataSample format
        
        Args:
            data_dict: Raw data dictionary
            source_info: Data source information
        
        Returns:
            DataSample: Standardized data sample
        """
        # Implement specific conversion logic
        return DataSample(...)
```

### Step 2: Register Converter
Import the converter in `rm_gallery/gallery/data/__init__.py` to complete registration:

```python
from rm_gallery.gallery.data.load.custom_format import CustomConverter
```
