# Data Process Module

## Overview

The Data Process Module provides users with a unified and flexible data processing solution. Based on the **Operator Pipeline** design philosophy, this module allows users to build complex data processing workflows by flexibly combining multiple operators.

## Architecture Design

### Core Components

#### 1. **DataProcess** - Data Processing Engine
   - Inherits from `BaseDataModule`, providing standardized data processing interfaces
   - Manages and orchestrates the execution order of operator sequences
   - Supports both batch data processing and real-time data stream processing

#### 2. **BaseOperator** - Abstract Base Class for Operators
   - Defines standard interface specifications for operators
   - Supports generic types for type safety
   - Provides extensible data processing abstract methods

#### 3. **OperatorFactory** - Operator Factory
   - Implements unified registration and dynamic creation mechanisms for operators
   - Seamlessly integrates with the data-juicer ecosystem operators
   - Supports configuration-based operator instantiation

## Core Features

### 1. Pipeline-based Data Processing
- **Chain Operations**: Supports seamless serial execution of multiple operators
- **Metadata Preservation**: Completely preserves metadata information from original datasets
- **Full Tracking**: Provides detailed processing logs, performance statistics, and data flow tracking

### 2. Rich Operator Ecosystem
- **Built-in Operators**:
  - `TextLengthFilter` - Intelligent filter based on text length
  - `ConversationTurnFilter` - Filter for conversation turn count
- **External Integration**:
  - Full support for data-juicer operator library
  - Support for custom operator extensions

### 3. Configuration-driven Design
- **Declarative Configuration**: Flexibly define data processing flows through configuration files
- **Parameterized Control**: All operator parameters can be adjusted through configuration files
- **Dynamic Adjustment**: Supports runtime dynamic modification of processing parameters



## Quick Start

### Method 1: Direct Operator Creation


In [None]:
from rm_gallery.core.data.process.process import create_process_module
from rm_gallery.core.data.process.ops.filter.text_length_filter import TextLengthFilter
from rm_gallery.core.data.process.ops.filter.conversation_turn_filter import ConversationTurnFilter
from rm_gallery.core.data.load.base import DataLoad
import rm_gallery.core.data     # Core strategy registration
import rm_gallery.gallery.data  # Extension strategy registration

# Configure local file loading parameters
config = {
    "path": "./data/reward-bench-2/data/test-00000-of-00001.parquet",
    "limit": 1000,  # Limit the number of data entries to load
}

# Create data loader
loader = DataLoad(
    name="rewardbench2",           # Dataset name
    load_strategy_type="local",    # Use local file loading strategy
    data_source="rewardbench2",    # Specify data source format converter
    config=config                  # Pass configuration parameters
)

# Execute data loading
dataset = loader.run()

# Create operators
text_filter = TextLengthFilter(
    name="text_length_filter",
    config={"min_length": 50, "max_length": 2000}
)

turn_filter = ConversationTurnFilter(
    name="conversation_turn_filter", 
    config={"min_turns": 1, "max_turns": 10}
)

# Create data processing module
processor = create_process_module(
    name="data_processor",
    operators=[text_filter, turn_filter]
)

# Process data
result = processor.run(dataset)
print(f"Before processing: {len(dataset.datas)} data entries")
print(f"After processing: {len(result.datas)} data entries")


### Method 2: Configuration-based Batch Processing

Using configuration files provides more flexible definition of data processing workflows, especially suitable for complex multi-step processing scenarios.


In [None]:
# Create operators through configuration
from rm_gallery.core.data.process.process import create_process_module
from rm_gallery.core.data.load.base import DataLoad
from rm_gallery.core.data.process.ops.base import OperatorFactory
import rm_gallery.core.data     # Core strategy registration
import rm_gallery.gallery.data  # Extension strategy registration

# Configure local file loading parameters
config = {
    "path": "./data/reward-bench-2/data/test-00000-of-00001.parquet",
    "limit": 1000,  # Limit the number of data entries to load
}

# Create data loader
loader = DataLoad(
    name="rewardbench2",           # Dataset name
    load_strategy_type="local",    # Use local file loading strategy
    data_source="rewardbench2",    # Specify data source format converter
    config=config                  # Pass configuration parameters
)

# Execute data loading
dataset = loader.run()

# Configure multiple operators
operator_configs = [
    {
        "type": "filter",
        "name": "conversation_turn_filter",
        "config": {"min_turns": 1, "max_turns": 8}
    },
    {
        "type": "filter",
        "name": "text_length_filter", 
        "config": {"min_length": 100, "max_length": 2000}
    },
    {
        "type": "data_juicer",
        "name": "character_repetition_filter",
        "config": {
            "rep_len": 10,
            "min_ratio": 0.0,
            "max_ratio": 0.5
        }
    }
]

# Batch create operators
operators = [OperatorFactory.create_operator(config) for config in operator_configs]

# Create processor
processor = create_process_module(
    name="batch_processor",
    operators=operators
)

result = processor.run(dataset)
print(f"Before processing: {len(dataset.datas)} data entries")
print(f"After processing: {len(result.datas)} data entries")


## Advanced Features

### Custom Operator Development

When built-in operators cannot meet specific requirements, you can easily create custom operators. Here's the complete development workflow:

#### Step 1: Implement Operator Class

Create custom operators in the `rm_gallery/gallery/data/process/` directory:

```python
from rm_gallery.core.data.process.ops.base import BaseOperator, OperatorFactory

@OperatorFactory.register("custom_filter")
class CustomFilter(BaseOperator):
    """Custom data filter example"""
    
    def process_dataset(self, items):
        """
        Core method for processing datasets
        
        Args:
            items: List of input data items
            
        Returns:
            List of filtered data items
        """
        filtered_items = []
        for item in items:
            if self._custom_condition(item):
                filtered_items.append(item)
        return filtered_items
    
    def _custom_condition(self, item):
        """
        Custom filtering condition
        
        Args:
            item: Single data item
            
        Returns:
            bool: Whether to keep this data item
        """
        # Implement your filtering logic here
        return True
```

#### Step 2: Register Operator

Import the operator in `rm_gallery/gallery/data/__init__.py` to complete registration:

```python
from rm_gallery.gallery.data.process.custom_filter import CustomFilter
```

### Data-Juicer Operator Integration

RM-Gallery seamlessly integrates with the data-juicer ecosystem, allowing you to use its rich collection of data processing operators:

```python
# Configuration example using data-juicer operators
config = {
    "type": "data_juicer",
    "name": "text_length_filter",
    "config": {
        "min_len": 10,
        "max_len": 20
    }
}

operator = OperatorFactory.create_operator(config)
```

## Supported Operators

### RM-Gallery Built-in Operators

| Operator Name | Functionality | Configuration Parameters |
|---------------|---------------|-------------------------|
| `TextLengthFilter` | Filter data samples based on text length | `min_length`, `max_length` |
| `ConversationTurnFilter` | Filter samples based on conversation turn count | `min_turns`, `max_turns` |

### Data-Juicer Integrated Operators

| Operator Name | Functionality | Status |
|---------------|---------------|--------|
| `text_length_filter` | Text length filtering | ✅ Tested |
| `character_repetition_filter` | Character repetition filtering | ✅ Tested |
| `word_repetition_filter` | Word repetition filtering | 🔄 Testing |

> **Tip**: We continuously add and test new operators, stay tuned for more features!
