# Principle Generator Tutorial (Helpfulness)

This notebook demonstrates how to use Principle Generator to create **Helpfulness** evaluation principles  
Includes full workflow: Data loading → Model configuration → Principle generation → Result analysis


In [None]:
# Import standard libraries
import sys
import os
from concurrent.futures import ThreadPoolExecutor
from typing import List

# Add project root directory to Python path
sys.path.append("..")

# Add environment variables
os.environ["OPENAI_API_KEY"] = ""
os.environ["BASE_URL"] = ""

# Import local modules
from rm_gallery.core.data.schema import DataSample
from rm_gallery.core.model.openai_llm import OpenaiLLM
from rm_gallery.core.reward.principle.base import PrincipleGenerator
from rm_gallery.core.utils.file import read_jsonl

# Initialize logger
from loguru import logger
logger.add("principle_generator.log", rotation="1 day")



## 1. Load Data

Using data from the "Precise IF" task as input examples

In [None]:
try:
    # Data path (modify according to your actual path)
    train_path = "/mnt3/huangsen.huang/codes/RM-Gallery/data/RMBbench_Train/pairwise/Helpfulness/Summarization.jsonl"
    test_path = "/mnt3/huangsen.huang/codes/RM-Gallery/data/RMBbench_Test/pairwise/Helpfulness/Summarization.jsonl"
    
    # Read JSONL format data and convert to DataSample objects
    train_samples = [DataSample(**sample) for sample in read_jsonl(train_path)]
    test_samples = [DataSample(**sample) for sample in read_jsonl(test_path)]
    
    logger.info(f"Successfully loaded {len(train_samples)} training samples and {len(test_samples)} test samples")
except Exception as e:
    logger.error(f"Data loading failed: {str(e)}")
    raise


## 2. Configure Generator Parameters

- Using Qwen3 as the language model
- Setting generation and clustering parameters

In [None]:
try:
    # Initialize language model
    llm = OpenaiLLM(
        model="qwen3-235b-a22b",  # Model name
        enable_thinking=True      # Enable reasoning mode
    )
    
    SCENARIO = "Summarization: The text is compressed into a short form, retaining the main information, which is divided into extraction (directly selected from the original text) and production (rewriting the information)."

    # Create principle generator
    generator = PrincipleGenerator( # or IterPrincipleGenerator
        llm=llm,
        scenario=SCENARIO,  # Scenario description
        generate_number=5,   # Generate 5 candidate principles per sample
        cluster_number=3     # Cluster to 3 representative principles
    )
    
    logger.info("Successfully initialized PrincipleGenerator")
except Exception as e:
    logger.error(f"Generator configuration failed: {str(e)}")
    raise


## 3. Execute Batch Generation

In [None]:


try:
    # Execute batch generation
    principles = generator.run_batch(
        train_samples[:10],  # Process first 10 samples as example
        thread_pool=ThreadPoolExecutor(max_workers=12)
    )
    
    logger.info(f"Successfully generated {len(principles)} principles")
except Exception as e:
    logger.error(f"Principle generation failed: {str(e)}")
    raise


## 4. Evauluation with Generated Principles

In [None]:
from rm_gallery.gallery.rm.alignment.base import BaseHelpfulnessListwiseReward

try:
    principles = [f"{k}: {v}" for k, v in principles.items()][:3]
    reward = BaseHelpfulnessListwiseReward(
        name="test_helpfulness_listwise_reward",
        llm=llm,
        principles=principles,
        scenario=SCENARIO
    )
    evaluation_samples = reward.evaluate_batch(samples=test_samples[:20])
    logger.info(f"Successfully evaluate test samples")
except Exception as e:
    logger.error(f"Reward evaluation failed: {str(e)}")
    raise

# 5. Results Analysis
Analyze the accuracy rate of test samples

In [None]:
# accuracy
def calc_acc(samples: List[DataSample]) -> float:
    labels = []
    for sample in samples:
        labels.append(0)
        for output in sample.output:
            if output.answer.label["preference"] == "chosen":
                score = sum(r.score for r in output.answer.reward.details)
                if score > 0:
                    labels[-1] = 1
    return sum(labels) / len(labels)

logger.info(f"Accuracy: {calc_acc(evaluation_samples)}")

# 6. Built-in Scenario Results

In each scenario, we use `qwen3-235b-a22b` to generate principles based on 10% of the samples, and evaluate on the remaining samples using `qwen3-8b`, with accuracy as the metric.


## 6.1 RewardBench2
| Scenario   | Base       | Generated    |
|------------|------------|-------------|
| Precise IF | 0.5653     | **0.6097**  |
| Factuality | 0.7030     | **0.7663**  |
| Math       | 0.8866     | **0.8927**  |
| Safety     | 0.7946     | **0.9467**  |
| Focus      | 0.9022     | **0.9404**  |

## 6.2 RMBBench
Note: we only evaluate on the best-of-n data

| Scenario        | Base       | Generated   |
|-----------------|------------|------------|
| Chat            | 0.6810     | **0.7603** |
| Brainstorming   | 0.8129     | **0.8187** |
| Classification  | **0.7200** | 0.6697     |
| Closed QA       | **0.7213** | 0.6915     |
| Open QA         | 0.6828     | **0.6937** |
| Generation      | **0.7289** | 0.7205     |
| Summarization   | 0.6333     | **0.6921** |
| Translation     | **0.7336** | 0.6930     |
| Rewrite         | **0.6743** | 0.5371     |
| Reasoning       | **0.7080** | 0.6986     |
| Role Playing    | 0.6164     | **0.6169** |
| Code            | **0.8348** | 0.8251     |
