# AutoPrinciple Tutorial

##  What is AutoPrinciple?

AutoPrinciple is an LLM-based automated principle generation system designed to dynamically create task-specific evaluation criteria for reward modeling. It leverages large language models (like Qwen3) to extract high-quality assessment rules (e.g., "Is generated content faithful to the source?" or "Is the answer factually accurate?") from minimal example data, replacing traditional manual rule engineering. The system supports multi-modal tasks (text summarization, mathematical reasoning, code generation, etc.) and generates scenario-aware rules adaptively.

## Why AutoPrinciple?
Traditional manual rule engineering faces three critical limitations:

- Poor Scalability: Manually designing rules for every task-scenario combination (e.g., 10 tasks × 5 scenarios = 50 rule sets) requires excessive human effort。

- Subjective Bias: Human-defined rules often reflect individual cognitive biases (e.g., cultural differences in defining "safe content")。

- Limited Adaptability: Static rules struggle to adapt to evolving model capabilities (e.g., new error patterns in upgraded models)


AutoPrinciple's advantages:

- Efficient Generation: Generates candidate rules in bulk via LLM (e.g., 5 samples × 5 candidates = 25 rules)

- Dynamic Optimization: Uses clustering to extract core representative rules (e.g., compress 25 to 3 rules)

- Cross-Domain Transfer: Applies the same framework to multi-modal tasks (e.g., "syntax correctness" for code → "semantic fidelity" for translation)



## How AutoPrinciple Works

The system operates through a streamlined three-step workflow (with optional iteration):

- Candidate Principle Extraction from In-Distribution Data: Generate diverse candidate principles using task-specific in-distribution (ID) data.

- High-Quality Principle Compression: Distill candidate principles into a compact, representative set, by applying semantic clustering to group similar candidates.

- Iterative Optimization (Optional): Refine principles through evaluation feedback loops.



## How to Use AutoPrinciple
Here we demonstrates how to use Principle Generator to create **Helpfulness** evaluation principles.

Includes full workflow: Data loading → Model configuration → Principle generation → Result analysis


In [None]:
# Import standard libraries
import sys
import os
from concurrent.futures import ThreadPoolExecutor
from typing import List

# Add project root directory to Python path
sys.path.append("..")

# Add environment variables
os.environ["OPENAI_API_KEY"] = ""
os.environ["BASE_URL"] = ""

# Import local modules
from rm_gallery.core.data.schema import DataSample
from rm_gallery.core.model.openai_llm import OpenaiLLM
from rm_gallery.core.reward.principle.generator import PrincipleGenerator
from rm_gallery.core.utils.file import read_jsonl

# Initialize logger
from loguru import logger
logger.add("principle_generator.log", rotation="1 day")


### Load Data
Using data from the "Precise IF" task as input examples

In [None]:
try:
    # Data path (modify according to your actual path)
    train_path = "/mnt3/huangsen.huang/codes/RM-Gallery/data/RMBbench_Train/pairwise/Helpfulness/Summarization.jsonl"
    test_path = "/mnt3/huangsen.huang/codes/RM-Gallery/data/RMBbench_Test/pairwise/Helpfulness/Summarization.jsonl"
    
    # Read JSONL format data and convert to DataSample objects
    train_samples = [DataSample(**sample) for sample in read_jsonl(train_path)]
    test_samples = [DataSample(**sample) for sample in read_jsonl(test_path)]
    
    logger.info(f"Successfully loaded {len(train_samples)} training samples and {len(test_samples)} test samples")
except Exception as e:
    logger.error(f"Data loading failed: {str(e)}")
    raise


### Configure Generator Parameters

- Using Qwen3 as the language model

- Setting generation and clustering parameters

In [None]:
try:
    # Initialize language model
    llm = OpenaiLLM(
        model="qwen3-235b-a22b",  # Model name
        enable_thinking=True      # Enable reasoning mode
    )
    
    SCENARIO = "Summarization: The text is compressed into a short form, retaining the main information, which is divided into extraction (directly selected from the original text) and production (rewriting the information)."

    # Create principle generator
    generator = PrincipleGenerator( # or IterPrincipleGenerator
        llm=llm,
        scenario=SCENARIO,  # Scenario description
        generate_number=5,   # Generate 5 candidate principles per sample
        cluster_number=3     # Cluster to 3 representative principles
    )
    
    logger.info("Successfully initialized PrincipleGenerator")
except Exception as e:
    logger.error(f"Generator configuration failed: {str(e)}")
    raise


### Execute Batch Generation

In [None]:


try:
    # Execute batch generation
    principles = generator.run_batch(
        train_samples[:10],  # Process first 10 samples as example
        thread_pool=ThreadPoolExecutor(max_workers=12)
    )
    
    logger.info(f"Successfully generated {len(principles)} principles")
except Exception as e:
    logger.error(f"Principle generation failed: {str(e)}")
    raise


### Evauluation with Generated Principles

In [None]:
from rm_gallery.gallery.rm.alignment.base import BaseHelpfulnessListwiseReward

try:
    principles = [f"{k}: {v}" for k, v in principles.items()][:3]
    reward = BaseHelpfulnessListwiseReward(
        name="test_helpfulness_listwise_reward",
        llm=llm,
        principles=principles,
        scenario=SCENARIO
    )
    evaluation_samples = reward.evaluate_batch(samples=test_samples[:20])
    logger.info(f"Successfully evaluate test samples")
except Exception as e:
    logger.error(f"Reward evaluation failed: {str(e)}")
    raise

### Evaluation Results Analysis
Analyze the accuracy rate of test samples

In [None]:
# accuracy
def calc_acc(samples: List[DataSample]) -> float:
    labels = []
    for sample in samples:
        labels.append(0)
        for output in sample.output:
            if output.answer.label["preference"] == "chosen":
                score = sum(r.score for r in output.answer.reward.details)
                if score > 0:
                    labels[-1] = 1
    return sum(labels) / len(labels)

logger.info(f"Accuracy: {calc_acc(evaluation_samples)}")

## Built-in Scenario Results
Introduce our experimental result on built-in scenarios with generated principles.


### Setting

The experimental setup compares two approaches across multiple scenarios:

- Base Configuration
Uses built-in reward templates from the system (e.g., BaseHelpfulnessListwiseReward) without incorporating any additional principles.
Relies solely on predefined reward criteria inherent to the framework.

- Generated Configuration
Extends the base approach by integrating automatically generated principles via the AutoPrinciple.
Principles are dynamically extracted from task-specific data samples using an LLM (e.g., Qwen3).
Includes clustering to compress 25 candidate principles into 3 representative rules for efficiency.

#### Evaluation Protocol

Models: Both configurations use qwen3-8b for evaluation, while principles are generated using qwen3-235b-a22b.

Data: 10% of training samples are used to generate principles, and the remaining samples are evaluated.

Metric: Accuracy, defined as the proportion of correctly preferred outputs based on reward scores, with 5 independent run.



### RewardBench2
| Scenario   | Base       | Generated    |
|------------|------------|-------------|
| Precise IF | 0.5653     | **0.6097**  |
| Factuality | 0.7030     | **0.7663**  |
| Math       | 0.8866     | **0.8927**  |
| Safety     | 0.7946     | **0.9467**  |
| Focus      | 0.9022     | **0.9404**  |

### RMBBench
Note: we only evaluate on the best-of-n data

| Scenario        | Base       | Generated   |
|-----------------|------------|------------|
| Chat            | 0.6810     | **0.7603** |
| Brainstorming   | 0.8129     | **0.8187** |
| Classification  | **0.7200** | 0.6697     |
| Closed QA       | **0.7213** | 0.6915     |
| Open QA         | 0.6828     | **0.6937** |
| Generation      | **0.7289** | 0.7205     |
| Summarization   | 0.6333     | **0.6921** |
| Translation     | **0.7336** | 0.6930     |
| Rewrite         | **0.6743** | 0.5371     |
| Reasoning       | **0.7080** | 0.6986     |
| Role Playing    | 0.6164     | **0.6169** |
| Code            | **0.8348** | 0.8251     |
