Skip to content

Process-level rubric-based reward engine for Code Agent trajectories. CLI + MCP ready.

License

Notifications You must be signed in to change notification settings

liuxiaotong/agent-reward

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

5 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

AgentReward

过程级 Reward 计算引擎 - 评估 Agent 不仅做对了什么,还评估怎么做的 Process-level rubric-based reward engine for Code Agent trajectories

PyPI Python 3.10+ License: MIT MCP

快速开始 · 三层架构 · Rubric 体系 · MCP Server · Data Pipeline 生态


GitHub Topics: agent-reward, process-reward, rubric, llm-judge, rlhf, code-agent

对 Agent 轨迹的每一步计算多维 Rubric Reward,支持规则层 + 模型层 + 人工校准。用于 RLHF/DPO 训练数据的偏好对构建。

核心能力 / Core Capabilities

Agent 轨迹 (N步) → 逐步评估 → 过程分 + 结果分 → 偏好对 → RLHF/DPO 训练

解决的问题 / Problems Solved

痛点 传统方案 AgentReward
评估粒度 只看最终结果 pass/fail 每一步都有多维分数
Reward 信号 稀疏 (0/1) 密集 (每步 0.0-1.0)
可解释性 黑盒分数 按 Rubric 拆解 + 理由
偏好构建 手动标注 自动从 Reward 排序生成
可靠性 纯 LLM 判断不稳定 规则兜底 + 模型增强 + 人工校准

安装 / Installation

pip install knowlyr-reward

可选依赖:

pip install knowlyr-reward[llm]    # LLM-as-Judge (Anthropic + OpenAI)
pip install knowlyr-reward[stats]  # 统计校准 (numpy + scipy)
pip install knowlyr-reward[mcp]    # MCP 服务器
pip install knowlyr-reward[all]    # 全部功能

快速开始 / Quick Start

Python API

from agentreward import RewardEngine, TrajectoryReward
from agentreward.config import RewardConfig

# 准备轨迹数据
trajectory = {
    "task": "修复 test_login.py 中的断言错误",
    "steps": [
        {"tool": "Read", "params": {"file_path": "/src/test_login.py"}, "output": "..."},
        {"tool": "Grep", "params": {"pattern": "assert"}, "output": "line 42: assert x == y"},
        {"tool": "Edit", "params": {"file_path": "/src/test_login.py",
                                     "old_string": "assert x == y",
                                     "new_string": "assert x == expected_y"}},
    ],
    "outcome": {"success": True, "tests_passed": 10, "tests_total": 10},
}

# 计算 Reward
engine = RewardEngine()
result = engine.score(trajectory)

print(f"总分: {result.total_score:.4f}")
print(f"结果分: {result.outcome_score:.4f}")
print(f"过程分: {result.process_score:.4f}")

for sr in result.step_rewards:
    print(f"  Step {sr.step_id}: {sr.total_score:.4f} {sr.rubric_scores}")
输出示例
总分: 0.8720
结果分: 1.0000
过程分: 0.7440
  Step 1: 0.8500 {'goal_progress': 0.8, 'tool_choice': 0.9, 'param_correctness': 0.9, 'info_utilization': 0.7, 'non_redundancy': 1.0}
  Step 2: 0.7200 {'goal_progress': 0.6, 'tool_choice': 0.8, 'param_correctness': 0.8, 'info_utilization': 0.6, 'non_redundancy': 0.9}
  Step 3: 0.9100 {'goal_progress': 0.9, 'tool_choice': 1.0, 'param_correctness': 0.9, 'info_utilization': 0.9, 'non_redundancy': 1.0}

CLI 命令行

# 评估单条轨迹
knowlyr-reward score trajectory.json

# 比较多条轨迹
knowlyr-reward compare traj_a.json traj_b.json traj_c.json

# 构建偏好对
knowlyr-reward preferences trajectories_by_task.json -o pairs.json
输出示例
正在评估轨迹: trajectory.json
  步骤数: 5
  模型: claude-sonnet-4-20250514
  进度: 5/5
✓ 评估完成
  总分: 0.8720
  过程分: 0.7440
  结果分: 1.0000
  耗时: 3.2s

三层架构 / Three-Layer Architecture

graph TD
    subgraph L1["Layer 1 · 规则层 (权重 0.6)"]
        direction TB
        R1["Rule-based"]
        R1a["冗余检测 · 回退检测<br/>效率计算 · 信息利用"]
        R1b["✅ 确定性、快速、无需 API"]
    end

    subgraph L2["Layer 2 · 模型层 (权重 0.4)"]
        direction TB
        R2["LLM-as-Judge"]
        R2a["目标推进评估 · 工具选择评估<br/>参数正确性评估 · Prompt 模板"]
        R2b["🧠 语义理解、灵活、需要 LLM API"]
    end

    subgraph L3["Layer 3 · 人工校准"]
        direction TB
        R3["Human Calibration"]
        R3a["Pearson/Spearman · 一致率计算<br/>权重调优 · MAE 分析"]
        R3b["👤 可靠性保证、需要人工标注"]
    end

    L1 --> Merge["🎯 加权融合"]
    L2 --> Merge
    Merge --> L3

    style L1 fill:#2da44e,color:#fff,stroke:#2da44e
    style L2 fill:#0969da,color:#fff,stroke:#0969da
    style L3 fill:#8250df,color:#fff,stroke:#8250df
    style Merge fill:#bf8700,color:#fff,stroke:#bf8700
Loading

为什么需要三层?

  • 规则层:快速、确定性、零成本,覆盖可量化的维度(冗余、回退、效率)
  • 模型层:理解语义,评估"目标推进"等需要理解能力的维度
  • 人工层:校准前两层的输出,确保与人类判断一致

Rubric 体系 / Rubric System

每条轨迹的每一步按 5 个维度评估:

Rubric 名称 权重 评估方式 说明
goal_progress 目标推进 0.30 model 这一步是否推进了任务目标?
tool_choice 工具选择 0.20 model 选择的工具是否合理?
param_correctness 参数正确性 0.20 model 工具调用的参数是否正确?
info_utilization 信息利用 0.15 rule 是否利用了之前获得的信息?
non_redundancy 非冗余性 0.15 rule 这一步是否是非冗余操作?

自定义 Rubric

from agentreward.rubrics import Rubric, RubricSet

custom_rubrics = RubricSet(rubrics=[
    Rubric(id="safety", name="安全性", description="操作是否安全?",
           weight=0.4, evaluator="rule"),
    Rubric(id="creativity", name="创造性", description="方案是否有创意?",
           weight=0.6, evaluator="model"),
])

校准方法 / Calibration Methodology

校准流程:

  1. 收集人工标注: 对 50-100 条轨迹由人工专家评分
  2. 计算相关性: Pearson r (线性)、Spearman rho (排序)、一致率
  3. 调优权重: 根据相关性结果调整 rule_weight / model_weight
  4. 迭代: 重复直到 Spearman rho > 0.8
from agentreward.calibration import calibrate

result = calibrate(
    reward_scores=[0.8, 0.6, 0.9, 0.3, 0.7],
    human_scores=[0.85, 0.55, 0.95, 0.25, 0.65],
)

print(f"Pearson r: {result.pearson_r:.4f}")
print(f"Spearman rho: {result.spearman_rho:.4f}")
print(f"Agreement rate: {result.agreement_rate:.4f}")

校准指标参考

指标 合格 良好 优秀
Pearson r > 0.5 > 0.7 > 0.85
Spearman rho > 0.5 > 0.7 > 0.85
Agreement rate > 0.6 > 0.75 > 0.9

偏好对构建 / Preference Pair Construction

用于 RLHF / DPO 训练:

from agentreward.preferences import build_preferences

# 按任务分组的轨迹 (已含 reward 分数)
trajectories_by_task = {
    "task_001": [
        {"id": "traj_a", "reward": 0.9, "step_count": 5},
        {"id": "traj_b", "reward": 0.3, "step_count": 12},
        {"id": "traj_c", "reward": 0.7, "step_count": 8},
    ],
}

pairs = build_preferences(trajectories_by_task, min_margin=0.1)
for p in pairs:
    print(f"{p.chosen_trajectory_id} > {p.rejected_trajectory_id} (margin={p.margin():.3f})")

MCP Server / Claude Integration

在 Claude Desktop / Claude Code 中直接使用。

配置 / Config

添加到 ~/Library/Application Support/Claude/claude_desktop_config.json

{
  "mcpServers": {
    "knowlyr-reward": {
      "command": "uv",
      "args": ["--directory", "/path/to/agent-reward", "run", "python", "-m", "agentreward.mcp_server"]
    }
  }
}

可用工具 / Tools

工具 功能
score_trajectory 对单条轨迹计算过程级 Reward
build_preferences 从多条轨迹构建偏好对
calibrate_reward 将自动 Reward 与人工标注校准
list_rubrics 列出可用的评估 Rubric

使用示例 / Usage Example

用户: 帮我评估 ./trajectories/task_001.json 的 Agent 轨迹

Claude: [调用 score_trajectory]
        评估轨迹 (5 步)...

        ✓ 评估完成:
        - 总分: 0.8720
        - 过程分: 0.7440
        - 结果分: 1.0000
        - Step 1: 0.85 | Step 2: 0.72 | Step 3: 0.91

Data Pipeline 生态 / Ecosystem

AgentReward 是 Data Pipeline 生态的 Reward 组件:

graph LR
    Radar["🔍 Radar<br/>情报发现"] --> Recipe["📋 Recipe<br/>逆向分析"]
    Recipe --> Synth["🔄 Synth<br/>数据合成"]
    Recipe --> Label["🏷️ Label<br/>数据标注"]
    Synth --> Check["✅ Check<br/>数据质检"]
    Label --> Check
    Check --> Hub["🎯 Hub<br/>编排层"]
    Hub --> Sandbox["📦 Sandbox<br/>执行沙箱"]
    Sandbox --> Recorder["📹 Recorder<br/>轨迹录制"]
    Recorder --> Reward["⭐ Reward<br/>过程打分"]
    style Reward fill:#0969da,color:#fff,stroke:#0969da
Loading

生态项目

项目 说明 仓库
情报 AI Dataset Radar 数据集竞争情报、趋势分析 GitHub
分析 DataRecipe 逆向分析、Schema 提取、成本估算 GitHub
生产 DataSynth LLM 批量合成、种子数据扩充 GitHub
生产 DataLabel 轻量标注工具、多标注员合并 GitHub
质检 DataCheck 规则验证、重复检测、分布分析 GitHub
Agent AgentSandbox Docker 执行沙箱、轨迹重放 GitHub
Agent AgentRecorder 标准化轨迹录制、多框架适配 GitHub
Agent AgentReward 过程级 Reward、Rubric 多维评估 You are here
编排 TrajectoryHub Pipeline 编排、数据集导出 GitHub

端到端工作流 / End-to-end Flow

# 1. Radar: 发现高质量数据集
knowlyr-radar scan --domain code-agent

# 2. DataRecipe: 分析数据集,生成 Schema 和样例
knowlyr-datarecipe deep-analyze tencent/CL-bench -o ./output

# 3. DataSynth: 基于种子数据批量合成
knowlyr-datasynth generate ./output/tencent_CL-bench/ -n 1000

# 4. DataLabel: 人工标注/校准种子数据
knowlyr-datalabel generate ./output/tencent_CL-bench/

# 5. DataCheck: 质量检查
knowlyr-datacheck validate ./output/tencent_CL-bench/

# 6. Recorder: 录制 Agent 执行轨迹
knowlyr-recorder record --task task_001.json

# 7. Hub: 管理轨迹数据
knowlyr-hub import ./trajectories/

# 8. Sandbox: 安全回放验证
knowlyr-sandbox replay trajectory_001.json

# 9. AgentReward: 计算过程级 Reward + 构建偏好对
knowlyr-reward score trajectory_001.json
knowlyr-reward preferences trajectories_by_task.json -o pairs.json

全家桶 MCP 配置 / Full MCP Config

{
  "mcpServers": {
    "knowlyr-radar": {
      "command": "uv",
      "args": ["--directory", "/path/to/ai-dataset-radar", "run", "knowlyr-radar-mcp"]
    },
    "knowlyr-datarecipe": {
      "command": "uv",
      "args": ["--directory", "/path/to/data-recipe", "run", "knowlyr-datarecipe-mcp"]
    },
    "knowlyr-datasynth": {
      "command": "uv",
      "args": ["--directory", "/path/to/data-synth", "run", "python", "-m", "datasynth.mcp_server"]
    },
    "knowlyr-datalabel": {
      "command": "uv",
      "args": ["--directory", "/path/to/data-label", "run", "python", "-m", "datalabel.mcp_server"]
    },
    "knowlyr-datacheck": {
      "command": "uv",
      "args": ["--directory", "/path/to/data-check", "run", "python", "-m", "datacheck.mcp_server"]
    },
    "knowlyr-hub": {
      "command": "uv",
      "args": ["--directory", "/path/to/agent-trajectory-hub", "run", "python", "-m", "trajhub.mcp_server"]
    },
    "knowlyr-sandbox": {
      "command": "uv",
      "args": ["--directory", "/path/to/agent-sandbox", "run", "python", "-m", "sandbox.mcp_server"]
    },
    "knowlyr-recorder": {
      "command": "uv",
      "args": ["--directory", "/path/to/agent-recorder", "run", "python", "-m", "recorder.mcp_server"]
    },
    "knowlyr-reward": {
      "command": "uv",
      "args": ["--directory", "/path/to/agent-reward", "run", "python", "-m", "agentreward.mcp_server"]
    }
  }
}

命令参考

命令 功能
knowlyr-reward score <file> 评估单条轨迹
knowlyr-reward compare <files...> 比较多条轨迹
knowlyr-reward preferences <file> 构建偏好对
knowlyr-reward calibrate <file> 人工校准
knowlyr-reward rubrics 列出 Rubric

API 使用

from agentreward import RewardEngine
from agentreward.config import RewardConfig

# 配置
config = RewardConfig(
    rule_weight=0.6,       # 规则层权重
    model_weight=0.4,      # 模型层权重
    rubric_set="default",  # Rubric 集合
    model_name="claude-sonnet-4-20250514",
    provider="anthropic",
    temperature=0.1,
)

# 评估
engine = RewardEngine(config)
result = engine.score(trajectory)

print(f"总分: {result.total_score:.4f}")
print(f"过程分: {result.process_score:.4f}")

Core Classes

说明
RewardEngine 核心引擎,组合规则层和模型层
StepReward 单步 Reward 结果
TrajectoryReward 轨迹 Reward 结果
Rubric 单个评估维度
RubricSet 评估维度集合
PreferencePair 偏好对
RewardConfig 引擎配置
CalibrationResult 校准结果

项目架构

src/agentreward/
├── reward.py        # 核心引擎 (RewardEngine)
├── rubrics.py       # Rubric 定义 (5 个默认维度)
├── rules.py         # 规则层 (冗余/回退/效率/信息利用)
├── judge.py         # 模型层 (LLM-as-Judge)
├── preferences.py   # 偏好对构建
├── calibration.py   # 人工校准
├── config.py        # 配置
├── cli.py           # CLI 命令行
└── mcp_server.py    # MCP Server (4 工具)

License

MIT


评估 Agent 不仅做对了什么,还评估怎么做的

About

Process-level rubric-based reward engine for Code Agent trajectories. CLI + MCP ready.

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages