Skip to content

v0.25.0

Choose a tag to compare

@github-actions github-actions released this 02 May 16:58
· 470 commits to main since this release

⚠ BREAKING changes

omk 把"评委模型 / executor"散落的并行字段(judgeModel: string + judgeExecutor + judgeRuntime + judgeModels: string[] + judgeRuntimes: Record<>)统一为唯一字段 judgeModels: JudgeRuntimeEntry[]。N=1 是单评委 degenerate case,N≥2 是 ensemble。同时 eval.yaml 补齐 8 个 CLI flag 已有但 schema 缺的实验设计字段(v0.2)。

CLI

--judge-model / --judge-executor 直接 error: Unknown option exit 2,无 alias。改用 --judge-models executor:model[,executor:model,...]:

--judge-model haiku --judge-executor claude --judge-models claude:haiku
(主路径多评委没对应入口) --judge-models claude:opus,openai:gpt-4o

bench run / bench gate / bench evolve / bench debias-validate / bench failures 全统一。后三者仍是单评委用例,length≥2 时 exit 2。

Report schema (cross-version COMPARABILITY)

ReportMeta / BatchEvaluationMeta:

// before
{ judgeModel: string | null, judgeRuntime?: ExecutorRuntimeFingerprint | null,
  judgeModels?: string[],   // stringified ensemble list
  judgeRuntimes?: Record<string, ExecutorRuntimeFingerprint> }

// after
{ judgeModels: JudgeRuntimeEntry[],  // runtime 直接挂在每个 entry 上
  noJudge?: boolean }                // 显式表达"无评委"语义

VariantSummary.judgeModelsstring[] 改为 JudgeConfig[](去 stringify);EvaluationRequestjudgeModel + judgeExecutor 单字段。

0.24- 报告跨版本不可比,renderer 在缺新字段时显示 "—",bench diff 标 schema 不可比。不写读取兼容 shim(0.x 阶段 BREAKING-COMPARABILITY 政策)。

Programmatic API

直接 import { grade, executeTasks, runEvaluation, evolveSkill } 的下游消费方:

// before
grade({ executor, judgeModel, judgeModels, judgeExecutors, ... })
executeTasks({ executor, judgeExecutor, judgeModel, ... })

// after
grade({ judgeModels: [{executor, model}], judgeExecutors: { [executor]: fn }, ... })
executeTasks({ executor, judgeModels, judgeExecutors, ... })
// `judgeExecutors` lazy:sync-only sample 可传 `{}`,真用到 LLM 时才校验 entry

✨ eval.yaml v0.2

CLI flag 已有但 schema 缺的 9 个字段进 EvalConfig:

judgeModels:
  - { executor: claude, model: haiku }
  - { executor: openai, model: gpt-4o }
repeat: 3
judgeRepeat: 3
bootstrap: true
bootstrapSamples: 1000
goldDir: ./gold
lengthDebias: true
strictBaseline: true
noJudge: false

优先级:CLI flag > eval.yaml > 硬编码 default。bench run 完整支持;bench gate 共享基础字段(variants / executor / model / judgeModels / noJudge / strictBaseline / 等),v0.2 specific 字段(repeat / bootstrap / 等)gate 不读。

🎨 显示文案变化

HTML / CLI 报告单评委从 评委: haiku 改为 评委: claude:haiku 完整 executor:model 形式。跨报告对比无 stringified vs structured 双路歧义。依赖 grep / parse 输出的脚本会注意到。


What's Changed

  • chore(deps): dependabot 指向 develop by @lizhiyao in #55
  • docs(readme): 澄清 Claude Code 与 Codex 的使用入口 by @lizhiyao in #56
  • feat: 新增 omk doctor 评测前置健康检查 by @lizhiyao in #57
  • docs(changelog): 回填 PR #57 占位 by @lizhiyao in #58
  • chore: 工作流精简 — 删 CHANGELOG + CI 矩阵 [22,24] + yarn cache + concurrency + dependabot grouping by @lizhiyao in #59
  • chore: 工作流 followup — CI [22,24] + 删 postbuild + yarn ci 三件并行 (~9s→5.7s) by @lizhiyao in #60
  • chore: 清测试名 + 代码注释里的 Phase / v0.X 时点 tag (39 处) by @lizhiyao in #61
  • feat(judge)!: judgeModels 全链路统一 + eval.yaml v0.2 — schema/CLI/子命令一次性收口 by @lizhiyao in #62

Full Changelog: v0.24.0...v0.25.0