v0.25.0
⚠ BREAKING changes
omk 把"评委模型 / executor"散落的并行字段(judgeModel: string + judgeExecutor + judgeRuntime + judgeModels: string[] + judgeRuntimes: Record<>)统一为唯一字段 judgeModels: JudgeRuntimeEntry[]。N=1 是单评委 degenerate case,N≥2 是 ensemble。同时 eval.yaml 补齐 8 个 CLI flag 已有但 schema 缺的实验设计字段(v0.2)。
CLI
旧 --judge-model / --judge-executor 直接 error: Unknown option exit 2,无 alias。改用 --judge-models executor:model[,executor:model,...]:
| 旧 | 新 |
|---|---|
--judge-model haiku --judge-executor claude |
--judge-models claude:haiku |
| (主路径多评委没对应入口) | --judge-models claude:opus,openai:gpt-4o |
bench run / bench gate / bench evolve / bench debias-validate / bench failures 全统一。后三者仍是单评委用例,length≥2 时 exit 2。
Report schema (cross-version COMPARABILITY)
ReportMeta / BatchEvaluationMeta:
// before
{ judgeModel: string | null, judgeRuntime?: ExecutorRuntimeFingerprint | null,
judgeModels?: string[], // stringified ensemble list
judgeRuntimes?: Record<string, ExecutorRuntimeFingerprint> }
// after
{ judgeModels: JudgeRuntimeEntry[], // runtime 直接挂在每个 entry 上
noJudge?: boolean } // 显式表达"无评委"语义VariantSummary.judgeModels 从 string[] 改为 JudgeConfig[](去 stringify);EvaluationRequest 删 judgeModel + judgeExecutor 单字段。
0.24- 报告跨版本不可比,renderer 在缺新字段时显示 "—",bench diff 标 schema 不可比。不写读取兼容 shim(0.x 阶段 BREAKING-COMPARABILITY 政策)。
Programmatic API
直接 import { grade, executeTasks, runEvaluation, evolveSkill } 的下游消费方:
// before
grade({ executor, judgeModel, judgeModels, judgeExecutors, ... })
executeTasks({ executor, judgeExecutor, judgeModel, ... })
// after
grade({ judgeModels: [{executor, model}], judgeExecutors: { [executor]: fn }, ... })
executeTasks({ executor, judgeModels, judgeExecutors, ... })
// `judgeExecutors` lazy:sync-only sample 可传 `{}`,真用到 LLM 时才校验 entry✨ eval.yaml v0.2
CLI flag 已有但 schema 缺的 9 个字段进 EvalConfig:
judgeModels:
- { executor: claude, model: haiku }
- { executor: openai, model: gpt-4o }
repeat: 3
judgeRepeat: 3
bootstrap: true
bootstrapSamples: 1000
goldDir: ./gold
lengthDebias: true
strictBaseline: true
noJudge: false优先级:CLI flag > eval.yaml > 硬编码 default。bench run 完整支持;bench gate 共享基础字段(variants / executor / model / judgeModels / noJudge / strictBaseline / 等),v0.2 specific 字段(repeat / bootstrap / 等)gate 不读。
🎨 显示文案变化
HTML / CLI 报告单评委从 评委: haiku 改为 评委: claude:haiku 完整 executor:model 形式。跨报告对比无 stringified vs structured 双路歧义。依赖 grep / parse 输出的脚本会注意到。
What's Changed
- chore(deps): dependabot 指向 develop by @lizhiyao in #55
- docs(readme): 澄清 Claude Code 与 Codex 的使用入口 by @lizhiyao in #56
- feat: 新增 omk doctor 评测前置健康检查 by @lizhiyao in #57
- docs(changelog): 回填 PR #57 占位 by @lizhiyao in #58
- chore: 工作流精简 — 删 CHANGELOG + CI 矩阵 [22,24] + yarn cache + concurrency + dependabot grouping by @lizhiyao in #59
- chore: 工作流 followup — CI [22,24] + 删 postbuild + yarn ci 三件并行 (~9s→5.7s) by @lizhiyao in #60
- chore: 清测试名 + 代码注释里的 Phase / v0.X 时点 tag (39 处) by @lizhiyao in #61
- feat(judge)!: judgeModels 全链路统一 + eval.yaml v0.2 — schema/CLI/子命令一次性收口 by @lizhiyao in #62
Full Changelog: v0.24.0...v0.25.0