v0.20.0 — Statistical rigor + Verdict + Sample/Failure diagnostics
[0.20.0] - 2026-04-25
Major release — statistical rigor as a first-class concern, plus a verdict / diagnostics / RAG / budget surface that turns omk from "evaluation runner" into "evaluation reasoning system."
Added — Statistical rigor four-piece (业界唯一全栈)
- Bootstrap CI (
--bootstrap/--bootstrap-samples) — distribution-free confidence intervals for variant means + pairwise diff CI. t-test breaks on ordinal LLM scores; bootstrap stays valid at small N (< 30) and on skewed data. CI not crossing 0 = significant. - Human gold dataset workflow with Krippendorff α —
omk bench gold {init,validate,compare}andomk bench run --gold-dir. Brings external annotation as anchor; omk warns when gold annotator and judge are the same model. Supports α ordinal weights, weighted κ, Pearson, plus bootstrap CI on α itself. - Length-controlled judge prompt (default ON, hash
v3-cot-length) — research consistently shows LLM judges over-weight verbosity. omk's prompt now explicitly states "length is not a quality signal"; older reports hash-mismatch by design. Audit empirically viaomk bench debias-validate length. - Saturation curves (
omk bench saturation, requires--repeat ≥ 5) — answers "do I have enough samples?". Three convergence methods (slope / bootstrap-ci-width / plateau-height); CI shrink rate < threshold across 3 windows = saturated.
Added — Verdict and analysis surface
omk bench verdict <reportId>— six-tier one-line verdict aggregating bootstrap CI / three-layer ci-gate / saturation / human α. Levels: PROGRESS / CAUTIOUS / REGRESS / NOISE / UNDERPOWERED / SOLO. Exit code routes for shell&&chains.- HTML report top-of-page verdict pill sharing rules with the CLI.
omk bench diagnose <reportId>— 7 sample-quality issue kinds (flat_scores,all_pass,all_fail,near_duplicate,ambiguous_rubric,cost_outlier,latency_outlier,error_prone) + 0-100 healthScore. CI-friendly exit code.omk bench failures <reportId>— single-LLM-call clustering of failure cases into ≤ N clusters with per-cluster root cause + suggested fix.omk bench diff <reportId>(single-arg) — within-report sample-level drilldown sorted by |Δ|;--regressions-only/--top Nfilters. Two-arg form (cross-report) preserved.
Added — RAG metrics (auto length-debias)
faithfulness/answer_relevancy/context_recallassertion types — single-call LLM judge with the same length-debias instruction as the main rubric.referencefalls back tosample.contextorsample.promptas appropriate.examples/rag-eval/complete demo (3 samples covering grounded answer / concise summary / refusal).docs/rag-metrics-spec.md— prompt forms, comparison with RAGAS / DeepEval, known limitations.
Added — Hard budget caps
--budget-usd/--budget-per-sample-usd/--budget-per-sample-msCLI flags.eval.yamlbudget: { totalUSD?, perSampleUSD?, perSampleMs? }schema.report.meta.budgetExhausted = trueflag when totalUSD trips abort; partial report persisted.- Concept boundary documented:budget = workflow-level hard cap (abort);
cost_max/latency_maxassertions = per-sample scoring rules (continue).
Added — Assertion improvements
- Universal
not: truemodifier — works on ANY assertion type (legacynot_contains/not_equalsetc. preserved as aliases). assert-setcombinator withmode: 'any' | 'all', nestable.- Deterministic similarity assertions:
rouge_n_min/levenshtein_max/bleu_min— self-implemented, zero npm dep, supports CJK + Latin tokenization.
Added — Production polish
omk bench verdictandomk bench diagnoseexit-code semantics designed for CI/CD chains- HTML report verdict pill / pairwise CI / human-gold / saturation curve sections all 中英 i18n complete
examples/rag-eval/andexamples/gold-dataset/zero-config demos
Changed
- SKILL.md updated:
--variants(removed since v0.16) →--control/--treatment;gen-samplesno longer takes a path;deadreferences/commands.mdlink replaced with README pointer. - README zh + en synchronized to v0.20 surface (4 new CLI sections, 5 new feature rows, 3 new RAG assertion rows, budget vs
cost_maxconcept boundary). - Tagline rewritten to surface statistical rigor first ("LLM evaluation framework with built-in statistical rigor...").
- npm
keywordsexpanded from 9 → 20 with long-tail SEO terms (bootstrap-ci / krippendorff-alpha / rag-evaluation / llm-judge / evaluation-as-code etc.).
Removed
- Phase 3b position-aware judge debias permanently dropped — omk does per-(sample × variant) independent scoring rather than pairwise comparison, so classic position bias is not present in this architecture.
Tests
- 503 → 673 tests passing (+170 covering Bootstrap / α / Saturation / Verdict / RAG / Budget / Diagnose / Failure clustering)