Skip to content

v0.20.0 — Statistical rigor + Verdict + Sample/Failure diagnostics

Choose a tag to compare

@lizhiyao lizhiyao released this 25 Apr 16:03
· 643 commits to main since this release
6bdb225

[0.20.0] - 2026-04-25

Major release — statistical rigor as a first-class concern, plus a verdict / diagnostics / RAG / budget surface that turns omk from "evaluation runner" into "evaluation reasoning system."

Added — Statistical rigor four-piece (业界唯一全栈)

  • Bootstrap CI (--bootstrap / --bootstrap-samples) — distribution-free confidence intervals for variant means + pairwise diff CI. t-test breaks on ordinal LLM scores; bootstrap stays valid at small N (< 30) and on skewed data. CI not crossing 0 = significant.
  • Human gold dataset workflow with Krippendorff αomk bench gold {init,validate,compare} and omk bench run --gold-dir. Brings external annotation as anchor; omk warns when gold annotator and judge are the same model. Supports α ordinal weights, weighted κ, Pearson, plus bootstrap CI on α itself.
  • Length-controlled judge prompt (default ON, hash v3-cot-length) — research consistently shows LLM judges over-weight verbosity. omk's prompt now explicitly states "length is not a quality signal"; older reports hash-mismatch by design. Audit empirically via omk bench debias-validate length.
  • Saturation curves (omk bench saturation, requires --repeat ≥ 5) — answers "do I have enough samples?". Three convergence methods (slope / bootstrap-ci-width / plateau-height); CI shrink rate < threshold across 3 windows = saturated.

Added — Verdict and analysis surface

  • omk bench verdict <reportId> — six-tier one-line verdict aggregating bootstrap CI / three-layer ci-gate / saturation / human α. Levels: PROGRESS / CAUTIOUS / REGRESS / NOISE / UNDERPOWERED / SOLO. Exit code routes for shell && chains.
  • HTML report top-of-page verdict pill sharing rules with the CLI.
  • omk bench diagnose <reportId> — 7 sample-quality issue kinds (flat_scores, all_pass, all_fail, near_duplicate, ambiguous_rubric, cost_outlier, latency_outlier, error_prone) + 0-100 healthScore. CI-friendly exit code.
  • omk bench failures <reportId> — single-LLM-call clustering of failure cases into ≤ N clusters with per-cluster root cause + suggested fix.
  • omk bench diff <reportId> (single-arg) — within-report sample-level drilldown sorted by |Δ|; --regressions-only / --top N filters. Two-arg form (cross-report) preserved.

Added — RAG metrics (auto length-debias)

  • faithfulness / answer_relevancy / context_recall assertion types — single-call LLM judge with the same length-debias instruction as the main rubric. reference falls back to sample.context or sample.prompt as appropriate.
  • examples/rag-eval/ complete demo (3 samples covering grounded answer / concise summary / refusal).
  • docs/rag-metrics-spec.md — prompt forms, comparison with RAGAS / DeepEval, known limitations.

Added — Hard budget caps

  • --budget-usd / --budget-per-sample-usd / --budget-per-sample-ms CLI flags.
  • eval.yaml budget: { totalUSD?, perSampleUSD?, perSampleMs? } schema.
  • report.meta.budgetExhausted = true flag when totalUSD trips abort; partial report persisted.
  • Concept boundary documented:budget = workflow-level hard cap (abort);cost_max / latency_max assertions = per-sample scoring rules (continue).

Added — Assertion improvements

  • Universal not: true modifier — works on ANY assertion type (legacy not_contains / not_equals etc. preserved as aliases).
  • assert-set combinator with mode: 'any' | 'all', nestable.
  • Deterministic similarity assertions: rouge_n_min / levenshtein_max / bleu_min — self-implemented, zero npm dep, supports CJK + Latin tokenization.

Added — Production polish

  • omk bench verdict and omk bench diagnose exit-code semantics designed for CI/CD chains
  • HTML report verdict pill / pairwise CI / human-gold / saturation curve sections all 中英 i18n complete
  • examples/rag-eval/ and examples/gold-dataset/ zero-config demos

Changed

  • SKILL.md updated:--variants(removed since v0.16) → --control / --treatment;gen-samples no longer takes a path;dead references/commands.md link replaced with README pointer.
  • README zh + en synchronized to v0.20 surface (4 new CLI sections, 5 new feature rows, 3 new RAG assertion rows, budget vs cost_max concept boundary).
  • Tagline rewritten to surface statistical rigor first ("LLM evaluation framework with built-in statistical rigor...").
  • npm keywords expanded from 9 → 20 with long-tail SEO terms (bootstrap-ci / krippendorff-alpha / rag-evaluation / llm-judge / evaluation-as-code etc.).

Removed

  • Phase 3b position-aware judge debias permanently dropped — omk does per-(sample × variant) independent scoring rather than pairwise comparison, so classic position bias is not present in this architecture.

Tests

  • 503 → 673 tests passing (+170 covering Bootstrap / α / Saturation / Verdict / RAG / Budget / Diagnose / Failure clustering)