prism-verify 1.4.0
Runtime verifier-as-a-product — now measured, not asserted.
A full dogfood swarm (health A/B/C + the F-01 "validate-on-own-data" feature pass). Tests 472 → 641.
Highlights
- Lock-1 family-A/B fixed — the family-different proof had been silently measuring nothing; it now produces a real paired McNemar delta + CI. First real run is an honest null on a small/confounded corpus (see
eval/RESULTS.md). - Validate-on-own-data — a real-bug corpus (vendored MIT QuixBugs, contamination-honest) + a CodeJudgeBench loader/harness (
prism eval --benchmark codejudgebench). - Observability — request-id correlation threaded HTTP→engine→providers, structured decision/breaker logging,
/healthzreports circuit-breaker state. - Eval reproducibility — corpus content-hash + resolved-model + temperature pinned in the report; honest contaminated-vs-fresh accuracy delta.
- Fixed (CRITICAL) — the npm launcher was shipping a 6-week-stale binary; it now self-syncs the binary pin from
package.json. - Maintenance: Node-24 action bumps, SHA-pinned publish actions, dependency upper-bounds,
CONTRIBUTING.md.
Full detail: CHANGELOG.md. Measured calibration: eval/RESULTS.md.