Skip to content

prism-verify 1.4.0

Choose a tag to compare

@mcp-tool-shop mcp-tool-shop released this 14 Jun 11:42

Runtime verifier-as-a-product — now measured, not asserted.

A full dogfood swarm (health A/B/C + the F-01 "validate-on-own-data" feature pass). Tests 472 → 641.

Highlights

  • Lock-1 family-A/B fixed — the family-different proof had been silently measuring nothing; it now produces a real paired McNemar delta + CI. First real run is an honest null on a small/confounded corpus (see eval/RESULTS.md).
  • Validate-on-own-data — a real-bug corpus (vendored MIT QuixBugs, contamination-honest) + a CodeJudgeBench loader/harness (prism eval --benchmark codejudgebench).
  • Observability — request-id correlation threaded HTTP→engine→providers, structured decision/breaker logging, /healthz reports circuit-breaker state.
  • Eval reproducibility — corpus content-hash + resolved-model + temperature pinned in the report; honest contaminated-vs-fresh accuracy delta.
  • Fixed (CRITICAL) — the npm launcher was shipping a 6-week-stale binary; it now self-syncs the binary pin from package.json.
  • Maintenance: Node-24 action bumps, SHA-pinned publish actions, dependency upper-bounds, CONTRIBUTING.md.

Full detail: CHANGELOG.md. Measured calibration: eval/RESULTS.md.