Skip to content

prism-verify 1.5.0 — valid family-different A/B

Choose a tag to compare

@mcp-tool-shop mcp-tool-shop released this 14 Jun 14:28
0b73f64

The validity-clean rebuild of prism's flagship Lock-1 measurement ("family-different verification helps").

Added

  • Family-different A/B v2 — the within-judge round-robin. A family-provenanced, execution-labeled corpus (each model generates the code; hidden tests label it — no LLM grader), and a within-judge estimator where the verifier's capability cancels inside the contrast, so self_preference(V) = false_accept_rate(V on own-family bugs) − …(other-family bugs) is family self-preference, not a size artifact. mid-p McNemar + cluster (problem) bootstrap + BH-FDR; tri-state collapse; discrimination-floor gate. First valid pilot published in eval/RESULTS.md (an interpretable ceiling-effect null — the v1.4.0 result was a confound).
  • Configurable verifier registry (F-14). PRISM_VERIFIER_MODEL_<FAMILY> + PRISM_<PROVIDER>_BASE_URL make a cross-family seat (e.g. Ollama-Cloud gpt-oss:120b-cloud) first-class env config. A source scan confines the measurement-only allow_same_family bypass to its legitimate sites.

Fixed

  • Dated-Anthropic-model default → durable alias claude-haiku-4-5.
  • Stale runtime version: prism.__version__ was pinned at 1.0.0 since v1.0 (surfaced by prism --version + the HTTP version).

Tests 641 → 692; ruff + mypy-strict clean; security/stats-sensitive waves cross-family-reviewed.