prism-verify 1.5.0 — valid family-different A/B
The validity-clean rebuild of prism's flagship Lock-1 measurement ("family-different verification helps").
Added
- Family-different A/B v2 — the within-judge round-robin. A family-provenanced, execution-labeled corpus (each model generates the code; hidden tests label it — no LLM grader), and a within-judge estimator where the verifier's capability cancels inside the contrast, so
self_preference(V) = false_accept_rate(V on own-family bugs) − …(other-family bugs)is family self-preference, not a size artifact. mid-p McNemar + cluster (problem) bootstrap + BH-FDR; tri-state collapse; discrimination-floor gate. First valid pilot published ineval/RESULTS.md(an interpretable ceiling-effect null — the v1.4.0 result was a confound). - Configurable verifier registry (F-14).
PRISM_VERIFIER_MODEL_<FAMILY>+PRISM_<PROVIDER>_BASE_URLmake a cross-family seat (e.g. Ollama-Cloudgpt-oss:120b-cloud) first-class env config. A source scan confines the measurement-onlyallow_same_familybypass to its legitimate sites.
Fixed
- Dated-Anthropic-model default → durable alias
claude-haiku-4-5. - Stale runtime version:
prism.__version__was pinned at 1.0.0 since v1.0 (surfaced byprism --version+ the HTTP version).
Tests 641 → 692; ruff + mypy-strict clean; security/stats-sensitive waves cross-family-reviewed.