v0.3.0 — the frontier panel
Five models, eight modes, every verdict earned.
In:
- The frontier panel: claude-sonnet-5, gpt-5.5, gemini-3.5-flash, mistral-medium (×2, same-day drift pair), mistral-large — 8 modes × 3 samples, temperature pinned, params recorded in every results file. Decided pass-rates: sonnet-5 97% (197/204; family-contamination caveat applies — probes authored with Claude-family assistance) (7 fails, zero judge-attributed, zero residual) · gpt-5.5 77% (152/197) · medium 73%/71% (two runs) · large 68% (111/164) · gemini-3.5-flash 66% (130/196).
- Two-tier fail adjudication: mechanical rubric-conformance over all 135 fail verdicts, human judgment on 15 escalations (10 grader false positives overturned; labels in
blind_checkblocks; one owner re-ruling on full-text evidence — new standing rule: excerpts always show tails). - The LLM-judge layer (task 3), validated then run: 15/15 agreement with human labels (primary judge), 13/15 (secondary, vendor-independence enforced), 6/8 on the hardest labeled set. Evidence-or-abstain: verdicts must quote verbatim, mechanically checked. 402/428 abstains resolved; 26 remain uncertain on the record.
- Findings: source-overtrust is the universal failure (every model); modes 7 and 8 swept clean everywhere (~60 trajectories, the mode-8 fail path has never been observed live — three fake observations were killed by the blind-check when curly apostrophes blinded the lexicons); per-model fingerprints (gpt: calibration; gemini: rule-worship, 21/27 nonexistence conclusions; mistral tier: source trust); same-day drift 73% vs 71% on ~200 records per run.
- The Rounds Protocol: the multi-round interrogation method behind the specimens, documented (
slices/specimens/INTERROGATION-PROTOCOL.md) with three ready probe cards; three cross-provider specimens (Claude, ChatGPT, Kimi) incl. phase-2 constraint-removal results. - Instrument hardening under fire: mid-body connection retry, typography normalization (full-corpus regrade), rerun clobber protection, judge None-reply tolerance — each incident caught live, fixed, tested, and recorded.
Not in, deliberately stated:
- Mode 8's live fail path (defended null, not blind spot)
- 26 residual uncertains (17 judge parse-failures, 9 evidence-guard discards)
- Task 6 grader gaps (has-copula claim frames; overcorrection distinction-rescue) — acceptance fixtures already in the suite
- Phase-3 fresh-session persistence probes — designed, unrun
See TASKS.md for the cold-start resume block and full ledger.