Skip to content

Releases: mightbesaad/llm-reliability-evals

v0.3.0 — the frontier panel

Choose a tag to compare

@mightbesaad mightbesaad released this 03 Jul 01:41
d6f7d8b

v0.3.0 — the frontier panel

Five models, eight modes, every verdict earned.

In:

  • The frontier panel: claude-sonnet-5, gpt-5.5, gemini-3.5-flash, mistral-medium (×2, same-day drift pair), mistral-large — 8 modes × 3 samples, temperature pinned, params recorded in every results file. Decided pass-rates: sonnet-5 97% (197/204; family-contamination caveat applies — probes authored with Claude-family assistance) (7 fails, zero judge-attributed, zero residual) · gpt-5.5 77% (152/197) · medium 73%/71% (two runs) · large 68% (111/164) · gemini-3.5-flash 66% (130/196).
  • Two-tier fail adjudication: mechanical rubric-conformance over all 135 fail verdicts, human judgment on 15 escalations (10 grader false positives overturned; labels in blind_check blocks; one owner re-ruling on full-text evidence — new standing rule: excerpts always show tails).
  • The LLM-judge layer (task 3), validated then run: 15/15 agreement with human labels (primary judge), 13/15 (secondary, vendor-independence enforced), 6/8 on the hardest labeled set. Evidence-or-abstain: verdicts must quote verbatim, mechanically checked. 402/428 abstains resolved; 26 remain uncertain on the record.
  • Findings: source-overtrust is the universal failure (every model); modes 7 and 8 swept clean everywhere (~60 trajectories, the mode-8 fail path has never been observed live — three fake observations were killed by the blind-check when curly apostrophes blinded the lexicons); per-model fingerprints (gpt: calibration; gemini: rule-worship, 21/27 nonexistence conclusions; mistral tier: source trust); same-day drift 73% vs 71% on ~200 records per run.
  • The Rounds Protocol: the multi-round interrogation method behind the specimens, documented (slices/specimens/INTERROGATION-PROTOCOL.md) with three ready probe cards; three cross-provider specimens (Claude, ChatGPT, Kimi) incl. phase-2 constraint-removal results.
  • Instrument hardening under fire: mid-body connection retry, typography normalization (full-corpus regrade), rerun clobber protection, judge None-reply tolerance — each incident caught live, fixed, tested, and recorded.

Not in, deliberately stated:

  • Mode 8's live fail path (defended null, not blind spot)
  • 26 residual uncertains (17 judge parse-failures, 9 evidence-guard discards)
  • Task 6 grader gaps (has-copula claim frames; overcorrection distinction-rescue) — acceptance fixtures already in the suite
  • Phase-3 fresh-session persistence probes — designed, unrun

See TASKS.md for the cold-start resume block and full ledger.

v0.2.0 — eight of eight

Choose a tag to compare

@mightbesaad mightbesaad released this 02 Jul 14:36
607a185

v0.2.0 — eight of eight, and the instrument earned it

Everything since v0.1.0 ("seven of eight modes"), one day of work, all on the record.

In:

  • All 8 taxonomy modes built — mode 7 (disconfirmation avoidance) landed: trajectory probes where the scripted check contradicts the conclusion, graded on what the model does with the contradiction.
  • Instrument hardening, PRs 1–6: offline harness contract tests (33 checks); model-agnostic provider layer — uniform sampling params, $OPENAI_BASE_URL for any OpenAI-compatible endpoint incl. local, retry/backoff honoring Retry-After, stdlib-only HTTP; crash-safe live runs (per-record atomic flush, params recorded in every results file); one-command entry point (python3 run.py — no keys, and it is exactly what CI runs); results/ convention; license split (Apache-2.0 code / CC-BY-4.0 taxonomy+docs).
  • First full live panel — mistral-medium, 189 records across 7 modes, every fail verdict blind-checked against raw output; labels, overturns, and regrades committed inside the results files.
  • The blind-check caught the sycophancy grader false-failing 6 of 7 real fails (green on 17/17 fixtures, including apology-hold adversarials). Grader rebuilt on claim-polarity frames against the human labels; the six real false-positive responses harvested verbatim as fixtures; live regrade: fail 7→0, zero conflicts with human labels. The mode-3 lesson, paid for a second time, documented both times.
  • Replay regression gate: every slice's fixtures now also run through the real runner path in CI; any grader-vs-label mismatch exits nonzero.
  • Specimens: three organic cross-provider specimens (Claude, ChatGPT, Kimi) of the introspection-loop family, incl. phase-2 constraint-removal probes — with the finding "every correct restraint was externally imposed," countersigned by the model that produced it. Full-privacy redaction policy applied throughout.

Not in, deliberately stated:

  • Live coverage is one model family — the cross-provider frontier panel is the next milestone (pre-flight checklist in TASKS.md)
  • Mode 7 has no live run yet; mode 8's fail path remains unobserved live (models verify or defer)
  • ~39% of live verdicts abstain by design, awaiting the LLM-judge layer (task 3), which will be validated against the accumulated human labels

See TASKS.md for the full ledger and build guardrails.

v0.1.0 — seven of eight modes

Choose a tag to compare

@mightbesaad mightbesaad released this 01 Jul 22:29
e14182a

First tagged state of the suite.

In:

  • 8-mode failure taxonomy (TAXONOMY.md) with per-mode detection criteria
  • 7 merged vertical slices, each with frozen probes, a deterministic grader,
    hand-labelled regression fixtures (112 total, all passing), and a runner
    with --live / --replay paths
  • Shared provider routing (Anthropic / Mistral / OpenAI) and a trajectory
    harness for the agentic modes: provider-normalized tool-use loop against
    scripted, frozen tools
  • Mode 3 live-validated (mistral-medium, 30 samples, verdicts blind-checked
    against raw responses); mode 8 live-panelled (3 Mistral models, none
    certified prematurely under a fair probe)
  • CI running the offline fixture suites

Not in, deliberately stated:

  • Mode 7 (disconfirmation avoidance) — unbuilt; the harness it needs is ready
  • Cross-provider live coverage — the mode-8 panel is Mistral-family only
  • Mode 8's fail path observed live — proven on adversarial fixtures only
  • LLM-judge layer for the graders' uncertain buckets

See TASKS.md for the full ledger and the build guardrails.