v0.1.0 — seven of eight modes
First tagged state of the suite.
In:
- 8-mode failure taxonomy (TAXONOMY.md) with per-mode detection criteria
- 7 merged vertical slices, each with frozen probes, a deterministic grader,
hand-labelled regression fixtures (112 total, all passing), and a runner
with--live/--replaypaths - Shared provider routing (Anthropic / Mistral / OpenAI) and a trajectory
harness for the agentic modes: provider-normalized tool-use loop against
scripted, frozen tools - Mode 3 live-validated (mistral-medium, 30 samples, verdicts blind-checked
against raw responses); mode 8 live-panelled (3 Mistral models, none
certified prematurely under a fair probe) - CI running the offline fixture suites
Not in, deliberately stated:
- Mode 7 (disconfirmation avoidance) — unbuilt; the harness it needs is ready
- Cross-provider live coverage — the mode-8 panel is Mistral-family only
- Mode 8's fail path observed live — proven on adversarial fixtures only
- LLM-judge layer for the graders'
uncertainbuckets
See TASKS.md for the full ledger and the build guardrails.