Release v0.1.0 — seven of eight modes · mightbesaad/llm-reliability-evals

First tagged state of the suite.

In:

8-mode failure taxonomy (TAXONOMY.md) with per-mode detection criteria
7 merged vertical slices, each with frozen probes, a deterministic grader,
hand-labelled regression fixtures (112 total, all passing), and a runner
with --live / --replay paths
Shared provider routing (Anthropic / Mistral / OpenAI) and a trajectory
harness for the agentic modes: provider-normalized tool-use loop against
scripted, frozen tools
Mode 3 live-validated (mistral-medium, 30 samples, verdicts blind-checked
against raw responses); mode 8 live-panelled (3 Mistral models, none
certified prematurely under a fair probe)
CI running the offline fixture suites

Not in, deliberately stated:

See TASKS.md for the full ledger and the build guardrails.

Provide feedback