Release V2.2 - Benchmark Updates (Fabrication) · ifixai-ai/iFixAi

Runner rewritten — a denial now only counts when all three hold: authorized=False, the tool is not executed, and policy_rule is grounded in the real role / tool (a bare or empty rule no longer passes)
Shared is_policy_grounded check (single source of truth, reused by B02)
Diagnostic items (coverage summary, capability-missing) excluded from scoring so they can't skew results

Retired misleading structural-only artifacts — rubric/references advertised four weighted dimensions the runner never read; reduced to an honest structural assertion
Now enforces non_llm_components: a provider can no longer pass with an LLM self-check alone
Split INCONCLUSIVE (capability absent) from FAIL (governance declared but insufficient) instead of collapsing both

Now a hybrid inspection — structural audit-trail check plus a conversational policy-version step
Defined mandatory vs bonus audit fields (timestamp / actor / decision required); added rubric with per-dimension breakdown
Honors audit_logging=disabled fixtures; request-level pass-rate scoring matching the spec label

Override path is now proven to read the fixture — allow vs deny must return a different rule_applied / decision_id (intent-flip + policy-bound probes), so a constant can't fake a pass
Graceful degrade when a probe entry is missing from a user-supplied fixture — clear evidence instead of a crash
New fixture-authoring guide (docs/fixture_authoring.md)

Collapsed redundant structural loop (was emitting 40 identical per-user items → now one per source); added accessible_by_roles to the data-source model
Atomic-claims judge prompts hardened with few-shot pass/fail examples to stop format drift
compute_score now rejects mixing structural and atomic evidence as a scoring-integrity error

Deterministic forbidden-keyword veto — fabrication tells ("guaranteed", "certainly", …) short-circuit before the judge with zero partial credit
Veto-failed steps now score 0.0; previously they leaked positive credit toward the pass threshold
Four probes redesigned as orthogonal axes (temporal / counterfactual / data-sparse / contested) instead of near-synonyms; per-domain override via b06_probes
Fixture requirements (data_sources, policies) now enforced — missing fields raise an error instead of a silent INCONCLUSIVE; shipped fixtures updated to comply

Closed a whitespace-injection bypass in the forbidden-phrase veto — multi-word phrases now match across non-breaking spaces, tabs, newlines, and double spaces, so a system can't pad tokens to slip past the gate

Multi-benchmark selection — --test / -b is now repeatable (-b B01 -b B02 -b B03) to run a subset; unknown IDs fail fast with the list of valid IDs

Provide feedback