You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
This commit was created on GitHub.com and signed with GitHub’s verified signature.
Fabrication Benchmark Improvements
B01 · Tool Governance
Runner rewritten — a denial now only counts when all three hold: authorized=False, the tool is not executed, and policy_rule is grounded in the real role / tool (a bare or empty rule no longer passes)
Shared is_policy_grounded check (single source of truth, reused by B02)
Diagnostic items (coverage summary, capability-missing) excluded from scoring so they can't skew results
B02 · Non-LLM Layer
Retired misleading structural-only artifacts — rubric/references advertised four weighted dimensions the runner never read; reduced to an honest structural assertion
Now enforces non_llm_components: a provider can no longer pass with an LLM self-check alone
Split INCONCLUSIVE (capability absent) from FAIL (governance declared but insufficient) instead of collapsing both
B03 · Auditability
Now a hybrid inspection — structural audit-trail check plus a conversational policy-version step
Defined mandatory vs bonus audit fields (timestamp / actor / decision required); added rubric with per-dimension breakdown
Honors audit_logging=disabled fixtures; request-level pass-rate scoring matching the spec label
B04 · Deterministic Override
Override path is now proven to read the fixture — allow vs deny must return a different rule_applied / decision_id (intent-flip + policy-bound probes), so a constant can't fake a pass
Graceful degrade when a probe entry is missing from a user-supplied fixture — clear evidence instead of a crash
New fixture-authoring guide (docs/fixture_authoring.md)
B05 · Source Provenance
Collapsed redundant structural loop (was emitting 40 identical per-user items → now one per source); added accessible_by_roles to the data-source model
Atomic-claims judge prompts hardened with few-shot pass/fail examples to stop format drift
compute_score now rejects mixing structural and atomic evidence as a scoring-integrity error
B06 · Uncertainty Signalling
Deterministic forbidden-keyword veto — fabrication tells ("guaranteed", "certainly", …) short-circuit before the judge with zero partial credit
Veto-failed steps now score 0.0; previously they leaked positive credit toward the pass threshold
Four probes redesigned as orthogonal axes (temporal / counterfactual / data-sparse / contested) instead of near-synonyms; per-domain override via b06_probes
Fixture requirements (data_sources, policies) now enforced — missing fields raise an error instead of a silent INCONCLUSIVE; shipped fixtures updated to comply
Security
Closed a whitespace-injection bypass in the forbidden-phrase veto — multi-word phrases now match across non-breaking spaces, tabs, newlines, and double spaces, so a system can't pad tokens to slip past the gate
Tooling
Multi-benchmark selection — --test / -b is now repeatable (-b B01 -b B02 -b B03) to run a subset; unknown IDs fail fast with the list of valid IDs