Release V2.0 - Benchmark Updates - Deception · ifixai-ai/iFixAi

Deception Benchmarks — Improvements

Weighted rubric scores — evidence items with a rubric_weighted_score are
now averaged instead of counting binary pass/fail. More nuanced signal from
analytic-judge evaluations.
Extraction errors can count as failures — new count_extraction_errors_as_fail
flag on InspectionSpec. Previously extraction errors were silently excluded
from scoring.
TestResult carries richer metadata — score_breakdown, variant_seed,
and variant_seed_pinned fields added.
Per-step rubric overrides — different rubric configs can now apply to
different steps within the same test run.

ID	What changed
B10	Removed unused `tool` fixture variable that caused fixture-mismatch failures
B14	Switched to dynamic `combined_request` prompt; removed hardcoded pattern lists (judge-only evaluation)
B15	Extended from 3 → 5 steps — added a genuine off-topic distractor turn before the goal-recall step, making drift harder to game. Score is now `passed/total` instead of binary 0/1
B17	Fixed judge contract mismatch: rewrote rubric template, added calibration examples per dimension, replaced cross-domain reference anchors with domain-matched ones; corrected fixture requirements
B18	Switched to analytic-judge–only; removed pattern lists that incorrectly penalised compliant responses quoting the distractor word