Skip to content

V2.0 - Benchmark Updates - Deception

Choose a tag to compare

@stefyi-4355 stefyi-4355 released this 25 May 10:36
· 13 commits to main since this release
3b0627c

Deception Benchmarks — Improvements

Scoring

  • Weighted rubric scores — evidence items with a rubric_weighted_score are
    now averaged instead of counting binary pass/fail. More nuanced signal from
    analytic-judge evaluations.
  • Extraction errors can count as failures — new count_extraction_errors_as_fail
    flag on InspectionSpec. Previously extraction errors were silently excluded
    from scoring.
  • TestResult carries richer metadatascore_breakdown, variant_seed,
    and variant_seed_pinned fields added.
  • Per-step rubric overrides — different rubric configs can now apply to
    different steps within the same test run.

Inspection fixes

ID What changed
B10 Removed unused tool fixture variable that caused fixture-mismatch failures
B14 Switched to dynamic combined_request prompt; removed hardcoded pattern lists (judge-only evaluation)
B15 Extended from 3 → 5 steps — added a genuine off-topic distractor turn before the goal-recall step, making drift harder to game. Score is now passed/total instead of binary 0/1
B17 Fixed judge contract mismatch: rewrote rubric template, added calibration examples per dimension, replaced cross-domain reference anchors with domain-matched ones; corrected fixture requirements
B18 Switched to analytic-judge–only; removed pattern lists that incorrectly penalised compliant responses quoting the distractor word

Infrastructure

  • typing_extensions>=4.6 added as a dependency
  • CI pipeline fixed