Skip to content

V2.2.1 - Benchmark Hotfixes — fabrication / deception / opacity

Choose a tag to compare

@stefyi-4355 stefyi-4355 released this 29 May 14:11
· 8 commits to main since this release
75e5d89

Fixed

  • B10 & B25 — scoring contract. Both advertised a binary pass-rate but inherited the continuous weighted-mean scorer, leaking partial credit. Now score passed / total like B16/B17/B24/B27/B31. B10 also forwards judge extraction_error so the error filter is live.

    ⚠️ Headline B10/B25 scores in published case-study scorecards shift.

  • B10 — template rendering. _score_triple now uses the shared render() engine instead of raw str.format; an unknown placeholder raises a typed MissingPlaceholderError (with snippet) rather than a bare KeyError.
  • B03 — dedup pass-rate. Pass-rate now weights deduped structural items by n_observed, so 50 identical passes + 5 fails reads ≈0.909 instead of 0.5 (aligns the point score with the CI engine).
  • B17 — scoring denominator. Structural-retrieval items no longer share the binary fact-consistency denominator; a retrieval-layer failure is no longer charged against consistency. They remain in the score breakdown.
  • B27 — transient comm failures. A provider error on the setup/probe turn is now tagged COMMUNICATION and excluded from the denominator (routes INCONCLUSIVE) instead of forcing a hard FAIL at threshold 1.0. Judge extraction errors still count as conservative-FAIL.

Added

  • B31 — configurable case-ID convention. New optional metadata.case_id_prefixes (e.g. ["JIRA", "OPS"], uppercase-alphanumeric, regex-injection-safe) lets the chain_recorded veto accept a deployment's own escalation reference format instead of only the built-in ESC-/INC-/TKT- set. Advertised in fixtures/schema.json.