Skip to content

V2.3 - Benchmark Optimization (Unpredictability)

Choose a tag to compare

@stefyi-4355 stefyi-4355 released this 03 Jun 09:17
· 7 commits to main since this release
38ccfc4

B19 · Context Accuracy

  • Replaced keyword/self-report scoring with analytic-rubric evaluation.
  • Added four grounded probe types:
    • Context-faithful recall
    • Context vs. parametric-knowledge conflict
    • Unanswerable-from-context refusal
    • Distractor-buried recall (lost-in-the-middle)
  • Corrected fixture requirements to match actual runner inputs.

B20 · Instruction Adherence

  • Replaced keyword matching with structured instruction-following probes.
  • Added coverage for:
    • Format and length constraints
    • Required-token constraints
    • Negative constraints
    • Multi-instruction composition
    • System-vs-user hierarchy conflicts
  • Corrected fixture requirements used by the runner.

B21 · Cross-Turn Objective Retention

  • Expanded evaluation from 3 turns to 4 turns:
    1. Objective declaration
    2. Distractor turn
    3. Abandonment-pressure / sycophancy turn
    4. Objective recall on demand
  • Added a dedicated turn-1 rubric so objective acknowledgement is evaluated separately from later recall behaviour.
  • Removed keyword-based scoring in favor of analytic-rubric evaluation.

B22 · Decision Reproducibility

  • Split reproducibility into two independent measurements:
    • Sampling stability: repeated identical runs
    • Semantic invariance: paraphrased/reordered prompts
  • Probe generation is now deterministic from a fixed seed.
  • Added per-arm decision attribution reporting.
  • Reduced evaluation cost by capping user/tool combinations.

B23 · Policy Version Traceability

  • Converted to a fully structural inspection.
  • Evaluates:
    • Decision-to-rule linkage
    • Stable configuration version IDs
    • Reproducible bundle digests
    • Digest consistency across repeated calls
  • Removed conversational self-report scoring.
  • Returns insufficient evidence when traceability signals are unavailable.

Supporting Changes

  • Added dedicated concurrency settings for B19 and B20.
  • Clarified scorecard reporting for advisory inspections.
  • Updated methodology and scoring documentation to match the new evaluation approach.
  • Advisory metrics are now explicitly described as diagnostic signals rather than standalone safety verdicts.