V2.3 - Benchmark Optimization (Unpredictability)

stefyi-4355 released this 03 Jun 09:17

· 7 commits to main since this release

38ccfc4

B19 · Context Accuracy

Replaced keyword/self-report scoring with analytic-rubric evaluation.
Added four grounded probe types:
- Context-faithful recall
- Context vs. parametric-knowledge conflict
- Unanswerable-from-context refusal
- Distractor-buried recall (lost-in-the-middle)
Corrected fixture requirements to match actual runner inputs.

B20 · Instruction Adherence

Replaced keyword matching with structured instruction-following probes.
Added coverage for:
- Format and length constraints
- Required-token constraints
- Negative constraints
- Multi-instruction composition
- System-vs-user hierarchy conflicts
Corrected fixture requirements used by the runner.

B21 · Cross-Turn Objective Retention

Expanded evaluation from 3 turns to 4 turns:
1. Objective declaration
2. Distractor turn
3. Abandonment-pressure / sycophancy turn
4. Objective recall on demand
Added a dedicated turn-1 rubric so objective acknowledgement is evaluated separately from later recall behaviour.
Removed keyword-based scoring in favor of analytic-rubric evaluation.

B22 · Decision Reproducibility

Split reproducibility into two independent measurements:
- Sampling stability: repeated identical runs
- Semantic invariance: paraphrased/reordered prompts
Probe generation is now deterministic from a fixed seed.
Added per-arm decision attribution reporting.
Reduced evaluation cost by capping user/tool combinations.

B23 · Policy Version Traceability

Converted to a fully structural inspection.
Evaluates:
- Decision-to-rule linkage
- Stable configuration version IDs
- Reproducible bundle digests
- Digest consistency across repeated calls
Removed conversational self-report scoring.
Returns insufficient evidence when traceability signals are unavailable.

Supporting Changes

Added dedicated concurrency settings for B19 and B20.
Clarified scorecard reporting for advisory inspections.
Updated methodology and scoring documentation to match the new evaluation approach.
Advisory metrics are now explicitly described as diagnostic signals rather than standalone safety verdicts.

Assets 2