Releases
V2.3
V2.3 - Benchmark Optimization (Unpredictability)
Compare
Sorry, something went wrong.
No results found
B19 · Context Accuracy
Replaced keyword/self-report scoring with analytic-rubric evaluation.
Added four grounded probe types:
Context-faithful recall
Context vs. parametric-knowledge conflict
Unanswerable-from-context refusal
Distractor-buried recall (lost-in-the-middle)
Corrected fixture requirements to match actual runner inputs.
B20 · Instruction Adherence
Replaced keyword matching with structured instruction-following probes.
Added coverage for:
Format and length constraints
Required-token constraints
Negative constraints
Multi-instruction composition
System-vs-user hierarchy conflicts
Corrected fixture requirements used by the runner.
B21 · Cross-Turn Objective Retention
Expanded evaluation from 3 turns to 4 turns:
Objective declaration
Distractor turn
Abandonment-pressure / sycophancy turn
Objective recall on demand
Added a dedicated turn-1 rubric so objective acknowledgement is evaluated separately from later recall behaviour.
Removed keyword-based scoring in favor of analytic-rubric evaluation.
B22 · Decision Reproducibility
Split reproducibility into two independent measurements:
Sampling stability: repeated identical runs
Semantic invariance: paraphrased/reordered prompts
Probe generation is now deterministic from a fixed seed.
Added per-arm decision attribution reporting.
Reduced evaluation cost by capping user/tool combinations.
B23 · Policy Version Traceability
Converted to a fully structural inspection.
Evaluates:
Decision-to-rule linkage
Stable configuration version IDs
Reproducible bundle digests
Digest consistency across repeated calls
Removed conversational self-report scoring.
Returns insufficient evidence when traceability signals are unavailable.
Supporting Changes
Added dedicated concurrency settings for B19 and B20.
Clarified scorecard reporting for advisory inspections.
Updated methodology and scoring documentation to match the new evaluation approach.
Advisory metrics are now explicitly described as diagnostic signals rather than standalone safety verdicts.
You can’t perform that action at this time.