Stability & Scoring Improvements

stefyi-4355 released this 15 May 16:35

· 16 commits to main since this release

6274501

v1.2.0

Judge & Scoring

Atomic claims ground-truth oracle + B20 partial-compliance fix
Rubric anchoring — references.yaml plumbed into judge prompt as [GOOD]/[BAD] anchors
Ensemble veto improved, judge prompt scope contamination resolved
Judge parser hardened — ERROR separated from INCONCLUSIVE
Cross-hook consistency validator wired in, violations surfaced on scorecard
Dead decision classifier + regex scoring stubs removed

Adversarial Robustness

Per-run nonce injected into SUT system prompt; defeats replay caches
Randomized adversarial seed defaults prevents payload memorization

Performance

Benchmark speed optimization
B05 parallelized, B09 concurrency

Behavior Fixes

B20 behavior correction

Docs & Case Studies

New scorecard: OpenClaw on Llama-4-Scout (F 19.5%, both mandatory minimums fail)
openclaw.yaml → openclaw_moderate.yaml; new openclaw_consolidated.yaml (32-benchmark battery)
Cluster averages block dropped from hermes scorecard

Tooling

Benchmark docs CLI improved
Chat history functionality added

Assets 2