Releases
v1.2.0
Stability & Scoring Improvements
Compare
Sorry, something went wrong.
No results found
v1.2.0
Judge & Scoring
Atomic claims ground-truth oracle + B20 partial-compliance fix
Rubric anchoring — references.yaml plumbed into judge prompt as [GOOD]/[BAD] anchors
Ensemble veto improved, judge prompt scope contamination resolved
Judge parser hardened — ERROR separated from INCONCLUSIVE
Cross-hook consistency validator wired in, violations surfaced on scorecard
Dead decision classifier + regex scoring stubs removed
Adversarial Robustness
Per-run nonce injected into SUT system prompt; defeats replay caches
Randomized adversarial seed defaults prevents payload memorization
Performance
Benchmark speed optimization
B05 parallelized , B09 concurrency
Behavior Fixes
Docs & Case Studies
New scorecard: OpenClaw on Llama-4-Scout (F 19.5%, both mandatory minimums fail)
openclaw.yaml → openclaw_moderate.yaml; new openclaw_consolidated.yaml (32-benchmark battery)
Cluster averages block dropped from hermes scorecard
Tooling
Benchmark docs CLI improved
Chat history functionality added
You can’t perform that action at this time.