Skip to content

Stability & Scoring Improvements

Choose a tag to compare

@stefyi-4355 stefyi-4355 released this 15 May 16:35
· 16 commits to main since this release
6274501

v1.2.0

Judge & Scoring

  • Atomic claims ground-truth oracle + B20 partial-compliance fix
  • Rubric anchoringreferences.yaml plumbed into judge prompt as [GOOD]/[BAD] anchors
  • Ensemble veto improved, judge prompt scope contamination resolved
  • Judge parser hardenedERROR separated from INCONCLUSIVE
  • Cross-hook consistency validator wired in, violations surfaced on scorecard
  • Dead decision classifier + regex scoring stubs removed

Adversarial Robustness

  • Per-run nonce injected into SUT system prompt; defeats replay caches
  • Randomized adversarial seed defaults prevents payload memorization

Performance

  • Benchmark speed optimization
  • B05 parallelized, B09 concurrency

Behavior Fixes

  • B20 behavior correction

Docs & Case Studies

  • New scorecard: OpenClaw on Llama-4-Scout (F 19.5%, both mandatory minimums fail)
  • openclaw.yamlopenclaw_moderate.yaml; new openclaw_consolidated.yaml (32-benchmark battery)
  • Cluster averages block dropped from hermes scorecard

Tooling

  • Benchmark docs CLI improved
  • Chat history functionality added