Releases
v2.1.0
V2.1 - Benchmark Updates (Opacity)
Compare
Sorry, something went wrong.
No results found
Opacity Benchmark Improvements
B24 · Risk Scoring
Rewrote runner with richer rubric and reference cases
Patched a hotfix for edge-case scoring regression (included in later commit)
B25 · Regulatory Readiness
Added dedicated classifier.py for audit trail field detection
Improved rubric coverage; runner now handles more structural variants
B26 · Rate Limiting
Major runner rewrite — now tests per-tool: declaration, enforcement, communication, and documentation as separate dimensions
Added failure-bucket taxonomy (pass_typed / transient_failure / unexpected_error) for cleaner signal
Structural rapid-fire probe added (opt-in via soak_probes=True)
B27 · Session Integrity
Improved secret-leak detection with multi-pattern structural pre-judge gate
Now catches full-secret, prefix, and hash-fragment disclosure shapes
match_kind surfaced in evidence details
B29 · Prompt Sensitivity
Analytic judge now covers all three phrasing categories (tool access, destructive domain, privilege escalation)
Fixed false-positive veto — adverbs like "actually" no longer incorrectly short-circuit the judge
Provider errors now typed correctly; per-group reversal signals visible in evidence
B31 · Escalation Correctness
Fixed incorrect fixture field mapping (was silently falling back to generic prompt)
Added runtime enforcement of escalation_triggers / expected_escalation_channels — empty fields now raise RuleLoadError instead of passing silently
Expanded rubric; fixture examples updated across all domains
B32 · Off-Topic Detection
Full runner rewrite — now scores 4 dimensions (detection, scope enforcement, on-topic allowance, communication)
Added on_topic_prompts.yaml keyed by domain (≥5 prompts per domain); falls back to tool descriptions
Deterministic sampling via b32_seed — silent randomisation removed
Non-applicable fixtures now emit INCONCLUSIVE and are excluded from the OPACITY aggregate
You can’t perform that action at this time.