Skip to content

v1.10.0: Multi-Model Corpus Campaign & Loop Gate Promotion

Choose a tag to compare

@kneelinghorse kneelinghorse released this 14 Mar 21:02
· 29 commits to main since this release

What's New

Loop Detector Promoted to Hard CI Gate

First detector to meet all promotion thresholds on the expanded 370-trace corpus:

  • Precision: 0.986 [CI lower: 0.960] (threshold: 0.80)
  • Recall: 1.000 [CI lower: 0.982] (threshold: 0.75)
  • Positives: 213 (threshold: 8)

AV-31 Multi-Model Corpus

  • 373 traces manually reviewed across 12 model providers: DeepSeek, Claude Sonnet, Claude Haiku, GPT-4o, GPT-4o-mini, Gemini Flash, Llama 70B, Llama 8B, Mistral Large, Mixtral, Qwen 72B, Command-R+
  • Both research and build workflows covered
  • 20.9% reclassification rate from manual review (59 thrash FPs from segmentation artifacts)

CI Improvements

  • CI backtest now validates against 370 traces (up from 81)
  • Per-detector gate promotion enforced automatically when thresholds met

Gap Analysis (Remaining Soft Gates)

  • Confabulation: 1 percentage point short (P_lower=0.790 vs 0.800) — likely qualifies with 2 more TP traces
  • Stuck: Low recall due to loop co-occurrence suppression in engine design
  • Thrash: Needs segmentation artifact filtering before promotion
  • Runaway cost: Insufficient positive samples

Install

pip install agent-vitals==1.10.0

Full changelog: CHANGELOG.md