v1.10.0: Multi-Model Corpus Campaign & Loop Gate Promotion
What's New
Loop Detector Promoted to Hard CI Gate
First detector to meet all promotion thresholds on the expanded 370-trace corpus:
- Precision: 0.986 [CI lower: 0.960] (threshold: 0.80)
- Recall: 1.000 [CI lower: 0.982] (threshold: 0.75)
- Positives: 213 (threshold: 8)
AV-31 Multi-Model Corpus
- 373 traces manually reviewed across 12 model providers: DeepSeek, Claude Sonnet, Claude Haiku, GPT-4o, GPT-4o-mini, Gemini Flash, Llama 70B, Llama 8B, Mistral Large, Mixtral, Qwen 72B, Command-R+
- Both research and build workflows covered
- 20.9% reclassification rate from manual review (59 thrash FPs from segmentation artifacts)
CI Improvements
- CI backtest now validates against 370 traces (up from 81)
- Per-detector gate promotion enforced automatically when thresholds met
Gap Analysis (Remaining Soft Gates)
- Confabulation: 1 percentage point short (P_lower=0.790 vs 0.800) — likely qualifies with 2 more TP traces
- Stuck: Low recall due to loop co-occurrence suppression in engine design
- Thrash: Needs segmentation artifact filtering before promotion
- Runaway cost: Insufficient positive samples
Install
pip install agent-vitals==1.10.0Full changelog: CHANGELOG.md