v0.71.7 — Eval live runners
What's New
Eval live runners — six probe surfaces that previously emitted heuristic / neutral stubs now load a real model and run live (closes #161, #162, #208, #211, #212, #165). A new shared soup_cli/utils/live_eval.py provides the model-loading primitives (generator / multi-generator closures, masked cross-entropy eval-loss, a short-LoRA probe, held-out logit agreement); every heavy import (torch / transformers / peft / lm_eval) stays lazy.
soup advise --probe-model <id>— LIVE ROI probe: zero/few-shot token-F1 baselines, a short LoRA probe (relative held-out-loss improvement + real wall-clock), and base-model proximity (held-out logit agreement) folded into the dataset profile. Without--probe-model,--probestays the offline heuristic.soup tunability --live— real per-candidate LoRA probe: loads eachrepo_id, trains--probe-stepson a held-out-excluded slice, reports the held-out-loss drop.soup eval capability --live --model <id>— invokes lm-eval-harness per resolved task (or--tasks) with--limit/--device, isolating per-task failures and surfacing a no-metric result as an explicit error.soup eval behavior --base-model <id> [--adapter <path>]— generates pre/post responses on the bundled behaviour battery and scores the live diff.soup diagnose --base-model <id> [--adapter <path>] [--dataset <jsonl>] [--tokenizer <id>]— runs all six failure-mode probes (forgetting / refusal / format / mode_collapse / memorization / contamination) live; falls back to neutral OK or--evidenceJSON when no model is supplied.
Validated end-to-end on real SmolLM2-135M (RTX 3050 4 GB): lora_probe loss 3.063→2.933, proximity 0.679, capability arc_easy acc_norm=1.0 via real lm-eval, diagnose all-six-probes OK, behaviour live diff, tunability delta 0.072.
Install / Upgrade
pip install --upgrade soup-cli
# live eval runners need the training extra:
pip install --upgrade 'soup-cli[train,eval]'Security
- The two new JSONL dataset readers (
diagnose.live._load_dataset_rows,tunability._load_jsonl_rows) open withO_NOFOLLOWafter the cwd-containment check, closing the check→open TOCTOU window (matches the v0.65 / v0.67 reader policy).
Known Limitations
- Live runners are CPU-runnable but designed for small models on modest hardware; large bases still need adequate VRAM.
--device cpuis honoured throughout. soup eval capability --livedepends onlm-evaluation-harness(pip install 'soup-cli[eval]'); unregistered tasks in a--suiteare reported per-task rather than aborting the run.soup diagnosecontamination probe stays neutral unless a benchmark corpus is supplied; the format probe is skipped (neutral) when the dataset does not look JSON-shaped.
Full test suite: 12,771 tests (+68 in tests/test_v0717.py).