Skip to content

v0.71.7 — Eval live runners

Choose a tag to compare

@MakazhanAlpamys MakazhanAlpamys released this 03 Jun 06:53
· 45 commits to main since this release

What's New

Eval live runners — six probe surfaces that previously emitted heuristic / neutral stubs now load a real model and run live (closes #161, #162, #208, #211, #212, #165). A new shared soup_cli/utils/live_eval.py provides the model-loading primitives (generator / multi-generator closures, masked cross-entropy eval-loss, a short-LoRA probe, held-out logit agreement); every heavy import (torch / transformers / peft / lm_eval) stays lazy.

  • soup advise --probe-model <id> — LIVE ROI probe: zero/few-shot token-F1 baselines, a short LoRA probe (relative held-out-loss improvement + real wall-clock), and base-model proximity (held-out logit agreement) folded into the dataset profile. Without --probe-model, --probe stays the offline heuristic.
  • soup tunability --live — real per-candidate LoRA probe: loads each repo_id, trains --probe-steps on a held-out-excluded slice, reports the held-out-loss drop.
  • soup eval capability --live --model <id> — invokes lm-eval-harness per resolved task (or --tasks) with --limit / --device, isolating per-task failures and surfacing a no-metric result as an explicit error.
  • soup eval behavior --base-model <id> [--adapter <path>] — generates pre/post responses on the bundled behaviour battery and scores the live diff.
  • soup diagnose --base-model <id> [--adapter <path>] [--dataset <jsonl>] [--tokenizer <id>] — runs all six failure-mode probes (forgetting / refusal / format / mode_collapse / memorization / contamination) live; falls back to neutral OK or --evidence JSON when no model is supplied.

Validated end-to-end on real SmolLM2-135M (RTX 3050 4 GB): lora_probe loss 3.063→2.933, proximity 0.679, capability arc_easy acc_norm=1.0 via real lm-eval, diagnose all-six-probes OK, behaviour live diff, tunability delta 0.072.

Install / Upgrade

pip install --upgrade soup-cli
# live eval runners need the training extra:
pip install --upgrade 'soup-cli[train,eval]'

Security

  • The two new JSONL dataset readers (diagnose.live._load_dataset_rows, tunability._load_jsonl_rows) open with O_NOFOLLOW after the cwd-containment check, closing the check→open TOCTOU window (matches the v0.65 / v0.67 reader policy).

Known Limitations

  • Live runners are CPU-runnable but designed for small models on modest hardware; large bases still need adequate VRAM. --device cpu is honoured throughout.
  • soup eval capability --live depends on lm-evaluation-harness (pip install 'soup-cli[eval]'); unregistered tasks in a --suite are reported per-task rather than aborting the run.
  • soup diagnose contamination probe stays neutral unless a benchmark corpus is supplied; the format probe is skipped (neutral) when the dataset does not look JSON-shaped.

Full test suite: 12,771 tests (+68 in tests/test_v0717.py).