v0.71.8 — Probes & SAE
What's New — v0.71.8 "Probes & SAE"
The activation-probe surfaces ship real weights, live SAE downloads, and an end-to-end capture → diff pipeline (validated on SmolLM2-135M, RTX 3050 4 GB).
soup probe truth/soup probe harm(#217) — TruthfulQA-style honesty and HarmBench-style misuse activation probes (6 bundled bases each, 5% / 20% verdict bands).--weightsloads a real calibrated probe; without it the bundled deterministic fallback is used.soup probe sleeper --weights <w.npz|.npy|.safetensors>(#215) — load a real calibrated sleeper-probe direction instead of the synthetic fallback.compute_contrast_probe(positive, negative)derives one from contrast-pair activations. Weights are cwd-contained,O_NOFOLLOW-opened,allow_pickle=False, size-capped.soup probe sae-diff <repo> --auto-download(#216) — fetch an allowlisted SAE from the HF Hub into~/.soup/sae-cache/(validated againstHF_HUB_ALLOWLISTbefore any network call) via a new SSRF-hardenedhubs.snapshot_download.soup probe interference --measure <eval.jsonl> --base-model <m> --adapter a=path ...(#218) — auto-measure the N×N adapter-interference matrix by actually loading the base + each LoRA adapter (PEFT multi-adapter;add_weighted_adapter(combination_type="cat")for co-loaded pairs). Exit 2 on a MAJOR worst-pair.soup train --capture-activations <layer> --capture-prompts <jsonl>(#219) — a post-training hook writes an SAE-diff-ready per-token activation snapshot to<output>/activations/activations.json. Themodel.layers.Npath resolves whether or not a LoRA adapter is loaded.
Install / Upgrade
pip install --upgrade soup-cliSecurity
Probe / SAE / capture file I/O is cwd-contained + O_NOFOLLOW (closes the TOCTOU symlink-swap window) + size-capped; SAE weight loads use allow_pickle=False (no pickle code-exec). --auto-download validates the allowlist before any network call and rejects a glob result that resolves outside the snapshot dir (symlink-escape guard).
Known Limitations
- #215 is partial — the operator-supplied (
--weights), contrast-pair (compute_contrast_probe), and deterministic-synthetic paths all ship and run live, but the 6 large-base Anthropic-style calibrated probe vectors are upstream-gated (no public artifact exists). The bundled truth/harm/sleeper specs use the synthetic seed unless you supply real weights. #215 stays open asupstream-gated. - Bundled probe bases are 4096-dim (Llama-3-8B family); running a bundled probe on a smaller model requires
--weightswith a matching-dim probe. --auto-downloadis gated to the 8-entryHF_HUB_ALLOWLIST; the happy-path download is covered by mocked unit tests (real Gemma-Scope SAEs are multi-GB — out of the 4 GB hardware budget), the allowlist-rejection path was smoked offline.interference --measureloads every adapter into one PEFT model (_MAX_ADAPTERS=16,_MAX_EVAL_PAIRS=64).
Full changelog: CHANGELOG.md