Skip to content

v0.71.8 — Probes & SAE

Choose a tag to compare

@MakazhanAlpamys MakazhanAlpamys released this 03 Jun 10:30
· 43 commits to main since this release

What's New — v0.71.8 "Probes & SAE"

The activation-probe surfaces ship real weights, live SAE downloads, and an end-to-end capture → diff pipeline (validated on SmolLM2-135M, RTX 3050 4 GB).

  • soup probe truth / soup probe harm (#217) — TruthfulQA-style honesty and HarmBench-style misuse activation probes (6 bundled bases each, 5% / 20% verdict bands). --weights loads a real calibrated probe; without it the bundled deterministic fallback is used.
  • soup probe sleeper --weights <w.npz|.npy|.safetensors> (#215) — load a real calibrated sleeper-probe direction instead of the synthetic fallback. compute_contrast_probe(positive, negative) derives one from contrast-pair activations. Weights are cwd-contained, O_NOFOLLOW-opened, allow_pickle=False, size-capped.
  • soup probe sae-diff <repo> --auto-download (#216) — fetch an allowlisted SAE from the HF Hub into ~/.soup/sae-cache/ (validated against HF_HUB_ALLOWLIST before any network call) via a new SSRF-hardened hubs.snapshot_download.
  • soup probe interference --measure <eval.jsonl> --base-model <m> --adapter a=path ... (#218) — auto-measure the N×N adapter-interference matrix by actually loading the base + each LoRA adapter (PEFT multi-adapter; add_weighted_adapter(combination_type="cat") for co-loaded pairs). Exit 2 on a MAJOR worst-pair.
  • soup train --capture-activations <layer> --capture-prompts <jsonl> (#219) — a post-training hook writes an SAE-diff-ready per-token activation snapshot to <output>/activations/activations.json. The model.layers.N path resolves whether or not a LoRA adapter is loaded.

Install / Upgrade

pip install --upgrade soup-cli

Security

Probe / SAE / capture file I/O is cwd-contained + O_NOFOLLOW (closes the TOCTOU symlink-swap window) + size-capped; SAE weight loads use allow_pickle=False (no pickle code-exec). --auto-download validates the allowlist before any network call and rejects a glob result that resolves outside the snapshot dir (symlink-escape guard).

Known Limitations

  • #215 is partial — the operator-supplied (--weights), contrast-pair (compute_contrast_probe), and deterministic-synthetic paths all ship and run live, but the 6 large-base Anthropic-style calibrated probe vectors are upstream-gated (no public artifact exists). The bundled truth/harm/sleeper specs use the synthetic seed unless you supply real weights. #215 stays open as upstream-gated.
  • Bundled probe bases are 4096-dim (Llama-3-8B family); running a bundled probe on a smaller model requires --weights with a matching-dim probe.
  • --auto-download is gated to the 8-entry HF_HUB_ALLOWLIST; the happy-path download is covered by mocked unit tests (real Gemma-Scope SAEs are multi-GB — out of the 4 GB hardware budget), the allowlist-rejection path was smoked offline.
  • interference --measure loads every adapter into one PEFT model (_MAX_ADAPTERS=16, _MAX_EVAL_PAIRS=64).

Full changelog: CHANGELOG.md