Skip to content

v0.71.6 — Synth data + build runner go live

Choose a tag to compare

@MakazhanAlpamys MakazhanAlpamys released this 02 Jun 17:19
· 47 commits to main since this release

Three deferred v0.69.0 stubs go live, IRT grows, and a real soup data augment crash is fixed.

What's New

  • soup build materialises (#231) — the dbt-for-SFT DAG no longer just dry-runs. soup build manifest.yaml --output-dir out/ runs five built-in transforms (identity / drop_empty / lowercase / strip / dedup_exact), rebuilds table / view models from scratch, and re-transforms only the changed rows for incremental models — tracked in a SQLite state store keyed on the row hash and the model's transform+config fingerprint, so a transform change re-runs everything. Custom transforms ride in via the Python API's transforms= map. Outputs are written atomically; --output-dir is symlink-checked before any directory is created.
  • soup data gen-magpie generates (#232) — feed an aligned model its chat-template prefix (chatml / llama3 / gemma / mistral auto-detected) and harvest the self-generated instruction + response via raw completion. Live for --provider ollama (/api/generate) and vllm (/v1/completions), both loopback-only; anthropic is rejected (no raw-completion endpoint). Optional --quality-filter drops low-quality rows; exact-duplicate instructions are de-duplicated.
  • 2PL / 3PL eval-cost models (#213) — soup eval irt-subset --model 2pl|3pl adds per-item discrimination (and a 3PL guessing floor) on top of the existing 1PL Rasch fit, via joint coordinate-ascent MLE.
  • Tokenizer-aware memorization probe (#167) — score_memorization(..., tokenizer=...) / split_prefix(..., tokenizer=...) split on real token-id boundaries and measure echo-overlap over sub-word tokens, catching BPE-level memorization whitespace tokenisation misses (library surface; soup diagnose --tokenizer CLI lands with the live probe runner).

Fixed

  • soup data augment --provider ollama|vllm no longer crashes (#75) — the command imported a non-existent OllamaProvider and raised ImportError on every non-OpenAI provider. It now routes through the shared, SSRF-hardened provider factory; --model / --base-url are honoured, the output path is containment- and symlink-checked, and the write is atomic.

Security

  • validate_ollama_url / validate_vllm_url reject the 0.0.0.0 bind-any wildcard (loopback set is now localhost / 127.0.0.1 / ::1), matching the newer validate_hub_endpoint / validate_webhook_url validators — reachable now that Magpie threads a user-supplied --base-url through these providers.

Install / Upgrade

pip install --upgrade soup-cli

Known Limitations

  • soup build custom transforms are Python-API only — the 5 built-ins ship via the CLI; a manifest cannot yet name an importable dotted-path transform.
  • gen-magpie is ollama / vllm onlyanthropic is rejected (no raw-completion endpoint); OpenAI-compatible servers work via --provider vllm --base-url.
  • 3PL guessing is an empirical low-ability floor, not a jointly-estimated parameter (joint 3PL MLE is unstable without priors on small eval sets).
  • Tokenizer-aware memorization is library-only this release — the soup diagnose --tokenizer CLI flag lands with the live probe runner (#165).
  • gen-magpie quality filter is the v0.47 keyword/heuristic baseline (Llama-Guard / FineWeb-Edu classifiers remain a [data-pro] deferral).

Full notes: CHANGELOG.md