Backpropagate v1.1.0
A minor release that takes the project from "polished v1" to "real v1" via a 10-wave dogfood swarm. Bug + security pass, proactive health pass, UX humanization, full UI redesign (Gradio → Reflex), 5 P0 features.
Added
- Reflex web UI — the optional
[ui]extra now installs Reflex (Radix UI) instead of Gradio. Pure-Python implementation, WebSocket-driven live state, refined Ocean Mist palette, full dark + light mode, WCAG 2.4.7 focus indicators, 30 SVG icons, heartbeat / sparkline / event-log / structured-error / recovery-banner patterns - Hugging Face Hub push —
backprop push <local> --repo <owner/name>+backprop export --push-to-hub <repo>for one-shot export+push. Adapter-only by default;--include-basefor the full merged model. Token resolution fromHF_TOKEN/HUGGING_FACE_HUB_TOKEN/ HF CLI cache.model_card.mdis mirrored to the repo'sREADME.mdso HF picks it up as the model card - Resume from checkpoint —
backprop resume <run_id>(andbackprop train --resume <run_id>/backprop multi-run --resume) reconstructs a crashed or interrupted run from RunHistoryManager + the atomic checkpoint manifest. A 5-run multi-run that crashes at run 4 is now recoverable - Run history —
RunHistoryManageris now actually wired into Trainer + MultiRunTrainer. Newbackprop list-runs(with--json,--status,--limitfilters + aligned columns) andbackprop show-run <run_id>(partial-prefix matching) subcommands surface the history - Model card generation — every export emits a
model_card.mdfollowing the HF model-card schema, with full provenance (run_id, base model, dataset hash, seed, training duration, ASCII loss sparkline, Ship Gate trust signals). Opt out via--no-model-card - Experiment tracking auto-wired —
[monitoring]extra (W&B, TensorBoard) now actually integrates.report_todefaults to"auto"(detect what's installed); the run shows up with namebackprop-<run_id_short>for cross-system correlation - Atomic checkpoint writes — Trainer.save / SLAOMerger.save / export_lora / export_gguf all write to
<path>.partialthen rename to final. Disk-full mid-write no longer leaves corrupt artifacts - OOM auto-recovery —
Trainer(oom_recovery=True)(default-on) halves batch_size + doubles gradient_accumulation_steps ontorch.cuda.OutOfMemoryError, preserving effective batch. Aborts after 3 consecutive failures at batch=1 - HF Hub transient retry — every
from_pretrained/load_dataset/snapshot_downloadretries on 5xx / 429 / connection errors with exponential backoff. 401 / 403 / 404 surface in < 1s with cause-classified hints - GPU pause-on-overheat —
Trainer(pause_on_overheat=True)now actually pauses training (the wiring was a no-op in v1.0) - Unsloth fallback —
Trainer(unsloth_fallback=True)(default-on) falls back to AutoModelForCausalLM + peft on Unsloth failures - run_id correlation — every training run mints a UUID4 that flows through every log line + checkpoint manifest + SLAO merge record
- Stable error codes —
BackpropagateError.codeis now an explicit Ship Gate registry-prefixed identifier on every subclass. 28-entryERROR_CODEScatalog visible viabackprop info --error-codes.cause_categoryenum on ModelLoadError surfaces cause-specific remediation hints - CLI exit codes — proper 0 / 1 user-error / 2 runtime-error / 3 partial-success / 130 SIGINT contract
- Stage C humanization — structured errors with actionable hints, progress feedback on long ops, bare
backpropprints help,backprop info --jsonfor support attachments, friendly first-run messages - CI hardening — every third-party GitHub Action SHA-pinned. PyPI publish via OIDC trusted publishing (Sigstore provenance). Docker image digest-pinned + HEALTHCHECK. Multi-OS test matrix (Linux + Windows + macOS + Python 3.13). pip-audit + Trivy + Bandit + Semgrep + TruffleHog all gate on findings
- Documentation — new handbook pages:
error-codes.md,troubleshooting.md,env-vars.md,cli-reference.md. README Troubleshooting + Reporting bugs + Web UI subsections.examples/quickstart.jsonlso the "3 lines" Quick Start runs on a clean install
Changed
- Default model —
Trainer()(andbackprop train/multi-runCLI defaults) now useQwen/Qwen2.5-7B-Instructinstead ofunsloth/Qwen2.5-7B-Instruct-bnb-4bit. The non-quantized form works without bitsandbytes; users who want the bnb-4bit speedup install[unsloth]and pass--model unsloth/...explicitly - safe_path stricter — absolute path +
..segment + noallowed_baseargument now raisesPathTraversalErrorinstead of warn-only-and-pass-through - Multi-run validation-overlap fix —
_get_data_chunkand_get_replay_samplesnow hard-cap at the train/validation boundary. Silent contamination is impossible;ConfigurationErrorsurfaces a clear "reduce samples or increase dataset" hint - Random state isolation — multi-run replay sampling uses a local
random.Random(seed)instead of mutating the global Python RNG - SLAO NaN/inf detection —
SLAOMerger.mergeraisesSLAO_MERGE_DIVERGEDwith run_index + run_id + offending layer on non-finite weights - Rate limiter Address handling —
_extract_client_ipnow correctly reads.hostfrom Starlette'sAddressnamedtuple (was including:port, giving every TCP connection its own bucket) - UI output dir denylist —
BACKPROPAGATE_UI__OUTPUT_DIRis validated against a denylist (/etc,~/.ssh, etc.) on first use --share+--authgating —backprop ui --sharenow requires--auth user:pass(or explicit env-var opt-out with 5-second grace period + loud warning)- Scorecard re-audited — B (Error Handling) row 3/7* → 5/7. Total 23/31 → 25/31
Removed
- Gradio web UI — moved to
backpropagate/ui_gradio_legacy.pywith a DEPRECATED docstring. Preserved for v1.1 reference; will be removed in v1.2.backpropagate.launch/create_backpropagate_theme/get_theme_info/get_cssnow raiseImportErrorwith the migration message
Tests
1654 → 1766 (+112): regression tests for every Stage A/B contract that landed and every P0 feature that shipped. Coverage threshold holds at 50%.