Skip to content

Backpropagate v1.1.0

Choose a tag to compare

@github-actions github-actions released this 21 May 10:24
· 134 commits to main since this release
403acb1

A minor release that takes the project from "polished v1" to "real v1" via a 10-wave dogfood swarm. Bug + security pass, proactive health pass, UX humanization, full UI redesign (Gradio → Reflex), 5 P0 features.

Added

  • Reflex web UI — the optional [ui] extra now installs Reflex (Radix UI) instead of Gradio. Pure-Python implementation, WebSocket-driven live state, refined Ocean Mist palette, full dark + light mode, WCAG 2.4.7 focus indicators, 30 SVG icons, heartbeat / sparkline / event-log / structured-error / recovery-banner patterns
  • Hugging Face Hub pushbackprop push <local> --repo <owner/name> + backprop export --push-to-hub <repo> for one-shot export+push. Adapter-only by default; --include-base for the full merged model. Token resolution from HF_TOKEN / HUGGING_FACE_HUB_TOKEN / HF CLI cache. model_card.md is mirrored to the repo's README.md so HF picks it up as the model card
  • Resume from checkpointbackprop resume <run_id> (and backprop train --resume <run_id> / backprop multi-run --resume) reconstructs a crashed or interrupted run from RunHistoryManager + the atomic checkpoint manifest. A 5-run multi-run that crashes at run 4 is now recoverable
  • Run historyRunHistoryManager is now actually wired into Trainer + MultiRunTrainer. New backprop list-runs (with --json, --status, --limit filters + aligned columns) and backprop show-run <run_id> (partial-prefix matching) subcommands surface the history
  • Model card generation — every export emits a model_card.md following the HF model-card schema, with full provenance (run_id, base model, dataset hash, seed, training duration, ASCII loss sparkline, Ship Gate trust signals). Opt out via --no-model-card
  • Experiment tracking auto-wired[monitoring] extra (W&B, TensorBoard) now actually integrates. report_to defaults to "auto" (detect what's installed); the run shows up with name backprop-<run_id_short> for cross-system correlation
  • Atomic checkpoint writes — Trainer.save / SLAOMerger.save / export_lora / export_gguf all write to <path>.partial then rename to final. Disk-full mid-write no longer leaves corrupt artifacts
  • OOM auto-recoveryTrainer(oom_recovery=True) (default-on) halves batch_size + doubles gradient_accumulation_steps on torch.cuda.OutOfMemoryError, preserving effective batch. Aborts after 3 consecutive failures at batch=1
  • HF Hub transient retry — every from_pretrained / load_dataset / snapshot_download retries on 5xx / 429 / connection errors with exponential backoff. 401 / 403 / 404 surface in < 1s with cause-classified hints
  • GPU pause-on-overheatTrainer(pause_on_overheat=True) now actually pauses training (the wiring was a no-op in v1.0)
  • Unsloth fallbackTrainer(unsloth_fallback=True) (default-on) falls back to AutoModelForCausalLM + peft on Unsloth failures
  • run_id correlation — every training run mints a UUID4 that flows through every log line + checkpoint manifest + SLAO merge record
  • Stable error codesBackpropagateError.code is now an explicit Ship Gate registry-prefixed identifier on every subclass. 28-entry ERROR_CODES catalog visible via backprop info --error-codes. cause_category enum on ModelLoadError surfaces cause-specific remediation hints
  • CLI exit codes — proper 0 / 1 user-error / 2 runtime-error / 3 partial-success / 130 SIGINT contract
  • Stage C humanization — structured errors with actionable hints, progress feedback on long ops, bare backprop prints help, backprop info --json for support attachments, friendly first-run messages
  • CI hardening — every third-party GitHub Action SHA-pinned. PyPI publish via OIDC trusted publishing (Sigstore provenance). Docker image digest-pinned + HEALTHCHECK. Multi-OS test matrix (Linux + Windows + macOS + Python 3.13). pip-audit + Trivy + Bandit + Semgrep + TruffleHog all gate on findings
  • Documentation — new handbook pages: error-codes.md, troubleshooting.md, env-vars.md, cli-reference.md. README Troubleshooting + Reporting bugs + Web UI subsections. examples/quickstart.jsonl so the "3 lines" Quick Start runs on a clean install

Changed

  • Default modelTrainer() (and backprop train / multi-run CLI defaults) now use Qwen/Qwen2.5-7B-Instruct instead of unsloth/Qwen2.5-7B-Instruct-bnb-4bit. The non-quantized form works without bitsandbytes; users who want the bnb-4bit speedup install [unsloth] and pass --model unsloth/... explicitly
  • safe_path stricter — absolute path + .. segment + no allowed_base argument now raises PathTraversalError instead of warn-only-and-pass-through
  • Multi-run validation-overlap fix_get_data_chunk and _get_replay_samples now hard-cap at the train/validation boundary. Silent contamination is impossible; ConfigurationError surfaces a clear "reduce samples or increase dataset" hint
  • Random state isolation — multi-run replay sampling uses a local random.Random(seed) instead of mutating the global Python RNG
  • SLAO NaN/inf detectionSLAOMerger.merge raises SLAO_MERGE_DIVERGED with run_index + run_id + offending layer on non-finite weights
  • Rate limiter Address handling_extract_client_ip now correctly reads .host from Starlette's Address namedtuple (was including :port, giving every TCP connection its own bucket)
  • UI output dir denylistBACKPROPAGATE_UI__OUTPUT_DIR is validated against a denylist (/etc, ~/.ssh, etc.) on first use
  • --share + --auth gatingbackprop ui --share now requires --auth user:pass (or explicit env-var opt-out with 5-second grace period + loud warning)
  • Scorecard re-audited — B (Error Handling) row 3/7* → 5/7. Total 23/31 → 25/31

Removed

  • Gradio web UI — moved to backpropagate/ui_gradio_legacy.py with a DEPRECATED docstring. Preserved for v1.1 reference; will be removed in v1.2. backpropagate.launch / create_backpropagate_theme / get_theme_info / get_css now raise ImportError with the migration message

Tests

1654 → 1766 (+112): regression tests for every Stage A/B contract that landed and every P0 feature that shipped. Coverage threshold holds at 50%.