Skip to content

Backpropagate v1.5.0

Choose a tag to compare

@github-actions github-actions released this 31 May 10:03
· 25 commits to main since this release
692f50d

Added

  • ORPO — reference-free preference tuning. --method orpo / Trainer(method="orpo") trains on {prompt, chosen, rejected} (or {chosen, rejected}) preference data via TRL's ORPOTrainer — single-stage, no reference model, so it fits the same single-GPU LoRA envelope as SFT. mode="lora" only in v1.5; the default learning rate auto-lowers to 8e-6; preference formats are auto-detected alongside the SFT formats. Ships a cross-version shim so it trains on transformers 4.x and 5.x (trl 0.24's ORPOTrainer writes model.warnings_issued, which transformers 5.x removed).
  • backprop data report <jsonl> — dataset-quality report. Exact + near-duplicate clusters (pure-stdlib MinHash/LSH), format distribution, token-length histogram, length outliers, empty / no-assistant turns, and train/test contamination (--against). Advisory by default; --fail-on-dups / --fail-on-contamination / --max-outlier-rate / --strict turn signals into hard gates (exit 65). --json for CI.
  • backprop eval <run-id> — post-train evaluation. Held-out loss + perplexity + N sample generations against a trained adapter; --vs diffs two runs; --gate-against rejects a held-out-loss regression (the seam the eval-gated merge consumes).
  • FP8 compute path (experimental). --fp8 routes base-weight GEMMs through torchao float8 on Blackwell/Hopper (sm_90+) — base in float8 (~1.4× throughput, ~60% less base memory), LoRA adapter + optimizer stay bf16, result still mergeable. New [fp8] extra. Gated to mode="lora" + method="sft"; degrades to bf16 with a warning on unsupported hardware (never a crash).
  • Multi-strategy, eval-gated merge framework for multi-run SLAO. --merge-strategy {qiao_mahdavi (default), linear, ties, dare}; a drift gate (--drift-gate) decides merge-vs-branch via LoRA-B cosine similarity; an eval gate (--eval-gate) rejects a merge that regresses held-out loss and restores the pre-merge accumulator. Default (qiao_mahdavi, gates off) is byte-identical to v1.4.
  • Adapter-native export. export --format ollama-adapter registers a LoRA adapter on a base via an Ollama FROM+ADAPTER Modelfile (safetensors, no merge); backprop ollama shelf <base> lists the adapters registered on a base.
  • rsLoRA. --use-rslora / use_rslora=True enables rank-stabilized LoRA (α/√r scaling) — zero inference cost, mergeable, benefit grows with rank.
  • Reasoning-trace SFT. --reasoning-trace keeps the <think>…</think> chain-of-thought in the training target (plain text — still mergeable/exportable, no embedding resize), filters empty / over-long / unbalanced traces, and raises the default max_seq_length to 8192. backprop data report now surfaces a <think> rate + a trace-length histogram.
  • MLX / Apple-Silicon backend (experimental). --backend {auto,cuda,mlx} + a standalone [mlx] extra route LoRA SFT through mlx_lm on Apple Silicon; CUDA stays canonical, auto picks by hardware. Built + unit-tested; pending dogfood verification on real Apple Silicon (mlx-lm is Apple-only and cannot run on the CUDA dev rig).
  • New error codes: RUNTIME_FP8_UNSUPPORTED, DEP_MLX_UNAVAILABLE, INPUT_EVAL_RUN_NOT_FOUND, INPUT_EVAL_HELDOUT_UNRESOLVED, RUNTIME_EVAL_FAILED, RUNTIME_EVAL_GATE_REGRESSED, INPUT_DATASET_REPORT_THRESHOLD.

Changed

  • README capability boundary reframed: "single-stage SFT + reference-free preference tuning (ORPO; SimPO/KTO planned); no online RL (PPO/GRPO/RLVR) — for those use TRL or LLaMA-Factory."
  • Reflex Web UI: recovery_banner / error_callout / status_pill / runs / run-detail render-compile fixes, plus a new CI dry-run compile-smoke gate (app._compile(dry_run=True)) so a render-time Var error can never ship green again.

Fixed

  • Composed re-audit remediation (v1.5 feature seams). backprop train --help and multi-run --help no longer crash (an unescaped % in --fp8 help → argparse TypeError; non-cp1252 glyphs in three multi-run flags → Windows-console UnicodeEncodeError); a new test renders format_help() for every subparser. SFT on a preference dataset now trains on the chosen response instead of silently producing a 0-row dataset (and backprop eval works on such runs). The model-card reproduce command now reflects --method / --fp8 / --use-rslora / --reasoning-trace / --backend so it actually reproduces the run. --fp8 on unsupported hardware no longer strips the operator's default 4-bit (OOM risk); orpo_beta is validated > 0; reasoning_trace is an honest no-op under ORPO and active on the MLX rail; the eval gate restores the accumulator on a mid-gate exception; FP8-trained merges cast to bf16 before save (so the GGUF disk pre-check holds); the _count_tokens_approx CJK caveat direction was corrected.
  • mode="full" now performs genuine full fine-tuning instead of silently running QLoRA. v1.4.0 shipped mode="full" advertising full fine-tuning (every weight updated) for ≤3B models on a 16GB GPU, but BOTH model loaders (_load_with_transformers, _load_with_unsloth) applied 4-bit quantization + a LoRA adapter unconditionally — neither had a self.mode branch. _build_sft_config(mode="full") correctly dropped the adapter and switched to paged 8-bit AdamW + gradient checkpointing, but it was handed a model that was already 4-bit + get_peft_model'd, so TRL trained the LoRA adapter on a frozen 4-bit base. Net effect: mode="full" was QLoRA — the opposite of its documented contract — and the saved artifact was a LoRA adapter, not full weights. The loaders are now mode-aware: mode="full" loads full-precision weights (bf16 on Ampere+, fp16 on pre-Ampere, fp32 on CPU) with no BitsAndBytesConfig and no get_peft_model / prepare_model_for_kbit_training (transformers path), and full_finetuning=True + load_in_4bit=False with no adapter (Unsloth path); the plain full model goes straight to SFTTrainer. mode="lora" (the default) is byte-identical to before. This also fixes a crash where mode="full"'s 4-bit load + device_map="auto" had no CUDA guard (bitsandbytes 4-bit is CUDA-only) — full-precision loads now omit device_map on CPU runners. Verified end-to-end on an RTX 5090: a mode="full" train produces a 100%-trainable full bf16 model and saves model.safetensors (full weights), no adapter_config.json. Regression tests (tests/test_wave6b_features.py::TestModeFullLoadsGenuineFullModel) assert both loaders produce a non-PEFT, non-4-bit model and FAIL on the pre-fix loaders.
  • mode="full" parameter ceiling raised 3B → 4B so the marketed "3B" presets actually work. The v1.4.0 ceiling was _FULL_FT_PARAM_CEILING_BILLIONS = 3.0, but the load-time gate checks the model's authoritative num_parameters() — and every marketed "3B" model is really 3.08–3.24B (SmolLM3-3B 3.08B, Qwen2.5-3B 3.09B, Llama-3.2-3B 3.21B). So Trainer(model="smollm3-3b", mode="full").load_model() raised RUNTIME_FULL_FT_MODEL_TOO_LARGE at load — mode="full" was unusable for its headline target models (only ≤~2B passed). The ceiling is now 4.0B: the genuine ~3B presets fit a 16GB card, and the 3.8–4B class (Phi-4-mini-3.8B, Qwen-3.5-4B) is also admitted — those need a 24GB+ card for the VRAM (weights + gradients alone approach 16GB), documented in the README envelope table. To match, the RUNTIME_FULL_FT_MODEL_TOO_LARGE recovery hint no longer mis-directs operators to presets above the ceiling, and a broken handbook/full-finetuning.md link in the hint was corrected to handbook/full-fine-tuning.md. Envelope confirmed empirically on an RTX 5090 capped to 16GB: a SmolLM3-3B (3.08B) mode="full" step peaks at ~13.6GB of VRAM (100% of parameters trainable, not a PEFT adapter) — comfortably inside the 16GB consumer-card budget.