You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
This commit was created on GitHub.com and signed with GitHub’s verified signature.
Added
ORPO — reference-free preference tuning.--method orpo / Trainer(method="orpo") trains on {prompt, chosen, rejected} (or {chosen, rejected}) preference data via TRL's ORPOTrainer — single-stage, no reference model, so it fits the same single-GPU LoRA envelope as SFT. mode="lora" only in v1.5; the default learning rate auto-lowers to 8e-6; preference formats are auto-detected alongside the SFT formats. Ships a cross-version shim so it trains on transformers 4.x and 5.x (trl 0.24's ORPOTrainer writes model.warnings_issued, which transformers 5.x removed).
backprop data report <jsonl> — dataset-quality report. Exact + near-duplicate clusters (pure-stdlib MinHash/LSH), format distribution, token-length histogram, length outliers, empty / no-assistant turns, and train/test contamination (--against). Advisory by default; --fail-on-dups / --fail-on-contamination / --max-outlier-rate / --strict turn signals into hard gates (exit 65). --json for CI.
backprop eval <run-id> — post-train evaluation. Held-out loss + perplexity + N sample generations against a trained adapter; --vs diffs two runs; --gate-against rejects a held-out-loss regression (the seam the eval-gated merge consumes).
FP8 compute path (experimental).--fp8 routes base-weight GEMMs through torchao float8 on Blackwell/Hopper (sm_90+) — base in float8 (~1.4× throughput, ~60% less base memory), LoRA adapter + optimizer stay bf16, result still mergeable. New [fp8] extra. Gated to mode="lora" + method="sft"; degrades to bf16 with a warning on unsupported hardware (never a crash).
Multi-strategy, eval-gated merge framework for multi-run SLAO. --merge-strategy {qiao_mahdavi (default), linear, ties, dare}; a drift gate (--drift-gate) decides merge-vs-branch via LoRA-B cosine similarity; an eval gate (--eval-gate) rejects a merge that regresses held-out loss and restores the pre-merge accumulator. Default (qiao_mahdavi, gates off) is byte-identical to v1.4.
Adapter-native export.export --format ollama-adapter registers a LoRA adapter on a base via an Ollama FROM+ADAPTER Modelfile (safetensors, no merge); backprop ollama shelf <base> lists the adapters registered on a base.
rsLoRA.--use-rslora / use_rslora=True enables rank-stabilized LoRA (α/√r scaling) — zero inference cost, mergeable, benefit grows with rank.
Reasoning-trace SFT.--reasoning-trace keeps the <think>…</think> chain-of-thought in the training target (plain text — still mergeable/exportable, no embedding resize), filters empty / over-long / unbalanced traces, and raises the default max_seq_length to 8192. backprop data report now surfaces a <think> rate + a trace-length histogram.
MLX / Apple-Silicon backend (experimental).--backend {auto,cuda,mlx} + a standalone [mlx] extra route LoRA SFT through mlx_lm on Apple Silicon; CUDA stays canonical, auto picks by hardware. Built + unit-tested; pending dogfood verification on real Apple Silicon (mlx-lm is Apple-only and cannot run on the CUDA dev rig).
New error codes: RUNTIME_FP8_UNSUPPORTED, DEP_MLX_UNAVAILABLE, INPUT_EVAL_RUN_NOT_FOUND, INPUT_EVAL_HELDOUT_UNRESOLVED, RUNTIME_EVAL_FAILED, RUNTIME_EVAL_GATE_REGRESSED, INPUT_DATASET_REPORT_THRESHOLD.
Changed
README capability boundary reframed: "single-stage SFT + reference-free preference tuning (ORPO; SimPO/KTO planned); no online RL (PPO/GRPO/RLVR) — for those use TRL or LLaMA-Factory."
Reflex Web UI: recovery_banner / error_callout / status_pill / runs / run-detail render-compile fixes, plus a new CI dry-run compile-smoke gate (app._compile(dry_run=True)) so a render-time Var error can never ship green again.
Fixed
Composed re-audit remediation (v1.5 feature seams).backprop train --help and multi-run --help no longer crash (an unescaped % in --fp8 help → argparse TypeError; non-cp1252 glyphs in three multi-run flags → Windows-console UnicodeEncodeError); a new test renders format_help() for every subparser. SFT on a preference dataset now trains on the chosen response instead of silently producing a 0-row dataset (and backprop eval works on such runs). The model-card reproduce command now reflects --method / --fp8 / --use-rslora / --reasoning-trace / --backend so it actually reproduces the run. --fp8 on unsupported hardware no longer strips the operator's default 4-bit (OOM risk); orpo_beta is validated > 0; reasoning_trace is an honest no-op under ORPO and active on the MLX rail; the eval gate restores the accumulator on a mid-gate exception; FP8-trained merges cast to bf16 before save (so the GGUF disk pre-check holds); the _count_tokens_approx CJK caveat direction was corrected.
mode="full" now performs genuine full fine-tuning instead of silently running QLoRA. v1.4.0 shipped mode="full" advertising full fine-tuning (every weight updated) for ≤3B models on a 16GB GPU, but BOTH model loaders (_load_with_transformers, _load_with_unsloth) applied 4-bit quantization + a LoRA adapter unconditionally — neither had a self.mode branch. _build_sft_config(mode="full") correctly dropped the adapter and switched to paged 8-bit AdamW + gradient checkpointing, but it was handed a model that was already 4-bit + get_peft_model'd, so TRL trained the LoRA adapter on a frozen 4-bit base. Net effect: mode="full"was QLoRA — the opposite of its documented contract — and the saved artifact was a LoRA adapter, not full weights. The loaders are now mode-aware: mode="full" loads full-precision weights (bf16 on Ampere+, fp16 on pre-Ampere, fp32 on CPU) with noBitsAndBytesConfig and noget_peft_model / prepare_model_for_kbit_training (transformers path), and full_finetuning=True + load_in_4bit=False with no adapter (Unsloth path); the plain full model goes straight to SFTTrainer. mode="lora" (the default) is byte-identical to before. This also fixes a crash where mode="full"'s 4-bit load + device_map="auto" had no CUDA guard (bitsandbytes 4-bit is CUDA-only) — full-precision loads now omit device_map on CPU runners. Verified end-to-end on an RTX 5090: a mode="full" train produces a 100%-trainable full bf16 model and saves model.safetensors (full weights), no adapter_config.json. Regression tests (tests/test_wave6b_features.py::TestModeFullLoadsGenuineFullModel) assert both loaders produce a non-PEFT, non-4-bit model and FAIL on the pre-fix loaders.
mode="full" parameter ceiling raised 3B → 4B so the marketed "3B" presets actually work. The v1.4.0 ceiling was _FULL_FT_PARAM_CEILING_BILLIONS = 3.0, but the load-time gate checks the model's authoritativenum_parameters() — and every marketed "3B" model is really 3.08–3.24B (SmolLM3-3B 3.08B, Qwen2.5-3B 3.09B, Llama-3.2-3B 3.21B). So Trainer(model="smollm3-3b", mode="full").load_model() raised RUNTIME_FULL_FT_MODEL_TOO_LARGE at load — mode="full" was unusable for its headline target models (only ≤~2B passed). The ceiling is now 4.0B: the genuine ~3B presets fit a 16GB card, and the 3.8–4B class (Phi-4-mini-3.8B, Qwen-3.5-4B) is also admitted — those need a 24GB+ card for the VRAM (weights + gradients alone approach 16GB), documented in the README envelope table. To match, the RUNTIME_FULL_FT_MODEL_TOO_LARGE recovery hint no longer mis-directs operators to presets above the ceiling, and a broken handbook/full-finetuning.md link in the hint was corrected to handbook/full-fine-tuning.md. Envelope confirmed empirically on an RTX 5090 capped to 16GB: a SmolLM3-3B (3.08B) mode="full" step peaks at ~13.6GB of VRAM (100% of parameters trainable, not a PEFT adapter) — comfortably inside the 16GB consumer-card budget.