Features
SFT default loss_type is now "chunked_nll"
The flip announced in v1.6 has landed. Setting loss_type is optional, and the default now resolves to "chunked_nll" — giving every SFTTrainer run ~30% less peak VRAM on average (up to ~50% on large-vocab models) with wall-clock time neutral or slightly faster. No action needed.
The auto-resolve falls back to "nll" when use_liger_kernel=True (the two paths are incompatible). If you want the old behavior — e.g. for custom heads — pin it explicitly:
SFTConfig(loss_type="nll")by @qgallouedec in #5846
MoE auxiliary loss in GRPO / RLOO / AsyncGRPO
Post-training MoE models now correctly include the router load-balancing auxiliary loss, matching the model's own reference forward and SFTTrainer. Enable via model_init_kwargs:
GRPOConfig(
...,
model_init_kwargs={"output_router_logits": True, "router_aux_loss_coef": 0.001},
)Plumbed through _get_per_token_logps_and_entropies (now returns a 3-tuple including aux_loss), folded into the policy loss with grad-accum scaling matched per trainer, and logged as aux_loss. AsyncGRPO recomputes it via load_balancing_loss_func in the chunked LM-head path (same as SFT's chunked path).
by @AmineDiro in #6083, plus router_aux_loss_coef config wiring by @qgallouedec in #6085
New experimental GMPO trainer
Geometric-Mean Policy Optimization lands as an experimental trainer. Replaces GRPO's per-token arithmetic mean of importance ratios with a sequence-level geometric mean (mean of clipped log-ratios, then exp); clipping is one-sided by advantage sign and applied in log space. Default epsilon=0.4 per the paper.
from trl.experimental.gmpo import GMPOConfig, GMPOTrainer
trainer = GMPOTrainer(
model="Qwen/Qwen3-4B",
args=GMPOConfig(epsilon=0.4),
reward_funcs=accuracy_reward,
train_dataset=dataset,
)by @raghulchandramouli in #6078
Transformers continuous batching in GRPO / RLOO
use_transformers_paged was deprecated in v1.4; it's now replaced with proper transformers continuous batching. The old branch silently bypassed importance-sampling correction (logprobs = None); the new path captures logprobs from output.logprobs and exposes a ContinuousBatchingConfig for KV-cache tuning.
GRPOConfig(
...,
use_transformers_continuous_batching=True,
transformers_continuous_batching_config={
"use_cuda_graph": False,
"max_memory_percent": 0.4, # leave headroom for training
},
)Benchmark (Llama-3.2-1B-Instruct, A100 80GB, GSM8K): 1.25× faster at N=64 generations with -16 GB peak VRAM vs default generate(). Use when N ≥ 32 with variable completion lengths.
use_transformers_paged=True still works and forwards to the new flag with a FutureWarning. Requires transformers>=5.8.0.
by @sergiopaniego in #5765
AsyncGRPO: native weight sync with vLLM ≥ 0.22.0
WeightTransferClient now drives vLLM's native 4-phase RL weight-transfer API instead of the older 2-call flow: pause(mode="keep") → start_weight_update → threaded update_weights + NCCL broadcast → finish_weight_update → resume. Validated end-to-end on H100 across single-node, FSDP2×4 + TP=4, and 2-node FSDP2×4 + DP=2×TP=4 (weight-sync time ≈ 0.18-0.8 s).
by @AmineDiro in #5892
Padding-free training in AsyncGRPO
AsyncGRPO now supports the same padding-free path SFT already had. Flattens the batch and uses position_ids-based document boundaries instead of right-padding to the longest sequence — meaningful speedup and memory savings on heterogeneous-length workloads.
by @qgallouedec in #5854
Experimental Harbor integration
A new trl.experimental.harbor adapter plugs Harbor agentic task suites into GRPOTrainer via environment_factory. Same pattern as the OpenReward integration — one spec wires all three trainer slots:
from trl import GRPOConfig, GRPOTrainer
from trl.experimental.harbor import HarborSpec
spec = HarborSpec("AdithyaSK/data_agent_rl_environment_train", agent="bash", num_tasks=64)
trainer = GRPOTrainer(
model="Qwen/Qwen3-4B",
args=GRPOConfig(num_generations=8, max_steps=50, max_tool_calling_iterations=25),
train_dataset=spec.train_dataset,
environment_factory=spec.environment_factory,
reward_funcs=spec.reward_funcs,
)Built-in bash harness, plus jupyter and terminal_notes example harnesses. Gated by the new trl[harbor] extra.
by @adithya-s-k in #6018
trust_remote_code in trainer configs
A single trust_remote_code: bool = False field on the trainer configs now covers the whole load surface — model, processor / tokenizer, reference model, reward model, reward tokenizer, teacher — instead of forcing users to thread it through several independent kwarg dicts.
SFTConfig(trust_remote_code=True)ModelConfig.trust_remote_code is removed to avoid duplicate --trust_remote_code when combining dataclasses; CLI behavior is unchanged.
by @qgallouedec in #5802
KTO ↔ DPO alignment: tests, evaluate, sync_ref_model
The last alignment cycle before graduation: KTO now has parity with DPO on pad_to_multiple_of, sync_ref_model, evaluate(), method order/signature, metric placement (all moved into _compute_loss), and a real text + VLM (including multi-image) test suite.
PRs all by @albertvillanova: #6029, #6030, #6033, #6034, #6035, #6080, #6093, #6148, #6149, #6150, #6152, #6160, #6163.
LFM2-VL multimodal inputs in GRPO / RLOO
GRPO and RLOO now support LFM2-VL multimodal inputs end-to-end.
by @zwischenraum in #6114
New built-in reward helpers
get_repetition_penalty_rewardby @qgallouedec in #6058get_cosine_scaled_rewardby @qgallouedec in #6066
SFT refactor: build labels during dataset preparation
Label construction moves out of the collator and into dataset preparation, so "what's trainable" is defined in exactly one place. A single batched map produces a labels column where each token keeps its ID when every applicable mask is 1, else -100. Plain LM stays storage-neutral; pre-tokenized datasets with mask columns now go through the same path. Step 1 toward fixing #3927.
Idefics3 chat template
{% generation %}-marker training template for Idefics3, enabling assistant_only_loss=True.
vLLM version sweep
- Support vLLM 0.19.1 by @qgallouedec in #6107
- Support vLLM 0.20.0 by @qgallouedec in #6108
- Support vLLM 0.22.1 by @qgallouedec in #6119
- Support vLLM 0.23.0 by @qgallouedec in #6153
- Drop vLLM 0.12 by @qgallouedec in #6109
- Drop vLLM 0.13 by @qgallouedec in #6154
Other
- Normalize JSD distillation loss by
num_items_in_batchfor gradient accumulation by @behroozazarkhalili in #6006 [AsyncGRPO]Rollout worker: set aiohttp limit tomax(100, max_inflight_tasks)by @ggcr in #5861- Align SDPO with GRPO/RLOO: drop NaN values when averaging logged metrics by @anshulkulhari7 in #6055
- Make
evaluate()accept the same dataset types as the trainer by @qgallouedec in #6116 - Keep extra columns in
unpair_preference_datasetby @albertvillanova in #6161 - Warn when sequence-level importance sampling is combined with a token-summed loss type by @discobot in #6042
fix(profiling): logProfilingContextmetrics to Trackio backend by @Anai-Guo in #5979- Move experimental example scripts out of the packaged tree by @sergiopaniego in #6141
- Remove redundant
.contiguous()calls by @qgallouedec in #6045 and #6046 - Harmonize logger imports by @qgallouedec in #6142
Fixes
- Share frozen layers with reference model instead of duplicating in memory —
create_reference_modelwithnum_shared_layerswas double-allocating the "shared" frozen layers because the loop never assigned_ref_paramback. Now it does, so shared layers are held once. By @behroozazarkhalili in #6053 - Fix
chunked_nllmixed Tensor/DTensor error under FSDP2 + PEFT by @albertvillanova in #6065 - Fix per-chunk
lm_head.weightall-gathers under FSDP2 +chunked_nllby @albertvillanova in #6077 - Fix ZeRO-3 + PEFT mixed-dtype error for core trainers by @albertvillanova in #6091 and KTO by @albertvillanova in #6093
- Fix ref adapter creation when the LoRA config uses
target_parametersby @discobot in #6043 - Raise when
use_liger_kernelis combined with a PEFT adapter onlm_headby @akshansh47 in #5977 [fix]GLM-4-MoE template: turn-terminating token to the turn itself by @qgallouedec in #6044- Fix
unpair_preference_datasetdropping extra columns by @albertvillanova in #6059 fix(gold): preserve vllm prompt special tokens by @he-yufeng in #6063fix(sft): reject transformed datasets during preparation by @he-yufeng in #6054- Fix broken light-mode banner image in README by @strickvl in #6112
- Fix BrowserGym OpenEnv example dependency and task wiring by @burtenshaw in #6117
- Fix signature of
_unpair_rowby @albertvillanova in #6062 - Remove silently-ignored W&B/Hub fields from GOLD and Distillation configs by @DaoyuanLi2816 in #6023
Documentation and Examples
- docs: sync experimental trainer docstrings with their
__init__signatures by @DaoyuanLi2816 in #6011 - docs: fix stale default values in config docstrings by @DaoyuanLi2816 in #6015
- docs: fix function docstrings that drifted from their signatures by @DaoyuanLi2816 in #6022
- Fix broken doc links in
paper_index(experimental page was split) by @DaoyuanLi2816 in #6070 - Fix broken import in GOLD trainer docs (GOLD is experimental) by @DaoyuanLi2816 in #6073
- Fix broken section anchors in docs by @DaoyuanLi2816 in #6071
- Fix broken example link in BCO trainer docs by @DaoyuanLi2816 in #6084
docs(rloo): add clarifying comment on KL penalty formula divergence from GRPO by @abderahmane-ai in #6096- Fix async GRPO docs to require
vllm>=0.22.0by @sergiopaniego in #6101 - Fix broken internal doc link to GKD Trainer in MiniLLM docs by @ShamSaleem in #6131
- Align
epsilonhelp/docstring wording by @qgallouedec in #6014 - Fix style of optional parameters in docstrings by @albertvillanova in #6081
- Align format of code examples in docstrings by @albertvillanova in #6147
- Add license to Harbor examples by @qgallouedec in #6104
CI
- Extract VLM tests into dedicated classes for core trainers by @albertvillanova in #6033
- Add raises test with vision dataset and text model for DPO and SFT by @albertvillanova in #6026
- Extract
_push_param_to_vllmhelper inVLLMGenerationby @albertvillanova in #6004 - Use relative imports in
async_grpoto match the rest oftrl/experimentalby @qgallouedec in #6012 - Align AsyncGRPO with GRPO:
num_completions_to_print,epsilon_highfallback,logging_stepsdocstring, loss variable names, clip-ratio metrics by @qgallouedec in #6020, #6019, #6016, #6013 and #6021 - Add vision requirement marker to Idefics3 test parameter by @qgallouedec in #6106
- Update distributed SFT test after default
chunked_nllloss by @albertvillanova in #6074 test: bound memory intest_gkd_trainer_with_ligerto avoid OOM on shared runners by @behroozazarkhalili in #6103- Fix
sft_fa2invariant test by @qgallouedec in #6069 - Pin SHA instead of version tag for CI
actions/checkout/astral-sh/setup-uv/actions/setup-python/pre-commit/actionby @albertvillanova in #6097, #6098, #6099 and #6100 - Replace
parse_versionwithVersionby @albertvillanova in #6164 - Hotfix CI: temporarily pin
deepspeed < 0.19.2by @albertvillanova in #6090 chore: updatetests_transformers_branch.ymlby @hf-security-analysis[bot] in #6051chore: updateclear_cache.ymlby @hf-security-analysis[bot] in #6047chore: updatedocker-build.ymlby @hf-security-analysis[bot] in #6048- Delete CI
pr_style_botworkflow by @albertvillanova in #6082 - Remove issue labeller by @qgallouedec in #6052
- Bump the actions group with 4 updates by @dependabot[bot] in #6041
- Change
_get_train_samplercomment to mentionnum_iterations > 1by @anidoesdev in #6125
New Contributors
- @Anai-Guo made their first contribution in #5979
- @ggcr made their first contribution in #5861
- @he-yufeng made their first contribution in #6063
- @discobot made their first contribution in #6043
- @anshulkulhari7 made their first contribution in #6055
- @abderahmane-ai made their first contribution in #6096
- @akshansh47 made their first contribution in #5977
- @strickvl made their first contribution in #6112
- @ShamSaleem made their first contribution in #6131
- @0xadvait made their first contribution in #6037
- @anidoesdev made their first contribution in #6125
- @zwischenraum made their first contribution in #6114
What's Changed
- ⬆️ Bump dev version by @qgallouedec in #6010
- docs: sync experimental trainer docstrings with their init signatures by @DaoyuanLi2816 in #6011
- docs: fix stale default values in config docstrings by @DaoyuanLi2816 in #6015
- fix(profiling): log ProfilingContext metrics to Trackio backend by @Anai-Guo in #5979
- docs: fix function docstrings that drifted from their signatures by @DaoyuanLi2816 in #6022
- Add raises test with vision dataset and text model for DPO and SFT by @albertvillanova in #6026
- Align KTO with DPO: Unwrap VLM batch dimension for text-only data in _tokenize by @albertvillanova in #6029
- Extract VLM tests into dedicated classes for core trainers by @albertvillanova in #6033
- Extract _push_param_to_vllm helper in VLLMGeneration by @albertvillanova in #6004
- Align KTO with DPO: Add tests for text data collator by @albertvillanova in #6034
- Align KTO with DPO: Support pad_to_multiple_of by @albertvillanova in #6035
- Align KTO with DPO: Add VLM tests by @albertvillanova in #6030
- Normalize JSD distillation loss by num_items_in_batch for gradient accumulation by @behroozazarkhalili in #6006
- Bump the actions group with 4 updates by @dependabot[bot] in #6041
- [fix] GLM-4-MoE template: turn-terminating token to the turn itself by @qgallouedec in #6044
- fix: share frozen layers with reference model instead of duplicating in memory by @behroozazarkhalili in #6053
- Default SFT loss to chunked_nll by @qgallouedec in #5846
- Remove redundant
.contiguous()calls by @qgallouedec in #6045 - Remove silently-ignored W&B/Hub fields from GOLD and Distillation configs by @DaoyuanLi2816 in #6023
- [AsyncGRPO] Rollout worker: set aiohttp limit to max(100, max_inflight_tasks) by @ggcr in #5861
- async grpo native weight sync with vllm>=0.22.0 by @AmineDiro in #5892
- Fix
unpair_preference_datasetdropping extra columns by @albertvillanova in #6059 - fix(gold): preserve vllm prompt special tokens by @he-yufeng in #6063
- Fix broken doc links in paper_index (experimental page was split) by @DaoyuanLi2816 in #6070
- chore: update tests_transformers_branch.yml by @hf-security-analysis[bot] in #6051
- Fix chunked_nll mixed Tensor/DTensor error under FSDP2 + PEFT by @albertvillanova in #6065
- Fix signature of _unpair_row by @albertvillanova in #6062
- fix(sft): reject transformed datasets during preparation by @he-yufeng in #6054
- Fix broken import in GOLD trainer docs (GOLD is experimental) by @DaoyuanLi2816 in #6073
- Fix broken section anchors in docs (slug mismatches, renamed/removed sections) by @DaoyuanLi2816 in #6071
- Update distributed SFT test after default chunked_nll loss by @albertvillanova in #6074
- Add GMPO (Geometric-Mean Policy Optimization) experimental trainer by @raghulchandramouli in #6078
- Fix style of optional parameters in docstrings by @albertvillanova in #6081
- Align KTO with DPO: Support config pad_to_multiple_of by @albertvillanova in #6080
- chore: update clear_cache.yml by @hf-security-analysis[bot] in #6047
- chore: update docker-build.yml by @hf-security-analysis[bot] in #6048
- Padding-free training in AsyncGRPO by @qgallouedec in #5854
- Delete CI pr_style_bot workflow by @albertvillanova in #6082
- Fix broken example link in BCO trainer docs by @DaoyuanLi2816 in #6084
- Hotfix CI: Temporarily pin deepspeed < 0.19.2 by @albertvillanova in #6090
- Add MoE auxiliary loss to GRPO, RLOO, and AsyncGRPO trainers by @AmineDiro in #6083
- Add experimental Harbor integration for GRPO environment training by @adithya-s-k in #6018
- Fix
sft_fa2invariant test by @qgallouedec in #6069 - Use relative imports in async_grpo to match the rest of trl/experimental` by @qgallouedec in #6012
- Fix ref adapter creation when the LoRA config uses
target_parametersby @discobot in #6043 - Remove issue labeller by @qgallouedec in #6052
- Add Idefics3 original and training chat template with generation markers by @aazizyan in #5871
- Align SDPO with GRPO/RLOO: drop NaN values when averaging logged metrics by @anshulkulhari7 in #6055
- docs(rloo): add clarifying comment on KL penalty formula divergence from GRPO by @abderahmane-ai in #6096
- feat(grpo): replace deprecated
use_transformers_pagedwith transformers continuous batching by @sergiopaniego in #5765 - test: bound memory in test_gkd_trainer_with_liger to avoid OOM on shared runners by @behroozazarkhalili in #6103
- Add vision requirement marker to Idefics3 test parameter by @qgallouedec in #6106
- Pin SHA instead of version tag for CI actions/checkout by @albertvillanova in #6097
- Pin SHA instead of version tag for CI astral-sh/setup-uv by @albertvillanova in #6098
- Pin SHA instead of version tag for CI actions/setup-python by @albertvillanova in #6099
- Pin SHA instead of version tag for CI pre-commit/action by @albertvillanova in #6100
- Fix ZeRO-3 + PEFT mixed-dtype error for core trainers by @albertvillanova in #6091
- Align KTO with DPO: Fix ZeRO-3 + PEFT dtype mismatch for non-quantized models by @albertvillanova in #6093
- Fix per-chunk lm_head.weight all-gathers under FSDP2 + chunked_nll by @albertvillanova in #6077
- Raise when use_liger_kernel is combined with a PEFT adapter on lm_head by @akshansh47 in #5977
- Align
epsilonhelp/docstring wording by @qgallouedec in #6014 - Fix broken light-mode banner image in README by @strickvl in #6112
- Drop vLLM 0.12 support by @qgallouedec in #6109
- Add
router_aux_loss_coefby @qgallouedec in #6085 - Add support for vLLM 0.19.1 by @qgallouedec in #6107
- Add license to Harbor examples by @qgallouedec in #6104
- Remove redundant
.contiguous()from the shift logits/labels pattern by @qgallouedec in #6046 - Fix BrowserGym OpenEnv example dependency and task wiring by @burtenshaw in #6117
- Fix async GRPO docs to require vllm>=0.22.0 by @sergiopaniego in #6101
- Add
get_repetition_penalty_rewardby @qgallouedec in #6058 - Add
trust_remote_codeto trainer configs by @qgallouedec in #5802 - Align AsyncGRPO num_completions_to_print with GRPO (int | None) by @qgallouedec in #6020
- Align AsyncGRPO epsilon_high with GRPO (None fallback to epsilon) by @qgallouedec in #6019
- Fix logging_steps default mentioned in AsyncGRPOConfig docstring by @qgallouedec in #6016
- Align async GRPO loss variable names with GRPOTrainer by @qgallouedec in #6013
- Make
evaluate()accept the same dataset types as the trainer by @qgallouedec in #6116 - Add support for vLLM 0.20.0 by @qgallouedec in #6108
- Add support for vLLM 0.22.1 by @qgallouedec in #6119
- Fix broken internal doc link to GKD Trainer in MiniLLM docs by @ShamSaleem in #6131
- Align AsyncGRPO clip-ratio metrics with GRPOTrainer by @qgallouedec in #6021
- Add
get_cosine_scaled_rewardby @qgallouedec in #6066 - Move experimental example scripts out of the packaged tree by @sergiopaniego in #6141
- Harmonize logger imports by @qgallouedec in #6142
- refactor(sft): build labels during dataset preparation instead of collation by @0xadvait in #6037
- Add support for vLLM 0.23.0 by @qgallouedec in #6153
- Change
_get_train_samplercomment to mentionnum_iterations > 1by @anidoesdev in #6125 - Warn when sequence-level importance sampling is combined with a token-summed loss type by @discobot in #6042
- Drop vLLM 0.13 support by @qgallouedec in #6154
- Align KTO with DPO: Add evaluate() override by @albertvillanova in #6148
- Align KTO with DPO: Align order and signature of methods by @albertvillanova in #6149
- Align KTO with DPO: Move all metrics computation from log to _compute_loss by @albertvillanova in #6150
- Align KTO with DPO: Support sync_ref_model by @albertvillanova in #6152
- Replace parse_version with Version by @albertvillanova in #6164
- Keep extra columns in unpair_preference_dataset by @albertvillanova in #6161
- Align KTO with DPO: Add tests by @albertvillanova in #6160
- Align format of code examples in docstrings by @albertvillanova in #6147
- Align KTO with DPO: Add VLM multi-image test by @albertvillanova in #6163
- Support LFM2-VL multimodal inputs in GRPO and RLOO by @zwischenraum in #6114
Full Changelog: v1.6.0...v1.7.0