Release v1.7.0 · huggingface/trl

Features

SFT default `loss_type` is now `"chunked_nll"`

The flip announced in v1.6 has landed. Setting loss_type is optional, and the default now resolves to "chunked_nll" — giving every SFTTrainer run ~30% less peak VRAM on average (up to ~50% on large-vocab models) with wall-clock time neutral or slightly faster. No action needed.

The auto-resolve falls back to "nll" when use_liger_kernel=True (the two paths are incompatible). If you want the old behavior — e.g. for custom heads — pin it explicitly:

SFTConfig(loss_type="nll")

by @qgallouedec in #5846

MoE auxiliary loss in GRPO / RLOO / AsyncGRPO

Post-training MoE models now correctly include the router load-balancing auxiliary loss, matching the model's own reference forward and SFTTrainer. Enable via model_init_kwargs:

GRPOConfig(
    ...,
    model_init_kwargs={"output_router_logits": True, "router_aux_loss_coef": 0.001},
)

Plumbed through _get_per_token_logps_and_entropies (now returns a 3-tuple including aux_loss), folded into the policy loss with grad-accum scaling matched per trainer, and logged as aux_loss. AsyncGRPO recomputes it via load_balancing_loss_func in the chunked LM-head path (same as SFT's chunked path).

by @AmineDiro in #6083, plus router_aux_loss_coef config wiring by @qgallouedec in #6085

New experimental GMPO trainer

Geometric-Mean Policy Optimization lands as an experimental trainer. Replaces GRPO's per-token arithmetic mean of importance ratios with a sequence-level geometric mean (mean of clipped log-ratios, then exp); clipping is one-sided by advantage sign and applied in log space. Default epsilon=0.4 per the paper.

from trl.experimental.gmpo import GMPOConfig, GMPOTrainer

trainer = GMPOTrainer(
    model="Qwen/Qwen3-4B",
    args=GMPOConfig(epsilon=0.4),
    reward_funcs=accuracy_reward,
    train_dataset=dataset,
)

by @raghulchandramouli in #6078

Transformers continuous batching in GRPO / RLOO

use_transformers_paged was deprecated in v1.4; it's now replaced with proper transformers continuous batching. The old branch silently bypassed importance-sampling correction (logprobs = None); the new path captures logprobs from output.logprobs and exposes a ContinuousBatchingConfig for KV-cache tuning.

GRPOConfig(
    ...,
    use_transformers_continuous_batching=True,
    transformers_continuous_batching_config={
        "use_cuda_graph": False,
        "max_memory_percent": 0.4,  # leave headroom for training
    },
)

Benchmark (Llama-3.2-1B-Instruct, A100 80GB, GSM8K): 1.25× faster at N=64 generations with -16 GB peak VRAM vs default generate(). Use when N ≥ 32 with variable completion lengths.

use_transformers_paged=True still works and forwards to the new flag with a FutureWarning. Requires transformers>=5.8.0.

by @sergiopaniego in #5765

AsyncGRPO: native weight sync with vLLM ≥ 0.22.0

WeightTransferClient now drives vLLM's native 4-phase RL weight-transfer API instead of the older 2-call flow: pause(mode="keep") → start_weight_update → threaded update_weights + NCCL broadcast → finish_weight_update → resume. Validated end-to-end on H100 across single-node, FSDP2×4 + TP=4, and 2-node FSDP2×4 + DP=2×TP=4 (weight-sync time ≈ 0.18-0.8 s).

by @AmineDiro in #5892

Padding-free training in AsyncGRPO

AsyncGRPO now supports the same padding-free path SFT already had. Flattens the batch and uses position_ids-based document boundaries instead of right-padding to the longest sequence — meaningful speedup and memory savings on heterogeneous-length workloads.

by @qgallouedec in #5854

Experimental Harbor integration

A new trl.experimental.harbor adapter plugs Harbor agentic task suites into GRPOTrainer via environment_factory. Same pattern as the OpenReward integration — one spec wires all three trainer slots:

from trl import GRPOConfig, GRPOTrainer
from trl.experimental.harbor import HarborSpec

spec = HarborSpec("AdithyaSK/data_agent_rl_environment_train", agent="bash", num_tasks=64)

trainer = GRPOTrainer(
    model="Qwen/Qwen3-4B",
    args=GRPOConfig(num_generations=8, max_steps=50, max_tool_calling_iterations=25),
    train_dataset=spec.train_dataset,
    environment_factory=spec.environment_factory,
    reward_funcs=spec.reward_funcs,
)

Built-in bash harness, plus jupyter and terminal_notes example harnesses. Gated by the new trl[harbor] extra.

by @adithya-s-k in #6018

`trust_remote_code` in trainer configs

A single trust_remote_code: bool = False field on the trainer configs now covers the whole load surface — model, processor / tokenizer, reference model, reward model, reward tokenizer, teacher — instead of forcing users to thread it through several independent kwarg dicts.

SFTConfig(trust_remote_code=True)

ModelConfig.trust_remote_code is removed to avoid duplicate --trust_remote_code when combining dataclasses; CLI behavior is unchanged.

by @qgallouedec in #5802

KTO ↔ DPO alignment: tests, evaluate, sync_ref_model

The last alignment cycle before graduation: KTO now has parity with DPO on pad_to_multiple_of, sync_ref_model, evaluate(), method order/signature, metric placement (all moved into _compute_loss), and a real text + VLM (including multi-image) test suite.

PRs all by @albertvillanova: #6029, #6030, #6033, #6034, #6035, #6080, #6093, #6148, #6149, #6150, #6152, #6160, #6163.

LFM2-VL multimodal inputs in GRPO / RLOO

GRPO and RLOO now support LFM2-VL multimodal inputs end-to-end.

by @zwischenraum in #6114

New built-in reward helpers

get_repetition_penalty_reward by @qgallouedec in #6058
get_cosine_scaled_reward by @qgallouedec in #6066

SFT refactor: build labels during dataset preparation

Label construction moves out of the collator and into dataset preparation, so "what's trainable" is defined in exactly one place. A single batched map produces a labels column where each token keeps its ID when every applicable mask is 1, else -100. Plain LM stays storage-neutral; pre-tokenized datasets with mask columns now go through the same path. Step 1 toward fixing #3927.

by @0xadvait in #6037

Idefics3 chat template

{% generation %}-marker training template for Idefics3, enabling assistant_only_loss=True.

by @aazizyan in #5871

vLLM version sweep

Support vLLM 0.19.1 by @qgallouedec in #6107
Support vLLM 0.20.0 by @qgallouedec in #6108
Support vLLM 0.22.1 by @qgallouedec in #6119
Support vLLM 0.23.0 by @qgallouedec in #6153
Drop vLLM 0.12 by @qgallouedec in #6109
Drop vLLM 0.13 by @qgallouedec in #6154

Other

Normalize JSD distillation loss by num_items_in_batch for gradient accumulation by @behroozazarkhalili in #6006
[AsyncGRPO] Rollout worker: set aiohttp limit to max(100, max_inflight_tasks) by @ggcr in #5861
Align SDPO with GRPO/RLOO: drop NaN values when averaging logged metrics by @anshulkulhari7 in #6055
Make evaluate() accept the same dataset types as the trainer by @qgallouedec in #6116
Keep extra columns in unpair_preference_dataset by @albertvillanova in #6161
Warn when sequence-level importance sampling is combined with a token-summed loss type by @discobot in #6042
fix(profiling): log ProfilingContext metrics to Trackio backend by @Anai-Guo in #5979
Move experimental example scripts out of the packaged tree by @sergiopaniego in #6141
Remove redundant .contiguous() calls by @qgallouedec in #6045 and #6046
Harmonize logger imports by @qgallouedec in #6142

Fixes

Share frozen layers with reference model instead of duplicating in memory — create_reference_model with num_shared_layers was double-allocating the "shared" frozen layers because the loop never assigned _ref_param back. Now it does, so shared layers are held once. By @behroozazarkhalili in #6053
Fix chunked_nll mixed Tensor/DTensor error under FSDP2 + PEFT by @albertvillanova in #6065
Fix per-chunk lm_head.weight all-gathers under FSDP2 + chunked_nll by @albertvillanova in #6077
Fix ZeRO-3 + PEFT mixed-dtype error for core trainers by @albertvillanova in #6091 and KTO by @albertvillanova in #6093
Fix ref adapter creation when the LoRA config uses target_parameters by @discobot in #6043
Raise when use_liger_kernel is combined with a PEFT adapter on lm_head by @akshansh47 in #5977
[fix] GLM-4-MoE template: turn-terminating token to the turn itself by @qgallouedec in #6044
Fix unpair_preference_dataset dropping extra columns by @albertvillanova in #6059
fix(gold): preserve vllm prompt special tokens by @he-yufeng in #6063
fix(sft): reject transformed datasets during preparation by @he-yufeng in #6054
Fix broken light-mode banner image in README by @strickvl in #6112
Fix BrowserGym OpenEnv example dependency and task wiring by @burtenshaw in #6117
Fix signature of _unpair_row by @albertvillanova in #6062
Remove silently-ignored W&B/Hub fields from GOLD and Distillation configs by @DaoyuanLi2816 in #6023

Documentation and Examples

docs: sync experimental trainer docstrings with their __init__ signatures by @DaoyuanLi2816 in #6011
docs: fix stale default values in config docstrings by @DaoyuanLi2816 in #6015
docs: fix function docstrings that drifted from their signatures by @DaoyuanLi2816 in #6022
Fix broken doc links in paper_index (experimental page was split) by @DaoyuanLi2816 in #6070
Fix broken import in GOLD trainer docs (GOLD is experimental) by @DaoyuanLi2816 in #6073
Fix broken section anchors in docs by @DaoyuanLi2816 in #6071
Fix broken example link in BCO trainer docs by @DaoyuanLi2816 in #6084
docs(rloo): add clarifying comment on KL penalty formula divergence from GRPO by @abderahmane-ai in #6096
Fix async GRPO docs to require vllm>=0.22.0 by @sergiopaniego in #6101
Fix broken internal doc link to GKD Trainer in MiniLLM docs by @ShamSaleem in #6131
Align epsilon help/docstring wording by @qgallouedec in #6014
Fix style of optional parameters in docstrings by @albertvillanova in #6081
Align format of code examples in docstrings by @albertvillanova in #6147
Add license to Harbor examples by @qgallouedec in #6104

CI

Extract VLM tests into dedicated classes for core trainers by @albertvillanova in #6033
Add raises test with vision dataset and text model for DPO and SFT by @albertvillanova in #6026
Extract _push_param_to_vllm helper in VLLMGeneration by @albertvillanova in #6004
Use relative imports in async_grpo to match the rest of trl/experimental by @qgallouedec in #6012
Align AsyncGRPO with GRPO: num_completions_to_print, epsilon_high fallback, logging_steps docstring, loss variable names, clip-ratio metrics by @qgallouedec in #6020, #6019, #6016, #6013 and #6021
Add vision requirement marker to Idefics3 test parameter by @qgallouedec in #6106
Update distributed SFT test after default chunked_nll loss by @albertvillanova in #6074
test: bound memory in test_gkd_trainer_with_liger to avoid OOM on shared runners by @behroozazarkhalili in #6103
Fix sft_fa2 invariant test by @qgallouedec in #6069
Pin SHA instead of version tag for CI actions/checkout / astral-sh/setup-uv / actions/setup-python / pre-commit/action by @albertvillanova in #6097, #6098, #6099 and #6100
Replace parse_version with Version by @albertvillanova in #6164
Hotfix CI: temporarily pin deepspeed < 0.19.2 by @albertvillanova in #6090
chore: update tests_transformers_branch.yml by @hf-security-analysis[bot] in #6051
chore: update clear_cache.yml by @hf-security-analysis[bot] in #6047
chore: update docker-build.yml by @hf-security-analysis[bot] in #6048
Delete CI pr_style_bot workflow by @albertvillanova in #6082
Remove issue labeller by @qgallouedec in #6052
Bump the actions group with 4 updates by @dependabot[bot] in #6041
Change _get_train_sampler comment to mention num_iterations > 1 by @anidoesdev in #6125

New Contributors

@Anai-Guo made their first contribution in #5979
@ggcr made their first contribution in #5861
@he-yufeng made their first contribution in #6063
@discobot made their first contribution in #6043
@anshulkulhari7 made their first contribution in #6055
@abderahmane-ai made their first contribution in #6096
@akshansh47 made their first contribution in #5977
@strickvl made their first contribution in #6112
@ShamSaleem made their first contribution in #6131
@0xadvait made their first contribution in #6037
@anidoesdev made their first contribution in #6125
@zwischenraum made their first contribution in #6114

What's Changed

⬆️ Bump dev version by @qgallouedec in #6010
docs: sync experimental trainer docstrings with their init signatures by @DaoyuanLi2816 in #6011
docs: fix stale default values in config docstrings by @DaoyuanLi2816 in #6015
fix(profiling): log ProfilingContext metrics to Trackio backend by @Anai-Guo in #5979
docs: fix function docstrings that drifted from their signatures by @DaoyuanLi2816 in #6022
Add raises test with vision dataset and text model for DPO and SFT by @albertvillanova in #6026
Align KTO with DPO: Unwrap VLM batch dimension for text-only data in _tokenize by @albertvillanova in #6029
Extract VLM tests into dedicated classes for core trainers by @albertvillanova in #6033
Extract _push_param_to_vllm helper in VLLMGeneration by @albertvillanova in #6004
Align KTO with DPO: Add tests for text data collator by @albertvillanova in #6034
Align KTO with DPO: Support pad_to_multiple_of by @albertvillanova in #6035
Align KTO with DPO: Add VLM tests by @albertvillanova in #6030
Normalize JSD distillation loss by num_items_in_batch for gradient accumulation by @behroozazarkhalili in #6006
Bump the actions group with 4 updates by @dependabot[bot] in #6041
[fix] GLM-4-MoE template: turn-terminating token to the turn itself by @qgallouedec in #6044
fix: share frozen layers with reference model instead of duplicating in memory by @behroozazarkhalili in #6053
Default SFT loss to chunked_nll by @qgallouedec in #5846
Remove redundant .contiguous() calls by @qgallouedec in #6045
Remove silently-ignored W&B/Hub fields from GOLD and Distillation configs by @DaoyuanLi2816 in #6023
[AsyncGRPO] Rollout worker: set aiohttp limit to max(100, max_inflight_tasks) by @ggcr in #5861
async grpo native weight sync with vllm>=0.22.0 by @AmineDiro in #5892
Fix unpair_preference_dataset dropping extra columns by @albertvillanova in #6059
fix(gold): preserve vllm prompt special tokens by @he-yufeng in #6063
Fix broken doc links in paper_index (experimental page was split) by @DaoyuanLi2816 in #6070
chore: update tests_transformers_branch.yml by @hf-security-analysis[bot] in #6051
Fix chunked_nll mixed Tensor/DTensor error under FSDP2 + PEFT by @albertvillanova in #6065
Fix signature of _unpair_row by @albertvillanova in #6062
fix(sft): reject transformed datasets during preparation by @he-yufeng in #6054
Fix broken import in GOLD trainer docs (GOLD is experimental) by @DaoyuanLi2816 in #6073
Fix broken section anchors in docs (slug mismatches, renamed/removed sections) by @DaoyuanLi2816 in #6071
Update distributed SFT test after default chunked_nll loss by @albertvillanova in #6074
Add GMPO (Geometric-Mean Policy Optimization) experimental trainer by @raghulchandramouli in #6078
Fix style of optional parameters in docstrings by @albertvillanova in #6081
Align KTO with DPO: Support config pad_to_multiple_of by @albertvillanova in #6080
chore: update clear_cache.yml by @hf-security-analysis[bot] in #6047
chore: update docker-build.yml by @hf-security-analysis[bot] in #6048
Padding-free training in AsyncGRPO by @qgallouedec in #5854
Delete CI pr_style_bot workflow by @albertvillanova in #6082
Fix broken example link in BCO trainer docs by @DaoyuanLi2816 in #6084
Hotfix CI: Temporarily pin deepspeed < 0.19.2 by @albertvillanova in #6090
Add MoE auxiliary loss to GRPO, RLOO, and AsyncGRPO trainers by @AmineDiro in #6083
Add experimental Harbor integration for GRPO environment training by @adithya-s-k in #6018
Fix sft_fa2 invariant test by @qgallouedec in #6069
Use relative imports in async_grpo to match the rest of trl/experimental` by @qgallouedec in #6012
Fix ref adapter creation when the LoRA config uses target_parameters by @discobot in #6043
Remove issue labeller by @qgallouedec in #6052
Add Idefics3 original and training chat template with generation markers by @aazizyan in #5871
Align SDPO with GRPO/RLOO: drop NaN values when averaging logged metrics by @anshulkulhari7 in #6055
docs(rloo): add clarifying comment on KL penalty formula divergence from GRPO by @abderahmane-ai in #6096
feat(grpo): replace deprecated use_transformers_paged with transformers continuous batching by @sergiopaniego in #5765
test: bound memory in test_gkd_trainer_with_liger to avoid OOM on shared runners by @behroozazarkhalili in #6103
Add vision requirement marker to Idefics3 test parameter by @qgallouedec in #6106
Pin SHA instead of version tag for CI actions/checkout by @albertvillanova in #6097
Pin SHA instead of version tag for CI astral-sh/setup-uv by @albertvillanova in #6098
Pin SHA instead of version tag for CI actions/setup-python by @albertvillanova in #6099
Pin SHA instead of version tag for CI pre-commit/action by @albertvillanova in #6100
Fix ZeRO-3 + PEFT mixed-dtype error for core trainers by @albertvillanova in #6091
Align KTO with DPO: Fix ZeRO-3 + PEFT dtype mismatch for non-quantized models by @albertvillanova in #6093
Fix per-chunk lm_head.weight all-gathers under FSDP2 + chunked_nll by @albertvillanova in #6077
Raise when use_liger_kernel is combined with a PEFT adapter on lm_head by @akshansh47 in #5977
Align epsilon help/docstring wording by @qgallouedec in #6014
Fix broken light-mode banner image in README by @strickvl in #6112
Drop vLLM 0.12 support by @qgallouedec in #6109
Add router_aux_loss_coef by @qgallouedec in #6085
Add support for vLLM 0.19.1 by @qgallouedec in #6107
Add license to Harbor examples by @qgallouedec in #6104
Remove redundant .contiguous() from the shift logits/labels pattern by @qgallouedec in #6046
Fix BrowserGym OpenEnv example dependency and task wiring by @burtenshaw in #6117
Fix async GRPO docs to require vllm>=0.22.0 by @sergiopaniego in #6101
Add get_repetition_penalty_reward by @qgallouedec in #6058
Add trust_remote_code to trainer configs by @qgallouedec in #5802
Align AsyncGRPO num_completions_to_print with GRPO (int | None) by @qgallouedec in #6020
Align AsyncGRPO epsilon_high with GRPO (None fallback to epsilon) by @qgallouedec in #6019
Fix logging_steps default mentioned in AsyncGRPOConfig docstring by @qgallouedec in #6016
Align async GRPO loss variable names with GRPOTrainer by @qgallouedec in #6013
Make evaluate() accept the same dataset types as the trainer by @qgallouedec in #6116
Add support for vLLM 0.20.0 by @qgallouedec in #6108
Add support for vLLM 0.22.1 by @qgallouedec in #6119
Fix broken internal doc link to GKD Trainer in MiniLLM docs by @ShamSaleem in #6131
Align AsyncGRPO clip-ratio metrics with GRPOTrainer by @qgallouedec in #6021
Add get_cosine_scaled_reward by @qgallouedec in #6066
Move experimental example scripts out of the packaged tree by @sergiopaniego in #6141
Harmonize logger imports by @qgallouedec in #6142
refactor(sft): build labels during dataset preparation instead of collation by @0xadvait in #6037
Add support for vLLM 0.23.0 by @qgallouedec in #6153
Change _get_train_sampler comment to mention num_iterations > 1 by @anidoesdev in #6125
Warn when sequence-level importance sampling is combined with a token-summed loss type by @discobot in #6042
Drop vLLM 0.13 support by @qgallouedec in #6154
Align KTO with DPO: Add evaluate() override by @albertvillanova in #6148
Align KTO with DPO: Align order and signature of methods by @albertvillanova in #6149
Align KTO with DPO: Move all metrics computation from log to _compute_loss by @albertvillanova in #6150
Align KTO with DPO: Support sync_ref_model by @albertvillanova in #6152
Replace parse_version with Version by @albertvillanova in #6164
Keep extra columns in unpair_preference_dataset by @albertvillanova in #6161
Align KTO with DPO: Add tests by @albertvillanova in #6160
Align format of code examples in docstrings by @albertvillanova in #6147
Align KTO with DPO: Add VLM multi-image test by @albertvillanova in #6163
Support LFM2-VL multimodal inputs in GRPO and RLOO by @zwischenraum in #6114

Full Changelog: v1.6.0...v1.7.0

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

v1.7.0

Choose a tag to compare

Sorry, something went wrong.

Sorry, something went wrong.

Uh oh!

No results found

Features

SFT default `loss_type` is now `"chunked_nll"`

MoE auxiliary loss in GRPO / RLOO / AsyncGRPO

New experimental GMPO trainer

Transformers continuous batching in GRPO / RLOO

AsyncGRPO: native weight sync with vLLM ≥ 0.22.0

Padding-free training in AsyncGRPO

Experimental Harbor integration

`trust_remote_code` in trainer configs

KTO ↔ DPO alignment: tests, evaluate, sync_ref_model

LFM2-VL multimodal inputs in GRPO / RLOO

New built-in reward helpers

SFT refactor: build labels during dataset preparation

Idefics3 chat template

vLLM version sweep

Other

Fixes

Documentation and Examples

CI

New Contributors

What's Changed

Contributors

Uh oh!

Uh oh!

v1.7.0

Features

SFT default loss_type is now "chunked_nll"

MoE auxiliary loss in GRPO / RLOO / AsyncGRPO

New experimental GMPO trainer

Transformers continuous batching in GRPO / RLOO

AsyncGRPO: native weight sync with vLLM ≥ 0.22.0

Padding-free training in AsyncGRPO

Experimental Harbor integration

trust_remote_code in trainer configs

KTO ↔ DPO alignment: tests, evaluate, sync_ref_model

LFM2-VL multimodal inputs in GRPO / RLOO

New built-in reward helpers

SFT refactor: build labels during dataset preparation

Idefics3 chat template

vLLM version sweep

Other

Fixes

Documentation and Examples

CI

New Contributors

What's Changed

Contributors

Uh oh!

SFT default `loss_type` is now `"chunked_nll"`

`trust_remote_code` in trainer configs