Skip to content

v1.7.0

Latest

Choose a tag to compare

@qgallouedec qgallouedec released this 25 Jun 22:52
· 2 commits to main since this release
06b42c7

Features

SFT default loss_type is now "chunked_nll"

The flip announced in v1.6 has landed. Setting loss_type is optional, and the default now resolves to "chunked_nll" — giving every SFTTrainer run ~30% less peak VRAM on average (up to ~50% on large-vocab models) with wall-clock time neutral or slightly faster. No action needed.

The auto-resolve falls back to "nll" when use_liger_kernel=True (the two paths are incompatible). If you want the old behavior — e.g. for custom heads — pin it explicitly:

SFTConfig(loss_type="nll")

by @qgallouedec in #5846

MoE auxiliary loss in GRPO / RLOO / AsyncGRPO

Post-training MoE models now correctly include the router load-balancing auxiliary loss, matching the model's own reference forward and SFTTrainer. Enable via model_init_kwargs:

GRPOConfig(
    ...,
    model_init_kwargs={"output_router_logits": True, "router_aux_loss_coef": 0.001},
)

Plumbed through _get_per_token_logps_and_entropies (now returns a 3-tuple including aux_loss), folded into the policy loss with grad-accum scaling matched per trainer, and logged as aux_loss. AsyncGRPO recomputes it via load_balancing_loss_func in the chunked LM-head path (same as SFT's chunked path).

by @AmineDiro in #6083, plus router_aux_loss_coef config wiring by @qgallouedec in #6085

New experimental GMPO trainer

Geometric-Mean Policy Optimization lands as an experimental trainer. Replaces GRPO's per-token arithmetic mean of importance ratios with a sequence-level geometric mean (mean of clipped log-ratios, then exp); clipping is one-sided by advantage sign and applied in log space. Default epsilon=0.4 per the paper.

from trl.experimental.gmpo import GMPOConfig, GMPOTrainer

trainer = GMPOTrainer(
    model="Qwen/Qwen3-4B",
    args=GMPOConfig(epsilon=0.4),
    reward_funcs=accuracy_reward,
    train_dataset=dataset,
)

by @raghulchandramouli in #6078

Transformers continuous batching in GRPO / RLOO

use_transformers_paged was deprecated in v1.4; it's now replaced with proper transformers continuous batching. The old branch silently bypassed importance-sampling correction (logprobs = None); the new path captures logprobs from output.logprobs and exposes a ContinuousBatchingConfig for KV-cache tuning.

GRPOConfig(
    ...,
    use_transformers_continuous_batching=True,
    transformers_continuous_batching_config={
        "use_cuda_graph": False,
        "max_memory_percent": 0.4,  # leave headroom for training
    },
)

Benchmark (Llama-3.2-1B-Instruct, A100 80GB, GSM8K): 1.25× faster at N=64 generations with -16 GB peak VRAM vs default generate(). Use when N ≥ 32 with variable completion lengths.

use_transformers_paged=True still works and forwards to the new flag with a FutureWarning. Requires transformers>=5.8.0.

by @sergiopaniego in #5765

AsyncGRPO: native weight sync with vLLM ≥ 0.22.0

WeightTransferClient now drives vLLM's native 4-phase RL weight-transfer API instead of the older 2-call flow: pause(mode="keep")start_weight_update → threaded update_weights + NCCL broadcast → finish_weight_updateresume. Validated end-to-end on H100 across single-node, FSDP2×4 + TP=4, and 2-node FSDP2×4 + DP=2×TP=4 (weight-sync time ≈ 0.18-0.8 s).

by @AmineDiro in #5892

Padding-free training in AsyncGRPO

AsyncGRPO now supports the same padding-free path SFT already had. Flattens the batch and uses position_ids-based document boundaries instead of right-padding to the longest sequence — meaningful speedup and memory savings on heterogeneous-length workloads.

by @qgallouedec in #5854

Experimental Harbor integration

A new trl.experimental.harbor adapter plugs Harbor agentic task suites into GRPOTrainer via environment_factory. Same pattern as the OpenReward integration — one spec wires all three trainer slots:

from trl import GRPOConfig, GRPOTrainer
from trl.experimental.harbor import HarborSpec

spec = HarborSpec("AdithyaSK/data_agent_rl_environment_train", agent="bash", num_tasks=64)

trainer = GRPOTrainer(
    model="Qwen/Qwen3-4B",
    args=GRPOConfig(num_generations=8, max_steps=50, max_tool_calling_iterations=25),
    train_dataset=spec.train_dataset,
    environment_factory=spec.environment_factory,
    reward_funcs=spec.reward_funcs,
)

Built-in bash harness, plus jupyter and terminal_notes example harnesses. Gated by the new trl[harbor] extra.

by @adithya-s-k in #6018

trust_remote_code in trainer configs

A single trust_remote_code: bool = False field on the trainer configs now covers the whole load surface — model, processor / tokenizer, reference model, reward model, reward tokenizer, teacher — instead of forcing users to thread it through several independent kwarg dicts.

SFTConfig(trust_remote_code=True)

ModelConfig.trust_remote_code is removed to avoid duplicate --trust_remote_code when combining dataclasses; CLI behavior is unchanged.

by @qgallouedec in #5802

KTO ↔ DPO alignment: tests, evaluate, sync_ref_model

The last alignment cycle before graduation: KTO now has parity with DPO on pad_to_multiple_of, sync_ref_model, evaluate(), method order/signature, metric placement (all moved into _compute_loss), and a real text + VLM (including multi-image) test suite.

PRs all by @albertvillanova: #6029, #6030, #6033, #6034, #6035, #6080, #6093, #6148, #6149, #6150, #6152, #6160, #6163.

LFM2-VL multimodal inputs in GRPO / RLOO

GRPO and RLOO now support LFM2-VL multimodal inputs end-to-end.

by @zwischenraum in #6114

New built-in reward helpers

SFT refactor: build labels during dataset preparation

Label construction moves out of the collator and into dataset preparation, so "what's trainable" is defined in exactly one place. A single batched map produces a labels column where each token keeps its ID when every applicable mask is 1, else -100. Plain LM stays storage-neutral; pre-tokenized datasets with mask columns now go through the same path. Step 1 toward fixing #3927.

by @0xadvait in #6037

Idefics3 chat template

{% generation %}-marker training template for Idefics3, enabling assistant_only_loss=True.

by @aazizyan in #5871

vLLM version sweep

Other

Fixes

  • Share frozen layers with reference model instead of duplicating in memorycreate_reference_model with num_shared_layers was double-allocating the "shared" frozen layers because the loop never assigned _ref_param back. Now it does, so shared layers are held once. By @behroozazarkhalili in #6053
  • Fix chunked_nll mixed Tensor/DTensor error under FSDP2 + PEFT by @albertvillanova in #6065
  • Fix per-chunk lm_head.weight all-gathers under FSDP2 + chunked_nll by @albertvillanova in #6077
  • Fix ZeRO-3 + PEFT mixed-dtype error for core trainers by @albertvillanova in #6091 and KTO by @albertvillanova in #6093
  • Fix ref adapter creation when the LoRA config uses target_parameters by @discobot in #6043
  • Raise when use_liger_kernel is combined with a PEFT adapter on lm_head by @akshansh47 in #5977
  • [fix] GLM-4-MoE template: turn-terminating token to the turn itself by @qgallouedec in #6044
  • Fix unpair_preference_dataset dropping extra columns by @albertvillanova in #6059
  • fix(gold): preserve vllm prompt special tokens by @he-yufeng in #6063
  • fix(sft): reject transformed datasets during preparation by @he-yufeng in #6054
  • Fix broken light-mode banner image in README by @strickvl in #6112
  • Fix BrowserGym OpenEnv example dependency and task wiring by @burtenshaw in #6117
  • Fix signature of _unpair_row by @albertvillanova in #6062
  • Remove silently-ignored W&B/Hub fields from GOLD and Distillation configs by @DaoyuanLi2816 in #6023

Documentation and Examples

CI

New Contributors

What's Changed

Full Changelog: v1.6.0...v1.7.0