Skip to content

v1.6.0

Latest

Choose a tag to compare

@qgallouedec qgallouedec released this 11 Jun 22:00
· 78 commits to main since this release
0dac440

Features

AsyncGRPO rollout worker now runs in a separate process

AsyncRolloutWorker is no longer a thread — it's a spawned child process with its own GIL. The trainer's autograd engine no longer competes with recursive_parse / accuracy_reward for the GIL, which was causing 1-5s stalls in real Qwen3-30B-A3B @ 16k runs and ultimately NCCL watchdog timeouts on other ranks.

Architectural changes:

  • AsyncRolloutWorker (parent) owns the child process + shared mp.Queue / mp.Value / mp.Event.
  • _AsyncRolloutLoop (child-only) handles tokenization, dataset iteration, reward funcs, and asyncio loops.
  • A new WeightTransferClient owns the NCCL group with vLLM (/pause, /resume, /init_weight_transfer_engine, /update_weights); the rollout child only talks to /v1/completions.

Two correctness fixes shipped alongside (they would have conflicted otherwise): broader aiohttp retry (now catches ClientPayloadError) with bounded exponential backoff, and all-NaN reward columns are now preserved — np.nansum was silently returning 0, giving unscorable completions a real advantage signal and pushing the policy away from correct answers (~30% of DeepMath / OpenR1-Math rows).

Note

reward_funcs / tools / environment_factory must now be picklable, and the child runs CPU-only (CUDA_VISIBLE_DEVICES="").

by @AmineDiro in #5749

New experimental A2PO trainer (Optimal Advantage Regression)

A new A2POTrainer implements A*-PO from "Accelerating RL for LLM Reasoning with Optimal Advantage Regression". Two stages: an offline V* estimation pass from reference policy samples (with optional filter_all_incorrect to drop prompts where every reference completion fails), then on-policy training with one generation per prompt and a plain least-squares loss on β₂·log(π/π_ref) vs r − V*. No group, no critic, no clipping, no reward normalization.

from trl.experimental.a2po import A2POConfig, A2POTrainer

trainer = A2POTrainer(
    model="Qwen/Qwen3-4B",
    args=A2POConfig(num_value_samples=8, filter_all_incorrect=True),
    train_dataset=dataset,
    reward_funcs=accuracy_reward,
)
trainer.train()

Designed for binary verifiable rewards (math/code), not open-ended problems.

by @raghulchandramouli in #5940

KTO now supports VLMs + big alignment push

The biggest KTO ↔ DPO alignment cycle yet — KTOTrainer now supports vision-language models, plus a deep restructuring of compute_loss, KL dataset generation, ref-logp precomputation, activation offloading, sampler strategy, metrics, and more. KTO graduation is very close.

from trl.experimental.kto import KTOConfig, KTOTrainer

trainer = KTOTrainer(
    model="Qwen/Qwen2.5-VL-3B-Instruct",
    args=KTOConfig(...),
    train_dataset=vision_kto_dataset,
)

VLM support: by @albertvillanova in #5939. Plus ~20 alignment PRs all by @albertvillanova: #5820, #5849, #5852, #5850, #5866, #5864, #5856, #5872, #5875, #5900, #5901, #5899, #5906, #5909, #5914, #5982, #5936, #5996, #5998, #5999.

Cross-tokenizer alignment in GOLD via byte offsets

The GOLD distillation trainer used to align student/teacher tokens by extending two decoded strings and flushing on equality. It silently broke on any byte-level disagreement — including the common case of one tokenizer prepending BOS while the other doesn't (Llama-3 ↔ Qwen-3). The X-Token paper called this out by name.

Each side now carries (start_byte, end_byte) spans derived once from the fast tokenizer's char offsets, and the walker syncs on cumulative byte boundaries. On the on-policy path, spans come from piece_byte_len over the sampled token ids (not from re-encoding the decoded completion — BPE makes that round-trip non-injective).

Two related fixes shipped: long rows no longer lose the completion (now keeping the last max_length tokens), and the vLLM on-policy original_prompt_text is now decoded from the truncated ids the student actually consumed.

by @kashif in #5885

SDFT / SDPO: live teacher logprobs from the vLLM server

When teacher_model_kind="live" and vllm_mode="server", the vLLM generation server already holds the current student weights (synced every step for rollouts). The new use_teacher_server=True flag scores the teacher's log-probs on that same server instead of running a separate local teacher forward — removing the teacher from the training step entirely.

Supported modes: sampled_token (reverse KL on the realized token) and topk_logits. When buffered batches reuse steps (num_iterations > 1), weights are re-synced before scoring so the teacher never scores stale.

by @kashif in #5989

Bidirectional masked importance sampling (MIS) for IcePop

vLLM importance sampling in GRPO now uses a two-sided band [C_min, C_max] instead of a single upper cap, aligning TIS/MIS with IcePop's bidirectional handling of train–inference ratio outliers.

from trl import GRPOConfig

config = GRPOConfig(
    vllm_importance_sampling_clip_min=0.5,
    vllm_importance_sampling_clip_max=2.0,
    vllm_importance_sampling_correction="mask",  # or "truncate"
)

The old vllm_importance_sampling_cap is deprecated and maps to clip_max.

by @casinca in #4732

NemotronH and Nemotron 3 Ultra support

Day-zero training support for NVIDIA's new model families.

Even more training chat templates

Three more model families with {% generation %} markers (assistant-only loss out of the box):

Distributed backend boilerplate, hidden

A new trl/distributed.py introduces a single DistributedBackend class that detects ZeRO stage and FSDP version once, then exposes two context managers (gather_params, summon_full_params) used everywhere. Replaces the scattered getattr(state, "fsdp_plugin", None) / gather_if_zero3 / summon_full_params if ... else nullcontext() boilerplate spread across vllm_generation.py, models/utils.py, and the main trainers. Future deprecations land in one place.

by @albertvillanova in #6000

Decoupled self-distillation trainers

A two-PR refactor that disentangles SDPO, SDFT, and other self-distillation trainers from their shared base, making each one self-contained and consistent with the rest of the codebase.

by @LeonEricsson in #5862 and #5883

Heads-up: SFT default loss_type will change in 1.7

Setting SFTConfig.loss_type is now optional, and leaving it unset emits a FutureWarning: in TRL 1.7 the default will switch from "nll" to "chunked_nll". No action needed — you'll just get the new default automatically on upgrade — unless you want to pin the current behavior (e.g. for custom models) with loss_type="nll".

by @qgallouedec in #5997

Other

Fixes

Documentation and Examples

CI

  • Refresh sft.json / dpo.json snapshots after transformers num_items_in_batch fix by @qgallouedec in #5845
  • Add testing for Olmo 3 by @qgallouedec in #5962
  • Align trainer train tests by @qgallouedec in #5963
  • Align trainers: Remove redundant else branch by @albertvillanova in #5983
  • [CI] Check that training chat templates keep the stop token in the loss mask by @kashif in #5988
  • Create CI workflow to sync TRL skill with huggingface/skills by @albertvillanova in #5950
  • Simplify agent skills target and default to .agents by @albertvillanova in #5987
  • chore: enable Dependabot weekly GitHub Actions bumps by @hf-dependantbot-rollout[bot] in #5910
  • Bump the actions group with 9 updates by @dependabot[bot] in #5913
  • Bump the actions group with 4 updates by @dependabot[bot] in #5954
  • chore: update docker-build.yml with version parsing by @hf-security-analysis[bot] in #5920
  • ci: use GitHub App auth for doc preview comment bot by @sergiopaniego in #5915

New Contributors

What's Changed

Full Changelog: v1.5.0...v1.6.0