Release v1.6.0 · huggingface/trl

Features

AsyncGRPO rollout worker now runs in a separate process

AsyncRolloutWorker is no longer a thread — it's a spawned child process with its own GIL. The trainer's autograd engine no longer competes with recursive_parse / accuracy_reward for the GIL, which was causing 1-5s stalls in real Qwen3-30B-A3B @ 16k runs and ultimately NCCL watchdog timeouts on other ranks.

Architectural changes:

AsyncRolloutWorker (parent) owns the child process + shared mp.Queue / mp.Value / mp.Event.
_AsyncRolloutLoop (child-only) handles tokenization, dataset iteration, reward funcs, and asyncio loops.
A new WeightTransferClient owns the NCCL group with vLLM (/pause, /resume, /init_weight_transfer_engine, /update_weights); the rollout child only talks to /v1/completions.

Two correctness fixes shipped alongside (they would have conflicted otherwise): broader aiohttp retry (now catches ClientPayloadError) with bounded exponential backoff, and all-NaN reward columns are now preserved — np.nansum was silently returning 0, giving unscorable completions a real advantage signal and pushing the policy away from correct answers (~30% of DeepMath / OpenR1-Math rows).

Note

reward_funcs / tools / environment_factory must now be picklable, and the child runs CPU-only (CUDA_VISIBLE_DEVICES="").

by @AmineDiro in #5749

New experimental A2PO trainer (Optimal Advantage Regression)

A new A2POTrainer implements A*-PO from "Accelerating RL for LLM Reasoning with Optimal Advantage Regression". Two stages: an offline V* estimation pass from reference policy samples (with optional filter_all_incorrect to drop prompts where every reference completion fails), then on-policy training with one generation per prompt and a plain least-squares loss on β₂·log(π/π_ref) vs r − V*. No group, no critic, no clipping, no reward normalization.

from trl.experimental.a2po import A2POConfig, A2POTrainer

trainer = A2POTrainer(
    model="Qwen/Qwen3-4B",
    args=A2POConfig(num_value_samples=8, filter_all_incorrect=True),
    train_dataset=dataset,
    reward_funcs=accuracy_reward,
)
trainer.train()

Designed for binary verifiable rewards (math/code), not open-ended problems.

by @raghulchandramouli in #5940

KTO now supports VLMs + big alignment push

The biggest KTO ↔ DPO alignment cycle yet — KTOTrainer now supports vision-language models, plus a deep restructuring of compute_loss, KL dataset generation, ref-logp precomputation, activation offloading, sampler strategy, metrics, and more. KTO graduation is very close.

from trl.experimental.kto import KTOConfig, KTOTrainer

trainer = KTOTrainer(
    model="Qwen/Qwen2.5-VL-3B-Instruct",
    args=KTOConfig(...),
    train_dataset=vision_kto_dataset,
)

VLM support: by @albertvillanova in #5939. Plus ~20 alignment PRs all by @albertvillanova: #5820, #5849, #5852, #5850, #5866, #5864, #5856, #5872, #5875, #5900, #5901, #5899, #5906, #5909, #5914, #5982, #5936, #5996, #5998, #5999.

Cross-tokenizer alignment in GOLD via byte offsets

The GOLD distillation trainer used to align student/teacher tokens by extending two decoded strings and flushing on equality. It silently broke on any byte-level disagreement — including the common case of one tokenizer prepending BOS while the other doesn't (Llama-3 ↔ Qwen-3). The X-Token paper called this out by name.

Each side now carries (start_byte, end_byte) spans derived once from the fast tokenizer's char offsets, and the walker syncs on cumulative byte boundaries. On the on-policy path, spans come from piece_byte_len over the sampled token ids (not from re-encoding the decoded completion — BPE makes that round-trip non-injective).

Two related fixes shipped: long rows no longer lose the completion (now keeping the last max_length tokens), and the vLLM on-policy original_prompt_text is now decoded from the truncated ids the student actually consumed.

by @kashif in #5885

SDFT / SDPO: live teacher logprobs from the vLLM server

When teacher_model_kind="live" and vllm_mode="server", the vLLM generation server already holds the current student weights (synced every step for rollouts). The new use_teacher_server=True flag scores the teacher's log-probs on that same server instead of running a separate local teacher forward — removing the teacher from the training step entirely.

Supported modes: sampled_token (reverse KL on the realized token) and topk_logits. When buffered batches reuse steps (num_iterations > 1), weights are re-synced before scoring so the teacher never scores stale.

by @kashif in #5989

Bidirectional masked importance sampling (MIS) for IcePop

vLLM importance sampling in GRPO now uses a two-sided band [C_min, C_max] instead of a single upper cap, aligning TIS/MIS with IcePop's bidirectional handling of train–inference ratio outliers.

from trl import GRPOConfig

config = GRPOConfig(
    vllm_importance_sampling_clip_min=0.5,
    vllm_importance_sampling_clip_max=2.0,
    vllm_importance_sampling_correction="mask",  # or "truncate"
)

The old vllm_importance_sampling_cap is deprecated and maps to clip_max.

by @casinca in #4732

NemotronH and Nemotron 3 Ultra support

Day-zero training support for NVIDIA's new model families.

NemotronH integration by @qgallouedec in #5938
Nemotron 3 Ultra support by @qgallouedec in #5942
Enable gradient checkpointing in Nemotron 3 SFT example by @sergiopaniego in #5944

Even more training chat templates

Three more model families with {% generation %} markers (assistant-only loss out of the box):

Qwen2.5-VL by @aazizyan in #5838
Qwen2-VL by @aazizyan in #5839
Llava-Next by @aazizyan in #5959

Distributed backend boilerplate, hidden

A new trl/distributed.py introduces a single DistributedBackend class that detects ZeRO stage and FSDP version once, then exposes two context managers (gather_params, summon_full_params) used everywhere. Replaces the scattered getattr(state, "fsdp_plugin", None) / gather_if_zero3 / summon_full_params if ... else nullcontext() boilerplate spread across vllm_generation.py, models/utils.py, and the main trainers. Future deprecations land in one place.

by @albertvillanova in #6000

Decoupled self-distillation trainers

A two-PR refactor that disentangles SDPO, SDFT, and other self-distillation trainers from their shared base, making each one self-contained and consistent with the rest of the codebase.

by @LeonEricsson in #5862 and #5883

Heads-up: SFT default `loss_type` will change in 1.7

Setting SFTConfig.loss_type is now optional, and leaving it unset emits a FutureWarning: in TRL 1.7 the default will switch from "nll" to "chunked_nll". No action needed — you'll just get the new default automatically on upgrade — unless you want to pin the current behavior (e.g. for custom models) with loss_type="nll".

by @qgallouedec in #5997

Other

Support 'None' as CLI value for Optional[T] fields by @qgallouedec in #5843
Support non-lm_head output projections in chunked SFT loss (GPTNeoX) by @qgallouedec in #5857
SFTTrainer: merge entropy and accuracy computation to eliminate redundant logits copy by @flutist in #5897
Remove redundant .contiguous() calls in DPOTrainer to reduce peak memory by @flutist in #5926
Remove unnecessary explicit .contiguous() before entropy_from_logits by @qgallouedec in #5930
Exclude None reward completions from GRPO/RLOO advantage baseline by @AmineDiro in #5902
Support multimodal config in PPO ValueHead by @albertvillanova in #5907
Support vision datasets for Liger in DPO by @albertvillanova in #5943
Raise if precompute_ref_log_probs with vision datasets in DPO by @albertvillanova in #5867
🔒 Gate trainer telemetry on an explicit class-name allowlist by @qgallouedec in #5851
Update vLLM version support to 0.19.0 by @sergiopaniego in #5879
Improve error message when image tokens are truncated by max_length by @lxk8998 in #5927
Padding-free invariance test by @qgallouedec in #5842
Per-field invariance tolerances, calibrated by @qgallouedec in #5844

Fixes

Fix loss_type="chunked_nll" under DeepSpeed ZeRO-3 by @qgallouedec in #5873
Fix GRPO use_liger_kernel under DeepSpeed ZeRO-3 by @kashif in #5891
async_grpo: don't return on queue.Empty by @AmineDiro in #5751
Don't treat ROCm GPUs as Ampere by @kashif in #5917
Route liger student forward through DDP wrapper in GKD, GOLD, and Distillation trainers by @albertvillanova in #5934
Fix backbone access in GRPO by aligning with SFT by @albertvillanova in #5949
Fix priority order in PPO ValueHead and raise ValueError for unsupported config by @albertvillanova in #5908
Fix generate_batch: inference tensors block inplace ops in background thread by @albertvillanova in #5818 (cross-listed from v1.5 changelog window)
Fix SFT padding-free test config by @kashif in #5923
Specify encoding="utf-8" when reading .jinja chat templates on Windows by @ColebyPearson in #5869
Fix ValueError by pinning kernels < 0.15.1 by @albertvillanova in #5880
Set kernels optional dependency via transformers by @albertvillanova in #5884
Support kernels extra for transformers < 5.1.0 by @albertvillanova in #5928
Add missing use_liger_kernel guard to SDPO teacher-server validation by @DaoyuanLi2816 in #5994
Flash Attention capitalization fix by @qgallouedec in #5855

Documentation and Examples

Remove NeMo Gym Integration Guide (broken) by @cmunley1 in #5840
docs(GRPOTrainer): remove duplicate sentence by @zafstojano in #5957
docs(RLOOTrainer): fix blockquote math not rendering by @zafstojano in #5958
docs: highlight the role of KL in RLOO compared to GRPO by @zafstojano in #5966
docs: clarify PPO entropy metrics in PPO trainer docs by @biefan in #5289
docs: update OpenEnv GitHub org references and package name by @sergiopaniego in #5919
docs: update OpenEnv doc URLs to huggingface.co/docs/openenv by @sergiopaniego in #5929
docs: sync SDFT/SDPO config docstrings with their fields by @DaoyuanLi2816 in #5992
docs: sync Distillation/GOLD/OnlineDPO config docstrings with their fields by @DaoyuanLi2816 in #5995
docs: Document bnb_4bit_quant_storage and normalize docstring param headers by @DaoyuanLi2816 in #5993
docs: fix rendering typos by @zafstojano in #5991
fix(docs): correct broken GKD Trainer link in MiniLLM docs by @DaoyuanLi2816 in #5960 and #5961
fix(docs): drop duplicate "a" in online_dpo_vlm example description by @DaoyuanLi2816 in #5978
Fix broken doc links by @DaoyuanLi2816 in #5971
Fix broken code examples in docs (RLOO syntax, SFTConfig max_length) by @DaoyuanLi2816 in #5970
Fix malformed ScaleRL paper link in GRPOConfig epsilon_high help by @DaoyuanLi2816 in #5972
fix(cli): drop duplicate "to" in trl skills install description by @DaoyuanLi2816 in #6008
Remove invalid max_prompt_length argument from GRPO example by @DaoyuanLi2816 in #5964

CI

Refresh sft.json / dpo.json snapshots after transformers num_items_in_batch fix by @qgallouedec in #5845
Add testing for Olmo 3 by @qgallouedec in #5962
Align trainer train tests by @qgallouedec in #5963
Align trainers: Remove redundant else branch by @albertvillanova in #5983
[CI] Check that training chat templates keep the stop token in the loss mask by @kashif in #5988
Create CI workflow to sync TRL skill with huggingface/skills by @albertvillanova in #5950
Simplify agent skills target and default to .agents by @albertvillanova in #5987
chore: enable Dependabot weekly GitHub Actions bumps by @hf-dependantbot-rollout[bot] in #5910
Bump the actions group with 9 updates by @dependabot[bot] in #5913
Bump the actions group with 4 updates by @dependabot[bot] in #5954
chore: update docker-build.yml with version parsing by @hf-security-analysis[bot] in #5920
ci: use GitHub App auth for doc preview comment bot by @sergiopaniego in #5915

New Contributors

@ColebyPearson made their first contribution in #5869
@hf-dependantbot-rollout[bot] made their first contribution in #5910
@raghulchandramouli made their first contribution in #5940
@zafstojano made their first contribution in #5957
@DaoyuanLi2816 made their first contribution in #5964
@lxk8998 made their first contribution in #5927
@biefan made their first contribution in #5289

What's Changed

⬆️ Bump dev version by @qgallouedec in #5836
Add Qwen2.5-VL original and training chat template with generation markers by @aazizyan in #5838
Align KTO with DPO: Simplify metrics from sum/count to direct averages by @albertvillanova in #5820
async_grpo don't return on queue.Empty by @AmineDiro in #5751
Align KTO with DPO: Refactor forward by @albertvillanova in #5849
Per-field invariance tolerances, calibrated by @qgallouedec in #5844
Add Qwen2-VL original and training chat template with generation markers by @aazizyan in #5839
Remove NeMo Gym Integration Guide (broken) by @cmunley1 in #5840
Align KTO with DPO: Align compute_ref_log_probs by @albertvillanova in #5852
Align KTO with DPO: Align precompute_ref_logps by @albertvillanova in #5850
Flash Attention capitalization fix by @qgallouedec in #5855
🔒 Gate trainer telemetry on an explicit class-name allowlist by @qgallouedec in #5851
Align KTO with DPO: Support remove_unused_columns by @albertvillanova in #5866
Raise if precompute_ref_log_probs with vision datasets in DPO by @albertvillanova in #5867
Support 'None' as CLI value for Optional[T] fields by @qgallouedec in #5843
KTO: Replace _get_train_sampler with train_sampling_strategy for transformers >= 5.2.0 by @albertvillanova in #5864
Fix: specify encoding="utf-8" when reading .jinja chat templates on Windows by @ColebyPearson in #5869
Align KTO with DPO: Align ref log probability names by @albertvillanova in #5856
KTO: Support non-sequential train_sampling_strategy for apo_zero_unpaired by @albertvillanova in #5872
Align KTO with DPO: Remove null_ref_context by @albertvillanova in #5875
Fix ValueError by pinning kernels < 0.15.1 by @albertvillanova in #5880
Update vLLM version support to 0.19.0 by @sergiopaniego in #5879
Set kernels optional dependency via transformers by @albertvillanova in #5884
Support non-lm_head output projections in chunked SFT loss (GPTNeoX) by @qgallouedec in #5857
SFTTrainer: merge entropy and accuracy computation to eliminate redundant logits copy by @flutist in #5897
Align KTO with DPO: Add disable_gradient_checkpointing to ref model forward passes by @albertvillanova in #5900
Align KTO with DPO: Add activation offloading support by @albertvillanova in #5901
Align KTO with DPO: Decouple KL dataset generation by @albertvillanova in #5899
Fix GRPO use_liger_kernel under DeepSpeed ZeRO-3 by @kashif in #5891
Replace custom numpy cache in precompute_ref_logps with native datasets by @albertvillanova in #5906
[1/2] refactor: decoupled self distillation trainers (sdpo, sdft, ...) by @LeonEricsson in #5862
Align KTO with DPO: Use datasets caching in precompute_ref_logps by @albertvillanova in #5909
Support multimodal config in PPO ValueHead by @albertvillanova in #5907
Fix priority order in PPO ValueHead and raise ValueError for unsupported config by @albertvillanova in #5908
Fix loss_type="chunked_nll" under DeepSpeed ZeRO-3 by @qgallouedec in #5873
chore: enable Dependabot weekly GitHub Actions bumps by @hf-dependantbot-rollout[bot] in #5910
Exclude None reward completions from GRPO/RLOO advantage baseline by @AmineDiro in #5902
Don't treat ROCm GPUs as Ampere by @kashif in #5917
ci: use GitHub App auth for doc preview comment bot by @sergiopaniego in #5915
Bump the actions group with 9 updates by @dependabot[bot] in #5913
Align KTO with DPO: Replace completion_labels/get_batch_logps with completion_mask by @albertvillanova in #5914
chore: update docker-build.yml with version parsing by @hf-security-analysis[bot] in #5920
docs: update OpenEnv GitHub org references and package name by @sergiopaniego in #5919
Support kernels extra for transformers < 5.1.0 by @albertvillanova in #5928
feat: move async rollout worker to separate process by @AmineDiro in #5749
Remove redundant .contiguous() calls in DPOTrainer to reduce peak memory by @flutist in #5926
docs: update OpenEnv doc URLs from meta-pytorch.org to huggingface.co/docs/openenv by @sergiopaniego in #5929
Refresh sft.json / dpo.json snapshots after transformers num_items_in_batch fix by @qgallouedec in #5845
NemotronH integration by @qgallouedec in #5938
Nemotron 3 Ultra support by @qgallouedec in #5942
Enable gradient checkpointing in Nemotron 3 SFT example (transformers>=5.7.0) by @sergiopaniego in #5944
Fix SFT padding-free test config by @kashif in #5923
Add experimental A2PO trainer (Optimal Advantage Regression) by @raghulchandramouli in #5940
Align KTO with DPO: Inline _compute_logps into _compute_loss by @albertvillanova in #5936
Fix: Route liger student forward through DDP wrapper in GKD, GOLD, and Distillation trainers by @albertvillanova in #5934
Fix backbone access in GRPO by aligning with SFT by @albertvillanova in #5949
fix(docs): Remove duplicate sentence in GRPOTrainer docs by @zafstojano in #5957
fix(docs): Blockquote math not rendering in RLooTrainer docs by @zafstojano in #5958
Remove invalid max_prompt_length argument from GRPO example by @DaoyuanLi2816 in #5964
Bump the actions group with 4 updates by @dependabot[bot] in #5954
Add Llava-Next training tempalates support with generation markers by @aazizyan in #5959
fix(docs): correct broken GKD Trainer link in MiniLLM docs by @DaoyuanLi2816 in #5961
Remove unnecessary explicit .contiguous() before entropy_from_logits by @qgallouedec in #5930
Improve error message when image tokens are truncated by max_length by @lxk8998 in #5927
fix(docs): correct broken GKD Trainer link in MiniLLM docs by @DaoyuanLi2816 in #5960
Cross-tokenizer alignment via byte offsets in GOLD trainer by @kashif in #5885
[2/2] refactor: decoupled self distillation trainers; cleanup by @LeonEricsson in #5883
Create CI workflow to sync TRL skill with huggingface/skills by @albertvillanova in #5950
Support vision datasets for Liger in DPO by @albertvillanova in #5943
Align trainer train tests by @qgallouedec in #5963
Fix malformed ScaleRL paper link in GRPOConfig epsilon_high help by @DaoyuanLi2816 in #5972
Add testing for Olmo3 by @qgallouedec in #5962
chore(docs): Highlight the role of KL in RLOO compared to GRPO by @zafstojano in #5966
fix(docs): drop duplicate "a" in online_dpo_vlm example description by @DaoyuanLi2816 in #5978
Fix broken doc links (CONTRIBUTING online DPO paths, async GRPO anchor) by @DaoyuanLi2816 in #5971
Align KTO with DPO: Support VLM by @albertvillanova in #5939
Fix broken code examples in docs (RLOO syntax, SFTConfig max_length) by @DaoyuanLi2816 in #5970
Align KTO with DPO: Improve error message for VLM truncation by @albertvillanova in #5982
Align trainers: Remove redundant else branch by @albertvillanova in #5983
SDFT/SDPO: live teacher logprobs from the vLLM server by @kashif in #5989
docs: sync SDFT/SDPO config docstrings with their fields by @DaoyuanLi2816 in #5992
fix(docs): Fix rendering typos by @zafstojano in #5991
Add missing use_liger_kernel guard to SDPO teacher-server validation by @DaoyuanLi2816 in #5994
docs: sync Distillation/GOLD/OnlineDPO config docstrings with their fields by @DaoyuanLi2816 in #5995
feat: Bidirectional masked importance sampling ratio (MIS) for IcePop by @casinca in #4732
Simplify agent skills target and default to .agents by @albertvillanova in #5987
Align KTO with DPO: Remove unused use_dpo_data_collator attribute by @albertvillanova in #5996
Align KTO with DPO: Rename kto_loss_fn to liger_loss_fn by @albertvillanova in #5998
Align KTO with DPO: Inline kto_loss in _compute_loss by @albertvillanova in #5999
Document bnb_4bit_quant_storage and normalize docstring param headers by @DaoyuanLi2816 in #5993
[CI] Check that training chat templates keep the stop token in the loss mask by @kashif in #5988
Announce upcoming SFT loss_type default change from 'nll' to 'chunked_nll' by @qgallouedec in #5997
Padding-free invariance test by @qgallouedec in #5842
Hide DeepSpeed/FSDP distributed backend boilerplate by @albertvillanova in #6000
fix(cli): drop duplicate "to" in trl skills install description by @DaoyuanLi2816 in #6008
docs: clarify PPO entropy metrics in PPO trainer docs by @biefan in #5289
Release: v1.6 by @qgallouedec in #6009

Full Changelog: v1.5.0...v1.6.0

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

v1.6.0

Choose a tag to compare

Sorry, something went wrong.

Sorry, something went wrong.

Uh oh!

No results found

Features

AsyncGRPO rollout worker now runs in a separate process

New experimental A2PO trainer (Optimal Advantage Regression)

KTO now supports VLMs + big alignment push

Cross-tokenizer alignment in GOLD via byte offsets

SDFT / SDPO: live teacher logprobs from the vLLM server

Bidirectional masked importance sampling (MIS) for IcePop

NemotronH and Nemotron 3 Ultra support

Even more training chat templates

Distributed backend boilerplate, hidden

Decoupled self-distillation trainers

Heads-up: SFT default `loss_type` will change in 1.7

Other

Fixes

Documentation and Examples

CI

New Contributors

What's Changed

Contributors

Uh oh!

v1.6.0

Features

AsyncGRPO rollout worker now runs in a separate process

New experimental A2PO trainer (Optimal Advantage Regression)

KTO now supports VLMs + big alignment push

Cross-tokenizer alignment in GOLD via byte offsets

SDFT / SDPO: live teacher logprobs from the vLLM server

Bidirectional masked importance sampling (MIS) for IcePop

NemotronH and Nemotron 3 Ultra support

Even more training chat templates

Distributed backend boilerplate, hidden

Decoupled self-distillation trainers

Heads-up: SFT default loss_type will change in 1.7

Other

Fixes

Documentation and Examples

CI

New Contributors

What's Changed

Contributors

Uh oh!

Heads-up: SFT default `loss_type` will change in 1.7