Features
AsyncGRPO rollout worker now runs in a separate process
AsyncRolloutWorker is no longer a thread — it's a spawned child process with its own GIL. The trainer's autograd engine no longer competes with recursive_parse / accuracy_reward for the GIL, which was causing 1-5s stalls in real Qwen3-30B-A3B @ 16k runs and ultimately NCCL watchdog timeouts on other ranks.
Architectural changes:
AsyncRolloutWorker(parent) owns the child process + sharedmp.Queue/mp.Value/mp.Event._AsyncRolloutLoop(child-only) handles tokenization, dataset iteration, reward funcs, and asyncio loops.- A new
WeightTransferClientowns the NCCL group with vLLM (/pause,/resume,/init_weight_transfer_engine,/update_weights); the rollout child only talks to/v1/completions.
Two correctness fixes shipped alongside (they would have conflicted otherwise): broader aiohttp retry (now catches ClientPayloadError) with bounded exponential backoff, and all-NaN reward columns are now preserved — np.nansum was silently returning 0, giving unscorable completions a real advantage signal and pushing the policy away from correct answers (~30% of DeepMath / OpenR1-Math rows).
Note
reward_funcs / tools / environment_factory must now be picklable, and the child runs CPU-only (CUDA_VISIBLE_DEVICES="").
by @AmineDiro in #5749
New experimental A2PO trainer (Optimal Advantage Regression)
A new A2POTrainer implements A*-PO from "Accelerating RL for LLM Reasoning with Optimal Advantage Regression". Two stages: an offline V* estimation pass from reference policy samples (with optional filter_all_incorrect to drop prompts where every reference completion fails), then on-policy training with one generation per prompt and a plain least-squares loss on β₂·log(π/π_ref) vs r − V*. No group, no critic, no clipping, no reward normalization.
from trl.experimental.a2po import A2POConfig, A2POTrainer
trainer = A2POTrainer(
model="Qwen/Qwen3-4B",
args=A2POConfig(num_value_samples=8, filter_all_incorrect=True),
train_dataset=dataset,
reward_funcs=accuracy_reward,
)
trainer.train()Designed for binary verifiable rewards (math/code), not open-ended problems.
by @raghulchandramouli in #5940
KTO now supports VLMs + big alignment push
The biggest KTO ↔ DPO alignment cycle yet — KTOTrainer now supports vision-language models, plus a deep restructuring of compute_loss, KL dataset generation, ref-logp precomputation, activation offloading, sampler strategy, metrics, and more. KTO graduation is very close.
from trl.experimental.kto import KTOConfig, KTOTrainer
trainer = KTOTrainer(
model="Qwen/Qwen2.5-VL-3B-Instruct",
args=KTOConfig(...),
train_dataset=vision_kto_dataset,
)VLM support: by @albertvillanova in #5939. Plus ~20 alignment PRs all by @albertvillanova: #5820, #5849, #5852, #5850, #5866, #5864, #5856, #5872, #5875, #5900, #5901, #5899, #5906, #5909, #5914, #5982, #5936, #5996, #5998, #5999.
Cross-tokenizer alignment in GOLD via byte offsets
The GOLD distillation trainer used to align student/teacher tokens by extending two decoded strings and flushing on equality. It silently broke on any byte-level disagreement — including the common case of one tokenizer prepending BOS while the other doesn't (Llama-3 ↔ Qwen-3). The X-Token paper called this out by name.
Each side now carries (start_byte, end_byte) spans derived once from the fast tokenizer's char offsets, and the walker syncs on cumulative byte boundaries. On the on-policy path, spans come from piece_byte_len over the sampled token ids (not from re-encoding the decoded completion — BPE makes that round-trip non-injective).
Two related fixes shipped: long rows no longer lose the completion (now keeping the last max_length tokens), and the vLLM on-policy original_prompt_text is now decoded from the truncated ids the student actually consumed.
SDFT / SDPO: live teacher logprobs from the vLLM server
When teacher_model_kind="live" and vllm_mode="server", the vLLM generation server already holds the current student weights (synced every step for rollouts). The new use_teacher_server=True flag scores the teacher's log-probs on that same server instead of running a separate local teacher forward — removing the teacher from the training step entirely.
Supported modes: sampled_token (reverse KL on the realized token) and topk_logits. When buffered batches reuse steps (num_iterations > 1), weights are re-synced before scoring so the teacher never scores stale.
Bidirectional masked importance sampling (MIS) for IcePop
vLLM importance sampling in GRPO now uses a two-sided band [C_min, C_max] instead of a single upper cap, aligning TIS/MIS with IcePop's bidirectional handling of train–inference ratio outliers.
from trl import GRPOConfig
config = GRPOConfig(
vllm_importance_sampling_clip_min=0.5,
vllm_importance_sampling_clip_max=2.0,
vllm_importance_sampling_correction="mask", # or "truncate"
)The old vllm_importance_sampling_cap is deprecated and maps to clip_max.
NemotronH and Nemotron 3 Ultra support
Day-zero training support for NVIDIA's new model families.
- NemotronH integration by @qgallouedec in #5938
- Nemotron 3 Ultra support by @qgallouedec in #5942
- Enable gradient checkpointing in Nemotron 3 SFT example by @sergiopaniego in #5944
Even more training chat templates
Three more model families with {% generation %} markers (assistant-only loss out of the box):
Distributed backend boilerplate, hidden
A new trl/distributed.py introduces a single DistributedBackend class that detects ZeRO stage and FSDP version once, then exposes two context managers (gather_params, summon_full_params) used everywhere. Replaces the scattered getattr(state, "fsdp_plugin", None) / gather_if_zero3 / summon_full_params if ... else nullcontext() boilerplate spread across vllm_generation.py, models/utils.py, and the main trainers. Future deprecations land in one place.
by @albertvillanova in #6000
Decoupled self-distillation trainers
A two-PR refactor that disentangles SDPO, SDFT, and other self-distillation trainers from their shared base, making each one self-contained and consistent with the rest of the codebase.
by @LeonEricsson in #5862 and #5883
Heads-up: SFT default loss_type will change in 1.7
Setting SFTConfig.loss_type is now optional, and leaving it unset emits a FutureWarning: in TRL 1.7 the default will switch from "nll" to "chunked_nll". No action needed — you'll just get the new default automatically on upgrade — unless you want to pin the current behavior (e.g. for custom models) with loss_type="nll".
by @qgallouedec in #5997
Other
- Support
'None'as CLI value forOptional[T]fields by @qgallouedec in #5843 - Support non-
lm_headoutput projections in chunked SFT loss (GPTNeoX) by @qgallouedec in #5857 SFTTrainer: merge entropy and accuracy computation to eliminate redundant logits copy by @flutist in #5897- Remove redundant
.contiguous()calls inDPOTrainerto reduce peak memory by @flutist in #5926 - Remove unnecessary explicit
.contiguous()beforeentropy_from_logitsby @qgallouedec in #5930 - Exclude
Nonereward completions from GRPO/RLOO advantage baseline by @AmineDiro in #5902 - Support multimodal config in PPO ValueHead by @albertvillanova in #5907
- Support vision datasets for Liger in DPO by @albertvillanova in #5943
- Raise if
precompute_ref_log_probswith vision datasets in DPO by @albertvillanova in #5867 - 🔒 Gate trainer telemetry on an explicit class-name allowlist by @qgallouedec in #5851
- Update vLLM version support to 0.19.0 by @sergiopaniego in #5879
- Improve error message when image tokens are truncated by
max_lengthby @lxk8998 in #5927 - Padding-free invariance test by @qgallouedec in #5842
- Per-field invariance tolerances, calibrated by @qgallouedec in #5844
Fixes
- Fix
loss_type="chunked_nll"under DeepSpeed ZeRO-3 by @qgallouedec in #5873 - Fix GRPO
use_liger_kernelunder DeepSpeed ZeRO-3 by @kashif in #5891 async_grpo: don't return onqueue.Emptyby @AmineDiro in #5751- Don't treat ROCm GPUs as Ampere by @kashif in #5917
- Route liger student forward through DDP wrapper in GKD, GOLD, and Distillation trainers by @albertvillanova in #5934
- Fix backbone access in GRPO by aligning with SFT by @albertvillanova in #5949
- Fix priority order in PPO ValueHead and raise ValueError for unsupported config by @albertvillanova in #5908
- Fix
generate_batch: inference tensors block inplace ops in background thread by @albertvillanova in #5818 (cross-listed from v1.5 changelog window) - Fix SFT padding-free test config by @kashif in #5923
- Specify
encoding="utf-8"when reading.jinjachat templates on Windows by @ColebyPearson in #5869 - Fix
ValueErrorby pinningkernels < 0.15.1by @albertvillanova in #5880 - Set
kernelsoptional dependency via transformers by @albertvillanova in #5884 - Support
kernelsextra fortransformers < 5.1.0by @albertvillanova in #5928 - Add missing
use_liger_kernelguard to SDPO teacher-server validation by @DaoyuanLi2816 in #5994 - Flash Attention capitalization fix by @qgallouedec in #5855
Documentation and Examples
- Remove NeMo Gym Integration Guide (broken) by @cmunley1 in #5840
- docs(GRPOTrainer): remove duplicate sentence by @zafstojano in #5957
- docs(RLOOTrainer): fix blockquote math not rendering by @zafstojano in #5958
- docs: highlight the role of KL in RLOO compared to GRPO by @zafstojano in #5966
- docs: clarify PPO entropy metrics in PPO trainer docs by @biefan in #5289
- docs: update OpenEnv GitHub org references and package name by @sergiopaniego in #5919
- docs: update OpenEnv doc URLs to
huggingface.co/docs/openenvby @sergiopaniego in #5929 - docs: sync SDFT/SDPO config docstrings with their fields by @DaoyuanLi2816 in #5992
- docs: sync Distillation/GOLD/OnlineDPO config docstrings with their fields by @DaoyuanLi2816 in #5995
- docs: Document
bnb_4bit_quant_storageand normalize docstring param headers by @DaoyuanLi2816 in #5993 - docs: fix rendering typos by @zafstojano in #5991
- fix(docs): correct broken GKD Trainer link in MiniLLM docs by @DaoyuanLi2816 in #5960 and #5961
- fix(docs): drop duplicate "a" in
online_dpo_vlmexample description by @DaoyuanLi2816 in #5978 - Fix broken doc links by @DaoyuanLi2816 in #5971
- Fix broken code examples in docs (RLOO syntax,
SFTConfigmax_length) by @DaoyuanLi2816 in #5970 - Fix malformed ScaleRL paper link in
GRPOConfigepsilon_highhelp by @DaoyuanLi2816 in #5972 - fix(cli): drop duplicate "to" in
trl skills installdescription by @DaoyuanLi2816 in #6008 - Remove invalid
max_prompt_lengthargument from GRPO example by @DaoyuanLi2816 in #5964
CI
- Refresh
sft.json/dpo.jsonsnapshots after transformersnum_items_in_batch fixby @qgallouedec in #5845 - Add testing for Olmo 3 by @qgallouedec in #5962
- Align trainer train tests by @qgallouedec in #5963
- Align trainers: Remove redundant else branch by @albertvillanova in #5983
- [CI] Check that training chat templates keep the stop token in the loss mask by @kashif in #5988
- Create CI workflow to sync TRL skill with
huggingface/skillsby @albertvillanova in #5950 - Simplify agent skills target and default to
.agentsby @albertvillanova in #5987 - chore: enable Dependabot weekly GitHub Actions bumps by @hf-dependantbot-rollout[bot] in #5910
- Bump the actions group with 9 updates by @dependabot[bot] in #5913
- Bump the actions group with 4 updates by @dependabot[bot] in #5954
- chore: update
docker-build.ymlwith version parsing by @hf-security-analysis[bot] in #5920 - ci: use GitHub App auth for doc preview comment bot by @sergiopaniego in #5915
New Contributors
- @ColebyPearson made their first contribution in #5869
- @hf-dependantbot-rollout[bot] made their first contribution in #5910
- @raghulchandramouli made their first contribution in #5940
- @zafstojano made their first contribution in #5957
- @DaoyuanLi2816 made their first contribution in #5964
- @lxk8998 made their first contribution in #5927
- @biefan made their first contribution in #5289
What's Changed
- ⬆️ Bump dev version by @qgallouedec in #5836
- Add Qwen2.5-VL original and training chat template with generation markers by @aazizyan in #5838
- Align KTO with DPO: Simplify metrics from sum/count to direct averages by @albertvillanova in #5820
- async_grpo don't return on queue.Empty by @AmineDiro in #5751
- Align KTO with DPO: Refactor forward by @albertvillanova in #5849
- Per-field invariance tolerances, calibrated by @qgallouedec in #5844
- Add Qwen2-VL original and training chat template with generation markers by @aazizyan in #5839
- Remove NeMo Gym Integration Guide (broken) by @cmunley1 in #5840
- Align KTO with DPO: Align compute_ref_log_probs by @albertvillanova in #5852
- Align KTO with DPO: Align precompute_ref_logps by @albertvillanova in #5850
- Flash Attention capitalization fix by @qgallouedec in #5855
- 🔒 Gate trainer telemetry on an explicit class-name allowlist by @qgallouedec in #5851
- Align KTO with DPO: Support remove_unused_columns by @albertvillanova in #5866
- Raise if precompute_ref_log_probs with vision datasets in DPO by @albertvillanova in #5867
- Support
'None'as CLI value forOptional[T]fields by @qgallouedec in #5843 - KTO: Replace _get_train_sampler with train_sampling_strategy for transformers >= 5.2.0 by @albertvillanova in #5864
- Fix: specify encoding="utf-8" when reading .jinja chat templates on Windows by @ColebyPearson in #5869
- Align KTO with DPO: Align ref log probability names by @albertvillanova in #5856
- KTO: Support non-sequential train_sampling_strategy for apo_zero_unpaired by @albertvillanova in #5872
- Align KTO with DPO: Remove null_ref_context by @albertvillanova in #5875
- Fix ValueError by pinning kernels < 0.15.1 by @albertvillanova in #5880
- Update vLLM version support to 0.19.0 by @sergiopaniego in #5879
- Set kernels optional dependency via transformers by @albertvillanova in #5884
- Support non-lm_head output projections in chunked SFT loss (GPTNeoX) by @qgallouedec in #5857
- SFTTrainer: merge entropy and accuracy computation to eliminate redundant logits copy by @flutist in #5897
- Align KTO with DPO: Add disable_gradient_checkpointing to ref model forward passes by @albertvillanova in #5900
- Align KTO with DPO: Add activation offloading support by @albertvillanova in #5901
- Align KTO with DPO: Decouple KL dataset generation by @albertvillanova in #5899
- Fix GRPO use_liger_kernel under DeepSpeed ZeRO-3 by @kashif in #5891
- Replace custom numpy cache in precompute_ref_logps with native datasets by @albertvillanova in #5906
- [1/2] refactor: decoupled self distillation trainers (sdpo, sdft, ...) by @LeonEricsson in #5862
- Align KTO with DPO: Use datasets caching in precompute_ref_logps by @albertvillanova in #5909
- Support multimodal config in PPO ValueHead by @albertvillanova in #5907
- Fix priority order in PPO ValueHead and raise ValueError for unsupported config by @albertvillanova in #5908
- Fix
loss_type="chunked_nll"under DeepSpeed ZeRO-3 by @qgallouedec in #5873 - chore: enable Dependabot weekly GitHub Actions bumps by @hf-dependantbot-rollout[bot] in #5910
- Exclude None reward completions from GRPO/RLOO advantage baseline by @AmineDiro in #5902
- Don't treat ROCm GPUs as Ampere by @kashif in #5917
- ci: use GitHub App auth for doc preview comment bot by @sergiopaniego in #5915
- Bump the actions group with 9 updates by @dependabot[bot] in #5913
- Align KTO with DPO: Replace completion_labels/get_batch_logps with completion_mask by @albertvillanova in #5914
- chore: update docker-build.yml with version parsing by @hf-security-analysis[bot] in #5920
- docs: update OpenEnv GitHub org references and package name by @sergiopaniego in #5919
- Support kernels extra for transformers < 5.1.0 by @albertvillanova in #5928
- feat: move async rollout worker to separate process by @AmineDiro in #5749
- Remove redundant .contiguous() calls in DPOTrainer to reduce peak memory by @flutist in #5926
- docs: update OpenEnv doc URLs from meta-pytorch.org to huggingface.co/docs/openenv by @sergiopaniego in #5929
- Refresh
sft.json/dpo.jsonsnapshots after transformersnum_items_in_batch fixby @qgallouedec in #5845 - NemotronH integration by @qgallouedec in #5938
- Nemotron 3 Ultra support by @qgallouedec in #5942
- Enable gradient checkpointing in Nemotron 3 SFT example (transformers>=5.7.0) by @sergiopaniego in #5944
- Fix SFT padding-free test config by @kashif in #5923
- Add experimental A2PO trainer (Optimal Advantage Regression) by @raghulchandramouli in #5940
- Align KTO with DPO: Inline _compute_logps into _compute_loss by @albertvillanova in #5936
- Fix: Route liger student forward through DDP wrapper in GKD, GOLD, and Distillation trainers by @albertvillanova in #5934
- Fix backbone access in GRPO by aligning with SFT by @albertvillanova in #5949
- fix(docs): Remove duplicate sentence in GRPOTrainer docs by @zafstojano in #5957
- fix(docs): Blockquote math not rendering in RLooTrainer docs by @zafstojano in #5958
- Remove invalid max_prompt_length argument from GRPO example by @DaoyuanLi2816 in #5964
- Bump the actions group with 4 updates by @dependabot[bot] in #5954
- Add Llava-Next training tempalates support with generation markers by @aazizyan in #5959
- fix(docs): correct broken GKD Trainer link in MiniLLM docs by @DaoyuanLi2816 in #5961
- Remove unnecessary explicit
.contiguous()beforeentropy_from_logitsby @qgallouedec in #5930 - Improve error message when image tokens are truncated by max_length by @lxk8998 in #5927
- fix(docs): correct broken GKD Trainer link in MiniLLM docs by @DaoyuanLi2816 in #5960
- Cross-tokenizer alignment via byte offsets in GOLD trainer by @kashif in #5885
- [2/2] refactor: decoupled self distillation trainers; cleanup by @LeonEricsson in #5883
- Create CI workflow to sync TRL skill with huggingface/skills by @albertvillanova in #5950
- Support vision datasets for Liger in DPO by @albertvillanova in #5943
- Align trainer train tests by @qgallouedec in #5963
- Fix malformed ScaleRL paper link in GRPOConfig epsilon_high help by @DaoyuanLi2816 in #5972
- Add testing for Olmo3 by @qgallouedec in #5962
- chore(docs): Highlight the role of KL in RLOO compared to GRPO by @zafstojano in #5966
- fix(docs): drop duplicate "a" in online_dpo_vlm example description by @DaoyuanLi2816 in #5978
- Fix broken doc links (CONTRIBUTING online DPO paths, async GRPO anchor) by @DaoyuanLi2816 in #5971
- Align KTO with DPO: Support VLM by @albertvillanova in #5939
- Fix broken code examples in docs (RLOO syntax, SFTConfig max_length) by @DaoyuanLi2816 in #5970
- Align KTO with DPO: Improve error message for VLM truncation by @albertvillanova in #5982
- Align trainers: Remove redundant else branch by @albertvillanova in #5983
- SDFT/SDPO: live teacher logprobs from the vLLM server by @kashif in #5989
- docs: sync SDFT/SDPO config docstrings with their fields by @DaoyuanLi2816 in #5992
- fix(docs): Fix rendering typos by @zafstojano in #5991
- Add missing use_liger_kernel guard to SDPO teacher-server validation by @DaoyuanLi2816 in #5994
- docs: sync Distillation/GOLD/OnlineDPO config docstrings with their fields by @DaoyuanLi2816 in #5995
- feat: Bidirectional masked importance sampling ratio (MIS) for IcePop by @casinca in #4732
- Simplify agent skills target and default to
.agentsby @albertvillanova in #5987 - Align KTO with DPO: Remove unused use_dpo_data_collator attribute by @albertvillanova in #5996
- Align KTO with DPO: Rename kto_loss_fn to liger_loss_fn by @albertvillanova in #5998
- Align KTO with DPO: Inline kto_loss in _compute_loss by @albertvillanova in #5999
- Document bnb_4bit_quant_storage and normalize docstring param headers by @DaoyuanLi2816 in #5993
- [CI] Check that training chat templates keep the stop token in the loss mask by @kashif in #5988
- Announce upcoming SFT
loss_typedefault change from'nll'to'chunked_nll'by @qgallouedec in #5997 - Padding-free invariance test by @qgallouedec in #5842
- Hide DeepSpeed/FSDP distributed backend boilerplate by @albertvillanova in #6000
- fix(cli): drop duplicate "to" in trl skills install description by @DaoyuanLi2816 in #6008
- docs: clarify PPO entropy metrics in PPO trainer docs by @biefan in #5289
- Release: v1.6 by @qgallouedec in #6009
Full Changelog: v1.5.0...v1.6.0