Release v5.11.0
New Model additions
DiffusionGemma
DiffusionGemma is engineered to reduce the sequential bottlenecks of standard causal language models by employing an encoder-decoder architecture specifically optimized for inference speed. During inference, DiffusionGemma leverages multi-canvas sampling, where rather than generating one token at a time, the model iteratively denoises a full block of tokens using a diffusion sampler. This block-autoregressive approach facilitates text generation at higher speeds compared to traditional sequential generation methods.
Links: Documentation
DeepSeek-V3.2
DeepSeek-V3.2-Exp is an experimental model from DeepSeek-AI that introduces DeepSeek Sparse Attention (DSA), a trainable, fine-grained sparse attention mechanism designed to improve training and inference efficiency in long-context scenarios. Built on top of DeepSeek-V3.1-Terminus with a 685B-parameter Mixture-of-Experts backbone, it reduces the quadratic cost of attention over long sequences by attending only to a selected subset of past tokens while maintaining virtually identical benchmark performance. The work was extended in DeepSeek-V3.2 which pairs DSA with scalable reinforcement learning and achieves gold-medal level results on competition math and competitive programming benchmarks.
Links: Documentation | Paper
- Add deepseek 3.2 exp (#41251) by @ArthurZucker in #41251
Kernels
The KernelConfig API was extended to support n-to-1 module fusion and parameter transformation, simplifying how custom kernels are integrated with Transformers modules. Additional fixes include resolving a dtype mismatch in the Mamba2 CUDA kernel path for NemotronH/Zamba2, adding fine-grained fp8/fp4 Triton kernel support, and correcting the FalconMamba fast-path warning to recommend pip install kernels instead of mamba-ssm.
- Extended & simplified n-to-1 kernel fusion via KernelConfig (#46339) by @michaelbenayoun in [#46339]
- Triton finegrained fp8/fp4 (#46407) by @IlyasMoutawwakil in [#46407]
- Fix dtype mismatch in NemotronH/Zamba2 Mamba2 CUDA-kernel path (
out_proj) (#46487) by @yuekaizhang in [#46487] - fix(falcon_mamba): recommend
pip install kernelsin fast-path warning (#46343) by @Anai-Guo in [#46343]
Parallelization
Fixed model parallel beam search bugs in the Qwen2-VL, Qwen2.5-VL, and Qwen3-VL MoE model families, and added documentation for tensor parallelism support with continuous batching.
- [docs] tp for continuous batching (#46019) by @stevhliu in [#46019]
- revisit history parallel beam search tests to avoid unnecessary fix (#46495) by @kaixuanliu in [#46495]
- fix qwen series VL model's model parallel bug (#46316) by @kaixuanliu in [#46316]
Bugfixes and improvements
- Fix the offsets in processing (#46525) by @zucchini-nlp in [#46525]
- Fix buggy action sha pin (#46534) by @ydshieh in [#46534]
- Fix trailing comma bug in DataCollatorForLanguageModeling example (#46527) by @JemmaUZH in [#46527]
- Fix missing Gemma4Processor._compute_audio_num_tokens (#46416) by @csantosbh in [#46416]
- Fix InternVL models (#46524) by @hmellor in [#46524]
- fix(afmoe): reduce tokens in test_compile_static_cache to avoid flaky bfloat16 drift (#46521) by @ydshieh in [#46521]
- [CB] Add a "max_requests_per_batch" parameter (#46434) by @remi-or in [#46434]
- revamp cv docs and fix rf-detr (#46219) by @merveenoyan in [#46219]
- Update hub metadata (#46379) by @zucchini-nlp in [#46379]
- extend DeepseekV4FlashIntegrationTest to non-cuda device (#46517) by @sywangyi in [#46517]
- [docs] deepgemm (#46361) by @stevhliu in [#46361]
- [fix] regression introduced by #45534 (#46456) by @eustlb in [#46456]
- Use torchvision's native LANCZOS interpolation instead of PIL fallback (#46496) by @NicolasHug in [#46496]
- Add debugging info in
pr-ci-caller.yml(#46505) by @ydshieh in [#46505] - Fix tests: 'Cohere2MoeModel' object has no attribute 'hf_device_map' (#46337) by @kaixuanliu in [#46337]
- Bump the actions group across 1 directory with 19 updates (#46414) by @dependabot[bot] in [#46414]
- Log some information in
.github/workflows/pr-ci-post-dashboard-link.yml(#46499) by @ydshieh in [#46499] - feat(quantizers): support non-weight param names in TorchAo safetensors loading (#46325) by @agesf in [#46325]
- docs: fix typo in make_list_of_images docstring (#46469) by @ramkumar27072006 in [#46469]
- add XPU expectation for deepseek_ocr2 model tests (#46492) by @kaixuanliu in [#46492]
- Fix sapiens2 tests: add XPU device expectations (#46488) by @kaixuanliu in [#46488]
- Add vLLM smoke test to CI (#46383) by @hmellor in [#46383]
- extend deepseek v4 test to xpu (#46366) by @sywangyi in [#46366]
- Added cosmos3 model (#46146) by @MaciejBalaNV in [#46146]
- fbgemm_fp8:Keep the current device aligned with the input tensor (#46403) by @kaixuanliu in [#46403]
- [Modular] Add
no_inherit_decoratorsand fixup wrong RoPE related inheritances (#46440) by @Bissmella in [#46440] - skip deepgemm test except cuda (#46090) by @jiqing-feng in [#46090]
- Fix/video classification pipeline video processor (#46256) by @J3r3myPerera in [#46256]
- ci: less flaky test_assisted_decoding_matches_greedy_search_1_same (#46445) by @ydshieh in [#46445]
- Fix flip_back graph break (#46344) by @guarin in [#46344]
- Add the other processors to auto-mappings (#46046) by @zucchini-nlp in [#46046]
- fix: compatibility with torch<=2.7 (#46393) by @andylin-hao in [#46393]
- fix: remove dynamic per-actor Slack ID lookup in ssh-runner workflow (#46327) by @ydshieh in [#46327]
- [docs] Romanian translation of
pipeline_tutorial.md,pipeline_gradio.md,pipeline_webserver.mdandadd_new_pipeline.md. (#46388) by @filipinescu in [#46388] - [docs] gemma4 typos (#46351) by @stevhliu in [#46351]
- [docs] padding-free training (#46333) by @stevhliu in [#46333]
- fix[vLLM x v5]: Default untied embeddings in AudioFlamingo3 and VibeVoice (#46400) by @harshaljanjani in [#46400]
- Fix deepspeed docker (#46108) by @SunMarc in [#46108]
- Fix conversion for clip models (#46406) by @zucchini-nlp in [#46406]
- ci: mention code quality failure in CI dashboard comment (#46415) by @ydshieh in [#46415]
- Fix noisy logging from image_processing module aliases issue - 46298 (#46350) by @skshmjn in [#46350]
- Raise tqdm minimum to 4.60 to match tqdm.contrib.logging import (#46397) by @n0gu-furiosa in [#46397]
- fix(gemma4_unified): conversion script and config bugs (#46398) by @douglas-reid in [#46398]
- [docs] remove sparsity from compressed-tensors (#46387) by @stevhliu in [#46387]
- [CB] Fix crashes when fork is not possible (#46251) by @remi-or in [#46251]
- Improve CI dashboard comment: rename and deduplicate (#46412) by @ydshieh in [#46412]
- Fix missing f-string prefixes in error messages (#46354) by @joaopedroassad in [#46354]
- Add workflow to post CI Grafana dashboard link to PR (#46410) by @ydshieh in [#46410]
- [docs] Romanian translation of
fast_tokenizers.md,custom_tokenizers.md,tokenizer_summary.md,image_processors.mdandvideo_processors.md. (#46356) by @filipinescu in [#46356] - Clean up new models after release (#46092) by @zucchini-nlp in [#46092]
Significant community contributions
The following contributors have made significant changes to the library over the last release:
- @ArthurZucker
- Add deepseek 3.2 exp (#41251)
- @gante
- GPU go brr (#46540)
- @merveenoyan
- revamp cv docs and fix rf-detr (#46219)
- @sgerrard
- Quantization for small models (#46449)
- @MaciejBalaNV
- Added cosmos3 model (#46146)
- @J3r3myPerera
- Fix/video classification pipeline video processor (#46256)
- @filipinescu