Skip to content

Release v5.13.0

Latest

Choose a tag to compare

@vasqu vasqu released this 03 Jul 16:06

Release v5.13.0

New Model additions

KimiK 2.5, 2.6, and 2.7

image

This release includes the architecture for Kimi 2.5 which is used by 2.5-2.7:

Kimi K2.5 is an open-source, native multimodal agentic model that advances practical capabilities in long-horizon coding, coding-driven design, proactive autonomous execution, and swarm-based task orchestration. The model was proposed in Kimi K2.5: Visual Agentic Intelligence and further improved in [Kimi K2.6: Advancing Open-Source Coding](Kimi K2.5: Visual Agentic Intelligence).

Kimi K2.5 achieves significant improvements on complex, end-to-end coding tasks, generalizing robustly across programming languages (Rust, Go, Python) and domains spanning front-end, DevOps, and performance optimization. The model is capable of transforming simple prompts and visual inputs into production-ready interfaces and lightweight full-stack workflows, generating structured layouts, interactive elements, and rich animations with deliberate aesthetic precision.

Links: Documentation

MiMo-V2-Flash

image

MiMo-V2-Flash is a Mixture-of-Experts (MoE) language model developed by the Xiaomi MiMo team. Designed to establish a new balance between long-context modeling capabilities and inference efficiency, the model is built for strong performance in complex reasoning and agentic tasks. Trained on 27T tokens with native 32k sequence lengths, MiMo-V2-Flash seamlessly supports an extended 256K context window while significantly reducing KV-cache storage compared to standard global attention models.

Links: Documentation

Nemotron 3.5 ASR

image

Nemotron 3.5 ASR is a 600M-parameter multilingual speech recognition model from NVIDIA, built for high-quality transcription in both low-latency streaming and high-throughput batch settings, with native punctuation and capitalization. For streaming, it offers configurable chunk sizes—80ms, 160ms, 560ms, and 1120ms, letting users trade off latency against accuracy to suit their application. Its cache-aware FastConformer-RNNT architecture is central to this capability: unlike traditional buffered streaming, which repeatedly reprocesses overlapping audio windows, the model processes only each new incoming chunk while reusing cached encoder context from prior chunks. This eliminates redundant computation, significantly improves efficiency, and minimizes end-to-end delay without sacrificing accuracy, making it well suited to real-time transcription workloads.

Links: Documentation

NemotronAsrStreaming

Nemotron ASR Streaming is a 600M-parameter English speech recognition model from NVIDIA, built for high-quality transcription in both low-latency streaming and high-throughput batch settings, with native punctuation and capitalization. For streaming, it offers configurable chunk sizes—80ms, 160ms, 560ms, and 1120ms, letting users trade off latency against accuracy to suit their application. Its cache-aware FastConformer-RNNT architecture is central to this capability: unlike traditional buffered streaming, which repeatedly reprocesses overlapping audio windows, the model processes only each new incoming chunk while reusing cached encoder context from prior chunks. This eliminates redundant computation, significantly improves efficiency, and minimizes end-to-end delay without sacrificing accuracy, making it well suited to real-time transcription workloads.

Links: Documentation

Qwen3 ASR

image

Qwen3 ASR is an automatic speech recognition model from Alibaba's Qwen team that combines a Whisper-style audio encoder with a Qwen3 language model decoder for speech-to-text transcription. The model supports automatic language detection and multilingual transcription.

A forced aligner model is also included. It can be used to timestamp a provided transcript and its audio. It uses the same audio encoder model with a classification head that predicts a word's length. This model can be used with the transcript from any ASR model (see the example below with Parakeet CTC).

Links: Documentation

ZAYA

image

ZAYA1 is a 760M active / 8.4B total parameter MoE language model trained by Zyphra. It combines Compressed
Convolutional Attention (CCA), a nonlinear ZAYA1 router, and residual scaling.

Links: Documentation

VideoPrism

The VideoPrism model was proposed in the paper VideoPrism: A Foundational Visual Encoder for Video Understanding by Google DeepMind (blog post).

VideoPrism is a general-purpose video encoder that tackles diverse video understanding tasks with a single frozen model. The model is pretrained on a large-scale heterogeneous corpus containing 36M high-quality video-caption pairs and 582M video clips with noisy parallel text (e.g., ASR transcripts). The pretraining approach improves upon masked autoencoding through global-local distillation of semantic video embeddings and a token shuffling scheme, enabling the model to focus primarily on the video modality while leveraging text associated with videos. VideoPrism achieves state-of-the-art performance on 31 out of 33 video understanding benchmarks across four broad task groups, from web video question answering to computer vision for science.

Links: Documentation

RADIO

RADIO (Reduce All Domains Into One) is a family of vision foundation models from NVIDIA trained by multi-teacher distillation (e.g. CLIP, DINOv2, SAM) into a single ViT backbone. It produces both an image-level summary embedding and dense spatial features, and supports variable input resolutions through a Cropped Position Embedding (CPE) patch generator.

Links: Documentation

MiniCPM3

MiniCPM3 is the third-generation MiniCPM dense language model from OpenBMB. The 4B variant
(openbmb/MiniCPM3-4B) outperforms many 7B–9B open
models on standard benchmarks while remaining lightweight enough for on-device usage.

MiniCPM3 combines several architectural ideas:

  • Multi-head Latent Attention (MLA) from DeepSeek-V2, which compresses the key/value cache
    into a low-rank latent representation while still using rotary embeddings on a portion of the
    query/key heads.
  • A standard SwiGLU MLP (no MoE).
  • Three scalar scaling factors that govern signal flow:
    • scale_emb — scales input embeddings.
    • scale_depth / sqrt(num_hidden_layers) — scales residual connections.
    • hidden_size / dim_model_base — scales hidden states before the language model head.

Links: Documentation

Breaking changes

A broad set of modeling changes have been made to standardize layer declarations, mask/cache construction, and hybrid-attention handling, making many models cleanly exportable (ONNX, torch.export, ExecuTorch) and fullgraph-compilable — users relying on internal modeling APIs may need to update their code accordingly.

Attention masking for image tokens in Gemma 3/4 models has been fixed to correctly respect sliding window boundaries in local layers, which changes model behavior and may affect reproducibility of previous results.

  • 🚨 [gemma 3/4] Fix bidirectional attention masking crossing sliding window boundaries (#46850) by @douglas-reid

The Expert Parallelism (EP) router contract has been corrected across many models and FP8 scale format handling has been fixed, requiring users of EP or FP8 quantization with affected models to verify their configurations and potentially update conversion mappings.

The Kernels integration has been synced to the latest version, which includes a breaking change where model-type repositories are no longer accepted by the kernels interface — users must migrate to the updated kernel repository format as shown in the updated tests.

HfExporters: Native, Unified export for PyTorch / ONNX / ExecuTorch

thumbnail

A native, in-Transformers export pipeline — one base class (HfExporter), three subclasses for the runtimes we care about, one unified API:

Exporter Output Runtime
DynamoExporter ExportedProgram Any PyTorch runtime, AOT compilation
OnnxExporter ONNXProgram Any ONNX runtime (ORT, TensorRT, OpenVINO, …)
ExecutorchExporter ExecutorchProgramManager Mobile and edge (ExecuTorch)

Same call shape across all three. Dynamic shapes by default. Generation-style models split automatically into prefill + decode (+ vision/audio sub-encoders for VLMs).

from transformers import AutoModelForMaskedLM, AutoTokenizer
from transformers.exporters import OnnxExporter, OnnxConfig

model_id = "hf-internal-testing/tiny-random-BertForMaskedLM"
tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForMaskedLM.from_pretrained(model_id).eval()
inputs = tokenizer(["Hello, my dog is cute"] * 2, return_tensors="pt")
onnx_program = OnnxExporter().export(model, inputs, config=OnnxConfig(dynamic=True))

new_input = tokenizer("Hello, my cat is so adorable!", return_tensors="pt")
torch.testing.assert_close(
    onnx_program.call_reference(**new_input)[0],   # numpy reference
    onnx_program(**new_input)[0],                  # onnxruntime
    rtol=1e-4, atol=1e-4,
)

Swap one line for another runtime — DynamoExporter() / DynamoConfig or ExecutorchExporter() / ExecutorchConfig(backend=...).

For generative models the prefill/decode split is captured automatically:

from transformers import AutoModelForCausalLM, AutoTokenizer
from transformers.exporters import OnnxExporter, OnnxConfig

model_id = "hf-internal-testing/tiny-random-LlamaForCausalLM"
tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(model_id).eval()
inputs = tokenizer(["Hello, my dog is cute"] * 2, return_tensors="pt")

artifacts = OnnxExporter().export_for_generation(model, inputs, config=OnnxConfig(dynamic=True))
# {"prefill": ONNXProgram, "decode": ONNXProgram}
# For VLMs: also vision_encoder, audio_encoder, multi_modal_projector, language_model, lm_head

Kernels

Kernels: Fixed a silent SDPA math-kernel fallback for GQA models with head_dim > 256 (e.g., Gemma4) that caused O(S²) memory materialization, and resolved a regression where use_kernels=True failed to apply kernel mappings. Additional improvements include lazy loading of the default kernel mapping to prevent import failures with incompatible kernel versions, ROCm routing to AITER Triton kernels for AMD GPUs, GB10/SM121 Hub-kernel support for Qwen3.6 Gated DeltaNet, and expanded documentation for the kernel API.

Generation

Several generation bugs were fixed, including Mamba2 chunked-prefill and speculative decoding for hybrid models (Zamba2, Nemotron-H, Bamba, FalconH1, GraniteMoeHybrid), beam search for Mamba models, prompt lookup decoding crashes with no EOS token, and incorrect stateful model handling for LFM2. Additional improvements include reduced unnecessary generation warnings, a fix for continuous batching output mutation, and a new option to keep input tensors on CPU during generation to avoid retracing on Neuron/TPU devices.

Attention

Several attention-related bugs were fixed in this release, including silent SDPA math-kernel fallbacks for GQA with large head dimensions, broken Flash Attention with StaticCache, incorrect causal masking in Xcodec2, a cross-attention reshape regression in Blip2, and eager GQA support in Evolla. Accelerate hook handling was also corrected for models using linear attention to prevent silently wrong results during offloading.

Cache

Cache APIs were improved by consolidating redundant getters into a cleaner get_max_length method and updating documentation accordingly. Several bug fixes were also applied, including correcting mask generation beyond sliding windows, fixing a dimension issue in cumulative length tracking, resolving device mismatches in offloaded cache for hybrid models, and fixing crashes when loading trust_remote_code models from symlinked local caches.

Serve

Several fixes and improvements were made to the Serve functionality, including lazy imports to prevent CLI crashes when the optional serve extra is not installed, a fix for dropped attributes during serialization of subclassed Pydantic models, and added documentation for the kernel API.

  • fix(cli/serve): import serve handlers lazily so the CLI works without the serve extra (#46473) by @ in [#46473]
  • [Fix] Serve drops some attributes at serialization (#46680) by @remi-or in [#46680]
  • Reduce per_page from 100 to 50 in GitHub API calls to avoid server errors (#46678) by @ydshieh in [#46678]

Quantization

Fixed dtype casting bugs in Gemma4's vision and audio multimodal embedders when using BitsAndBytes quantization, where inputs were incorrectly cast to integer storage dtypes (uint8/int8) instead of the actual compute dtype. Also corrected FP8 quantization to round block scales before quantizing weights, ensuring dequantization produces correct values for ue8m0 (DeepSeek-V4 style) format.

Bugfixes and improvements

Significant community contributions

The following contributors have made significant changes to the library over the last release:

  • @ydshieh
    • Update workflow callers to use transformers-ci (#47040)
    • Add tiny_model_id support to ProcessorTesterMixin for memory-sensitive tests (#47005)
    • Install in docker (#46910)
    • [CI] Use pre-computed _OLD_MODELS in test_new_models_require_torchvision_backend (#46882)
    • Fix secondary rate limit when downloading artifacts in slack report (#46796)
    • [CI] Fix artifact download path in self-comment-ci workflow (#46769)
    • ci: add comment explaining why secrets are not inherited in security gate (#46750)
    • ci: trigger PR CI on ci-* branches (#46746)
    • ci: disable CircleCI by replacing config with no-op (#46721)
    • ci: grant pull-requests:write to the security gate caller (#46715)
    • Reduce per_page from 100 to 50 in GitHub API calls to avoid server errors (#46678)
    • ci: add NO_COLOR=1 to suppress ANSI color codes in CI output (#46659)
    • ci: add merge_group trigger to pr-ci-caller.yml (#46668)
    • Revert "Disable PR CI workflow for PRs from forked repo. during the weekend" (#46652)
    • Disable PR CI workflow for PRs from forked repo. during the weekend (#46609)
  • @Mi-Jiazhi
    • Add HunYuan VL model (#46417)
  • @tarekziade
    • chore(linter): add TRF018 modeling rule (#46259)
    • only in the original repo (#46982)
    • the CI status should be a comment (#46976)
    • Insert a Grafana badge in the PR (#46774)
    • call transformers-ci in a nightly run (#46811)
  • @casinca
  • @JJJYmmm
    • [new model] Add Zyphra/ZAYA1-8B (#45862)
  • @ebezzam
    • Fix typo in Qwen3 ASR no_split_module (#47002)
    • Fix Xcodec2 attention to be non-causal. (#46963)
    • Use common floats_list method for feature extractor tests. (#46956)
    • Add xcodec2 model (#44178)
  • @meatybobby
    • Add support for RADIO models (#46425)
  • @douglas-reid
    • 🚨 [gemma 3/4] Fix bidirectional attention masking crossing sliding window boundaries (#46850)
  • @Sunt-ing
    • Fix Mamba2 chunked-prefill / speculative decoding for Zamba2, Nemotron-H, Bamba, FalconH1 and GraniteMoeHybrid (#46741)
    • Reject assisted generation for LFM2 and LFM2-MoE (set _is_stateful) (#46937)
    • Don't pin the gated delta net norm to cuda:0 with a hardcoded device (#46817)
    • Fix prompt lookup decoding crash when no EOS token is configured (#46790)
    • Fix left-padding token selection in BioGptForSequenceClassification (#46782)
    • Fix offloaded cache device mismatch on hybrid models (#46748)
    • Fall back to the for-loop grouped_mm on CPU (#46743)
  • @eustlb
    • Add Nemotron 3.5 ASR Streaming (#46565)
    • [NemotronAsrStreaming] fix pipeline (#46870)
    • [NemotronAsrStreaming] processor without modular (#46865)
    • Add Nemotron ASR Streaming (#46332)
    • [fix] enable base64 str audio in load_audio (#46694)
  • @vasqu
    • [Dia] Fix docs (#46923)
    • [CB] Add FA2 to the fast path (#46729)
    • [Kernels] Trigger proper kernelization on use_kernels=True (#46755)
    • [CI] Fix some failures introduced by myself 😬 (#46751)
    • 🚨 [Kernels] Sync to latest version (#46039)
    • [Templates] Update members (#46720)
    • [Blip2] Fix cross attention reshape (#46695)
    • Update post release (#46608)
  • @mbtariq82
    • Qwen3 ASR and Forced Aligner (#43838)
  • @remi-or
    • [CB] Changes to increase max_batch_tokens (#46712)
    • [CB] Fix issues with FA read / writes (#46765)
    • [CB] Fix offloading (#46587)
    • [Fix] Serve drops some attributes at serialization (#46680)
    • [CB] Slice logits inside the model (#46660)
    • [CB] Fix seqlens and use TypedDict (#46593)
  • @jiqing-feng
    • Fix BitNet packed-weight unpacking dtype (F.linear dtype mismatch) (#46808)
    • Fix Evolla eager attention for the GQA text decoder (#46860)
    • Fix flex_attention block mask creation when get_seq_length returns a tensor (#46802)
    • Lazily build the default kernel mapping to decouple kernels from normal transformers usage (#46681)
  • @bzantium
  • @MHRDYN7
  • @YangKai0616
    • [RecurrentGemma] Support attn_implementation dispatch (#46320)
    • [blip_2] Support attn_implementation dispatch (#46401)
    • [CTRL] Support attn_implementation dispatch (#46073)