Skip to content

Conversation

@dependabot
Copy link
Contributor

@dependabot dependabot bot commented on behalf of github Jan 13, 2026

Bumps vllm from 0.8.3 to 0.12.0.

Release notes

Sourced from vllm's releases.

v0.12.0

vLLM v0.12.0 Release Notes Highlights

Highlights

This release features 474 commits from 213 contributors (57 new)!

Breaking Changes: This release includes PyTorch 2.9.0 upgrade (CUDA 12.9), V0 deprecations including xformers backend, and scheduled removals - please review the changelog carefully.

Major Features:

  • EAGLE Speculative Decoding Improvements: Multi-step CUDA graph support (#29559), DP>1 support (#26086), and multimodal support with Qwen3VL (#29594).
  • Significant Performance Optimizations: 18.1% throughput improvement from batch invariant BMM (#29345), 2.2% throughput improvement from shared experts overlap (#28879).
  • AMD ROCm Expansion: DeepSeek v3.2 + SparseMLA support (#26670), FP8 MLA decode (#28032), AITER attention backend (#28701).

Model Support

  • New model families: PLaMo-3 (#28834), OpenCUA-7B (#29068), HunyuanOCR (#29327), Mistral Large 3 and Ministral 3 (#29757).
  • Format support: Gemma3 GGUF multimodal support (#27772).
  • Multimodal enhancements: Qwen3 Omni audio-in-video support (#27721), Eagle3 multimodal support for Qwen3VL (#29594).
  • Performance: QwenVL cos/sin cache optimization (#28798).

Engine Core

  • GPU Model Runner V2 (Experimental) (#25266): Complete refactoring of model execution pipeline:

    • No "reordering" or complex bookkeeping with persistent batch removal
    • GPU-persistent block tables for better scalability with max_model_len and num_kv_groups
    • Triton-native sampler: no -1 temperature hack, efficient per-request seeds, memory-efficient prompt logprobs
    • Simplified DP and CUDA graph implementations
    • Efficient structured outputs support
  • Prefill Context Parallel (PCP) (Preparatory) (#28718): Partitions the sequence dimension during prefill for improved long-sequence inference. Complements existing Decode Context Parallel (DCP). See RFC #25749 for details.

  • RLHF Support: Pause and Resume Generation for Asynchronous RL Training (#28037).

  • KV Cache Enhancements: Cross-layer KV blocks support (#27743), KV cache residency metrics (#27793).

  • Audio support: Audio embeddings support in chat completions (#29059).

  • Speculative Decoding:

    • Multi-step Eagle with CUDA graph (#29559)
    • EAGLE DP>1 support (#26086)
    • EAGLE3 heads without use_aux_hidden_states (#27688)
    • Eagle multimodal CUDA graphs with MRoPE (#28896)
    • Logprobs support with spec decode + async scheduling (#29223)
  • Configuration: Flexible inputs_embeds_size separate from hidden_size (#29741), --fully-sharded-loras for fused_moe (#28761).

Hardware & Performance

  • NVIDIA Performance:

... (truncated)

Changelog

Sourced from vllm's changelog.

Releasing vLLM

vLLM releases offer a reliable version of the code base, packaged into a binary format that can be conveniently accessed via PyPI. These releases also serve as key milestones for the development team to communicate with the community about newly available features, improvements, and upcoming changes that could affect users, including potential breaking changes.

Release Cadence and Versioning

We aim to have a regular release every 2 weeks. Since v0.12.0, regular releases increment the minor version rather than patch version. The list of past releases can be found here.

Our version numbers are expressed in the form vX.Y.Z, where X is the major version, Y is the minor version, and Z is the patch version. They are incremented according to the following rules:

  • Major releases are reserved for architectural milestones involving sweeping API changes, similar to PyTorch 2.0.
  • Minor releases correspond to regular releases, which include new features, bug fixes and other backwards-compatible changes.
  • Patch releases correspond to special releases for new models, as well as emergency patches for critical performance, functionality and security issues.

This versioning scheme is similar to SemVer for compatibility purposes, except that backwards compatibility is only guaranteed for a limited number of minor releases (see our deprecation policy for details).

Release Branch

Each release is built from a dedicated release branch.

  • For major and minor releases, the release branch cut is performed 1-2 days before release is live.
  • For patch releases, previously cut release branch is reused.
  • Release builds are triggered via push to RC tag like vX.Y.Z-rc1. This enables us to build and test multiple RCs for each release.
  • Final tag: vX.Y.Z does not trigger the build but used for Release notes and assets.
  • After branch cut is created, we monitor the main branch for any reverts and apply these reverts to a release branch.

Cherry-Pick Criteria

After branch cut, we approach finalizing the release branch with clear criteria on what cherry picks are allowed in. Note: a cherry pick is a process to land a PR in the release branch after branch cut. These are typically limited to ensure that the team has sufficient time to complete a thorough round of testing on a stable code base.

  • Regression fixes - that address functional/performance regression against the most recent release (e.g. 0.7.0 for 0.7.1 release)
  • Critical fixes - critical fixes for severe issue such as silent incorrectness, backwards compatibility, crashes, deadlocks, (large) memory leaks
  • Fixes to new features introduced in the most recent release (e.g. 0.7.0 for 0.7.1 release)
  • Documentation improvements
  • Release branch specific changes (e.g. change version identifiers or CI fixes)

Please note: No feature work allowed for cherry picks. All PRs that are considered for cherry-picks need to be merged on trunk, the only exception are Release branch specific changes.

Manual validations

E2E Performance Validation

Before each release, we perform end-to-end performance validation to ensure no regressions are introduced. This validation uses the vllm-benchmark workflow on PyTorch CI.

Current Coverage:

  • Models: Llama3, Llama4, and Mixtral
  • Hardware: NVIDIA H100 and AMD MI300x
  • Note: Coverage may change based on new model releases and hardware availability

... (truncated)

Commits
  • 4fd9d6a [Core] Rename PassConfig flags as per RFC #27995 (#29646)
  • a1d627e [BugFix] Fix assert in build_for_cudagraph_capture (#29893)
  • 2f055ec [Bugfix] Fix incorrect channel order for idefics3 in edge case (#29881)
  • 6a61085 [BUGFIX] Fix regex pattern for Mistral Tool Call (#29918)
  • 9057fc2 [BUGFIX] llama_4_scaling wrongly passed to DeepseekAttention (#29908)
  • a05b580 [Bugfix] fix --scheduling-policy=priority & n>1 crashes engine (#29764)
  • b6ae5ae [Bugfix][EPLB] Prevent user-provided EPLB config from being overwritten with ...
  • 5c7c09a [Perf] Avoid pageable HtoD transfer in MinTokensLogitsProcessor (#29826)
  • 7f71816 [CI/Build] Fixes missing runtime dependencies (#29822)
  • 339e84c [Bugfix] Fix DeepSeek R1 MTP weight loading (#29545)
  • Additional commits viewable in compare view

Dependabot compatibility score

Dependabot will resolve any conflicts with this PR as long as you don't alter it yourself. You can also trigger a rebase manually by commenting @dependabot rebase.


Dependabot commands and options

You can trigger Dependabot actions by commenting on this PR:

  • @dependabot rebase will rebase this PR
  • @dependabot recreate will recreate this PR, overwriting any edits that have been made to it
  • @dependabot merge will merge this PR after your CI passes on it
  • @dependabot squash and merge will squash and merge this PR after your CI passes on it
  • @dependabot cancel merge will cancel a previously requested merge and block automerging
  • @dependabot reopen will reopen this PR if it is closed
  • @dependabot close will close this PR and stop Dependabot recreating it. You can achieve the same result by closing it manually
  • @dependabot show <dependency name> ignore conditions will show all of the ignore conditions of the specified dependency
  • @dependabot ignore this major version will close this PR and stop Dependabot creating any more for this major version (unless you reopen the PR or upgrade to it yourself)
  • @dependabot ignore this minor version will close this PR and stop Dependabot creating any more for this minor version (unless you reopen the PR or upgrade to it yourself)
  • @dependabot ignore this dependency will close this PR and stop Dependabot creating any more for this dependency (unless you reopen the PR or upgrade to it yourself)
    You can disable automated security fix PRs for this repo from the Security Alerts page.

Bumps [vllm](https://github.com/vllm-project/vllm) from 0.8.3 to 0.12.0.
- [Release notes](https://github.com/vllm-project/vllm/releases)
- [Changelog](https://github.com/vllm-project/vllm/blob/main/RELEASE.md)
- [Commits](vllm-project/vllm@v0.8.3...v0.12.0)

---
updated-dependencies:
- dependency-name: vllm
  dependency-version: 0.12.0
  dependency-type: direct:production
...

Signed-off-by: dependabot[bot] <support@github.com>
@dependabot dependabot bot added dependencies Pull requests that update a dependency file python Pull requests that update python code labels Jan 13, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

dependencies Pull requests that update a dependency file python Pull requests that update python code

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant