Skip to content

Fix vLLM v0.19 MLA merge validation and CacheOnly KV cache registration#76

Merged
yubofredwang merged 1 commit intomainfrom
fix/vllm-mla-merge-and-cache-only-spec
Apr 15, 2026
Merged

Fix vLLM v0.19 MLA merge validation and CacheOnly KV cache registration#76
yubofredwang merged 1 commit intomainfrom
fix/vllm-mla-merge-and-cache-only-spec

Conversation

@yubofredwang
Copy link
Copy Markdown
Collaborator

Summary

  • MLAAttentionSpec.merge() dimension validation: Without this fix, CacheOnlyAttentionLayer (used by extract_hidden_states) gets incorrectly merged with MLA attention layers into a single KV cache group, causing the CacheOnly KV cache to be reshaped with MLA dimensions (1 head, 576 dim) instead of hidden-state dimensions (num_aux_layers heads, hidden_size dim).
  • Register _CacheOnlyKVCacheSpec in spec_manager_map: extract_hidden_states uses _CacheOnlyKVCacheSpec (a subclass of AttentionSpec) which is not in vLLM's spec_manager_map, causing a KeyError during engine init. Routes it to FullAttentionManager.
  • Integration test improvements: Update defaults to Kimi-K2.5 model, add CLI flags for --load-format, --enforce-eager, and --max-model-len.

Test plan

  • Run vLLM engine integration test with Kimi-K2.5 and verify engine init succeeds without KeyError
  • Verify MLA and CacheOnly layers are placed in separate KV cache groups

Add two new patches to the vLLM v0.19.0 patch set:
- MLAAttentionSpec.merge() now validates that all fields match, preventing
  CacheOnlyAttentionLayer from being silently merged with MLA layers into
  a single KV cache group with wrong dimensions.
- Register _CacheOnlyKVCacheSpec in spec_manager_map so extract_hidden_states
  doesn't hit a KeyError during engine init.

Update integration test to default to Kimi-K2.5, add CLI flags for
--load-format, --enforce-eager, and --max-model-len.
Copilot AI review requested due to automatic review settings April 15, 2026 06:38
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR updates the vLLM v0.19 patchset to prevent incorrect KV-cache grouping between MLA attention layers and CacheOnly attention (used by extract_hidden_states), and to ensure the CacheOnly KV-cache spec is properly registered during engine initialization. It also extends the vLLM integration test CLI to better match the intended model/config defaults and allow additional runtime configuration.

Changes:

  • Add stricter merge-time validation for MLAAttentionSpec.merge() to prevent merging incompatible attention specs into the same KV cache group.
  • Register _CacheOnlyKVCacheSpec in vLLM’s KV cache manager spec_manager_map to avoid KeyError at engine init.
  • Enhance the vLLM engine integration test script with new CLI flags (--load-format, --[no-]enforce-eager, --max-model-len) and update default model/TP.

Reviewed changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated 1 comment.

File Description
tests/test_vllm_engine_integration.py Adds CLI/config plumbing for load_format, eager mode, and max model length; updates defaults for the integration test script.
patches/vllm/v0.19.0/vllm.patch Extends the vLLM v0.19.0 patchset to validate MLA merge compatibility and register CacheOnly KV-cache spec to the appropriate manager.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread tests/test_vllm_engine_integration.py
@yubofredwang yubofredwang merged commit 59d58e4 into main Apr 15, 2026
5 checks passed
@yubofredwang yubofredwang deleted the fix/vllm-mla-merge-and-cache-only-spec branch April 15, 2026 06:46
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants