Fix vLLM v0.19 MLA merge validation and CacheOnly KV cache registration#76
Merged
yubofredwang merged 1 commit intomainfrom Apr 15, 2026
Merged
Conversation
Add two new patches to the vLLM v0.19.0 patch set: - MLAAttentionSpec.merge() now validates that all fields match, preventing CacheOnlyAttentionLayer from being silently merged with MLA layers into a single KV cache group with wrong dimensions. - Register _CacheOnlyKVCacheSpec in spec_manager_map so extract_hidden_states doesn't hit a KeyError during engine init. Update integration test to default to Kimi-K2.5, add CLI flags for --load-format, --enforce-eager, and --max-model-len.
Contributor
There was a problem hiding this comment.
Pull request overview
This PR updates the vLLM v0.19 patchset to prevent incorrect KV-cache grouping between MLA attention layers and CacheOnly attention (used by extract_hidden_states), and to ensure the CacheOnly KV-cache spec is properly registered during engine initialization. It also extends the vLLM integration test CLI to better match the intended model/config defaults and allow additional runtime configuration.
Changes:
- Add stricter merge-time validation for
MLAAttentionSpec.merge()to prevent merging incompatible attention specs into the same KV cache group. - Register
_CacheOnlyKVCacheSpecin vLLM’s KV cache managerspec_manager_mapto avoidKeyErrorat engine init. - Enhance the vLLM engine integration test script with new CLI flags (
--load-format,--[no-]enforce-eager,--max-model-len) and update default model/TP.
Reviewed changes
Copilot reviewed 2 out of 2 changed files in this pull request and generated 1 comment.
| File | Description |
|---|---|
| tests/test_vllm_engine_integration.py | Adds CLI/config plumbing for load_format, eager mode, and max model length; updates defaults for the integration test script. |
| patches/vllm/v0.19.0/vllm.patch | Extends the vLLM v0.19.0 patchset to validate MLA merge compatibility and register CacheOnly KV-cache spec to the appropriate manager. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
CacheOnlyAttentionLayer(used byextract_hidden_states) gets incorrectly merged with MLA attention layers into a single KV cache group, causing the CacheOnly KV cache to be reshaped with MLA dimensions (1 head, 576 dim) instead of hidden-state dimensions (num_aux_layersheads,hidden_sizedim)._CacheOnlyKVCacheSpecinspec_manager_map:extract_hidden_statesuses_CacheOnlyKVCacheSpec(a subclass ofAttentionSpec) which is not in vLLM'sspec_manager_map, causing aKeyErrorduring engine init. Routes it toFullAttentionManager.--load-format,--enforce-eager, and--max-model-len.Test plan