[RFC] Hybrid model ExtractHiddenStates: CacheOnly as filtered KV cache group#160
Draft
rahul-tuli wants to merge 2 commits intomainfrom
Draft
[RFC] Hybrid model ExtractHiddenStates: CacheOnly as filtered KV cache group#160rahul-tuli wants to merge 2 commits intomainfrom
rahul-tuli wants to merge 2 commits intomainfrom
Conversation
- Add CacheOnlySpec(MLAAttentionSpec) to kv_cache_interface.py so it
duck-types through all existing AttentionSpec code paths
- Pre-filter CacheOnlySpec in get_kv_cache_groups() before type-
unification routing to prevent crashes with mixed spec types
- Joint budget calculation in get_kv_cache_config_from_groups() via
extra_bytes_per_block parameter on get_num_blocks()
- Gate HMA disable in config with supports_hma() check so hybrid
models keep their per-group block allocators
- Add SupportsHMA to ExampleHiddenStatesConnector with correct
cache_group_idx for block_ids
- Resolve CacheOnly slot_mapping from per-layer mappings in the
proposer instead of using main group's common_attn_metadata
Signed-off-by: Rahul-Tuli <rtuli@redhat.com>
CacheOnlySpec inherits MLAAttentionSpec -> FullAttentionSpec, so isinstance(cache_only, FullAttentionSpec) returns True. This causes build_block_map_addrs to include CacheOnly in page size uniformity checks, crashing with "Non-uniform page sizes" on hybrid models. Add explicit CacheOnlySpec filter after the FullAttentionSpec gate. Signed-off-by: Rahul Tuli <rtuli@redhat.com>
This was referenced Apr 15, 2026
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
extract_hidden_statesspeculative decoding by modeling CacheOnly as its own KV cache group, filtered from page-size uniformity checks and group coordinationCacheOnlySpec(MLAAttentionSpec)is pre-filtered inget_kv_cache_groups()before type-unification, then appended as a separate group with joint memory budget accountingsupports_hma()check so hybrid models keep per-group block allocators when the connector supports itThis is 1 of 3 alternative approaches — see RFC document and sister PRs for comparison:
Test plan
pre-commit run --all-filespasses on changed filestests/v1/core/test_kv_cache_utils.py🤖 Generated with Claude Code