Skip to content

[RFC] Hybrid model ExtractHiddenStates: CacheOnly as supplementary tensors#161

Draft
rahul-tuli wants to merge 1 commit intomainfrom
hma/extract-hidden-states-supplementary
Draft

[RFC] Hybrid model ExtractHiddenStates: CacheOnly as supplementary tensors#161
rahul-tuli wants to merge 1 commit intomainfrom
hma/extract-hidden-states-supplementary

Conversation

@rahul-tuli
Copy link
Copy Markdown
Member

Summary

  • Adds hybrid model support (e.g. Qwen3.5) for extract_hidden_states speculative decoding by modeling CacheOnly as supplementary tensors that share group 0's block table
  • Introduces supplementary_specs field on KVCacheConfig — CacheOnly layers get their own KV cache tensors but are invisible to the KV cache coordinator
  • Shared _reshape_one_layer() helper in attn_utils.py eliminates reshape duplication
  • Uses GPU-authoritative attn_metadata.slot_mapping instead of scheduler-computed slot mappings
  • Gates HMA disable with supports_hma() check so hybrid models keep per-group block allocators

This is 1 of 3 alternative approaches — see RFC document and sister PRs for comparison:

Test plan

  • Verified on Qwen3.5-9B with TP=4, non-zero hidden states extracted
  • Verified on standard (non-hybrid) model — no regression
  • pre-commit run --all-files passes on changed files
  • Unit tests added in tests/v1/core/test_kv_cache_utils.py

🤖 Generated with Claude Code

…nsors

CacheOnly layers are modeled as supplementary tensors that share group 0's
block table rather than as separate KV cache groups. This avoids polluting
group coordination logic while properly managing memory.

Key changes:
- Add CacheOnlySpec(MLAAttentionSpec) to kv_cache_interface.py
- Add supplementary_specs field to KVCacheConfig for non-group tensors
- split_supplementary_specs() separates CacheOnly before group routing
- Memory accounting includes supplementary bytes in budget calculation
- attn_utils: _reshape_one_layer() shared helper, supplementary init/reshape
- gpu_model_runner: supplementary alloc/reshape/slot_mapping support
- Connector uses GPU-authoritative slot_mapping from attn_metadata
- Gate HMA disable with supports_hma() check for hybrid models

Signed-off-by: Rahul Tuli <rtuli@redhat.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant