webgpu: Add configurable multi rotary cache concat offset parameter#27434
Open
webgpu: Add configurable multi rotary cache concat offset parameter#27434
Conversation
Add multiRotaryCacheConcatOffset EP option to support models with different context lengths for multi-rotary embedding cache concatenation. Changes: - Add uint32 multiRotaryCacheConcatOffset config parameter (default: 4096) - Keep useMultiRotaryCacheConcat boolean flag for enable/disable - Update WebGpuExecutionProviderConfig struct with new offset field - Add MultiRotaryCacheConcatOffset() accessor to EP and ComputeContext - Add kMultiRotaryCacheConcatOffset config option constant - Update webgpu_provider_factory.cc to parse and validate offset value - Pass offset parameter to SplitPackedQKVWithRotaryEmbedding programs - Update WGSL templates to use configurable offset instead of hardcoded 4096 - Reorder template parameters alphabetically for consistency The offset parameter allows models with different original_context_length values (e.g., 2048, 8192) to specify their cache concatenation boundary, replacing the previously hardcoded 4096 value for Phi models.
…parameter - Remove redundant use_multi_rotary_cache_concat boolean flag from public API - Keep only multi_rotary_cache_concat_offset parameter (0 = disabled) - Internally derive boolean from offset > 0 for shader template compilation - Add offset to CacheHint for proper shader variant caching - Update logging to remove reference to removed boolean field Benefits: - Simpler API: single parameter instead of two - Zero runtime overhead when disabled (conditional compilation) - Self-documenting: 0 clearly indicates disabled state
sushraja-msft
approved these changes
Feb 24, 2026
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
This pull request adds support for a new configuration option,
multiRotaryCacheConcatOffset, to the WebGPU execution provider in ONNX Runtime. This option allows for the use of a concatenated multi rotary cache in rotary embedding operations, which can be enabled or disabled via provider configuration. The changes propagate this option through the provider, compute context, and shader programs, and update the relevant WGSL shader templates to use the offset when specified.These changes enable more flexible and efficient handling of rotary cache concatenation in transformer models running on WebGPU. In GenAI builder.py, when enable_webgpu_graph is enabled, xxx_cache_small and xxx_cache_large will concat together. In rope mode, we need this information to get the correct rotary cache information.
It fixes the issue that the result is not correct when total sequence length is larger than 4096 for phi model generated by builder.py with enable_webgpu_graph enabled.