Skip to content

Comments

webgpu: Add configurable multi rotary cache concat offset parameter#27434

Open
qjia7 wants to merge 3 commits intomainfrom
jiajiaqin/webgpu-multi-rotary-cache-offset
Open

webgpu: Add configurable multi rotary cache concat offset parameter#27434
qjia7 wants to merge 3 commits intomainfrom
jiajiaqin/webgpu-multi-rotary-cache-offset

Conversation

@qjia7
Copy link
Contributor

@qjia7 qjia7 commented Feb 24, 2026

This pull request adds support for a new configuration option, multiRotaryCacheConcatOffset, to the WebGPU execution provider in ONNX Runtime. This option allows for the use of a concatenated multi rotary cache in rotary embedding operations, which can be enabled or disabled via provider configuration. The changes propagate this option through the provider, compute context, and shader programs, and update the relevant WGSL shader templates to use the offset when specified.

These changes enable more flexible and efficient handling of rotary cache concatenation in transformer models running on WebGPU. In GenAI builder.py, when enable_webgpu_graph is enabled, xxx_cache_small and xxx_cache_large will concat together. In rope mode, we need this information to get the correct rotary cache information.

It fixes the issue that the result is not correct when total sequence length is larger than 4096 for phi model generated by builder.py with enable_webgpu_graph enabled.

Add multiRotaryCacheConcatOffset EP option to support models with different
context lengths for multi-rotary embedding cache concatenation.

Changes:
- Add uint32 multiRotaryCacheConcatOffset config parameter (default: 4096)
- Keep useMultiRotaryCacheConcat boolean flag for enable/disable
- Update WebGpuExecutionProviderConfig struct with new offset field
- Add MultiRotaryCacheConcatOffset() accessor to EP and ComputeContext
- Add kMultiRotaryCacheConcatOffset config option constant
- Update webgpu_provider_factory.cc to parse and validate offset value
- Pass offset parameter to SplitPackedQKVWithRotaryEmbedding programs
- Update WGSL templates to use configurable offset instead of hardcoded 4096
- Reorder template parameters alphabetically for consistency

The offset parameter allows models with different original_context_length
values (e.g., 2048, 8192) to specify their cache concatenation boundary,
replacing the previously hardcoded 4096 value for Phi models.
…parameter

- Remove redundant use_multi_rotary_cache_concat boolean flag from public API
- Keep only multi_rotary_cache_concat_offset parameter (0 = disabled)
- Internally derive boolean from offset > 0 for shader template compilation
- Add offset to CacheHint for proper shader variant caching
- Update logging to remove reference to removed boolean field

Benefits:
- Simpler API: single parameter instead of two
- Zero runtime overhead when disabled (conditional compilation)
- Self-documenting: 0 clearly indicates disabled state
@qjia7 qjia7 requested review from fs-eire and guschmue February 24, 2026 13:02
@guschmue guschmue added the ep:WebGPU ort-web webgpu provider label Feb 24, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ep:WebGPU ort-web webgpu provider

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants