webgpu: Add configurable multi rotary cache concat offset parameter by qjia7 · Pull Request #27434 · microsoft/onnxruntime

qjia7 · 2026-02-24T10:00:16Z

This pull request adds support for a new configuration option, multiRotaryCacheConcatOffset, to the WebGPU execution provider in ONNX Runtime. This option allows for the use of a concatenated multi rotary cache in rotary embedding operations, which can be enabled or disabled via provider configuration. The changes propagate this option through the provider, compute context, and shader programs, and update the relevant WGSL shader templates to use the offset when specified.

These changes enable more flexible and efficient handling of rotary cache concatenation in transformer models running on WebGPU. In GenAI builder.py, when enable_webgpu_graph is enabled, xxx_cache_small and xxx_cache_large will concat together. In rope mode, we need this information to get the correct rotary cache information.

It fixes the issue that the result is not correct when total sequence length is larger than 4096 for phi model generated by builder.py with enable_webgpu_graph enabled.

Add multiRotaryCacheConcatOffset EP option to support models with different context lengths for multi-rotary embedding cache concatenation. Changes: - Add uint32 multiRotaryCacheConcatOffset config parameter (default: 4096) - Keep useMultiRotaryCacheConcat boolean flag for enable/disable - Update WebGpuExecutionProviderConfig struct with new offset field - Add MultiRotaryCacheConcatOffset() accessor to EP and ComputeContext - Add kMultiRotaryCacheConcatOffset config option constant - Update webgpu_provider_factory.cc to parse and validate offset value - Pass offset parameter to SplitPackedQKVWithRotaryEmbedding programs - Update WGSL templates to use configurable offset instead of hardcoded 4096 - Reorder template parameters alphabetically for consistency The offset parameter allows models with different original_context_length values (e.g., 2048, 8192) to specify their cache concatenation boundary, replacing the previously hardcoded 4096 value for Phi models.

…parameter - Remove redundant use_multi_rotary_cache_concat boolean flag from public API - Keep only multi_rotary_cache_concat_offset parameter (0 = disabled) - Internally derive boolean from offset > 0 for shader template compilation - Add offset to CacheHint for proper shader variant caching - Update logging to remove reference to removed boolean field Benefits: - Simpler API: single parameter instead of two - Zero runtime overhead when disabled (conditional compilation) - Self-documenting: 0 clearly indicates disabled state

qjia7 added 3 commits February 24, 2026 17:05

Fix linter formatting for WebGpuExecutionProviderConfig comments

b3d5413

qjia7 requested review from fs-eire and guschmue February 24, 2026 13:02

guschmue added the ep:WebGPU ort-web webgpu provider label Feb 24, 2026

sushraja-msft approved these changes Feb 24, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Comments

webgpu: Add configurable multi rotary cache concat offset parameter#27434

webgpu: Add configurable multi rotary cache concat offset parameter#27434
qjia7 wants to merge 3 commits intomainfrom
jiajiaqin/webgpu-multi-rotary-cache-offset

qjia7 commented Feb 24, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Comments

Conversation

qjia7 commented Feb 24, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

qjia7 commented Feb 24, 2026 •

edited

Loading