Add M-tile loop with dispatch capping for Intel Xe2/3-LPG by jchen10 · Pull Request #28250 · microsoft/onnxruntime

jchen10 · 2026-04-28T06:05:48Z

Wrap 8x16x16 MatMulNBits(SubgroupMatrix) kernel body in M-tile loop using uniforms.m_tiles_per_wg for tile assignment per workgroup
Cap dispatch_y on Xe2/3-LPG when M > 2k, with occupancy factor 16x
Non-Intel or small-M paths pass m_tiles_per_wg=1 (no behavior change)

jchen10 · 2026-04-28T06:11:43Z

We observed a sharp perf drop of prefill for long prompts(>4k) on PTL. This PR can largely alleviate the problem.

- Wrap 8x16x16 MatMulNBits(SubgroupMatrix) kernel body in M-tile loop using uniforms.m_tiles_per_wg for tile assignment per workgroup - Cap dispatch_y on Xe2/3-LPG when M > 2k, with occupancy factor 16x - Non-Intel or small-M paths pass m_tiles_per_wg=1 (no behavior change)

Copilot

Pull request overview

This PR updates the WebGPU SubgroupMatrix MatMulNBits 8x16x16 path to reduce dispatch overhead on large-M Intel Xe2/Xe3-LPG devices by having each workgroup process multiple M-tiles sequentially, driven by a new m_tiles_per_wg uniform and a capped dispatch_y.

Changes:

Wrap the 8x16x16 WGSL kernel body in an outer M-tile loop controlled by uniforms.m_tiles_per_wg.
Add m_tiles_per_wg to the program’s uniform interface and pass it from the CPU side.
Cap dispatch_y for large M on Intel Xe2/Xe3-LPG and derive m_tiles_per_wg accordingly.

Reviewed changes

Copilot reviewed 3 out of 3 changed files in this pull request and generated 1 comment.

File	Description
onnxruntime/contrib_ops/webgpu/quantization/subgroup_matrix_matmul_nbits_8x16x16.wgsl.template	Adds an outer M-tile loop and resets accumulators per tile using `m_tiles_per_wg`.
onnxruntime/contrib_ops/webgpu/quantization/subgroup_matrix_matmul_nbits.h	Extends the uniform variable list with `m_tiles_per_wg`.
onnxruntime/contrib_ops/webgpu/quantization/subgroup_matrix_matmul_nbits.cc	Computes capped `dispatch_y` on Intel Xe2/Xe3-LPG and passes `m_tiles_per_wg` to the shader.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

jchen10 force-pushed the large_prefill branch from fa313d1 to 1f25ee4 Compare April 29, 2026 01:39

guschmue added the ep:WebGPU ort-web webgpu provider label May 1, 2026

guschmue requested a review from Copilot May 1, 2026 16:51

Copilot started reviewing on behalf of guschmue May 1, 2026 16:52 View session

Copilot AI reviewed May 1, 2026

View reviewed changes

Comment thread onnxruntime/contrib_ops/webgpu/quantization/subgroup_matrix_matmul_nbits_8x16x16.wgsl.template

guschmue approved these changes May 6, 2026

View reviewed changes

guschmue merged commit 5f071fb into microsoft:main May 6, 2026
90 of 91 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add M-tile loop with dispatch capping for Intel Xe2/3-LPG#28250

Add M-tile loop with dispatch capping for Intel Xe2/3-LPG#28250
guschmue merged 1 commit into
microsoft:mainfrom
jchen10:large_prefill

jchen10 commented Apr 28, 2026

Uh oh!

jchen10 commented Apr 28, 2026

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

jchen10 commented Apr 28, 2026

Uh oh!

jchen10 commented Apr 28, 2026

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants