webgpu: Increase MatMulNBits K-parallelism with tile_size_k_vec=32#27834
Open
webgpu: Increase MatMulNBits K-parallelism with tile_size_k_vec=32#27834
Conversation
Contributor
There was a problem hiding this comment.
Pull request overview
Updates the WebGPU MatMulNBits default shader variant to increase K-reduction parallelism on non-Intel GPUs by making tile_size_k_vec configurable and selecting a larger default for better throughput.
Changes:
- Add a
tile_size_k_vecparameter (default16) toMatMulNBitsProgramso K-parallelism can be tuned per device. - Use
tile_size_k_vec = 32for non-Intel adapters and keep16for Intel adapters when constructing the default MatMulNBits program.
Reviewed changes
Copilot reviewed 2 out of 2 changed files in this pull request and generated no comments.
| File | Description |
|---|---|
onnxruntime/contrib_ops/webgpu/quantization/matmul_nbits.h |
Extends MatMulNBitsProgram to store a configurable tile_size_k_vec_ used during shader generation. |
onnxruntime/contrib_ops/webgpu/quantization/matmul_nbits.cc |
Plumbs tile_size_k_vec_ into WGSL template parameters and selects 16 vs 32 based on adapter vendor. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
Use tile_size_k_vec=32 (instead of 16) for MatMulNBits default kernel, doubling the number of threads working on K-dimension reduction per output row. This improves token generation throughput by ~3% on NVIDIA GPUs by better utilizing memory bandwidth. Intel devices retain tile_size_k_vec=16 due to different subgroup and cache characteristics. Changes: - matmul_nbits.h: Add tile_size_k_vec parameter (default 16) to MatMulNBitsProgram constructor. - matmul_nbits.cc: Select tile_size_k_vec=32 for non-Intel vendors, pass to program constructor.
183b6e3 to
304383d
Compare
guschmue
approved these changes
Mar 25, 2026
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Use tile_size_k_vec=32 (instead of 16) for MatMulNBits default kernel, doubling the number of threads working on K-dimension reduction per output row. This improves token generation throughput by ~3% on NVIDIA GPUs by better utilizing memory bandwidth.
Intel devices retain tile_size_k_vec=16 due to different subgroup and cache characteristics.
Changes: