[Don't review] webgpu: Refactor SubgroupMatrixMatMulNBits to vendor-agnostic config … by qjia7 · Pull Request #28109 · microsoft/onnxruntime

qjia7 · 2026-04-17T01:50:50Z

…+ add NVIDIA 16x16x16

Refactor subgroup matrix MatMulNBits support from vendor-specific (Apple/Intel) to a vendor-agnostic config-based approach. Any GPU reporting a matching subgroup matrix config from Dawn is now automatically supported.

Key changes:

Replace vendor-specific config table with SupportedSubgroupMatrixConfig struct containing {componentType, resultComponentType, M, N, K, subgroupMinSize, subgroupMaxSize, needsPrepack}. No architecture or backendType required.
Remove vendor_ member from SubgroupMatrixMatMulNBitsProgram. Shader selection is now driven by config dimensions (8x8x8, 8x16x16, 16x16x16).
Remove vendor gate in matmul_nbits.cc call site.
Rename shader templates: _apple -> _8x8x8, _intel -> _8x16x16.
Add new 16x16x16 shader template for NVIDIA Blackwell (RTX 5080).
- 4 subgroups x 32 lanes = 128 threads per workgroup
- 64x64 tile with 16x16 subgroup matrices
- Bounds-checked output via scratch buffer for partial M tiles
Fix prepack shader OOB reads: add scalar fallback with zero-fill for partial blocks where M is not a multiple of kSgMatM.
Prioritize larger configs (16x16x16 > 8x16x16 > 8x8x8) when multiple match.

Verified on NVIDIA RTX 5080 (Blackwell, Vulkan backend):

Correctness: model-qa.py with phi4-graph-prune produces identical output to D3D12 baseline
Prefill (phi4, l=1024):

phi4-graph-prune	D3D12 DP4A	Vulkan DP4A	Vulkan TC (16x16x16)
Prefill (tps)	3,134	6,389	7,089

NVIDIA reports ChromiumExperimentalSubgroupMatrix with F16/F16 16x16x16 config

Description

Motivation and Context

…+ add NVIDIA 16x16x16 Refactor subgroup matrix MatMulNBits support from vendor-specific (Apple/Intel) to a vendor-agnostic config-based approach. Any GPU reporting a matching subgroup matrix config from Dawn is now automatically supported. Key changes: - Replace vendor-specific config table with SupportedSubgroupMatrixConfig struct containing {componentType, resultComponentType, M, N, K, subgroupMinSize, subgroupMaxSize, needsPrepack}. No architecture or backendType required. - Remove vendor_ member from SubgroupMatrixMatMulNBitsProgram. Shader selection is now driven by config dimensions (8x8x8, 8x16x16, 16x16x16). - Remove vendor gate in matmul_nbits.cc call site. - Rename shader templates: _apple -> _8x8x8, _intel -> _8x16x16. - Add new 16x16x16 shader template for NVIDIA Blackwell (RTX 5080). - 4 subgroups x 32 lanes = 128 threads per workgroup - 64x64 tile with 16x16 subgroup matrices - Bounds-checked output via scratch buffer for partial M tiles - Fix prepack shader OOB reads: add scalar fallback with zero-fill for partial blocks where M is not a multiple of kSgMatM. - Prioritize larger configs (16x16x16 > 8x16x16 > 8x8x8) when multiple match. Verified on NVIDIA RTX 5080 (Blackwell, Vulkan backend): - Correctness: model-qa.py with phi4-graph-prune produces identical output to D3D12 baseline - Prefill (phi4, l=1024): - D3D12 DP4A baseline: 3,006 tps - Vulkan DP4A baseline: 6,155 tps - Vulkan tensor core (this change): 6,759 tps (+10% vs Vulkan DP4A, +125% vs D3D12) - NVIDIA reports ChromiumExperimentalSubgroupMatrix with F16/F16 16x16x16 config

…barrier placement - Use fast subgroupMatrixStore directly to output for full M blocks (sg_m_base + kSgMatM <= M), avoiding scratch overhead for the common case. - Use scratch + scalar write only for partial M blocks at the boundary. - Move workgroupBarrier outside the if/else to avoid divergent barrier (WGSL disallows workgroupBarrier in non-uniform control flow). - Make scratch array unconditional (needed for both bias and non-bias paths). This fixes the Invalid ShaderModule crash that occurred when the barrier was inside a branch that different subgroups could take different sides of.

qjia7 added 2 commits April 17, 2026 09:48

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Don't review] webgpu: Refactor SubgroupMatrixMatMulNBits to vendor-agnostic config …#28109

[Don't review] webgpu: Refactor SubgroupMatrixMatMulNBits to vendor-agnostic config …#28109
qjia7 wants to merge 2 commits intomainfrom
opt/webgpu-vulkan-perf

qjia7 commented Apr 17, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

qjia7 commented Apr 17, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Motivation and Context

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

qjia7 commented Apr 17, 2026 •

edited

Loading