Skip to content

[Don't review] webgpu: Refactor SubgroupMatrixMatMulNBits to vendor-agnostic config …#28109

Draft
qjia7 wants to merge 2 commits intomainfrom
opt/webgpu-vulkan-perf
Draft

[Don't review] webgpu: Refactor SubgroupMatrixMatMulNBits to vendor-agnostic config …#28109
qjia7 wants to merge 2 commits intomainfrom
opt/webgpu-vulkan-perf

Conversation

@qjia7
Copy link
Copy Markdown
Contributor

@qjia7 qjia7 commented Apr 17, 2026

…+ add NVIDIA 16x16x16

Refactor subgroup matrix MatMulNBits support from vendor-specific (Apple/Intel) to a vendor-agnostic config-based approach. Any GPU reporting a matching subgroup matrix config from Dawn is now automatically supported.

Key changes:

  • Replace vendor-specific config table with SupportedSubgroupMatrixConfig struct containing {componentType, resultComponentType, M, N, K, subgroupMinSize, subgroupMaxSize, needsPrepack}. No architecture or backendType required.
  • Remove vendor_ member from SubgroupMatrixMatMulNBitsProgram. Shader selection is now driven by config dimensions (8x8x8, 8x16x16, 16x16x16).
  • Remove vendor gate in matmul_nbits.cc call site.
  • Rename shader templates: _apple -> _8x8x8, _intel -> _8x16x16.
  • Add new 16x16x16 shader template for NVIDIA Blackwell (RTX 5080).
    • 4 subgroups x 32 lanes = 128 threads per workgroup
    • 64x64 tile with 16x16 subgroup matrices
    • Bounds-checked output via scratch buffer for partial M tiles
  • Fix prepack shader OOB reads: add scalar fallback with zero-fill for partial blocks where M is not a multiple of kSgMatM.
  • Prioritize larger configs (16x16x16 > 8x16x16 > 8x8x8) when multiple match.

Verified on NVIDIA RTX 5080 (Blackwell, Vulkan backend):

  • Correctness: model-qa.py with phi4-graph-prune produces identical output to D3D12 baseline
  • Prefill (phi4, l=1024):
phi4-graph-prune D3D12 DP4A Vulkan DP4A Vulkan TC (16x16x16)
Prefill (tps) 3,134 6,389 7,089
  • NVIDIA reports ChromiumExperimentalSubgroupMatrix with F16/F16 16x16x16 config

Description

Motivation and Context

qjia7 added 2 commits April 17, 2026 09:48
…+ add NVIDIA 16x16x16

Refactor subgroup matrix MatMulNBits support from vendor-specific (Apple/Intel)
to a vendor-agnostic config-based approach. Any GPU reporting a matching
subgroup matrix config from Dawn is now automatically supported.

Key changes:
- Replace vendor-specific config table with SupportedSubgroupMatrixConfig struct
  containing {componentType, resultComponentType, M, N, K, subgroupMinSize,
  subgroupMaxSize, needsPrepack}. No architecture or backendType required.
- Remove vendor_ member from SubgroupMatrixMatMulNBitsProgram. Shader selection
  is now driven by config dimensions (8x8x8, 8x16x16, 16x16x16).
- Remove vendor gate in matmul_nbits.cc call site.
- Rename shader templates: _apple -> _8x8x8, _intel -> _8x16x16.
- Add new 16x16x16 shader template for NVIDIA Blackwell (RTX 5080).
  - 4 subgroups x 32 lanes = 128 threads per workgroup
  - 64x64 tile with 16x16 subgroup matrices
  - Bounds-checked output via scratch buffer for partial M tiles
- Fix prepack shader OOB reads: add scalar fallback with zero-fill for
  partial blocks where M is not a multiple of kSgMatM.
- Prioritize larger configs (16x16x16 > 8x16x16 > 8x8x8) when multiple match.

Verified on NVIDIA RTX 5080 (Blackwell, Vulkan backend):
- Correctness: model-qa.py with phi4-graph-prune produces identical output
  to D3D12 baseline
- Prefill (phi4, l=1024):
  - D3D12 DP4A baseline: 3,006 tps
  - Vulkan DP4A baseline: 6,155 tps
  - Vulkan tensor core (this change): 6,759 tps (+10% vs Vulkan DP4A, +125% vs D3D12)
- NVIDIA reports ChromiumExperimentalSubgroupMatrix with F16/F16 16x16x16 config
…barrier placement

- Use fast subgroupMatrixStore directly to output for full M blocks
  (sg_m_base + kSgMatM <= M), avoiding scratch overhead for the common case.
- Use scratch + scalar write only for partial M blocks at the boundary.
- Move workgroupBarrier outside the if/else to avoid divergent barrier
  (WGSL disallows workgroupBarrier in non-uniform control flow).
- Make scratch array unconditional (needed for both bias and non-bias paths).

This fixes the Invalid ShaderModule crash that occurred when the barrier
was inside a branch that different subgroups could take different sides of.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant