Skip to content

[WebGPU] QKV and MLP fusions for Qwen3#28280

Open
hariharans29 wants to merge 25 commits intomainfrom
hari/webgpu_perf_1
Open

[WebGPU] QKV and MLP fusions for Qwen3#28280
hariharans29 wants to merge 25 commits intomainfrom
hari/webgpu_perf_1

Conversation

@hariharans29
Copy link
Copy Markdown
Member

@hariharans29 hariharans29 commented Apr 30, 2026

Description

Summary

Adds two WebGPU-only graph fusions and the contrib ops they target, plus a small
refactor of the existing MatMulNBits dispatch logic so the new fused kernels
can share its predicates.

Component Files Purpose
MatMulNBitsMlp op + kernel contrib_ops/webgpu/quantization/matmul_nbits_mlp.{cc,h}, *.wgsl.template (3) Fuses the SwiGLU MLP block: optional (Skip)SimplifiedLayerNormalization + two MatMulNBits projections (gate, up) + optional biases + Sigmoid/Mul (SiLU) + element-wise Mul. Single dispatch instead of 5–7.
MatMulNBitsQkv op + kernel contrib_ops/webgpu/quantization/matmul_nbits_qkv.{cc,h}, *.wgsl.template Fuses (Skip)SimplifiedLayerNormalization + three MatMulNBits projections (Q, K, V) sharing the same input. Single dispatch instead of 4.
Op schemas core/graph/contrib_ops/contrib_defs.cc MatMulNBitsMlp and MatMulNBitsQkv contrib op schemas (kMSDomain, opset 1).
Graph transformers core/optimizer/matmul_nbits_{mlp,qkv}_fusion.{cc,h} Pattern-match the source subgraphs and emit the fused ops. EP-gated to WebGPU only — no impact on other EPs. Registered in graph_transformer_utils.cc.
Dispatch helpers contrib_ops/webgpu/quantization/matmul_nbits_common.{cc,h} + matmul_nbits.cc Extracts the "would this dispatch use Subgroup-Matrix / DP4A / WideTile?" predicates into pure functions reusable by the fused kernels. No behavior change in the unfused MatMulNBits path.
Tests test/optimizer/matmul_nbits_{mlp,qkv}_fusion_test.cc, graph_transform_utils_test.cc Unit tests for the new transformers (positive + negative cases).

Motivation and Context

~25-30% decode TPS throughput improvement on WebGPU + D3D backend on Windows. GPU used: RTX 5060Ti for Qwe3-1.7B.

BEFORE (95 decode TPS): main branch
image

AFTER (120+ decode TPS): PR branch
image

@hariharans29 hariharans29 changed the title [DO NOT REVIEW]: Title-TODO [DO NOT REVIEW]: TODO Apr 30, 2026
Copy link
Copy Markdown
Contributor

@github-actions github-actions Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can commit the suggested changes from lintrunner.

Comment thread onnxruntime/contrib_ops/webgpu/quantization/matmul_nbits.cc Outdated
Comment thread onnxruntime/contrib_ops/webgpu/quantization/matmul_nbits_common.h Outdated
Comment thread onnxruntime/contrib_ops/webgpu/quantization/matmul_nbits_mlp.cc Outdated
Comment thread onnxruntime/contrib_ops/webgpu/quantization/matmul_nbits_mlp.cc Outdated
Comment thread onnxruntime/contrib_ops/webgpu/quantization/matmul_nbits_mlp.cc Outdated
Comment thread onnxruntime/core/optimizer/matmul_nbits_qkv_fusion.cc Outdated
Comment thread onnxruntime/core/optimizer/matmul_nbits_qkv_fusion.cc Outdated
Comment thread onnxruntime/core/providers/webgpu/allocator.cc Outdated
Comment thread onnxruntime/core/providers/webgpu/allocator.cc Outdated
Comment thread onnxruntime/test/onnx/microbenchmark/webgpu_matmul_nbits_decode.cc Outdated
Comment thread onnxruntime/contrib_ops/webgpu/quantization/matmul_nbits_mlp.cc Fixed
Comment thread onnxruntime/contrib_ops/webgpu/quantization/matmul_nbits_mlp.h Fixed
Comment thread onnxruntime/contrib_ops/webgpu/quantization/matmul_nbits_qkv.cc Fixed
Comment thread onnxruntime/core/optimizer/matmul_nbits_mlp_fusion.cc Fixed
Comment thread onnxruntime/core/optimizer/matmul_nbits_qkv_fusion.cc Fixed
Comment thread onnxruntime/test/onnx/microbenchmark/webgpu_matmul_nbits_decode.cc Fixed
Comment thread onnxruntime/test/optimizer/matmul_nbits_mlp_fusion_test.cc Fixed
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR adds WebGPU-focused fused operators and optimizer passes for decoder-style MatMulNBits patterns (MLP gate/up and QKV projections), along with tests and a microbenchmark to evaluate decode performance/correctness.

Changes:

  • Introduces new contrib ops MatMulNBitsMlp and MatMulNBitsQkv (schemas + WebGPU kernels + WGSL templates).
  • Adds graph transformers MatMulNBitsMlpFusion / MatMulNBitsQkvFusion and corresponding optimizer tests.
  • Improves WebGPU runtime support (graph-capture buffer manager activation, queue-idle wait helper, better shader compilation diagnostics) and adds a decode microbenchmark.

Reviewed changes

Copilot reviewed 33 out of 33 changed files in this pull request and generated 4 comments.

Show a summary per file
File Description
onnxruntime/test/optimizer/matmul_nbits_qkv_fusion_test.cc New unit tests validating QKV fusion and output contracts on WebGPU.
onnxruntime/test/optimizer/matmul_nbits_mlp_fusion_test.cc New unit tests validating MLP fusion (simplified/skip + passthrough) on WebGPU.
onnxruntime/test/optimizer/graph_transform_utils_test.cc Minor formatting-only tweak (blank line).
onnxruntime/test/onnx/microbenchmark/webgpu_matmul_nbits_decode.cc New benchmark harness for fused/unfused decode paths on WebGPU.
onnxruntime/test/onnx/microbenchmark/main.cc Adjusts benchmark env logging severity.
onnxruntime/core/session/ort_version_check.h Makes version parsing consteval-friendly with a macro fallback.
onnxruntime/core/providers/webgpu/webgpu_execution_provider.h Tracks when graph-capture buffer manager is active.
onnxruntime/core/providers/webgpu/webgpu_execution_provider.cc Lazily creates/activates graph buffer manager for capture; allocator uses dynamic buffer manager getter.
onnxruntime/core/providers/webgpu/webgpu_context.h Adds WaitForQueueIdle() declaration.
onnxruntime/core/providers/webgpu/webgpu_context.cc Implements WaitForQueueIdle() using OnSubmittedWorkDone.
onnxruntime/core/providers/webgpu/program_manager.cc Enhances pipeline build failures with shader compilation diagnostics.
onnxruntime/core/providers/webgpu/compute_context.h Adds FlushAndWait() convenience for flushing + waiting on queue idle.
onnxruntime/core/providers/webgpu/allocator.h Adds allocator ctor that accepts a buffer-manager getter function.
onnxruntime/core/providers/webgpu/allocator.cc Implements getter-based allocator to support switching buffer managers.
onnxruntime/core/optimizer/matmul_nbits_qkv_fusion.h New transformer declaration for QKV fusion.
onnxruntime/core/optimizer/matmul_nbits_qkv_fusion.cc New transformer implementation for QKV fusion.
onnxruntime/core/optimizer/matmul_nbits_mlp_fusion.h New transformer declaration for MLP fusion.
onnxruntime/core/optimizer/matmul_nbits_mlp_fusion.cc New transformer implementation for MLP fusion.
onnxruntime/core/optimizer/graph_transformer_utils.cc Registers the new fusion transformers.
onnxruntime/core/graph/contrib_ops/contrib_defs.cc Adds contrib operator schemas/docs for MatMulNBitsMlp and MatMulNBitsQkv.
onnxruntime/contrib_ops/webgpu/webgpu_contrib_kernels.cc Registers WebGPU kernels for the new fused ops.
onnxruntime/contrib_ops/webgpu/quantization/matmul_nbits_qkv.wgsl.template New WGSL template implementing fused QKV decode kernel.
onnxruntime/contrib_ops/webgpu/quantization/matmul_nbits_qkv.h New WebGPU kernel wrapper for MatMulNBitsQkv.
onnxruntime/contrib_ops/webgpu/quantization/matmul_nbits_qkv.cc New WebGPU kernel implementation for MatMulNBitsQkv.
onnxruntime/contrib_ops/webgpu/quantization/matmul_nbits_mlp_wide_tile_m1.wgsl.template New WGSL template for an MLP wide-tile variant.
onnxruntime/contrib_ops/webgpu/quantization/matmul_nbits_mlp.wgsl.template New WGSL template implementing fused MLP (optionally with norm/skip/passthrough).
onnxruntime/contrib_ops/webgpu/quantization/matmul_nbits_mlp.h New WebGPU kernel wrapper for MatMulNBitsMlp.
onnxruntime/contrib_ops/webgpu/quantization/matmul_nbits_mlp.cc New WebGPU kernel implementation for MatMulNBitsMlp.
onnxruntime/contrib_ops/webgpu/quantization/matmul_nbits_common.h Adds declarations for “would apply” dispatch-selection helpers and shared constants.
onnxruntime/contrib_ops/webgpu/quantization/matmul_nbits_common.cc Implements the new dispatch-selection helpers.
onnxruntime/contrib_ops/webgpu/quantization/matmul_nbits.cc Refactors path selection to use the new “would apply” helpers.
onnxruntime/contrib_ops/webgpu/quantization/dp4a_matmul_mlp.wgsl.template Adds WGSL template for DP4A MLP path.
cmake/onnxruntime_unittests.cmake Wires the new WebGPU decode benchmark into the benchmark target sources.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread onnxruntime/core/optimizer/matmul_nbits_qkv_fusion.cc Outdated
Comment thread onnxruntime/core/optimizer/matmul_nbits_qkv_fusion.cc
Comment thread onnxruntime/core/optimizer/matmul_nbits_mlp_fusion.cc Outdated
Comment thread onnxruntime/core/optimizer/matmul_nbits_mlp_fusion.cc
…shader diagnostics

These changes are kept on hari/webgpu_perf_1_full locally. The lazy buffer-mgr fix is being submitted as a separate PR (branch hari/webgpu_graph_capture_buffer_fix) because it is an independent correctness fix for a pre-existing latent bug, exposed but not introduced by these fusions.
@hariharans29 hariharans29 changed the title [DO NOT REVIEW]: TODO [WebGPU]: QKV and MLP fusions for Qwen3 May 2, 2026
This template file was added speculatively but is not referenced by any kernel, include, or build rule. Removing to keep the PR clean.
@hariharans29 hariharans29 requested a review from Copilot May 2, 2026 04:05
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 20 out of 20 changed files in this pull request and generated 5 comments.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread onnxruntime/core/optimizer/matmul_nbits_mlp_fusion.h Outdated
Comment thread onnxruntime/test/optimizer/matmul_nbits_qkv_fusion_test.cc
Comment thread onnxruntime/core/optimizer/matmul_nbits_qkv_fusion.cc
Comment thread onnxruntime/core/optimizer/matmul_nbits_qkv_fusion.cc Outdated
Comment thread onnxruntime/contrib_ops/webgpu/quantization/matmul_nbits_common.cc Outdated
@hariharans29 hariharans29 changed the title [WebGPU]: QKV and MLP fusions for Qwen3 [WebGPU] QKV and MLP fusions for Qwen3 May 2, 2026
The shared-EP path through TransformerTester triggers a SEH 0xC0000005 in CI
when the EP outlives a per-session profiler whose pointer is still cached on
the EP. A separate fix to the WebGPU EP's session_profiler_ lifetime is in
flight; meanwhile, switch the 8 MatMulNBits MLP and QKV WebGPU fusion-vs-
unfused tests to a small RunWebGpuFusionTransformerTest helper that creates a
fresh execution provider per session via a factory lambda. Production code is
unchanged.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants