QMoE CUDA: Rename build options, refactor PrePack, add GPU kernels by tianleiwu · Pull Request #28583 · microsoft/onnxruntime

tianleiwu · 2026-05-20T07:42:32Z

Description

Follow-up refinements to QMoE CUDA EP (#28467): rename build options for consistency, fix CodeQL warnings, refactor PrePack from nested lambdas into named helper methods, and replace CPU data-transformation loops with GPU kernels.

Motivation and Context

PR #28467 introduced the QMoE operator with a 373-line PrePack function containing 5 nested lambdas that performed weight/scale prepacking at model load time. Reviewer feedback requested:

Rename onnxruntime_ENABLE_CUDA_* cmake options to onnxruntime_USE_* for naming consistency.
Fix CodeQL empty-except warnings in test code.
Extract lambdas into named private methods for readability and testability.
Replace CPU loops (block-scale swizzle, FP4 col-to-row repack) with GPU kernels to avoid unnecessary CPU↔GPU round-trips during model loading.

Key Changes

Commit	Scope	Description
`594642a`	Build system	Rename `ENABLE_CUDA_FP4_QMOE`→`USE_FP4_QMOE`, `ENABLE_CUDA_FP8_QMOE`→`USE_FP8_QMOE` in cmake, C++ defines, and 340+ generated .cu files
`594642a`	Test	Fix CodeQL empty-except warning in `test_qmoe_cuda.py`
`e8d364b`	QMoE operator	Extract 5 lambdas into private helper methods: `PrePackTransposeAndPack`, `PrePackCopyToGpu`, `PrePackSwizzleBlockScales`, `PrePackRepackFP4Weights`, `PrePackComputeBias`
`e8d364b`	CUDA kernels	Add `QMoERepackFP4ColToRowKernel` — repacks column-major FP4 packed weights to row-major on GPU (replaces per-expert CPU loop)
`e8d364b`	CUDA kernels	`PrePackSwizzleBlockScales` now calls existing `LaunchQMoEBlockScaleInterleave` GPU kernel (replaces CPU `SwizzleMXFPXBlockScalesToGpu` loop)

Impact

No behavioral change — all transformations produce identical output tensors.
Model load only — PrePack runs once during InferenceSession::Initialize(), not on the inference hot path.
Performance: Eliminates CPU→GPU→CPU→GPU round-trips for block-scale swizzling and FP4 weight repacking. Data stays on GPU throughout.

Testing

Build verified with onnxruntime_USE_FP4_QMOE=ON onnxruntime_USE_FP8_QMOE=ON (CUDA 13.0, SM90).
All new symbols confirmed linked in libonnxruntime_providers_cuda.so via nm.
Existing test_qmoe_fp4_cuda.py, test_qmoe_wfp4afp8_cuda.py, test_qmoe_cuda.py cover the affected code paths.

Rename cmake options to match the onnxruntime_USE_* naming convention: - onnxruntime_ENABLE_CUDA_FP4_QMOE -> onnxruntime_USE_FP4_QMOE - onnxruntime_ENABLE_CUDA_FP8_QMOE -> onnxruntime_USE_FP8_QMOE Also rename the corresponding C preprocessor defines: - ENABLE_CUDA_FP4_QMOE -> USE_FP4_QMOE - ENABLE_CUDA_FP8_QMOE -> USE_FP8_QMOE Fix CodeQL empty-except warning in test_qmoe_cuda.py by adding an explanatory comment and removing unused exception variable.

Copilot

Copilot wasn't able to review this pull request because it exceeds the maximum number of files (300). Try reducing the number of changed files and requesting a review from Copilot again.

tianleiwu added 2 commits May 19, 2026 23:13

PrePack refactoring (Lambda to helper; use GPU kernel)

e8d364b

tianleiwu requested a review from Copilot May 20, 2026 07:46

Copilot AI reviewed May 20, 2026

View reviewed changes

fix(cuda): validate QMoE prepack GPU paths

37c955f

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

QMoE CUDA: Rename build options, refactor PrePack, add GPU kernels#28583

QMoE CUDA: Rename build options, refactor PrePack, add GPU kernels#28583
tianleiwu wants to merge 3 commits into
mainfrom
tlwu/20260520/qmoe_cuda_refine

tianleiwu commented May 20, 2026

Uh oh!

Copilot AI left a comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

tianleiwu commented May 20, 2026

Description

Motivation and Context

Key Changes

Impact

Testing

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants