Skip to content

QMoE CUDA: Rename build options, refactor PrePack, add GPU kernels#28583

Open
tianleiwu wants to merge 3 commits into
mainfrom
tlwu/20260520/qmoe_cuda_refine
Open

QMoE CUDA: Rename build options, refactor PrePack, add GPU kernels#28583
tianleiwu wants to merge 3 commits into
mainfrom
tlwu/20260520/qmoe_cuda_refine

Conversation

@tianleiwu
Copy link
Copy Markdown
Contributor

Description

Follow-up refinements to QMoE CUDA EP (#28467): rename build options for consistency, fix CodeQL warnings, refactor PrePack from nested lambdas into named helper methods, and replace CPU data-transformation loops with GPU kernels.

Motivation and Context

PR #28467 introduced the QMoE operator with a 373-line PrePack function containing 5 nested lambdas that performed weight/scale prepacking at model load time. Reviewer feedback requested:

  1. Rename onnxruntime_ENABLE_CUDA_* cmake options to onnxruntime_USE_* for naming consistency.
  2. Fix CodeQL empty-except warnings in test code.
  3. Extract lambdas into named private methods for readability and testability.
  4. Replace CPU loops (block-scale swizzle, FP4 col-to-row repack) with GPU kernels to avoid unnecessary CPU↔GPU round-trips during model loading.

Key Changes

Commit Scope Description
594642a Build system Rename ENABLE_CUDA_FP4_QMOEUSE_FP4_QMOE, ENABLE_CUDA_FP8_QMOEUSE_FP8_QMOE in cmake, C++ defines, and 340+ generated .cu files
594642a Test Fix CodeQL empty-except warning in test_qmoe_cuda.py
e8d364b QMoE operator Extract 5 lambdas into private helper methods: PrePackTransposeAndPack, PrePackCopyToGpu, PrePackSwizzleBlockScales, PrePackRepackFP4Weights, PrePackComputeBias
e8d364b CUDA kernels Add QMoERepackFP4ColToRowKernel — repacks column-major FP4 packed weights to row-major on GPU (replaces per-expert CPU loop)
e8d364b CUDA kernels PrePackSwizzleBlockScales now calls existing LaunchQMoEBlockScaleInterleave GPU kernel (replaces CPU SwizzleMXFPXBlockScalesToGpu loop)

Impact

  • No behavioral change — all transformations produce identical output tensors.
  • Model load onlyPrePack runs once during InferenceSession::Initialize(), not on the inference hot path.
  • Performance: Eliminates CPU→GPU→CPU→GPU round-trips for block-scale swizzling and FP4 weight repacking. Data stays on GPU throughout.

Testing

  • Build verified with onnxruntime_USE_FP4_QMOE=ON onnxruntime_USE_FP8_QMOE=ON (CUDA 13.0, SM90).
  • All new symbols confirmed linked in libonnxruntime_providers_cuda.so via nm.
  • Existing test_qmoe_fp4_cuda.py, test_qmoe_wfp4afp8_cuda.py, test_qmoe_cuda.py cover the affected code paths.

tianleiwu added 2 commits May 19, 2026 23:13
Rename cmake options to match the onnxruntime_USE_* naming convention:
- onnxruntime_ENABLE_CUDA_FP4_QMOE -> onnxruntime_USE_FP4_QMOE
- onnxruntime_ENABLE_CUDA_FP8_QMOE -> onnxruntime_USE_FP8_QMOE

Also rename the corresponding C preprocessor defines:
- ENABLE_CUDA_FP4_QMOE -> USE_FP4_QMOE
- ENABLE_CUDA_FP8_QMOE -> USE_FP8_QMOE

Fix CodeQL empty-except warning in test_qmoe_cuda.py by adding an
explanatory comment and removing unused exception variable.
@tianleiwu tianleiwu requested a review from Copilot May 20, 2026 07:46
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copilot wasn't able to review this pull request because it exceeds the maximum number of files (300). Try reducing the number of changed files and requesting a review from Copilot again.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants