Add CPU QMoE 2-bit support and LUT GEMM fast path by tianleiwu · Pull Request #28185 · microsoft/onnxruntime

tianleiwu · 2026-04-22T16:35:05Z

Description

This PR adds expert_weight_bits=2 support to the CPU QMoE operator and introduces a fast path for supported block-wise shapes using MLAS LUT GEMM. It also tightens CPU-side validation, expands test coverage for non-trivial 2-bit behavior, and adds implementation notes for the CPU QMoE kernel.

Summary of Changes

CPU QMoE Kernel

File	Change
`onnxruntime/contrib_ops/cpu/moe/moe_quantization_cpu.cc`	Adds CPU 2-bit dequant support, 2-bit LUT GEMM eligibility checks, LUT prepack/cache support, and LUT execution for FC1/FC2 on supported block-wise shapes. Refactors the compute flow so the 2-bit LUT path is isolated while routing and accumulation remain shared.
`onnxruntime/contrib_ops/cpu/moe/moe_quantization_cpu.h`	Adds CPU-side state for LUT prepacked buffers and shared compute inputs.
`onnxruntime/contrib_ops/cpu/moe/moe_helper.h`	Tightens shape validation, including `hidden_size % pack_size == 0` and inferred `inter_size` divisibility checks.

Schema and Documentation

File	Change
`onnxruntime/core/graph/contrib_ops/contrib_defs.cc`	Updates QMoE schema/docs to allow CPU-side 2-bit weights.
`docs/contrib_ops/cpu/qmoe.md`	Adds CPU QMoE implementation notes covering routing, quantization layouts, prepack behavior, LUT fast paths, fallbacks, and current limitations.

Tests

File	Change
`onnxruntime/test/contrib_ops/moe_test.cc`	Adds CPU 2-bit smoke, validation, non-zero functional, and LUT-eligible block-wise identity tests.
`onnxruntime/test/python/transformers/test_qmoe_cpu.py`	Extends Python-side QMoE parity coverage for 2-bit row-wise and block-wise packing paths.

Testing

Built the provider object:
- ninja -C build/cu128/Release CMakeFiles/onnxruntime_providers.dir/home/tlwu/git/onnxruntime/onnxruntime/contrib_ops/cpu/moe/moe_quantization_cpu.cc.o
Built the provider test object:
- ninja -C build/cu128/Release CMakeFiles/onnxruntime_provider_test.dir/home/tlwu/git/onnxruntime/onnxruntime/test/contrib_ops/moe_test.cc.o
Added CPU-side test coverage for:
- 2-bit validation failures
- non-trivial non-zero 2-bit outputs
- LUT-eligible 2-bit block-wise identity behavior
Full end-to-end provider gtest execution was not run from this checkout because the available top-level test binary does not expose the MoETest suite here.

Motivation and Context

This work addresses CPU-provider support for QMoE 2-bit expert weights, matching the issue request for QMoE 2 bits on CPU. The PR also aligns the CPU implementation with how MLAS currently exposes optimized 2-bit execution: block-wise 2-bit shapes can use LUT GEMM, while unsupported shapes continue to use dequantize-plus-GEMM fallback paths.

Checklist

Tests added/updated
Documentation updated
No breaking changes
CI passes

…workflow

…nto tlwu/qmoe_2bit_cpu

Copilot

Pull request overview

This PR extends the CPU QMoE implementation to handle 2-bit expert weights and adds a new MLAS LUT-GEMM fast path for supported block-wise layouts. It also updates schema/docs and broadens CPU-focused test coverage so the new low-bit execution path fits into the existing quantized MoE stack.

Changes:

Adds CPU-side 2-bit QMoE execution support, including LUT-GEMM packing/cache plumbing and fallback dequantize+GEMM handling.
Tightens CPU QMoE input validation and updates the public QMoE schema/documentation to describe 2-bit support.
Expands C++ and Python tests to cover 2-bit row-wise/block-wise behavior, validation failures, and parity scenarios.

Reviewed changes

Copilot reviewed 7 out of 7 changed files in this pull request and generated 5 comments.

Show a summary per file

File	Description
`onnxruntime/test/python/transformers/test_qmoe_cpu.py`	Generalizes Python-side test quantization helpers from 4/8-bit to 2/4/8-bit and adds 2-bit parity cases.
`onnxruntime/test/contrib_ops/moe_test.cc`	Updates packed-dimension handling and adds CPU-specific 2-bit functional/validation/LUT-path tests.
`onnxruntime/core/graph/contrib_ops/contrib_defs.cc`	Updates QMoE schema/docs to advertise 2-bit weight support and default zero-point behavior.
`onnxruntime/contrib_ops/cpu/moe/moe_quantization_cpu.h`	Adds shared compute input struct and new members for LUT prepacked buffers.
`onnxruntime/contrib_ops/cpu/moe/moe_quantization_cpu.cc`	Implements 2-bit CPU execution changes, LUT-GEMM helpers, prepack/cache handling, and compute-path refactoring.
`onnxruntime/contrib_ops/cpu/moe/moe_helper.h`	Adds stricter shape/packing validation for hidden and inferred intermediate sizes.
`docs/contrib_ops/cpu/qmoe.md`	Adds a new CPU QMoE implementation note covering execution flow, layouts, fast paths, and limitations.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

- Fix null dereference when LUT cache is prepacked but packed_fc1_/fc2_ is null (check packed_fc1_lut_cache_ in weights_data assignment) - Fix per-expert indexing into prepacked LUT cache (index by expert_idx * packed_size_per_expert at call sites) - Fix symmetric zero-encoding in test helpers (all-zero tensors use sym_zp_offset packed byte, not 0x00) - Fix block-wise weight shapes in documentation (block_size only affects scale tensor shape, not weight tensor shape)

github-actions

You can commit the suggested changes from lintrunner.

Copilot

Pull request overview

Copilot reviewed 7 out of 7 changed files in this pull request and generated 4 comments.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Co-authored-by: Copilot <copilot@github.com>

Copilot

Pull request overview

Copilot reviewed 7 out of 7 changed files in this pull request and generated no new comments.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

tianleiwu added 3 commits April 22, 2026 08:27

Add 2 bit QMoE

b7e9fd8

Add LuT GEMM for 2 bits

ae555a3

Add doc

16ca1f9

tianleiwu mentioned this pull request Apr 22, 2026

[Feature Request] QMoE: support 2-bits quantized expert Weights #28163

Open

tianleiwu marked this pull request as draft April 22, 2026 16:39

tianleiwu added 6 commits April 22, 2026 14:40

Add doc gen

620d057

upload docs

c459709

Merge remote-tracking branch 'origin/main' into tlwu/win_gpu_doc_gen_…

c059653

…workflow

Merge remote-tracking branch 'origin/main' into tlwu/qmoe_2bit_cpu

e1f2436

Merge remote-tracking branch 'origin/tlwu/win_gpu_doc_gen_workflow' i…

9eeb277

…nto tlwu/qmoe_2bit_cpu

Merge remote-tracking branch 'origin/main' into tlwu/qmoe_2bit_cpu

1825452

tianleiwu requested review from apsonawane and Copilot May 3, 2026 15:48

Copilot started reviewing on behalf of tianleiwu May 3, 2026 15:49 View session

Copilot AI reviewed May 3, 2026

View reviewed changes

tianleiwu mentioned this pull request May 3, 2026

feat(qmoe): support 2-bit expert weights in CPU kernel #28336

Open

github-actions Bot reviewed May 3, 2026

View reviewed changes

Comment thread onnxruntime/contrib_ops/cpu/moe/moe_quantization_cpu.cc Outdated

Comment thread onnxruntime/contrib_ops/cpu/moe/moe_quantization_cpu.cc Outdated

lintrunner

e7f814f

tianleiwu requested a review from Copilot May 3, 2026 16:11

Copilot started reviewing on behalf of tianleiwu May 3, 2026 16:12 View session

Copilot AI reviewed May 3, 2026

View reviewed changes

Comment thread onnxruntime/contrib_ops/cpu/moe/moe_quantization_cpu.cc

Comment thread onnxruntime/test/contrib_ops/moe_test.cc

Comment thread onnxruntime/test/contrib_ops/moe_test.cc

Comment thread docs/contrib_ops/cpu/qmoe.md Outdated

address feedbacks

07d4d43

Co-authored-by: Copilot <copilot@github.com>

tianleiwu requested a review from Copilot May 3, 2026 16:45

Copilot started reviewing on behalf of tianleiwu May 3, 2026 16:46 View session

Copilot AI reviewed May 3, 2026

View reviewed changes

update comment

ecc98f0

tianleiwu marked this pull request as ready for review May 3, 2026 16:53

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add CPU QMoE 2-bit support and LUT GEMM fast path#28185

Add CPU QMoE 2-bit support and LUT GEMM fast path#28185
tianleiwu wants to merge 13 commits intomainfrom
tlwu/qmoe_2bit_cpu

tianleiwu commented Apr 22, 2026 •

edited

Loading

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

github-actions Bot left a comment

Uh oh!

Uh oh!

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Copilot AI left a comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

tianleiwu commented Apr 22, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Summary of Changes

CPU QMoE Kernel

Schema and Documentation

Tests

Testing

Motivation and Context

Checklist

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

github-actions Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

tianleiwu commented Apr 22, 2026 •

edited

Loading