Add CPU QMoE 2-bit support and LUT GEMM fast path#28185
Add CPU QMoE 2-bit support and LUT GEMM fast path#28185
Conversation
…nto tlwu/qmoe_2bit_cpu
There was a problem hiding this comment.
Pull request overview
This PR extends the CPU QMoE implementation to handle 2-bit expert weights and adds a new MLAS LUT-GEMM fast path for supported block-wise layouts. It also updates schema/docs and broadens CPU-focused test coverage so the new low-bit execution path fits into the existing quantized MoE stack.
Changes:
- Adds CPU-side 2-bit QMoE execution support, including LUT-GEMM packing/cache plumbing and fallback dequantize+GEMM handling.
- Tightens CPU QMoE input validation and updates the public QMoE schema/documentation to describe 2-bit support.
- Expands C++ and Python tests to cover 2-bit row-wise/block-wise behavior, validation failures, and parity scenarios.
Reviewed changes
Copilot reviewed 7 out of 7 changed files in this pull request and generated 5 comments.
Show a summary per file
| File | Description |
|---|---|
onnxruntime/test/python/transformers/test_qmoe_cpu.py |
Generalizes Python-side test quantization helpers from 4/8-bit to 2/4/8-bit and adds 2-bit parity cases. |
onnxruntime/test/contrib_ops/moe_test.cc |
Updates packed-dimension handling and adds CPU-specific 2-bit functional/validation/LUT-path tests. |
onnxruntime/core/graph/contrib_ops/contrib_defs.cc |
Updates QMoE schema/docs to advertise 2-bit weight support and default zero-point behavior. |
onnxruntime/contrib_ops/cpu/moe/moe_quantization_cpu.h |
Adds shared compute input struct and new members for LUT prepacked buffers. |
onnxruntime/contrib_ops/cpu/moe/moe_quantization_cpu.cc |
Implements 2-bit CPU execution changes, LUT-GEMM helpers, prepack/cache handling, and compute-path refactoring. |
onnxruntime/contrib_ops/cpu/moe/moe_helper.h |
Adds stricter shape/packing validation for hidden and inferred intermediate sizes. |
docs/contrib_ops/cpu/qmoe.md |
Adds a new CPU QMoE implementation note covering execution flow, layouts, fast paths, and limitations. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
- Fix null dereference when LUT cache is prepacked but packed_fc1_/fc2_ is null (check packed_fc1_lut_cache_ in weights_data assignment) - Fix per-expert indexing into prepacked LUT cache (index by expert_idx * packed_size_per_expert at call sites) - Fix symmetric zero-encoding in test helpers (all-zero tensors use sym_zp_offset packed byte, not 0x00) - Fix block-wise weight shapes in documentation (block_size only affects scale tensor shape, not weight tensor shape)
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 7 out of 7 changed files in this pull request and generated 4 comments.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
Co-authored-by: Copilot <copilot@github.com>
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 7 out of 7 changed files in this pull request and generated no new comments.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
Description
This PR adds
expert_weight_bits=2support to the CPU QMoE operator and introduces a fast path for supported block-wise shapes using MLAS LUT GEMM. It also tightens CPU-side validation, expands test coverage for non-trivial 2-bit behavior, and adds implementation notes for the CPU QMoE kernel.Summary of Changes
CPU QMoE Kernel
onnxruntime/contrib_ops/cpu/moe/moe_quantization_cpu.cconnxruntime/contrib_ops/cpu/moe/moe_quantization_cpu.honnxruntime/contrib_ops/cpu/moe/moe_helper.hhidden_size % pack_size == 0and inferredinter_sizedivisibility checks.Schema and Documentation
onnxruntime/core/graph/contrib_ops/contrib_defs.ccdocs/contrib_ops/cpu/qmoe.mdTests
onnxruntime/test/contrib_ops/moe_test.cconnxruntime/test/python/transformers/test_qmoe_cpu.pyTesting
ninja -C build/cu128/Release CMakeFiles/onnxruntime_providers.dir/home/tlwu/git/onnxruntime/onnxruntime/contrib_ops/cpu/moe/moe_quantization_cpu.cc.oninja -C build/cu128/Release CMakeFiles/onnxruntime_provider_test.dir/home/tlwu/git/onnxruntime/onnxruntime/test/contrib_ops/moe_test.cc.oMoETestsuite here.Motivation and Context
This work addresses CPU-provider support for QMoE 2-bit expert weights, matching the issue request for QMoE 2 bits on CPU. The PR also aligns the CPU implementation with how MLAS currently exposes optimized 2-bit execution: block-wise 2-bit shapes can use LUT GEMM, while unsupported shapes continue to use dequantize-plus-GEMM fallback paths.
Checklist