Skip to content

Add CPU QMoE 2-bit support and LUT GEMM fast path#28185

Open
tianleiwu wants to merge 13 commits intomainfrom
tlwu/qmoe_2bit_cpu
Open

Add CPU QMoE 2-bit support and LUT GEMM fast path#28185
tianleiwu wants to merge 13 commits intomainfrom
tlwu/qmoe_2bit_cpu

Conversation

@tianleiwu
Copy link
Copy Markdown
Contributor

@tianleiwu tianleiwu commented Apr 22, 2026

Description

This PR adds expert_weight_bits=2 support to the CPU QMoE operator and introduces a fast path for supported block-wise shapes using MLAS LUT GEMM. It also tightens CPU-side validation, expands test coverage for non-trivial 2-bit behavior, and adds implementation notes for the CPU QMoE kernel.

Summary of Changes

CPU QMoE Kernel

File Change
onnxruntime/contrib_ops/cpu/moe/moe_quantization_cpu.cc Adds CPU 2-bit dequant support, 2-bit LUT GEMM eligibility checks, LUT prepack/cache support, and LUT execution for FC1/FC2 on supported block-wise shapes. Refactors the compute flow so the 2-bit LUT path is isolated while routing and accumulation remain shared.
onnxruntime/contrib_ops/cpu/moe/moe_quantization_cpu.h Adds CPU-side state for LUT prepacked buffers and shared compute inputs.
onnxruntime/contrib_ops/cpu/moe/moe_helper.h Tightens shape validation, including hidden_size % pack_size == 0 and inferred inter_size divisibility checks.

Schema and Documentation

File Change
onnxruntime/core/graph/contrib_ops/contrib_defs.cc Updates QMoE schema/docs to allow CPU-side 2-bit weights.
docs/contrib_ops/cpu/qmoe.md Adds CPU QMoE implementation notes covering routing, quantization layouts, prepack behavior, LUT fast paths, fallbacks, and current limitations.

Tests

File Change
onnxruntime/test/contrib_ops/moe_test.cc Adds CPU 2-bit smoke, validation, non-zero functional, and LUT-eligible block-wise identity tests.
onnxruntime/test/python/transformers/test_qmoe_cpu.py Extends Python-side QMoE parity coverage for 2-bit row-wise and block-wise packing paths.

Testing

  • Built the provider object:
    • ninja -C build/cu128/Release CMakeFiles/onnxruntime_providers.dir/home/tlwu/git/onnxruntime/onnxruntime/contrib_ops/cpu/moe/moe_quantization_cpu.cc.o
  • Built the provider test object:
    • ninja -C build/cu128/Release CMakeFiles/onnxruntime_provider_test.dir/home/tlwu/git/onnxruntime/onnxruntime/test/contrib_ops/moe_test.cc.o
  • Added CPU-side test coverage for:
    • 2-bit validation failures
    • non-trivial non-zero 2-bit outputs
    • LUT-eligible 2-bit block-wise identity behavior
  • Full end-to-end provider gtest execution was not run from this checkout because the available top-level test binary does not expose the MoETest suite here.

Motivation and Context

This work addresses CPU-provider support for QMoE 2-bit expert weights, matching the issue request for QMoE 2 bits on CPU. The PR also aligns the CPU implementation with how MLAS currently exposes optimized 2-bit execution: block-wise 2-bit shapes can use LUT GEMM, while unsupported shapes continue to use dequantize-plus-GEMM fallback paths.

Checklist

  • Tests added/updated
  • Documentation updated
  • No breaking changes
  • CI passes

Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR extends the CPU QMoE implementation to handle 2-bit expert weights and adds a new MLAS LUT-GEMM fast path for supported block-wise layouts. It also updates schema/docs and broadens CPU-focused test coverage so the new low-bit execution path fits into the existing quantized MoE stack.

Changes:

  • Adds CPU-side 2-bit QMoE execution support, including LUT-GEMM packing/cache plumbing and fallback dequantize+GEMM handling.
  • Tightens CPU QMoE input validation and updates the public QMoE schema/documentation to describe 2-bit support.
  • Expands C++ and Python tests to cover 2-bit row-wise/block-wise behavior, validation failures, and parity scenarios.

Reviewed changes

Copilot reviewed 7 out of 7 changed files in this pull request and generated 5 comments.

Show a summary per file
File Description
onnxruntime/test/python/transformers/test_qmoe_cpu.py Generalizes Python-side test quantization helpers from 4/8-bit to 2/4/8-bit and adds 2-bit parity cases.
onnxruntime/test/contrib_ops/moe_test.cc Updates packed-dimension handling and adds CPU-specific 2-bit functional/validation/LUT-path tests.
onnxruntime/core/graph/contrib_ops/contrib_defs.cc Updates QMoE schema/docs to advertise 2-bit weight support and default zero-point behavior.
onnxruntime/contrib_ops/cpu/moe/moe_quantization_cpu.h Adds shared compute input struct and new members for LUT prepacked buffers.
onnxruntime/contrib_ops/cpu/moe/moe_quantization_cpu.cc Implements 2-bit CPU execution changes, LUT-GEMM helpers, prepack/cache handling, and compute-path refactoring.
onnxruntime/contrib_ops/cpu/moe/moe_helper.h Adds stricter shape/packing validation for hidden and inferred intermediate sizes.
docs/contrib_ops/cpu/qmoe.md Adds a new CPU QMoE implementation note covering execution flow, layouts, fast paths, and limitations.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread docs/contrib_ops/cpu/qmoe.md Outdated
Comment thread onnxruntime/test/python/transformers/test_qmoe_cpu.py Outdated
Comment thread onnxruntime/test/python/transformers/test_qmoe_cpu.py Outdated
Comment thread onnxruntime/contrib_ops/cpu/moe/moe_quantization_cpu.cc Outdated
Comment thread onnxruntime/contrib_ops/cpu/moe/moe_quantization_cpu.cc
- Fix null dereference when LUT cache is prepacked but packed_fc1_/fc2_
  is null (check packed_fc1_lut_cache_ in weights_data assignment)
- Fix per-expert indexing into prepacked LUT cache (index by
  expert_idx * packed_size_per_expert at call sites)
- Fix symmetric zero-encoding in test helpers (all-zero tensors use
  sym_zp_offset packed byte, not 0x00)
- Fix block-wise weight shapes in documentation (block_size only
  affects scale tensor shape, not weight tensor shape)
Copy link
Copy Markdown
Contributor

@github-actions github-actions Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can commit the suggested changes from lintrunner.

Comment thread onnxruntime/contrib_ops/cpu/moe/moe_quantization_cpu.cc Outdated
Comment thread onnxruntime/contrib_ops/cpu/moe/moe_quantization_cpu.cc Outdated
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 7 out of 7 changed files in this pull request and generated 4 comments.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread onnxruntime/contrib_ops/cpu/moe/moe_quantization_cpu.cc
Comment thread onnxruntime/test/contrib_ops/moe_test.cc
Comment thread onnxruntime/test/contrib_ops/moe_test.cc
Comment thread docs/contrib_ops/cpu/qmoe.md Outdated
Co-authored-by: Copilot <copilot@github.com>
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 7 out of 7 changed files in this pull request and generated no new comments.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

@tianleiwu tianleiwu marked this pull request as ready for review May 3, 2026 16:53
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants