Extend DQ→MatMulNBits fusion to support 2/8-bit weights and Cast(fp16→fp32) patterns#27614
Extend DQ→MatMulNBits fusion to support 2/8-bit weights and Cast(fp16→fp32) patterns#27614
Conversation
This reverts commit 5f7df04.
There was a problem hiding this comment.
Pull request overview
Extends the QDQ selector/action-based fusion that rewrites DequantizeLinear -> MatMul patterns into com.microsoft.MatMulNBits, adding support for additional quantized weight bit-widths (2/8) and a cast-aware FP16 pattern.
Changes:
- Generalize the DQ->MatMul fusion to derive
MatMulNBits.bitsfrom the DQ weight element type (2/4/8-bit). - Add a new selector/action to fuse
DQ(fp16) -> Cast(fp16->fp32) -> MatMulintoMatMulNBitswith inserted casts for type alignment. - Extend MLAS blockwise transpose/pack implementation to support 8-bit (and add template instantiations), plus add new optimizer tests for 8-bit and cast-aware fusion.
Reviewed changes
Copilot reviewed 7 out of 7 changed files in this pull request and generated 3 comments.
Show a summary per file
| File | Description |
|---|---|
| onnxruntime/test/optimizer/qdq_matmulnbits_transformer_test.cc | Adds 8-bit fusion tests and cast-aware (fp16 DQ + Cast + MatMul) fusion tests; updates type-mismatch expectations. |
| onnxruntime/core/optimizer/qdq_transformer/selectors_actions/qdq_selectors.h | Declares DQCastMatMulToMatMulNBitsSelector for the cast-aware fusion pattern. |
| onnxruntime/core/optimizer/qdq_transformer/selectors_actions/qdq_selectors.cc | Broadens supported weight types to 2/4/8-bit and implements the cast-aware selector. |
| onnxruntime/core/optimizer/qdq_transformer/selectors_actions/qdq_selector_action_transformer.cc | Registers the new cast-aware selector/action rule alongside the existing DQ->MatMul rule. |
| onnxruntime/core/optimizer/qdq_transformer/selectors_actions/qdq_actions.h | Declares DQCastMatMulToMatMulNBitsAction. |
| onnxruntime/core/optimizer/qdq_transformer/selectors_actions/qdq_actions.cc | Derives bits from weight type, updates packing size math for 2/8-bit, and implements the new cast-aware fusion action. |
| onnxruntime/core/mlas/lib/q4_dq.cpp | Adds MLAS transpose/pack support and explicit instantiations for 8-bit (and fp16 2-bit) paths used by fusion. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
You can also share your feedback on Copilot code review. Take the survey.
onnxruntime/core/optimizer/qdq_transformer/selectors_actions/qdq_selectors.cc
Show resolved
Hide resolved
onnxruntime/core/optimizer/qdq_transformer/selectors_actions/qdq_selectors.cc
Show resolved
Hide resolved
onnxruntime/core/optimizer/qdq_transformer/selectors_actions/qdq_selector_action_transformer.cc
Show resolved
Hide resolved
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 7 out of 7 changed files in this pull request and generated 1 comment.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
You can also share your feedback on Copilot code review. Take the survey.
onnxruntime/core/optimizer/qdq_transformer/selectors_actions/qdq_actions.cc
Outdated
Show resolved
Hide resolved
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 7 out of 7 changed files in this pull request and generated 1 comment.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
You can also share your feedback on Copilot code review. Take the survey.
onnxruntime/core/optimizer/qdq_transformer/selectors_actions/qdq_selectors.cc
Outdated
Show resolved
Hide resolved
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 7 out of 7 changed files in this pull request and generated 5 comments.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
You can also share your feedback on Copilot code review. Take the survey.
onnxruntime/core/optimizer/qdq_transformer/selectors_actions/qdq_actions.cc
Show resolved
Hide resolved
onnxruntime/core/optimizer/qdq_transformer/selectors_actions/qdq_actions.cc
Show resolved
Hide resolved
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 7 out of 7 changed files in this pull request and generated 2 comments.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
You can also share your feedback on Copilot code review. Take the survey.
Description
Extends the QDQ selector-action
DQ → MatMul → MatMulNBitsfusion in two ways:1. Support 2-bit and 8-bit quantized weights
The existing fusion only handled 4-bit (
Int4x2/UInt4x2) DQ weights. This PR broadens it to also support 2-bit (Int2x4/UInt2x4) and 8-bit (int8/uint8) quantized weights.Is2BitIntType,Is8BitIntType, andIsNBitsIntTypehelpers. UpdatedDQMatMulNodeGroupSelector::Checkto accept 2/4/8-bit weight types.DQWeightBitsandIsDQWeightSignedhelpers to dispatch the correct bit-width and signedness for MLAS transpose and MatMulNBits attributes.q4_dq.cpp(MLAS): Added 8-bitGetElem/SetElemspecializations and an 8-bitTransposeColumnWiseQuantizedpath. Added 6 new template instantiations for 2-bit (signed/unsigned, float/float16) and 8-bit (signed/unsigned, float/float16).2. Handle
Cast(fp16→fp32)between DQ and MatMul (FP16 model fusion)FP16 models often have
DQ(int4→fp16) → Cast(fp16→fp32) → MatMul(fp32)patterns that the existing selector couldn't match. This PR adds a newDQCastMatMulToMatMulNBitsSelector/DQCastMatMulToMatMulNBitsActionpair that:DQ → Cast(fp16→fp32) → MatMulpattern on input B.MatMulNBitsnode operating in the DQ scale dtype (fp16).Caston input A (to DQ dtype) andCaston output (DQ dtype to MatMul output dtype), relying on ORT's existingCastEliminationoptimizer to remove redundant back-to-back casts in subsequent passes.Motivation and Context
DQ → MatMulNBitsfusion only supported 4-bit weights, leaving these models unoptimized.DQ(→fp16) → Cast(fp16→fp32) → MatMulpatterns because the DQ output type matches the scale type (fp16), but the MatMul operates in fp32. Without handling the intermediate Cast, the fusion was blocked entirely for these models.