Skip to content

Extend DQ→MatMulNBits fusion to support 2/8-bit weights and Cast(fp16→fp32) patterns#27614

Merged
jambayk merged 7 commits intomainfrom
jambayk/mnb-rules
Mar 11, 2026
Merged

Extend DQ→MatMulNBits fusion to support 2/8-bit weights and Cast(fp16→fp32) patterns#27614
jambayk merged 7 commits intomainfrom
jambayk/mnb-rules

Conversation

@jambayk
Copy link
Contributor

@jambayk jambayk commented Mar 10, 2026

Description

Extends the QDQ selector-action DQ → MatMul → MatMulNBits fusion in two ways:

1. Support 2-bit and 8-bit quantized weights

The existing fusion only handled 4-bit (Int4x2/UInt4x2) DQ weights. This PR broadens it to also support 2-bit (Int2x4/UInt2x4) and 8-bit (int8/uint8) quantized weights.

  • qdq_selectors.cc: Added Is2BitIntType, Is8BitIntType, and IsNBitsIntType helpers. Updated DQMatMulNodeGroupSelector::Check to accept 2/4/8-bit weight types.
  • qdq_actions.cc: Added DQWeightBits and IsDQWeightSigned helpers to dispatch the correct bit-width and signedness for MLAS transpose and MatMulNBits attributes.
  • q4_dq.cpp (MLAS): Added 8-bit GetElem/SetElem specializations and an 8-bit TransposeColumnWiseQuantized path. Added 6 new template instantiations for 2-bit (signed/unsigned, float/float16) and 8-bit (signed/unsigned, float/float16).

2. Handle Cast(fp16→fp32) between DQ and MatMul (FP16 model fusion)

FP16 models often have DQ(int4→fp16) → Cast(fp16→fp32) → MatMul(fp32) patterns that the existing selector couldn't match. This PR adds a new DQCastMatMulToMatMulNBitsSelector / DQCastMatMulToMatMulNBitsAction pair that:

  • Matches the DQ → Cast(fp16→fp32) → MatMul pattern on input B.
  • Creates a MatMulNBits node operating in the DQ scale dtype (fp16).
  • Always inserts Cast on input A (to DQ dtype) and Cast on output (DQ dtype to MatMul output dtype), relying on ORT's existing CastElimination optimizer to remove redundant back-to-back casts in subsequent passes.
  • Removes the original DQ, Cast (on B), and MatMul nodes.

Motivation and Context

  • Many quantized models (e.g., from Olive, AutoAWQ) use 2-bit or 8-bit quantization, but the DQ → MatMulNBits fusion only supported 4-bit weights, leaving these models unoptimized.
  • FP16 models produce DQ(→fp16) → Cast(fp16→fp32) → MatMul patterns because the DQ output type matches the scale type (fp16), but the MatMul operates in fp32. Without handling the intermediate Cast, the fusion was blocked entirely for these models.

@jambayk jambayk requested a review from Copilot March 10, 2026 21:56
@jambayk jambayk changed the title Extend DQMatMulNBits QDQ transformer to 2,8 bits and fp16 Extend DQ→MatMulNBits fusion to support 2/8-bit weights and Cast(fp16→fp32) patterns Mar 10, 2026
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Extends the QDQ selector/action-based fusion that rewrites DequantizeLinear -> MatMul patterns into com.microsoft.MatMulNBits, adding support for additional quantized weight bit-widths (2/8) and a cast-aware FP16 pattern.

Changes:

  • Generalize the DQ->MatMul fusion to derive MatMulNBits.bits from the DQ weight element type (2/4/8-bit).
  • Add a new selector/action to fuse DQ(fp16) -> Cast(fp16->fp32) -> MatMul into MatMulNBits with inserted casts for type alignment.
  • Extend MLAS blockwise transpose/pack implementation to support 8-bit (and add template instantiations), plus add new optimizer tests for 8-bit and cast-aware fusion.

Reviewed changes

Copilot reviewed 7 out of 7 changed files in this pull request and generated 3 comments.

Show a summary per file
File Description
onnxruntime/test/optimizer/qdq_matmulnbits_transformer_test.cc Adds 8-bit fusion tests and cast-aware (fp16 DQ + Cast + MatMul) fusion tests; updates type-mismatch expectations.
onnxruntime/core/optimizer/qdq_transformer/selectors_actions/qdq_selectors.h Declares DQCastMatMulToMatMulNBitsSelector for the cast-aware fusion pattern.
onnxruntime/core/optimizer/qdq_transformer/selectors_actions/qdq_selectors.cc Broadens supported weight types to 2/4/8-bit and implements the cast-aware selector.
onnxruntime/core/optimizer/qdq_transformer/selectors_actions/qdq_selector_action_transformer.cc Registers the new cast-aware selector/action rule alongside the existing DQ->MatMul rule.
onnxruntime/core/optimizer/qdq_transformer/selectors_actions/qdq_actions.h Declares DQCastMatMulToMatMulNBitsAction.
onnxruntime/core/optimizer/qdq_transformer/selectors_actions/qdq_actions.cc Derives bits from weight type, updates packing size math for 2/8-bit, and implements the new cast-aware fusion action.
onnxruntime/core/mlas/lib/q4_dq.cpp Adds MLAS transpose/pack support and explicit instantiations for 8-bit (and fp16 2-bit) paths used by fusion.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

You can also share your feedback on Copilot code review. Take the survey.

Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 7 out of 7 changed files in this pull request and generated 1 comment.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

You can also share your feedback on Copilot code review. Take the survey.

Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 7 out of 7 changed files in this pull request and generated 1 comment.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

You can also share your feedback on Copilot code review. Take the survey.

Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 7 out of 7 changed files in this pull request and generated 5 comments.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

You can also share your feedback on Copilot code review. Take the survey.

@jambayk jambayk marked this pull request as ready for review March 10, 2026 23:47
@tianleiwu tianleiwu requested a review from Copilot March 11, 2026 19:45
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 7 out of 7 changed files in this pull request and generated 2 comments.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

You can also share your feedback on Copilot code review. Take the survey.

Copy link
Member

@hariharans29 hariharans29 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@jambayk jambayk merged commit e91a5c3 into main Mar 11, 2026
99 checks passed
@jambayk jambayk deleted the jambayk/mnb-rules branch March 11, 2026 22:07
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants