Extend DQ→MatMulNBits fusion to support 2/8-bit weights and Cast(fp16→fp32) patterns by jambayk · Pull Request #27614 · microsoft/onnxruntime

jambayk · 2026-03-10T21:56:05Z

Description

Extends the QDQ selector-action DQ → MatMul → MatMulNBits fusion in two ways:

1. Support 2-bit and 8-bit quantized weights

The existing fusion only handled 4-bit (Int4x2/UInt4x2) DQ weights. This PR broadens it to also support 2-bit (Int2x4/UInt2x4) and 8-bit (int8/uint8) quantized weights.

qdq_selectors.cc: Added Is2BitIntType, Is8BitIntType, and IsNBitsIntType helpers. Updated DQMatMulNodeGroupSelector::Check to accept 2/4/8-bit weight types.
qdq_actions.cc: Added DQWeightBits and IsDQWeightSigned helpers to dispatch the correct bit-width and signedness for MLAS transpose and MatMulNBits attributes.
q4_dq.cpp (MLAS): Added 8-bit GetElem/SetElem specializations and an 8-bit TransposeColumnWiseQuantized path. Added 6 new template instantiations for 2-bit (signed/unsigned, float/float16) and 8-bit (signed/unsigned, float/float16).

2. Handle Cast(fp16→fp32) between DQ and MatMul (FP16 model fusion)

FP16 models often have DQ(int4→fp16) → Cast(fp16→fp32) → MatMul(fp32) patterns that the existing selector couldn't match. This PR adds a new DQCastMatMulToMatMulNBitsSelector / DQCastMatMulToMatMulNBitsAction pair that:

Matches the DQ → Cast(fp16→fp32) → MatMul pattern on input B.
Creates a MatMulNBits node operating in the DQ scale dtype (fp16).
Always inserts Cast on input A (to DQ dtype) and Cast on output (DQ dtype to MatMul output dtype), relying on ORT's existing CastElimination optimizer to remove redundant back-to-back casts in subsequent passes.
Removes the original DQ, Cast (on B), and MatMul nodes.

Motivation and Context

Many quantized models (e.g., from Olive, AutoAWQ) use 2-bit or 8-bit quantization, but the DQ → MatMulNBits fusion only supported 4-bit weights, leaving these models unoptimized.
FP16 models produce DQ(→fp16) → Cast(fp16→fp32) → MatMul patterns because the DQ output type matches the scale type (fp16), but the MatMul operates in fp32. Without handling the intermediate Cast, the fusion was blocked entirely for these models.

This reverts commit 5f7df04.

Copilot

Pull request overview

Extends the QDQ selector/action-based fusion that rewrites DequantizeLinear -> MatMul patterns into com.microsoft.MatMulNBits, adding support for additional quantized weight bit-widths (2/8) and a cast-aware FP16 pattern.

Changes:

Generalize the DQ->MatMul fusion to derive MatMulNBits.bits from the DQ weight element type (2/4/8-bit).
Add a new selector/action to fuse DQ(fp16) -> Cast(fp16->fp32) -> MatMul into MatMulNBits with inserted casts for type alignment.
Extend MLAS blockwise transpose/pack implementation to support 8-bit (and add template instantiations), plus add new optimizer tests for 8-bit and cast-aware fusion.

Reviewed changes

Copilot reviewed 7 out of 7 changed files in this pull request and generated 3 comments.

Show a summary per file

File	Description
onnxruntime/test/optimizer/qdq_matmulnbits_transformer_test.cc	Adds 8-bit fusion tests and cast-aware (fp16 DQ + Cast + MatMul) fusion tests; updates type-mismatch expectations.
onnxruntime/core/optimizer/qdq_transformer/selectors_actions/qdq_selectors.h	Declares `DQCastMatMulToMatMulNBitsSelector` for the cast-aware fusion pattern.
onnxruntime/core/optimizer/qdq_transformer/selectors_actions/qdq_selectors.cc	Broadens supported weight types to 2/4/8-bit and implements the cast-aware selector.
onnxruntime/core/optimizer/qdq_transformer/selectors_actions/qdq_selector_action_transformer.cc	Registers the new cast-aware selector/action rule alongside the existing DQ->MatMul rule.
onnxruntime/core/optimizer/qdq_transformer/selectors_actions/qdq_actions.h	Declares `DQCastMatMulToMatMulNBitsAction`.
onnxruntime/core/optimizer/qdq_transformer/selectors_actions/qdq_actions.cc	Derives `bits` from weight type, updates packing size math for 2/8-bit, and implements the new cast-aware fusion action.
onnxruntime/core/mlas/lib/q4_dq.cpp	Adds MLAS transpose/pack support and explicit instantiations for 8-bit (and fp16 2-bit) paths used by fusion.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

You can also share your feedback on Copilot code review. Take the survey.

onnxruntime/core/optimizer/qdq_transformer/selectors_actions/qdq_selectors.cc

onnxruntime/core/optimizer/qdq_transformer/selectors_actions/qdq_selector_action_transformer.cc

Copilot

Pull request overview

Copilot reviewed 7 out of 7 changed files in this pull request and generated 1 comment.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

You can also share your feedback on Copilot code review. Take the survey.

onnxruntime/core/optimizer/qdq_transformer/selectors_actions/qdq_actions.cc

Copilot

Pull request overview

Copilot reviewed 7 out of 7 changed files in this pull request and generated 1 comment.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

You can also share your feedback on Copilot code review. Take the survey.

onnxruntime/core/optimizer/qdq_transformer/selectors_actions/qdq_selectors.cc

Copilot

Pull request overview

Copilot reviewed 7 out of 7 changed files in this pull request and generated 5 comments.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

You can also share your feedback on Copilot code review. Take the survey.

onnxruntime/test/optimizer/qdq_matmulnbits_transformer_test.cc

onnxruntime/core/optimizer/qdq_transformer/selectors_actions/qdq_actions.cc

Copilot

Pull request overview

Copilot reviewed 7 out of 7 changed files in this pull request and generated 2 comments.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

You can also share your feedback on Copilot code review. Take the survey.

onnxruntime/core/optimizer/qdq_transformer/selectors_actions/qdq_actions.h

onnxruntime/core/optimizer/qdq_transformer/selectors_actions/qdq_actions.cc

hariharans29

LGTM

jambayk added 3 commits March 10, 2026 19:11

dq_mnb_fusion change

5f7df04

Revert "dq_mnb_fusion change"

7618741

This reverts commit 5f7df04.

mnb fusion rules extension

9e131cf

jambayk requested a review from Copilot March 10, 2026 21:56

Copilot started reviewing on behalf of jambayk March 10, 2026 21:57 View session

jambayk changed the title ~~Extend DQMatMulNBits QDQ transformer to 2,8 bits and fp16~~ Extend DQ→MatMulNBits fusion to support 2/8-bit weights and Cast(fp16→fp32) patterns Mar 10, 2026

Copilot AI reviewed Mar 10, 2026

View reviewed changes

jambayk added 2 commits March 10, 2026 22:32

comments and 2bit tests

bf7d861

replace C++20 explicit lambda template parameter syntax

56b5bbc

jambayk requested a review from Copilot March 10, 2026 22:36

Copilot started reviewing on behalf of jambayk March 10, 2026 22:37 View session

Copilot AI reviewed Mar 10, 2026

View reviewed changes

onnxruntime/core/optimizer/qdq_transformer/selectors_actions/qdq_actions.cc Outdated Show resolved Hide resolved

deduplicate code

c53889b

jambayk requested a review from Copilot March 10, 2026 23:05

Copilot started reviewing on behalf of jambayk March 10, 2026 23:06 View session

Copilot AI reviewed Mar 10, 2026

View reviewed changes

onnxruntime/core/optimizer/qdq_transformer/selectors_actions/qdq_selectors.cc Outdated Show resolved Hide resolved

more dedup

d1a02f5

jambayk requested a review from Copilot March 10, 2026 23:21

Copilot started reviewing on behalf of jambayk March 10, 2026 23:22 View session

Copilot AI reviewed Mar 10, 2026

View reviewed changes

jambayk marked this pull request as ready for review March 10, 2026 23:47

tianleiwu requested a review from Copilot March 11, 2026 19:45

Copilot started reviewing on behalf of tianleiwu March 11, 2026 19:46 View session

Copilot AI reviewed Mar 11, 2026

View reviewed changes

onnxruntime/core/optimizer/qdq_transformer/selectors_actions/qdq_actions.h Show resolved Hide resolved

onnxruntime/core/optimizer/qdq_transformer/selectors_actions/qdq_actions.cc Show resolved Hide resolved

hariharans29 approved these changes Mar 11, 2026

View reviewed changes

jambayk merged commit e91a5c3 into main Mar 11, 2026
99 checks passed

jambayk deleted the jambayk/mnb-rules branch March 11, 2026 22:07

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Extend DQ→MatMulNBits fusion to support 2/8-bit weights and Cast(fp16→fp32) patterns#27614

Extend DQ→MatMulNBits fusion to support 2/8-bit weights and Cast(fp16→fp32) patterns#27614
jambayk merged 7 commits intomainfrom
jambayk/mnb-rules

jambayk commented Mar 10, 2026 •

edited

Loading

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

Uh oh!

hariharans29 left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

jambayk commented Mar 10, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Motivation and Context

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Uh oh!

Uh oh!

hariharans29 left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

jambayk commented Mar 10, 2026 •

edited

Loading