Fix DQ→MatMulNBits fusion for FP16 models on CPU EP by jambayk · Pull Request #27640 · microsoft/onnxruntime

jambayk · 2026-03-13T05:34:37Z

Description

For FP16 models with block-quantized weights (DQ(int4/int2/int8, fp16_scale) → MatMul(fp16)), the DQMatMulToMatMulNBitsSelector failed to match on CPU EP because FP16 MatMul nodes are not claimed by CPU EP during graph partitioning, leaving their execution provider unassigned (empty string ""). The selector's EP compatibility check rejected these nodes.

This PR:

Adds "" (empty/unassigned EP) to the compatible providers list for DQMatMulToMatMulNBitsSelector so it can match FP16 MatMul nodes not yet assigned to an EP. The resulting MatMulNBits node is assigned to kCpuExecutionProvider by the action (which has both float and MLFloat16 CPU kernels).
Adds "" to the QDQSelectorActionTransformer transformer-level compatible EPs so unassigned nodes reach individual selectors (other selectors are unaffected since their own provider lists don't include "").
Removes the DQCastMatMulToMatMulNBitsSelector and DQCastMatMulToMatMulNBitsAction, which handled a DQ → Cast(fp16→fp32) → MatMul pattern that only existed after InsertCastTransformer ran. That fusion only worked incidentally when FuseInitializersTransformer (Level 4) triggered an optimization loop repeat, giving Level 2 QDQ fusions a second pass — a behavior that didn't occur in all builds (e.g., minimal/extended-minimal builds without FuseInitializersTransformer).
Replaces the DQCastMatMulConvertedToMatMulNBits test with DQMatMulFP16ConvertedToMatMulNBits that tests the actual scenario: DQ(int4, fp16_scale) → MatMul(fp16) on CPU EP.

Motivation and Context

FP16 models with block-quantized weights were not getting DQ → MatMulNBits fusion when running on CPU EP in certain ORT builds. The fusion worked on x64 full builds by luck — InsertCastTransformer created DQ→Cast→MatMul patterns, then FuseInitializersTransformer (Level 4) modified FP16 initializers causing the optimization loop to repeat, giving Level 2 QDQ fusions a second pass where the Cast-aware selector matched. In builds without FuseInitializersTransformer (e.g., minimal builds, arm packages), the loop didn't repeat and the fusion never applied.

The root cause is that CPU EP has no FP16 MatMul kernel, so it doesn't claim FP16 MatMul nodes during partitioning. These nodes have an empty EP string, which the QDQSelectorActionTransformer and BaseSelector both rejected. The fix allows the DQMatMulToMatMulNBits selector to match unassigned nodes directly on the first Level 2 pass, before InsertCastTransformer runs, eliminating the dependency on the optimization loop repeat.

Copilot

Pull request overview

Fixes DQ→MatMulNBits fusion for FP16 models on CPU EP by allowing the selector to match MatMul nodes with unassigned execution providers, and removes the now-unnecessary DQCastMatMulToMatMulNBits selector/action that relied on fragile optimization loop ordering.

Changes:

Adds "" (empty EP) to compatible providers for DQMatMulToMatMulNBitsSelector and QDQSelectorActionTransformer so unassigned FP16 MatMul nodes can be matched
Removes DQCastMatMulToMatMulNBitsSelector and DQCastMatMulToMatMulNBitsAction (selector, action, and registration)
Replaces the Cast-aware test with a direct FP16 DQ→MatMul test

Reviewed changes

Copilot reviewed 6 out of 6 changed files in this pull request and generated no comments.

Show a summary per file

File	Description
qdq_selector_action_transformer.cc	Adds `""` to transformer and selector provider lists; removes Cast-aware registration
qdq_selectors.h	Removes `DQCastMatMulToMatMulNBitsSelector` class
qdq_selectors.cc	Removes `DQCastMatMulToMatMulNBitsSelector::Select` implementation
qdq_actions.h	Removes `DQCastMatMulToMatMulNBitsAction` declaration
qdq_actions.cc	Removes `DQCastMatMulToMatMulNBitsAction` implementation; updates comment
qdq_matmulnbits_transformer_test.cc	Replaces Cast-aware test with direct FP16 DQ→MatMul test

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

You can also share your feedback on Copilot code review. Take the survey.

jambayk requested a review from Copilot March 13, 2026 17:51

Copilot started reviewing on behalf of jambayk March 13, 2026 17:52 View session

Copilot AI reviewed Mar 13, 2026

View reviewed changes

jambayk marked this pull request as ready for review March 13, 2026 18:02

hariharans29 approved these changes Mar 13, 2026

View reviewed changes

jambayk added 3 commits March 13, 2026 23:10

debug logs

13decdf

more logs

7ed1c19

update dqmatmul rule to allow fp16 matmul on cpu ep

3e64aa2

jambayk force-pushed the jambayk/qdq-mnb-arm branch from 65ab432 to 3e64aa2 Compare March 13, 2026 23:10

jambayk enabled auto-merge (squash) March 13, 2026 23:12

jambayk merged commit 09b5695 into main Mar 14, 2026
91 checks passed

jambayk deleted the jambayk/qdq-mnb-arm branch March 14, 2026 05:40

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix DQ→MatMulNBits fusion for FP16 models on CPU EP#27640

Fix DQ→MatMulNBits fusion for FP16 models on CPU EP#27640
jambayk merged 3 commits intomainfrom
jambayk/qdq-mnb-arm

jambayk commented Mar 13, 2026 •

edited

Loading

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

jambayk commented Mar 13, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Motivation and Context

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

jambayk commented Mar 13, 2026 •

edited

Loading