Skip to content

Fix DQ→MatMulNBits fusion for FP16 models on CPU EP#27640

Merged
jambayk merged 3 commits intomainfrom
jambayk/qdq-mnb-arm
Mar 14, 2026
Merged

Fix DQ→MatMulNBits fusion for FP16 models on CPU EP#27640
jambayk merged 3 commits intomainfrom
jambayk/qdq-mnb-arm

Conversation

@jambayk
Copy link
Contributor

@jambayk jambayk commented Mar 13, 2026

Description

For FP16 models with block-quantized weights (DQ(int4/int2/int8, fp16_scale) → MatMul(fp16)), the DQMatMulToMatMulNBitsSelector failed to match on CPU EP because FP16 MatMul nodes are not claimed by CPU EP during graph partitioning, leaving their execution provider unassigned (empty string ""). The selector's EP compatibility check rejected these nodes.

This PR:

  • Adds "" (empty/unassigned EP) to the compatible providers list for DQMatMulToMatMulNBitsSelector so it can match FP16 MatMul nodes not yet assigned to an EP. The resulting MatMulNBits node is assigned to kCpuExecutionProvider by the action (which has both float and MLFloat16 CPU kernels).
  • Adds "" to the QDQSelectorActionTransformer transformer-level compatible EPs so unassigned nodes reach individual selectors (other selectors are unaffected since their own provider lists don't include "").
  • Removes the DQCastMatMulToMatMulNBitsSelector and DQCastMatMulToMatMulNBitsAction, which handled a DQ → Cast(fp16→fp32) → MatMul pattern that only existed after InsertCastTransformer ran. That fusion only worked incidentally when FuseInitializersTransformer (Level 4) triggered an optimization loop repeat, giving Level 2 QDQ fusions a second pass — a behavior that didn't occur in all builds (e.g., minimal/extended-minimal builds without FuseInitializersTransformer).
  • Replaces the DQCastMatMulConvertedToMatMulNBits test with DQMatMulFP16ConvertedToMatMulNBits that tests the actual scenario: DQ(int4, fp16_scale) → MatMul(fp16) on CPU EP.

Motivation and Context

FP16 models with block-quantized weights were not getting DQ → MatMulNBits fusion when running on CPU EP in certain ORT builds. The fusion worked on x64 full builds by luck — InsertCastTransformer created DQ→Cast→MatMul patterns, then FuseInitializersTransformer (Level 4) modified FP16 initializers causing the optimization loop to repeat, giving Level 2 QDQ fusions a second pass where the Cast-aware selector matched. In builds without FuseInitializersTransformer (e.g., minimal builds, arm packages), the loop didn't repeat and the fusion never applied.

The root cause is that CPU EP has no FP16 MatMul kernel, so it doesn't claim FP16 MatMul nodes during partitioning. These nodes have an empty EP string, which the QDQSelectorActionTransformer and BaseSelector both rejected. The fix allows the DQMatMulToMatMulNBits selector to match unassigned nodes directly on the first Level 2 pass, before InsertCastTransformer runs, eliminating the dependency on the optimization loop repeat.

Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Fixes DQ→MatMulNBits fusion for FP16 models on CPU EP by allowing the selector to match MatMul nodes with unassigned execution providers, and removes the now-unnecessary DQCastMatMulToMatMulNBits selector/action that relied on fragile optimization loop ordering.

Changes:

  • Adds "" (empty EP) to compatible providers for DQMatMulToMatMulNBitsSelector and QDQSelectorActionTransformer so unassigned FP16 MatMul nodes can be matched
  • Removes DQCastMatMulToMatMulNBitsSelector and DQCastMatMulToMatMulNBitsAction (selector, action, and registration)
  • Replaces the Cast-aware test with a direct FP16 DQ→MatMul test

Reviewed changes

Copilot reviewed 6 out of 6 changed files in this pull request and generated no comments.

Show a summary per file
File Description
qdq_selector_action_transformer.cc Adds "" to transformer and selector provider lists; removes Cast-aware registration
qdq_selectors.h Removes DQCastMatMulToMatMulNBitsSelector class
qdq_selectors.cc Removes DQCastMatMulToMatMulNBitsSelector::Select implementation
qdq_actions.h Removes DQCastMatMulToMatMulNBitsAction declaration
qdq_actions.cc Removes DQCastMatMulToMatMulNBitsAction implementation; updates comment
qdq_matmulnbits_transformer_test.cc Replaces Cast-aware test with direct FP16 DQ→MatMul test

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

You can also share your feedback on Copilot code review. Take the survey.

@jambayk jambayk marked this pull request as ready for review March 13, 2026 18:02
@jambayk jambayk force-pushed the jambayk/qdq-mnb-arm branch from 65ab432 to 3e64aa2 Compare March 13, 2026 23:10
@jambayk jambayk enabled auto-merge (squash) March 13, 2026 23:12
@jambayk jambayk merged commit 09b5695 into main Mar 14, 2026
91 checks passed
@jambayk jambayk deleted the jambayk/qdq-mnb-arm branch March 14, 2026 05:40
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants