Fix DQ→MatMulNBits fusion for FP16 models on CPU EP#27640
Merged
Conversation
Contributor
There was a problem hiding this comment.
Pull request overview
Fixes DQ→MatMulNBits fusion for FP16 models on CPU EP by allowing the selector to match MatMul nodes with unassigned execution providers, and removes the now-unnecessary DQCastMatMulToMatMulNBits selector/action that relied on fragile optimization loop ordering.
Changes:
- Adds
""(empty EP) to compatible providers forDQMatMulToMatMulNBitsSelectorandQDQSelectorActionTransformerso unassigned FP16 MatMul nodes can be matched - Removes
DQCastMatMulToMatMulNBitsSelectorandDQCastMatMulToMatMulNBitsAction(selector, action, and registration) - Replaces the Cast-aware test with a direct FP16 DQ→MatMul test
Reviewed changes
Copilot reviewed 6 out of 6 changed files in this pull request and generated no comments.
Show a summary per file
| File | Description |
|---|---|
| qdq_selector_action_transformer.cc | Adds "" to transformer and selector provider lists; removes Cast-aware registration |
| qdq_selectors.h | Removes DQCastMatMulToMatMulNBitsSelector class |
| qdq_selectors.cc | Removes DQCastMatMulToMatMulNBitsSelector::Select implementation |
| qdq_actions.h | Removes DQCastMatMulToMatMulNBitsAction declaration |
| qdq_actions.cc | Removes DQCastMatMulToMatMulNBitsAction implementation; updates comment |
| qdq_matmulnbits_transformer_test.cc | Replaces Cast-aware test with direct FP16 DQ→MatMul test |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
You can also share your feedback on Copilot code review. Take the survey.
hariharans29
approved these changes
Mar 13, 2026
65ab432 to
3e64aa2
Compare
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Description
For FP16 models with block-quantized weights (
DQ(int4/int2/int8, fp16_scale) → MatMul(fp16)), theDQMatMulToMatMulNBitsSelectorfailed to match on CPU EP because FP16 MatMul nodes are not claimed by CPU EP during graph partitioning, leaving their execution provider unassigned (empty string""). The selector's EP compatibility check rejected these nodes.This PR:
""(empty/unassigned EP) to the compatible providers list forDQMatMulToMatMulNBitsSelectorso it can match FP16 MatMul nodes not yet assigned to an EP. The resultingMatMulNBitsnode is assigned tokCpuExecutionProviderby the action (which has bothfloatandMLFloat16CPU kernels).""to theQDQSelectorActionTransformertransformer-level compatible EPs so unassigned nodes reach individual selectors (other selectors are unaffected since their own provider lists don't include"").DQCastMatMulToMatMulNBitsSelectorandDQCastMatMulToMatMulNBitsAction, which handled aDQ → Cast(fp16→fp32) → MatMulpattern that only existed afterInsertCastTransformerran. That fusion only worked incidentally whenFuseInitializersTransformer(Level 4) triggered an optimization loop repeat, giving Level 2 QDQ fusions a second pass — a behavior that didn't occur in all builds (e.g., minimal/extended-minimal builds withoutFuseInitializersTransformer).DQCastMatMulConvertedToMatMulNBitstest withDQMatMulFP16ConvertedToMatMulNBitsthat tests the actual scenario:DQ(int4, fp16_scale) → MatMul(fp16)on CPU EP.Motivation and Context
FP16 models with block-quantized weights were not getting
DQ → MatMulNBitsfusion when running on CPU EP in certain ORT builds. The fusion worked on x64 full builds by luck —InsertCastTransformercreatedDQ→Cast→MatMulpatterns, thenFuseInitializersTransformer(Level 4) modified FP16 initializers causing the optimization loop to repeat, giving Level 2 QDQ fusions a second pass where the Cast-aware selector matched. In builds withoutFuseInitializersTransformer(e.g., minimal builds, arm packages), the loop didn't repeat and the fusion never applied.The root cause is that CPU EP has no FP16 MatMul kernel, so it doesn't claim FP16 MatMul nodes during partitioning. These nodes have an empty EP string, which the
QDQSelectorActionTransformerandBaseSelectorboth rejected. The fix allows theDQMatMulToMatMulNBitsselector to match unassigned nodes directly on the first Level 2 pass, beforeInsertCastTransformerruns, eliminating the dependency on the optimization loop repeat.