[QNN] MatMulAddFusion and Reshape Related Fusion by Lafi7e · Pull Request #22494 · microsoft/onnxruntime

Lafi7e · 2024-10-18T06:18:21Z

QNN EP relies on Gemm Op to use FullyConnected QNN Op to run the model, which is much faster than MatMul+Add. This PR fuses MatMul+Add when MatMul's 2nd input is 2D initializer, no matter the rank of the 1st input. If the 1st input is not 2D tensor, Reshape nodes will be added.

On QNN EP, the memory allocation is for each activation tensor, so Reshape/Squeeze/Unsqueeze is not no-op. This PR also add some fusion trying to remove redundant reshape nodes. For some QNN AI Hub models on specific device, without removing the Reshape nodes, it cannot finalize the graph when execution, but works well after removing.

Run below models with and without the change:
swin_tiny: Average inference time cost: 12.8077 ms | Average inference time cost: 23.956 ms
swin_base: Average inference time cost: 27.0639 ms | Average inference time cost: 57.6608 ms
convnext_tiny: Average inference time cost: 3.42956 ms | Average inference time cost: 16.1848 ms
openai_clip_CLIPTextEncoder: Average inference time cost: 5.96104 ms | Average inference time cost: 220.406 ms
openai_clip_CLIPImageEncoder: Average inference time cost: 41.8206 ms | Average inference time cost: 919.712 ms

NOTE that current change skips the Attention pattern because it not it will cause AttentionFusion to work. Ideally we need to adjust the AttentionFusion to support the Gemm pattern, but it requires big changes. Maybe we can do this in the future, say, when we want to run transformer models on QNN, since we don't have Attention QNN, we still want to fuse MatMul+Add in the Attention pattern to use FullyConnected in QNN side.

onnxruntime/core/optimizer/matmul_add_fusion.cc

adrianlizarraga · 2024-11-06T17:40:25Z

@centwang Thank you for the PR. It looks like many unit tests and pipelines are still not passing. Could you please address those issues first?

onnxruntime/core/optimizer/reshape_fusion.cc

onnxruntime/test/optimizer/graph_transform_test.cc

onnxruntime/core/optimizer/matmul_add_fusion.cc

onnxruntime/test/optimizer/graph_transform_test.cc

onnxruntime/test/providers/qnn/gemm_op_test.cc

onnxruntime/core/optimizer/matmul_add_fusion.cc

onnxruntime/core/optimizer/reshape_fusion.cc

onnxruntime/core/providers/qnn/builder/qnn_node_group/reshape_gemm_fusion.cc

onnxruntime/test/providers/qnn/qnn_basic_test.cc

onnxruntime/core/providers/qnn/builder/qnn_node_group/reshape_gemm_fusion.cc

adrianlizarraga · 2025-02-12T04:02:51Z

@HectorSVC could you please take a look at this PR?

adrianlizarraga · 2025-02-18T21:18:22Z

Hi @skottmckay, I think there are some unresolved comments. Would you be able to take another look?

QNN EP relies on Gemm Op to use FullyConnected QNN Op to run the model, which is much faster than MatMul+Add. This PR fuses MatMul+Add when MatMul's 2nd input is 2D initializer, no matter the rank of the 1st input. If the 1st input is not 2D tensor, Reshape nodes will be added. On QNN EP, the memory allocation is for each activation tensor, so Reshape/Squeeze/Unsqueeze is not no-op. This PR also add some fusion trying to remove redundant reshape nodes. For some QNN AI Hub models on specific device, without removing the Reshape nodes, it cannot finalize the graph when execution, but works well after removing. Run below models with and without the change: swin_tiny: Average inference time cost: 12.8077 ms | Average inference time cost: 23.956 ms swin_base: Average inference time cost: 27.0639 ms | Average inference time cost: 57.6608 ms convnext_tiny: Average inference time cost: 3.42956 ms | Average inference time cost: 16.1848 ms openai_clip_CLIPTextEncoder: Average inference time cost: 5.96104 ms | Average inference time cost: 220.406 ms openai_clip_CLIPImageEncoder: Average inference time cost: 41.8206 ms | Average inference time cost: 919.712 ms NOTE that current change skips the Attention pattern because it not it will cause AttentionFusion to work. Ideally we need to adjust the AttentionFusion to support the Gemm pattern, but it requires big changes. Maybe we can do this in the future, say, when we want to run transformer models on QNN, since we don't have Attention QNN, we still want to fuse MatMul+Add in the Attention pattern to use FullyConnected in QNN side. --------- Co-authored-by: adrianlizarraga <adlizarraga@microsoft.com>

Lafi7e force-pushed the weicwang/matmul_add_fusion branch from 7d3d515 to 0a05430 Compare October 21, 2024 03:34

snnn previously approved these changes Oct 21, 2024

View reviewed changes

Lafi7e requested review from adrianlizarraga and skottmckay October 22, 2024 02:08

skottmckay reviewed Oct 22, 2024

View reviewed changes

onnxruntime/core/optimizer/matmul_add_fusion.cc Outdated Show resolved Hide resolved

onnxruntime/core/optimizer/matmul_add_fusion.cc Show resolved Hide resolved

Lafi7e dismissed snnn’s stale review via ca59611 October 29, 2024 11:51

Lafi7e force-pushed the weicwang/matmul_add_fusion branch from a8388b7 to ca59611 Compare October 29, 2024 11:51

Lafi7e changed the title ~~Add More Cases to MatMulAddFusion~~ [QNN] MatMulAddFusion and Reshape Related Fusion Oct 29, 2024

Lafi7e requested review from cloudhan and jywu-msft October 29, 2024 11:52

skottmckay reviewed Nov 8, 2024

View reviewed changes

Lafi7e added 6 commits November 22, 2024 11:45

matmul add fusion

9405087

fix ut failure

c685f39

fix compile error

e5146ce

fix attn pattern

787b4fb

reshape related fusion

8606668

resolve comments

47d4755

Lafi7e force-pushed the weicwang/matmul_add_fusion branch from ca59611 to 47d4755 Compare November 25, 2024 05:49

Lafi7e added 2 commits November 25, 2024 14:16

fix build error

d063df5

fix test failure

8d75e0a

skottmckay reviewed Nov 28, 2024

View reviewed changes

Lafi7e added 7 commits December 5, 2024 10:37

Merge branch 'main' into weicwang/matmul_add_fusion

dff068b

resolve comments

81d3fe9

Merge branch 'main' into weicwang/matmul_add_fusion

e1c77da

fix merge error

9b09618

use constant

91258aa

Merge branch 'main' into weicwang/matmul_add_fusion

582eb3f

enforce constant initializer for qnn

4cf44f6

skottmckay previously approved these changes Jan 17, 2025

View reviewed changes

adrianlizarraga reviewed Jan 27, 2025

View reviewed changes

onnxruntime/core/providers/qnn/builder/qnn_node_group/reshape_gemm_fusion.cc Outdated Show resolved Hide resolved

Merge main and fix conflicts

d8260b8

adrianlizarraga dismissed skottmckay’s stale review via d8260b8 February 7, 2025 00:44

adrianlizarraga added 3 commits February 6, 2025 16:52

lintrunner fix

1e3dda9

signed comparison fix for qnn built as a shared lib

0ab6369

Merge main and fix conflicts

dd08cdc

adrianlizarraga previously approved these changes Feb 12, 2025

View reviewed changes

adrianlizarraga requested review from HectorSVC and removed request for cloudhan February 12, 2025 04:02

adrianlizarraga added 2 commits February 13, 2025 17:32

Merge in main branch and fix conflicts

47259bc

Add include to fix optional GitHub actions linter suggestion

0f9700d

adrianlizarraga dismissed their stale review via 0f9700d February 14, 2025 01:33

adrianlizarraga approved these changes Feb 14, 2025

View reviewed changes

HectorSVC approved these changes Feb 14, 2025

View reviewed changes

adrianlizarraga added the ep:QNN issues related to QNN exeution provider label Feb 14, 2025

jywu-msft merged commit 03c6c2e into main Feb 18, 2025
96 of 98 checks passed

jywu-msft deleted the weicwang/matmul_add_fusion branch February 18, 2025 21:22

Conversation

Lafi7e commented Oct 18, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

Uh oh!

adrianlizarraga commented Nov 6, 2024

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

adrianlizarraga commented Feb 12, 2025

Uh oh!

adrianlizarraga commented Feb 18, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

Lafi7e commented Oct 18, 2024 •

edited

Loading