[WebNN EP] Support MultiHeadAttention(MHA) #24079
Conversation
5e67f9b to
09de359
Compare
Honry
left a comment
There was a problem hiding this comment.
Thanks @peishenyan, some comments and pls. add this new op to the webnn-operators.md file.
onnxruntime/core/providers/webnn/builders/impl/attention_helper.h
Outdated
Show resolved
Hide resolved
onnxruntime/core/providers/webnn/builders/impl/mha_op_builder.cc
Outdated
Show resolved
Hide resolved
onnxruntime/core/providers/webnn/builders/impl/mha_op_builder.cc
Outdated
Show resolved
Hide resolved
onnxruntime/core/providers/webnn/builders/impl/mha_op_builder.cc
Outdated
Show resolved
Hide resolved
onnxruntime/core/providers/webnn/builders/impl/mha_op_builder.cc
Outdated
Show resolved
Hide resolved
onnxruntime/core/providers/webnn/builders/impl/mha_op_builder.cc
Outdated
Show resolved
Hide resolved
onnxruntime/core/providers/webnn/builders/impl/mha_op_builder.cc
Outdated
Show resolved
Hide resolved
|
/azp run ONNX Runtime Web CI Pipeline,Windows GPU CI Pipeline,Linux Android Emulator QNN CI Pipeline,Windows GPU WebGPU CI Pipeline,Windows OpenVINO CI Pipeline |
|
/azp run Linux CPU CI Pipeline,Linux CPU Minimal Build E2E CI Pipeline,Linux GPU CI Pipeline,Linux GPU TensorRT CI Pipeline,Linux OpenVINO CI Pipeline,Linux QNN CI Pipeline,MacOS CI Pipeline,Windows ARM64 QNN CI Pipeline,Windows CPU CI Pipeline |
|
/azp run Windows GPU CUDA CI Pipeline,Windows GPU DML CI Pipeline,Windows GPU Doc Gen CI Pipeline,Win_TRT_Minimal_CUDA_Test_CI |
|
/azp run Windows GPU TensorRT CI Pipeline,onnxruntime-binary-size-checks-ci-pipeline,orttraining-linux-ci-pipeline,orttraining-linux-gpu-ci-pipeline,orttraining-ortmodule-distributed,Windows x64 QNN CI Pipeline,Big Models |
|
Azure Pipelines successfully started running 1 pipeline(s). |
|
Azure Pipelines successfully started running 2 pipeline(s). |
|
Azure Pipelines successfully started running 3 pipeline(s). |
|
Azure Pipelines successfully started running 4 pipeline(s). |
|
/azp run ONNX Runtime Web CI Pipeline,Windows GPU CI Pipeline,Linux Android Emulator QNN CI Pipeline,Windows GPU WebGPU CI Pipeline,Windows OpenVINO CI Pipeline |
|
/azp run Linux CPU CI Pipeline,Linux CPU Minimal Build E2E CI Pipeline,Linux GPU CI Pipeline,Linux GPU TensorRT CI Pipeline,Linux OpenVINO CI Pipeline,Linux QNN CI Pipeline,MacOS CI Pipeline,Windows ARM64 QNN CI Pipeline,Windows CPU CI Pipeline |
|
/azp run Windows GPU CUDA CI Pipeline,Windows GPU DML CI Pipeline,Windows GPU Doc Gen CI Pipeline,Win_TRT_Minimal_CUDA_Test_CI |
|
/azp run Windows GPU TensorRT CI Pipeline,onnxruntime-binary-size-checks-ci-pipeline,orttraining-linux-ci-pipeline,orttraining-linux-gpu-ci-pipeline,orttraining-ortmodule-distributed,Windows x64 QNN CI Pipeline,Big Models |
|
Azure Pipelines successfully started running 1 pipeline(s). |
|
Azure Pipelines successfully started running 2 pipeline(s). |
|
Azure Pipelines successfully started running 3 pipeline(s). |
1 similar comment
|
Azure Pipelines successfully started running 3 pipeline(s). |
|
Oh my fault...I forgot to format |
|
/azp run Windows GPU CUDA CI Pipeline,Windows GPU DML CI Pipeline,Windows GPU Doc Gen CI Pipeline,Win_TRT_Minimal_CUDA_Test_CI |
|
/azp run Windows GPU TensorRT CI Pipeline,onnxruntime-binary-size-checks-ci-pipeline,orttraining-linux-ci-pipeline,orttraining-linux-gpu-ci-pipeline,orttraining-ortmodule-distributed,Windows x64 QNN CI Pipeline,Big Models |
|
Azure Pipelines successfully started running 3 pipeline(s). |
|
Azure Pipelines successfully started running 2 pipeline(s). |
|
Azure Pipelines successfully started running 3 pipeline(s). |
|
Will await @Honry's re-review. |
|
@peishenyan you forgot to add the op info to the webnn-operators.md , others LGTM. |
|
/azp run ONNX Runtime Web CI Pipeline,Windows GPU CI Pipeline,Linux Android Emulator QNN CI Pipeline,Windows GPU WebGPU CI Pipeline,Windows OpenVINO CI Pipeline |
|
/azp run Linux CPU CI Pipeline,Linux CPU Minimal Build E2E CI Pipeline,Linux GPU CI Pipeline,Linux GPU TensorRT CI Pipeline,Linux OpenVINO CI Pipeline,Linux QNN CI Pipeline,MacOS CI Pipeline,Windows ARM64 QNN CI Pipeline,Windows CPU CI Pipeline |
|
/azp run Windows GPU CUDA CI Pipeline,Windows GPU DML CI Pipeline,Windows GPU Doc Gen CI Pipeline,Win_TRT_Minimal_CUDA_Test_CI |
|
/azp run Windows GPU TensorRT CI Pipeline,onnxruntime-binary-size-checks-ci-pipeline,orttraining-linux-ci-pipeline,orttraining-linux-gpu-ci-pipeline,orttraining-ortmodule-distributed,Windows x64 QNN CI Pipeline,Big Models |
|
Azure Pipelines successfully started running 1 pipeline(s). |
|
Azure Pipelines successfully started running 2 pipeline(s). |
|
Azure Pipelines successfully started running 3 pipeline(s). |
1 similar comment
|
Azure Pipelines successfully started running 3 pipeline(s). |
|
That's so weird. ONNX Runtime CUDA Builds / Windows GPU CUDA CI Pipeline (pull_request) Failed... This test has passed every time before this commit, but I only changed doc file in this commit. |
|
Hi @fdwr, is it possible to re-trigger test and achieve a passed result? |
|
@peishenyan, you may need to rebase the code to latest main. |
|
/azp run ONNX Runtime CUDA Builds / Windows GPU CUDA CI Pipeline (pull_request) |
|
No pipelines are associated with this pull request. |
|
I'll retry the 2 required ones again (Linux CI / Build Linux x64 Release / build_test_pipeline (pull_request) ...). If they don't pass today, you'll need to try remerging with main. |
|
Amazing... they finally passed😂 |
Merging, as the 5 remaining failing tests are unrelated pervasive and persistent infrastructure issues. |
### Description
<!-- Describe your changes. -->
Adds support for MultiHeadAttention via WebNN matmul, transpose,
reshape, and other operations that follow the logic in the MHA subgraph
below
```
Abbreviatios: B is batch_size, S is sequence_length, W is hidden_size, P is past_sequence_length
N is number of attention heads, H is head size, and W=N*H, h=Sqrt(H)
Notes: If the datatype of the inputs (qkv and past kv) is float16, we cast them to float32 to ensure data precision.
query key value
| | |
q_Reshape k_Reshape v_Reshape (shape=B,S,H,N)
| | |
q_Transpose k_Transpose v_Transpose (perm=0,2,1,3)
\ / |
\ / |
present_key<---\----Concat <---------|----past_key
| | |
| opt_k_transpose |
\ (0,1,3,2) |
\ / | past_value
qk_MatMul | /
| scale | /
| / | /
qk_Div Concat------> present_value
| |
| /
Add <----------/---------------attention_bias
| /
Softmax /
\ /
\ /
qkv_MatMul
|
Transpose (perm=0,2,1,3)
|
Reshape---(shape=B,P,W)
|
output
```
### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
Signed-off-by: bfilipek <bartlomiej.filipek@intel.com>
Description
Adds support for MultiHeadAttention via WebNN matmul, transpose, reshape, and other operations that follow the logic in the MHA subgraph below
Motivation and Context