[DML EP] Attention Kernel #13371

sumitsays · 2022-10-19T17:09:26Z

Description

DML EP kernel for com.microsoft.attention operator. It has been implemented via DML_Graph. References for this implementation:

Hugging Face Attention for BERT
Chapter 3 of book Orielly: Natural Language Processing with Transformers, Revised Edition

This PR also

includes a very tiny fix for QLinearSigmoid kernel, which is storing the temporary object into a named variable.
enables 4 L2 transformers LayerNorm, Gelu, MatMulScale, Attention.

Motivation and Context

Why is this change required? What problem does it solve?
One of the main operators used in Transformer-based model. It contributes to the overall perf of DML EP for Transformer models.
If it fixes an open issue, please link to the issue here. N/A

…lease build will remove it from the stack, whenever it feels like

onnxruntime/core/providers/dml/DmlExecutionProvider/src/Operators/DmlOperatorAttention.cpp

onnxruntime/core/providers/dml/OperatorAuthorHelper/Attributes.h

onnxruntime/core/providers/dml/DmlExecutionProvider/src/Operators/DmlOperatorAttention.cpp

fdwr

### Description DML EP kernel for com.microsoft.attention operator. It has been implemented via DML_Graph. References for this implementation: 1. [Hugging Face Attention for BERT](https://github.com/huggingface/transformers/blob/310340d0d01929715b30863ee6f633974d75da16/src/transformers/models/bert/modeling_bert.py#L245-L284) 2. Chapter 3 of book Orielly: Natural Language Processing with Transformers, Revised Edition This PR also - includes a very tiny fix for QLinearSigmoid kernel, which is storing the temporary object into a named variable. - enables 4 L2 transformers LayerNorm, Gelu, MatMulScale, Attention. ### Motivation and Context - Why is this change required? What problem does it solve? One of the main operators used in Transformer-based model. It contributes to the overall perf of DML EP for Transformer models. - If it fixes an open issue, please link to the issue here. N/A Co-authored-by: Sumit Agarwal <sumitagarwal@microsoft.com> Co-authored-by: Dwayne Robinson <dwayner@microsoft.com>

sumitsays added 6 commits October 14, 2022 08:52

Attention Kernel First Commit

ee99225

Fixed output stride and transposed final output

aaa9acf

Used packed strided output for first GEMM

4d92c37

Fixed final output stride to transpose the output

569e196

Merge branch 'master' into user/sumita/attention

05d39c0

Store the return object into a variable and then use it. Otherwise Re…

b8b526a

…lease build will remove it from the stack, whenever it feels like

fdwr reviewed Oct 19, 2022

View reviewed changes

onnxruntime/core/providers/dml/DmlExecutionProvider/src/Operators/DmlOperatorAttention.cpp Outdated Show resolved Hide resolved

fdwr reviewed Oct 19, 2022

View reviewed changes

onnxruntime/core/providers/dml/DmlExecutionProvider/src/Operators/DmlOperatorAttention.cpp Outdated Show resolved Hide resolved

sumitsays added 3 commits October 19, 2022 17:38

Clean up

3deee22

Removed fusion

dfeac93

Added some comments

eb11002

sumitsays marked this pull request as ready for review October 20, 2022 00:55

sumitsays requested a review from jeffbloo October 20, 2022 00:55

sumitsays added 2 commits October 19, 2022 17:58

Store the temporary object into a variable

8f64765

enable fusion

f9bdbe4