Add Attention op for multi-head self attention in BERT by tianleiwu · Pull Request #1984 · microsoft/onnxruntime

tianleiwu · 2019-10-03T00:15:31Z

Description: Add Attention op for multi-head self attention for BERT model

Motivation and Context

BERT model is important for natural language processing, however performance is slow. Here, we create a fused op, which could significantly improve the inference performance. The implementation is based on QkvToContext plugin in NVidia TensorRT bert demo. We applied two changes: (1) scaling is moved from masked softmax to gemm for simplifying code; (2) Fully connection layer is merged into this fused op to get a more meaningful op.

TMVector · 2019-10-03T09:29:29Z

+  ONNX_CONTRIB_OPERATOR_SCHEMA(Attention)
+      .SetDomain(kOnnxDomain)
+      .SinceVersion(1)
+      .SetSupportLevel(OpSchema::SupportType::EXPERIMENTAL)


Maybe I'm missing something, but it seems to me that Attention is not an ONNX operator supported since version 1. Shouldn't this be in a different domain to avoid conflict with possible future ONNX operators?

@TMVector, Thank for suggestion, I'll move it to kMSDomain.

Limit test to run by CUDA provider only.

kkaehler · 2019-10-07T21:25:54Z

Very cool! @tianleiwu any expectations of the inference latency? I'd be curious to see any numbers you get.

tianleiwu · 2019-10-09T05:59:44Z

@kkaehler, the inference latency varies on different model size and GPU. For BERT-base model, I see 2ms ~ 10ms using Azure Virtual Machines with different GPUs. You are welcome to test it in your own environment to see the latency.

kkaehler · 2019-10-11T19:22:38Z

@tianleiwu that is very fast! I haven't built it for GPU yet, but in the docker build I was seeing times closer to 1 second. Were you by any chance using an off the shelf BERT onnx model I could benchmark with?

Add Attention op for multi head self attention in BERT

79bbc9e

tianleiwu requested review from ybrnathan and yufenglee October 3, 2019 00:15

tianleiwu requested a review from a team as a code owner October 3, 2019 00:15

Add test cases

9b2b92d

TMVector reviewed Oct 3, 2019

View reviewed changes

tianleiwu added 7 commits October 3, 2019 11:58

Move op from kOnnxDomain to kMSDomain.

0b27234

Limit test to run by CUDA provider only.

merge master

811d483

fix test

340901b

Merge remote-tracking branch 'origin/master' into tlwu/attention

3849288

Merge remote-tracking branch 'origin/master' into tlwu/attention

4308ad5

Add float16 test

d39b9e4

fix cpu build error

b51d970

yufenglee previously approved these changes Oct 6, 2019

View reviewed changes

handle cuda error

55096ac

tianleiwu dismissed yufenglee’s stale review via 55096ac October 6, 2019 23:02

tianleiwu requested a review from yufenglee October 6, 2019 23:03

get last cuda error when failed

d512503

yufenglee approved these changes Oct 7, 2019

View reviewed changes

tianleiwu closed this Oct 7, 2019

tianleiwu reopened this Oct 7, 2019

tianleiwu merged commit 7b39f50 into master Oct 7, 2019

tianleiwu deleted the tlwu/attention branch October 7, 2019 19:23

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add Attention op for multi-head self attention in BERT#1984

Add Attention op for multi-head self attention in BERT#1984
tianleiwu merged 11 commits into
masterfrom
tlwu/attention

tianleiwu commented Oct 3, 2019

Uh oh!

TMVector Oct 3, 2019

Uh oh!

tianleiwu Oct 3, 2019

Uh oh!

kkaehler commented Oct 7, 2019

Uh oh!

tianleiwu commented Oct 9, 2019

Uh oh!

kkaehler commented Oct 11, 2019

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Conversation

tianleiwu commented Oct 3, 2019

Uh oh!

TMVector Oct 3, 2019

Choose a reason for hiding this comment

Uh oh!

tianleiwu Oct 3, 2019

Choose a reason for hiding this comment

Uh oh!

kkaehler commented Oct 7, 2019

Uh oh!

tianleiwu commented Oct 9, 2019

Uh oh!

kkaehler commented Oct 11, 2019

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants