Skip to content

Add Attention op for multi-head self attention in BERT#1984

Merged
tianleiwu merged 11 commits into
masterfrom
tlwu/attention
Oct 7, 2019
Merged

Add Attention op for multi-head self attention in BERT#1984
tianleiwu merged 11 commits into
masterfrom
tlwu/attention

Conversation

@tianleiwu
Copy link
Copy Markdown
Contributor

Description: Add Attention op for multi-head self attention for BERT model

Motivation and Context

BERT model is important for natural language processing, however performance is slow. Here, we create a fused op, which could significantly improve the inference performance. The implementation is based on QkvToContext plugin in NVidia TensorRT bert demo. We applied two changes: (1) scaling is moved from masked softmax to gemm for simplifying code; (2) Fully connection layer is merged into this fused op to get a more meaningful op.

@tianleiwu tianleiwu requested a review from a team as a code owner October 3, 2019 00:15
Comment on lines +196 to +199
ONNX_CONTRIB_OPERATOR_SCHEMA(Attention)
.SetDomain(kOnnxDomain)
.SinceVersion(1)
.SetSupportLevel(OpSchema::SupportType::EXPERIMENTAL)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe I'm missing something, but it seems to me that Attention is not an ONNX operator supported since version 1. Shouldn't this be in a different domain to avoid conflict with possible future ONNX operators?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@TMVector, Thank for suggestion, I'll move it to kMSDomain.

yufenglee
yufenglee previously approved these changes Oct 6, 2019
@tianleiwu tianleiwu closed this Oct 7, 2019
@tianleiwu tianleiwu reopened this Oct 7, 2019
@tianleiwu tianleiwu merged commit 7b39f50 into master Oct 7, 2019
@tianleiwu tianleiwu deleted the tlwu/attention branch October 7, 2019 19:23
@kkaehler
Copy link
Copy Markdown

kkaehler commented Oct 7, 2019

Very cool! @tianleiwu any expectations of the inference latency? I'd be curious to see any numbers you get.

@tianleiwu
Copy link
Copy Markdown
Contributor Author

@kkaehler, the inference latency varies on different model size and GPU. For BERT-base model, I see 2ms ~ 10ms using Azure Virtual Machines with different GPUs. You are welcome to test it in your own environment to see the latency.

@kkaehler
Copy link
Copy Markdown

@tianleiwu that is very fast! I haven't built it for GPU yet, but in the docker build I was seeing times closer to 1 second. Were you by any chance using an off the shelf BERT onnx model I could benchmark with?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants