Add Attention op for multi-head self attention in BERT#1984
Conversation
| ONNX_CONTRIB_OPERATOR_SCHEMA(Attention) | ||
| .SetDomain(kOnnxDomain) | ||
| .SinceVersion(1) | ||
| .SetSupportLevel(OpSchema::SupportType::EXPERIMENTAL) |
There was a problem hiding this comment.
Maybe I'm missing something, but it seems to me that Attention is not an ONNX operator supported since version 1. Shouldn't this be in a different domain to avoid conflict with possible future ONNX operators?
There was a problem hiding this comment.
@TMVector, Thank for suggestion, I'll move it to kMSDomain.
Limit test to run by CUDA provider only.
|
Very cool! @tianleiwu any expectations of the inference latency? I'd be curious to see any numbers you get. |
|
@kkaehler, the inference latency varies on different model size and GPU. For BERT-base model, I see 2ms ~ 10ms using Azure Virtual Machines with different GPUs. You are welcome to test it in your own environment to see the latency. |
|
@tianleiwu that is very fast! I haven't built it for GPU yet, but in the docker build I was seeing times closer to 1 second. Were you by any chance using an off the shelf BERT onnx model I could benchmark with? |
Description: Add Attention op for multi-head self attention for BERT model
Motivation and Context
BERT model is important for natural language processing, however performance is slow. Here, we create a fused op, which could significantly improve the inference performance. The implementation is based on QkvToContext plugin in NVidia TensorRT bert demo. We applied two changes: (1) scaling is moved from masked softmax to gemm for simplifying code; (2) Fully connection layer is merged into this fused op to get a more meaningful op.