Why gpt2-xl (based transformer-xl) onnx slower than the originer pytorch #11293

lileilai · 2022-04-21T08:26:07Z

Describe the bug
I have a transformer-xl (Transformer-XL: Attentive Language Models Beyond a Fixed-Length Context) gpt-xl ( 41 layer ), and the code is implemented by myself; After transfer to onnx and optimized by gpt2_optimizer ( LayerNormalization kernel fusion, fastGelu kernel fusion )。Even with the IOBinding， the inference time still slower than original pytorch。

Urgency
If there are particular important use cases blocked by this or strict project-related timelines, please share more information and dates. If there are no hard deadlines, please specify none.

System information

OS Platform and Distribution (e.g., Linux Ubuntu 16.04):
ONNX Runtime installed from (source or binary):
ONNX Runtime version: 1.81
Python version: 3.8
Visual Studio version (if applicable):
GCC/Compiler version (if compiling from source):
CUDA/cuDNN version: 11.3
GPU model and memory: 40G

To Reproduce

Describe steps/code to reproduce the behavior.
Attach the ONNX model to the issue (where applicable) to expedite investigation.

Expected behavior
A clear and concise description of what you expected to happen.

Screenshots
If applicable, add screenshots to help explain your problem.

Additional context
Add any other context about the problem here. If the issue is about a particular model, please share the model details as well to facilitate debugging.

pranavsharma · 2022-04-21T19:57:33Z

Looks like you're using ver 1.8.1. Have you tried with the latest ORT ver? Also, attach the model and the repro code.
cc @tianleiwu

tianleiwu · 2022-04-21T23:11:08Z

@lileilai, to get fully optimized, you will need a custom Attention operator. It is because current Attention operator only applies to the self attention in BERT and GPT-2, and it cannot apply to transformer-xl.

See our guide if you would like to create a custom operator and fusion: https://github.com/microsoft/onnxruntime/blob/master/onnxruntime/python/tools/transformers/Dev_Guide.md

lileilai · 2022-04-22T07:52:03Z

Looks like you're using ver 1.8.1. Have you tried with the latest ORT ver? Also, attach the model and the repro code. cc @tianleiwu

Thanks for your reply, and i will try the latest ORT version later

lileilai · 2022-04-24T02:05:51Z

I have try ORT 1.11，but it got the same statistic。My confusion is when i using " torch.onnx.export(opt_version=12) ", the onnx model have a slower inference perfermance than original pytorch . Comparing to the result of baseline onnx model without additional kenel fusion ( LayerNorm、Attention、FastGlue )，it is abnormal.

elephantpanda · 2023-04-05T14:14:05Z

Hi can you

Describe the bug I have a transformer-xl (Transformer-XL: Attentive Language Models Beyond a Fixed-Length Context) gpt-xl ( 41 layer ), and the code is implemented by myself; After transfer to onnx and optimized by gpt2_optimizer ( LayerNormalization kernel fusion, fastGelu kernel fusion )。Even with the IOBinding， the inference time still slower than original pytorch。

Urgency If there are particular important use cases blocked by this or strict project-related timelines, please share more information and dates. If there are no hard deadlines, please specify none.

System information

OS Platform and Distribution (e.g., Linux Ubuntu 16.04):

ONNX Runtime installed from (source or binary):

ONNX Runtime version: 1.81

Python version: 3.8

Visual Studio version (if applicable):

GCC/Compiler version (if compiling from source):

CUDA/cuDNN version: 11.3

GPU model and memory: 40G

To Reproduce

Describe steps/code to reproduce the behavior.

Attach the ONNX model to the issue (where applicable) to expedite investigation.

Expected behavior A clear and concise description of what you expected to happen.

Screenshots If applicable, add screenshots to help explain your problem.

Additional context Add any other context about the problem here. If the issue is about a particular model, please share the model details as well to facilitate debugging.

Hi could you share how you used IOBinding for this model and did it give a speed up? I am trying to implement something similar myself.

tianleiwu · 2023-04-05T17:25:14Z

@pauldog, see https://onnxruntime.ai/docs/api/python/api_summary.html#iobinding and https://onnxruntime.ai/docs/api/python/api_summary.html#ortvalue for the API.

Examples:

Bind Torch Tensors: https://github.com/microsoft/onnxruntime/blob/main/onnxruntime/python/tools/transformers/io_binding_helper.py#L97

Bind OrtValue for cuda graph:
https://github.com/microsoft/onnxruntime/blob/289b3a8e37d8bf07e11be2b491c4cca8ea1a13a4/onnxruntime/test/python/onnxruntime_test_python_cudagraph.py

elephantpanda · 2023-04-05T20:05:52Z

@pauldog, see https://onnxruntime.ai/docs/api/python/api_summary.html#iobinding and https://onnxruntime.ai/docs/api/python/api_summary.html#ortvalue for the API.

Examples:

Bind Torch Tensors: https://github.com/microsoft/onnxruntime/blob/main/onnxruntime/python/tools/transformers/io_binding_helper.py#L97

Bind OrtValue for cuda graph: https://github.com/microsoft/onnxruntime/blob/289b3a8e37d8bf07e11be2b491c4cca8ea1a13a4/onnxruntime/test/python/onnxruntime_test_python_cudagraph.py

Ah thanks. I was using c# and I guess most of these functions haven't been implemented yet. That would explain it.

pranavsharma added type:performance labels Apr 21, 2022

chainyo mentioned this issue Apr 23, 2022

GPT2 text generation pipeline chainyo/transformers-pipeline-onnx#1

Closed

sophies927 added model:transformer issues related to a transformer model: BERT, GPT2, Hugging Face, Longformer, T5, etc. core runtime issues related to core runtime and removed model:GPT2 labels Aug 12, 2022

sknadig mentioned this issue Jan 26, 2023

Cache-aware Conformer Encoder has high latency with ONNX runtime NVIDIA/NeMo#5867

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Why gpt2-xl (based transformer-xl) onnx slower than the originer pytorch #11293

Why gpt2-xl (based transformer-xl) onnx slower than the originer pytorch #11293

lileilai commented Apr 21, 2022

pranavsharma commented Apr 21, 2022

tianleiwu commented Apr 21, 2022

lileilai commented Apr 22, 2022

lileilai commented Apr 24, 2022

elephantpanda commented Apr 5, 2023

tianleiwu commented Apr 5, 2023

elephantpanda commented Apr 5, 2023

Why gpt2-xl (based transformer-xl) onnx slower than the originer pytorch #11293

Why gpt2-xl (based transformer-xl) onnx slower than the originer pytorch #11293

Comments

lileilai commented Apr 21, 2022

pranavsharma commented Apr 21, 2022

tianleiwu commented Apr 21, 2022

lileilai commented Apr 22, 2022

lileilai commented Apr 24, 2022

elephantpanda commented Apr 5, 2023

tianleiwu commented Apr 5, 2023

elephantpanda commented Apr 5, 2023