onnx speed is even slower #414

chaodreaming · 2022-10-05T11:59:56Z

System Info

win10
python 3.8.4
pytorch 12.1 cpu
transformers4.22.2
optimum 1.4.0
onnxruntime 1.12.1

Who can help?

@Narsil
@patil-suraj

Information

The official example scripts
My own modified scripts

Tasks

An officially supported task in the examples folder (such as GLUE/SQuAD, ...)
My own task or dataset (give details below)

Reproduction

from transformers import AutoTokenizer, pipeline
from optimum.onnxruntime import ORTModelForSeq2SeqLM
import warnings
text="Vehicle detection technology is of great significance for realizing automatic monitoring and AI-assisted driving systems. The state-of-the-art object detection method, namely, a class of YOLOv5, has often been used to detect vehicles."
warnings.filterwarnings("ignore")
import time
textlists=[text,text,text,text,text]
model_checkpoint = "Helsinki-NLP/opus-mt-en-zh"
model = ORTModelForSeq2SeqLM.from_pretrained(model_checkpoint, from_transformers=True)
tokenizer = AutoTokenizer.from_pretrained(model_checkpoint)

model.save_pretrained("onnx")

tokenizer.save_pretrained("onnx")

onnx_translation = pipeline("translation_en_to_zh", model=model, tokenizer=tokenizer)
t1=time.time()
result = onnx_translation(textlists)
print(result ,time.time()-t1)

from transformers import (
MarianTokenizer,
MarianMTModel,
)
modchoice = "Helsinki-NLP/opus-mt-en-zh"
tokenizer = MarianTokenizer.from_pretrained(modchoice)

model = MarianMTModel.from_pretrained(modchoice)

t1 = time.time()
encoded=tokenizer.prepare_seq2seq_batch(
textlists,
truncation=True,
padding="longest",
return_tensors="pt"
)

encoded.to(device)

translated = model.generate(
**encoded
)

tgt_text = [tokenizer.decode(t, skip_special_tokens=True) for t in translated]
print(tgt_text,time.time() - t1)

Batch processing is much slower, and single processing is only a little faster

Expected behavior

Faster batch processing

The text was updated successfully, but these errors were encountered:

Narsil · 2022-10-05T13:33:13Z

Hi, I am going to forward this issue to optimum since I think is where this issue belong.

Also you can use tripe quotes ``` to better display code. Cheers !

Narsil · 2022-10-05T13:33:59Z

Tagging @mfuntowicz @michaelbenayoun for visibility. Sorry if the transfer wasn't correct.

chaodreaming · 2022-10-05T13:53:05Z

from transformers import AutoTokenizer, pipeline
from optimum.onnxruntime import ORTModelForSeq2SeqLM
import warnings
text="Vehicle detection technology is of great significance for realizing automatic monitoring and AI-assisted driving systems. The state-of-the-art object detection method, namely, a class of YOLOv5, has often been used to detect vehicles."
warnings.filterwarnings("ignore")
import time
textlists=[text,text,text,text,text]
model_checkpoint = "Helsinki-NLP/opus-mt-en-zh"
model = ORTModelForSeq2SeqLM.from_pretrained(model_checkpoint, from_transformers=True)
tokenizer = AutoTokenizer.from_pretrained(model_checkpoint)
onnx_translation = pipeline("translation_en_to_zh", model=model, tokenizer=tokenizer)
t1=time.time()
result = onnx_translation(textlists)
print(result ,time.time()-t1)

from transformers import (
MarianTokenizer,
MarianMTModel,
)
modchoice = "Helsinki-NLP/opus-mt-en-zh"
tokenizer = MarianTokenizer.from_pretrained(modchoice)

model = MarianMTModel.from_pretrained(modchoice)

t1 = time.time()
encoded=tokenizer.prepare_seq2seq_batch(
textlists,
truncation=True,
padding="longest",
return_tensors="pt"
)
translated = model.generate(
**encoded
)

tgt_text = [tokenizer.decode(t, skip_special_tokens=True) for t in translated]
print(tgt_text,time.time() - t1)

chaodreaming · 2022-10-05T13:54:16Z

I removed the comments and can run it directly to reproduce

Narsil · 2022-10-05T13:57:49Z

@CatchDr this code is not useable, you NEED to use ``` to make it block code and show the proper indents.

chaodreaming · 2022-10-05T13:59:54Z

'''
from transformers import AutoTokenizer, pipeline
from optimum.onnxruntime import ORTModelForSeq2SeqLM
import warnings
text="Vehicle detection technology is of great significance for realizing automatic monitoring and AI-assisted driving systems. The state-of-the-art object detection method, namely, a class of YOLOv5, has often been used to detect vehicles."
warnings.filterwarnings("ignore")
import time
textlists=[text,text,text,text,text]
model_checkpoint = "Helsinki-NLP/opus-mt-en-zh"
model = ORTModelForSeq2SeqLM.from_pretrained(model_checkpoint, from_transformers=True)
tokenizer = AutoTokenizer.from_pretrained(model_checkpoint)
onnx_translation = pipeline("translation_en_to_zh", model=model, tokenizer=tokenizer)
t1=time.time()
result = onnx_translation(textlists)
print(result ,time.time()-t1)

from transformers import (
MarianTokenizer,
MarianMTModel,
)
modchoice = "Helsinki-NLP/opus-mt-en-zh"
tokenizer = MarianTokenizer.from_pretrained(modchoice)

model = MarianMTModel.from_pretrained(modchoice)

t1 = time.time()
encoded=tokenizer.prepare_seq2seq_batch(
textlists,
truncation=True,
padding="longest",
return_tensors="pt"
)
translated = model.generate(
**encoded
)

tgt_text = [tokenizer.decode(t, skip_special_tokens=True) for t in translated]
print(tgt_text,time.time() - t1)
'''

chaodreaming · 2022-10-05T14:00:35Z

test

chaodreaming · 2022-10-05T14:00:54Z

from transformers import AutoTokenizer, pipeline
from optimum.onnxruntime import ORTModelForSeq2SeqLM
import warnings
text="Vehicle detection technology is of great significance for realizing automatic monitoring and AI-assisted driving systems. The state-of-the-art object detection method, namely, a class of YOLOv5, has often been used to detect vehicles."
warnings.filterwarnings("ignore")
import time
textlists=[text,text,text,text,text]
model_checkpoint = "Helsinki-NLP/opus-mt-en-zh"
model = ORTModelForSeq2SeqLM.from_pretrained(model_checkpoint, from_transformers=True)
tokenizer = AutoTokenizer.from_pretrained(model_checkpoint)
onnx_translation = pipeline("translation_en_to_zh", model=model, tokenizer=tokenizer)
t1=time.time()
result = onnx_translation(textlists)
print(result ,time.time()-t1)

from transformers import (
MarianTokenizer,
MarianMTModel,
)
modchoice = "Helsinki-NLP/opus-mt-en-zh"
tokenizer = MarianTokenizer.from_pretrained(modchoice)

model = MarianMTModel.from_pretrained(modchoice)

t1 = time.time()
encoded=tokenizer.prepare_seq2seq_batch(
textlists,
truncation=True,
padding="longest",
return_tensors="pt"
)
translated = model.generate(
**encoded
)

tgt_text = [tokenizer.decode(t, skip_special_tokens=True) for t in translated]
print(tgt_text,time.time() - t1)

chaodreaming · 2022-10-05T14:01:24Z

There are no functions, no indentation

chaodreaming · 2022-10-05T14:19:04Z

onnxruntime-gpu also can not speed up, pipline is required to set something to use gpu?

Narsil · 2022-10-05T14:31:57Z

pipeline(..., device=0) to use gpu 0 for instance.

chaodreaming · 2022-10-05T15:26:44Z

It seems that the onnx pipline approach to increase batch speed does not speed up, marian speed will improve a lot

JingyaHuang · 2022-11-02T21:27:16Z

Hi @CatchDr, for speeding up the inference on CPU, you can optimize the vanilla exported ONNX model with ORTOptimizer, and also try ORTQuantizer to compress your model for further speedup.
On GPU, we now apply IO Binding on the device, and for some models, without optimization/quantization, the inference could be faster than PyTorch.
Let me know if you have any other questions or need extra help.

JingyaHuang · 2022-11-14T14:42:47Z

Close the issue, as the previous latency issue has been targeted in #421. And here are some performance results from the community: mt5 / m2m100.

Feel free to re-open the issue if you still have any performance issue with the latest Optimum.

Narsil transferred this issue from huggingface/transformers Oct 5, 2022

michaelbenayoun added inference Related to Inference onnxruntime Related to ONNX Runtime labels Oct 14, 2022

JingyaHuang self-assigned this Oct 14, 2022

JingyaHuang mentioned this issue Oct 14, 2022

Add IOBinding support to ONNX Runtime module #421

Merged

5 tasks

JingyaHuang closed this as completed Nov 14, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

onnx speed is even slower #414

onnx speed is even slower #414

chaodreaming commented Oct 5, 2022

Narsil commented Oct 5, 2022

Narsil commented Oct 5, 2022

chaodreaming commented Oct 5, 2022

chaodreaming commented Oct 5, 2022

Narsil commented Oct 5, 2022

chaodreaming commented Oct 5, 2022

chaodreaming commented Oct 5, 2022

chaodreaming commented Oct 5, 2022

chaodreaming commented Oct 5, 2022

chaodreaming commented Oct 5, 2022

Narsil commented Oct 5, 2022

chaodreaming commented Oct 5, 2022

JingyaHuang commented Nov 2, 2022 •

edited

JingyaHuang commented Nov 14, 2022

onnx speed is even slower #414

onnx speed is even slower #414

Comments

chaodreaming commented Oct 5, 2022

System Info

Who can help?

Information

Tasks

Reproduction

model.save_pretrained("onnx")

tokenizer.save_pretrained("onnx")

encoded.to(device)

Expected behavior

Narsil commented Oct 5, 2022

Narsil commented Oct 5, 2022

chaodreaming commented Oct 5, 2022

chaodreaming commented Oct 5, 2022

Narsil commented Oct 5, 2022

chaodreaming commented Oct 5, 2022

chaodreaming commented Oct 5, 2022

chaodreaming commented Oct 5, 2022

chaodreaming commented Oct 5, 2022

chaodreaming commented Oct 5, 2022

Narsil commented Oct 5, 2022

chaodreaming commented Oct 5, 2022

JingyaHuang commented Nov 2, 2022 • edited

JingyaHuang commented Nov 14, 2022

JingyaHuang commented Nov 2, 2022 •

edited