Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

onnx speed is even slower #414

Closed
2 of 4 tasks
chaodreaming opened this issue Oct 5, 2022 · 14 comments
Closed
2 of 4 tasks

onnx speed is even slower #414

chaodreaming opened this issue Oct 5, 2022 · 14 comments
Assignees
Labels
inference Related to Inference onnxruntime Related to ONNX Runtime

Comments

@chaodreaming
Copy link

System Info

win10
python 3.8.4
pytorch 12.1 cpu
transformers4.22.2
optimum 1.4.0
onnxruntime 1.12.1

Who can help?

@Narsil
@patil-suraj

Information

  • The official example scripts
  • My own modified scripts

Tasks

  • An officially supported task in the examples folder (such as GLUE/SQuAD, ...)
  • My own task or dataset (give details below)

Reproduction

from transformers import AutoTokenizer, pipeline
from optimum.onnxruntime import ORTModelForSeq2SeqLM
import warnings
text="Vehicle detection technology is of great significance for realizing automatic monitoring and AI-assisted driving systems. The state-of-the-art object detection method, namely, a class of YOLOv5, has often been used to detect vehicles."
warnings.filterwarnings("ignore")
import time
textlists=[text,text,text,text,text]
model_checkpoint = "Helsinki-NLP/opus-mt-en-zh"
model = ORTModelForSeq2SeqLM.from_pretrained(model_checkpoint, from_transformers=True)
tokenizer = AutoTokenizer.from_pretrained(model_checkpoint)

model.save_pretrained("onnx")

tokenizer.save_pretrained("onnx")

onnx_translation = pipeline("translation_en_to_zh", model=model, tokenizer=tokenizer)
t1=time.time()
result = onnx_translation(textlists)
print(result ,time.time()-t1)

from transformers import (
MarianTokenizer,
MarianMTModel,
)
modchoice = "Helsinki-NLP/opus-mt-en-zh"
tokenizer = MarianTokenizer.from_pretrained(modchoice)

model = MarianMTModel.from_pretrained(modchoice)

t1 = time.time()
encoded=tokenizer.prepare_seq2seq_batch(
textlists,
truncation=True,
padding="longest",
return_tensors="pt"
)

encoded.to(device)

translated = model.generate(
**encoded
)

tgt_text = [tokenizer.decode(t, skip_special_tokens=True) for t in translated]
print(tgt_text,time.time() - t1)

Batch processing is much slower, and single processing is only a little faster

Expected behavior

Faster batch processing

@Narsil
Copy link

Narsil commented Oct 5, 2022

Hi, I am going to forward this issue to optimum since I think is where this issue belong.

Also you can use tripe quotes ``` to better display code. Cheers !

@Narsil Narsil transferred this issue from huggingface/transformers Oct 5, 2022
@Narsil
Copy link

Narsil commented Oct 5, 2022

Tagging @mfuntowicz @michaelbenayoun for visibility. Sorry if the transfer wasn't correct.

@chaodreaming
Copy link
Author

from transformers import AutoTokenizer, pipeline
from optimum.onnxruntime import ORTModelForSeq2SeqLM
import warnings
text="Vehicle detection technology is of great significance for realizing automatic monitoring and AI-assisted driving systems. The state-of-the-art object detection method, namely, a class of YOLOv5, has often been used to detect vehicles."
warnings.filterwarnings("ignore")
import time
textlists=[text,text,text,text,text]
model_checkpoint = "Helsinki-NLP/opus-mt-en-zh"
model = ORTModelForSeq2SeqLM.from_pretrained(model_checkpoint, from_transformers=True)
tokenizer = AutoTokenizer.from_pretrained(model_checkpoint)
onnx_translation = pipeline("translation_en_to_zh", model=model, tokenizer=tokenizer)
t1=time.time()
result = onnx_translation(textlists)
print(result ,time.time()-t1)

from transformers import (
MarianTokenizer,
MarianMTModel,
)
modchoice = "Helsinki-NLP/opus-mt-en-zh"
tokenizer = MarianTokenizer.from_pretrained(modchoice)

model = MarianMTModel.from_pretrained(modchoice)

t1 = time.time()
encoded=tokenizer.prepare_seq2seq_batch(
textlists,
truncation=True,
padding="longest",
return_tensors="pt"
)
translated = model.generate(
**encoded
)

tgt_text = [tokenizer.decode(t, skip_special_tokens=True) for t in translated]
print(tgt_text,time.time() - t1)

@chaodreaming
Copy link
Author

I removed the comments and can run it directly to reproduce

@Narsil
Copy link

Narsil commented Oct 5, 2022

@CatchDr this code is not useable, you NEED to use ``` to make it block code and show the proper indents.

@chaodreaming
Copy link
Author

'''
from transformers import AutoTokenizer, pipeline
from optimum.onnxruntime import ORTModelForSeq2SeqLM
import warnings
text="Vehicle detection technology is of great significance for realizing automatic monitoring and AI-assisted driving systems. The state-of-the-art object detection method, namely, a class of YOLOv5, has often been used to detect vehicles."
warnings.filterwarnings("ignore")
import time
textlists=[text,text,text,text,text]
model_checkpoint = "Helsinki-NLP/opus-mt-en-zh"
model = ORTModelForSeq2SeqLM.from_pretrained(model_checkpoint, from_transformers=True)
tokenizer = AutoTokenizer.from_pretrained(model_checkpoint)
onnx_translation = pipeline("translation_en_to_zh", model=model, tokenizer=tokenizer)
t1=time.time()
result = onnx_translation(textlists)
print(result ,time.time()-t1)

from transformers import (
MarianTokenizer,
MarianMTModel,
)
modchoice = "Helsinki-NLP/opus-mt-en-zh"
tokenizer = MarianTokenizer.from_pretrained(modchoice)

model = MarianMTModel.from_pretrained(modchoice)

t1 = time.time()
encoded=tokenizer.prepare_seq2seq_batch(
textlists,
truncation=True,
padding="longest",
return_tensors="pt"
)
translated = model.generate(
**encoded
)

tgt_text = [tokenizer.decode(t, skip_special_tokens=True) for t in translated]
print(tgt_text,time.time() - t1)
'''

@chaodreaming
Copy link
Author

test

@chaodreaming
Copy link
Author

from transformers import AutoTokenizer, pipeline
from optimum.onnxruntime import ORTModelForSeq2SeqLM
import warnings
text="Vehicle detection technology is of great significance for realizing automatic monitoring and AI-assisted driving systems. The state-of-the-art object detection method, namely, a class of YOLOv5, has often been used to detect vehicles."
warnings.filterwarnings("ignore")
import time
textlists=[text,text,text,text,text]
model_checkpoint = "Helsinki-NLP/opus-mt-en-zh"
model = ORTModelForSeq2SeqLM.from_pretrained(model_checkpoint, from_transformers=True)
tokenizer = AutoTokenizer.from_pretrained(model_checkpoint)
onnx_translation = pipeline("translation_en_to_zh", model=model, tokenizer=tokenizer)
t1=time.time()
result = onnx_translation(textlists)
print(result ,time.time()-t1)

from transformers import (
MarianTokenizer,
MarianMTModel,
)
modchoice = "Helsinki-NLP/opus-mt-en-zh"
tokenizer = MarianTokenizer.from_pretrained(modchoice)

model = MarianMTModel.from_pretrained(modchoice)

t1 = time.time()
encoded=tokenizer.prepare_seq2seq_batch(
textlists,
truncation=True,
padding="longest",
return_tensors="pt"
)
translated = model.generate(
**encoded
)

tgt_text = [tokenizer.decode(t, skip_special_tokens=True) for t in translated]
print(tgt_text,time.time() - t1)

@chaodreaming
Copy link
Author

There are no functions, no indentation

@chaodreaming
Copy link
Author

onnxruntime-gpu also can not speed up, pipline is required to set something to use gpu?

@Narsil
Copy link

Narsil commented Oct 5, 2022

pipeline(..., device=0) to use gpu 0 for instance.

@chaodreaming
Copy link
Author

It seems that the onnx pipline approach to increase batch speed does not speed up, marian speed will improve a lot

@michaelbenayoun michaelbenayoun added inference Related to Inference onnxruntime Related to ONNX Runtime labels Oct 14, 2022
@JingyaHuang JingyaHuang self-assigned this Oct 14, 2022
@JingyaHuang
Copy link
Collaborator

JingyaHuang commented Nov 2, 2022

Hi @CatchDr, for speeding up the inference on CPU, you can optimize the vanilla exported ONNX model with ORTOptimizer, and also try ORTQuantizer to compress your model for further speedup.
On GPU, we now apply IO Binding on the device, and for some models, without optimization/quantization, the inference could be faster than PyTorch.
Let me know if you have any other questions or need extra help.

@JingyaHuang
Copy link
Collaborator

Close the issue, as the previous latency issue has been targeted in #421. And here are some performance results from the community: mt5 / m2m100.

Feel free to re-open the issue if you still have any performance issue with the latest Optimum.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
inference Related to Inference onnxruntime Related to ONNX Runtime
Projects
None yet
Development

No branches or pull requests

4 participants