New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Inference worse with onnxruntime-gpu than native pytorch for seq2seq model #404
Comments
Hi @Matthieu-Tinycoaching , thank you for the report! Check my answer for your question 2. in the linked issue in onnxruntime repo. Could you post the report of On my end, I run Maybe try
with the right path for CUDA. For question 1, I am currently looking into it, stay tuned. |
Answering your question in the issue in onnxruntime repo: Yes, passing Given that you can load the model with CUDAExecutionProvider with my code snippet, I am not sure what goes wrong in yours. It's likely not an issue with CUDA/onnxruntime install. How were your |
Hi @fxmarty, Thanks for helping. For the question 2. it seems that my GPU is working properly... Please find below the steps I used to export and optimize ONNX models:
Please find below the output of the following commands:
1 / Did I do something wrong? Could you reproduce my results using fastAPI? 2/ Is there a way to get the prediction output from combination ONNX InferenceSession with the 3 ONNX optimized model instead of using Optimum? So, I could check with another deployment framework than fastAPI? 3 / How long does it takes to export the ONNX model from torch? I try to convert the model from my first code block above by adding |
Concerning 3/: The 2/: not sure what you mean. Do you mean, could you use the In the first place, you don't get anymore the warning |
Hi @fxmarty, OK I regenerate all onnx models based on this script:
And I got both triplets ( Then, I tried to compare load performance between native torch model and onnx models (w/wo optimizations) using fastAPI:
1 / And best results are the last one for the native pytorch model. Could you reproduce these misperformance of onnx models (with or wihtout optimization) regarding native model? 2/ Do you mean, could you use the encoder_model_optimized.onnx, decoder_model_optimized.onnx independently of Optimum? Yes, is it possible to sequentially call the triplet of onnx models? 3 / I'll test tomorrow with docker image, but since performance are still worse with onnx, I suppose it stil there... |
For your concern about the inference speed, it could be caused by the overhead on data copying between CPU and GPU, as ONNX Runtime put inputs and outputs on the CPU by default. I just merged the PR which add the IO binding support to avoid the issue, do you want to test with our main branch(no code change as You can build optimum from source with
|
In my case on T5 seq2seq model,
When def load_pytorch(saved_path, device="cuda:0"):
tokenizer = T5Tokenizer.from_pretrained("google/mt5-base")
net = MT5ForConditionalGeneration.from_pretrained("google/mt5-base").to(device)
net.eval()
state_dict = torch.load(saved_path, map_location=device)["model_state_dict"]
net.load_state_dict(state_dict)
return tokenizer, net
def load_onnx(saved_path, device="cuda:0"):
tokenizer = T5Tokenizer.from_pretrained(saved_path)
net = ORTModelForSeq2SeqLM.from_pretrained(
saved_path,
encoder_file_name="encoder_model.onnx",
decoder_file_name="decoder_model.onnx",
decoder_with_past_file_name="decoder_with_past_model.onnx",
use_io_binding=False,
provider="CUDAExecutionProvider",
).to(device)
return tokenizer, net
def inference(tokenizer, net, sentence, device="cuda:0"):
input_ids = tokenizer.encode(sentence)
input_ids = torch.tensor(input_ids).to(dtype=torch.int64, device=device)
input_ids = torch.unsqueeze(input_ids, dim=0)
output_ids = net.generate(input_ids, max_length=256, num_beams=5)
translated = tokenizer.decode(output_ids[0], skip_special_tokens=True)
return translated |
Hi @soocheolnoh, Thanks for testing the IO binding feature and sharing your results! I have done a quick test with beam search # -*- coding: utf-8 -*-
import logging
import time
import torch
from transformers import AutoTokenizer
from tqdm import tqdm
logging.basicConfig(level=logging.INFO)
model_checkpoint = "facebook/m2m100_418M"
loop = 100
chinese_text = "机器学习是人工智能的一个分支。人工智能的研究历史有着一条从以“推理”为重点,到以“知识”为重点,再到以“学习”为重点的自然、清晰的脉络。显然,机器学习是实现人工智能的一个途径,即以机器学习为手段解决人工智能中的问题。机器学习在近30多年已发展为一门多领域交叉学科,涉及概率论、统计学、逼近论、凸分析(英语:Convex analysis)、计算复杂性理论等多门学科。机器学习理论主要是设计和分析一些让计算机可以自动“学习”的算法。机器学习算法是一类从数据中自动分析获得规律,并利用规律对未知数据进行预测的算法。因为学习算法中涉及了大量的统计学理论,机器学习与推断统计学联系尤为密切,也被称为统计学习理论。算法设计方面,机器学习理论关注可以实现的,行之有效的学习算法。很多推论问题属于无程序可循难度,所以部分的机器学习研究是开发容易处理的近似算法。机器学习已广泛应用于数据挖掘、计算机视觉、自然语言处理、生物特征识别、搜索引擎、医学诊断、检测信用卡欺诈、证券市场分析、DNA序列测序、语音和手写识别、战略游戏和机器人等领域"
logging.info(f"chinese_text is {chinese_text}")
logging.info(f"chinese_text length is {len(chinese_text)}")
device = torch.device("cuda:0")
logging.info(f"This test will use device: {device}")
def get_transformer_model():
from transformers import M2M100ForConditionalGeneration, M2M100Tokenizer
model = M2M100ForConditionalGeneration.from_pretrained(model_checkpoint).to(device)
tokenizer = M2M100Tokenizer.from_pretrained(model_checkpoint)
return (model, tokenizer)
def get_optimum_onnx_model():
from optimum.onnxruntime import ORTModelForSeq2SeqLM
model = ORTModelForSeq2SeqLM.from_pretrained(model_checkpoint, from_transformers=True).to(device)
tokenizer = AutoTokenizer.from_pretrained(model_checkpoint)
return (model, tokenizer)
def translate():
# (model, tokenizer) = get_optimum_onnx_model()
(model, tokenizer) = get_transformer_model()
# Warm-up
for i in range(10):
encoded_zh = tokenizer(chinese_text, return_tensors="pt").to(device)
generated_tokens = model.generate(
**encoded_zh,
forced_bos_token_id=tokenizer.get_lang_id("en"),
max_length=256,
num_beams=5,
)
result = tokenizer.batch_decode(generated_tokens, skip_special_tokens=True)
logging.debug(f"#{i}: {result}")
start = time.time()
for i in tqdm(range(loop)):
encoded_zh = tokenizer(chinese_text, return_tensors="pt").to(device)
generated_tokens = model.generate(
**encoded_zh,
forced_bos_token_id=tokenizer.get_lang_id("en"),
max_length=256,
num_beams=5,
)
end = time.time()
total_time = end - start
logging.info(f"total: {total_time}")
logging.info(f"loop: {loop}")
logging.info(f"avg(s): {total_time / loop}")
logging.info(f"throughput(translation/s): {loop / total_time}")
if __name__ == "__main__":
translate() Here are the results that I got with a T4, ORTModel with IO binding V.S. PyTorch:
Can you test the snippet to see if you can get something similar on your end, or share your entire script so that I can try to reproduce your experiment? |
@JingyaHuang Thanks for your response! I'd try your script and I got similar results. (pytorch: 4.074 sec, optimum: 2.7825 sec for average time) # -*- coding: utf-8 -*-
import logging
import time
import torch
from tqdm import tqdm
from transformers import AutoTokenizer
logging.basicConfig(level=logging.INFO)
# model_checkpoint = "facebook/m2m100_418M"
model_checkpoint = "K024/mt5-zh-ja-en-trimmed"
loop = 100
chinese_text = "zh2en: 机器学习是人工智能的一个分支。人工智能的研究历史有着一条从以“推理”为重点,到以“知识”为重点,再到以“学习”为重点的自然、清晰的脉络。显然,机器学习是实现人工智能的一个途径,即以机器学习为手段解决人工智能中的问题。机器学习在近30多年已发展为一门多领域交叉学科,涉及概率论、统计学、逼近论、凸分析(英语:Convex analysis)、计算复杂性理论等多门学科。机器学习理论主要是设计和分析一些让计算机可以自动“学习”的算法。机器学习算法是一类从数据中自动分析获得规律,并利用规律对未知数据进行预测的算法。因为学习算法中涉及了大量的统计学理论,机器学习与推断统计学联系尤为密切,也被称为统计学习理论。算法设计方面,机器学习理论关注可以实现的,行之有效的学习算法。很多推论问题属于无程序可循难度,所以部分的机器学习研究是开发容易处理的近似算法。机器学习已广泛应用于数据挖掘、计算机视觉、自然语言处理、生物特征识别、搜索引擎、医学诊断、检测信用卡欺诈、证券市场分析、DNA序列测序、语音和手写识别、战略游戏和机器人等领域"
logging.info(f"chinese_text is {chinese_text}")
logging.info(f"chinese_text length is {len(chinese_text)}")
device = torch.device("cuda:0")
logging.info(f"This test will use device: {device}")
def get_transformer_model():
# from transformers import M2M100ForConditionalGeneration, M2M100Tokenizer
# model = M2M100ForConditionalGeneration.from_pretrained(model_checkpoint).to(device)
# tokenizer = M2M100Tokenizer.from_pretrained(model_checkpoint)
from transformers import MT5ForConditionalGeneration, T5Tokenizer
model = MT5ForConditionalGeneration.from_pretrained(model_checkpoint).to(device)
tokenizer = T5Tokenizer.from_pretrained(model_checkpoint)
return (model, tokenizer)
def get_optimum_onnx_model():
from optimum.onnxruntime import ORTModelForSeq2SeqLM
from transformers import T5Tokenizer
model = ORTModelForSeq2SeqLM.from_pretrained(model_checkpoint, from_transformers=True).to(device)
tokenizer = T5Tokenizer.from_pretrained(model_checkpoint)
return (model, tokenizer)
def translate():
(model, tokenizer) = get_optimum_onnx_model()
# (model, tokenizer) = get_transformer_model()
# Warm-up
for i in range(10):
encoded_zh = tokenizer(chinese_text, return_tensors="pt").to(device)
generated_tokens = model.generate(
**encoded_zh,
# forced_bos_token_id=tokenizer.get_lang_id("en"),
max_length=256,
num_beams=5,
)
result = tokenizer.batch_decode(generated_tokens, skip_special_tokens=True)
logging.debug(f"#{i}: {result}")
start = time.time()
for i in tqdm(range(loop)):
encoded_zh = tokenizer(chinese_text, return_tensors="pt").to(device)
generated_tokens = model.generate(
**encoded_zh,
# forced_bos_token_id=tokenizer.get_lang_id("en"),
max_length=256,
num_beams=5,
)
end = time.time()
total_time = end - start
logging.info(f"total: {total_time}")
logging.info(f"loop: {loop}")
logging.info(f"avg(s): {total_time / loop}")
logging.info(f"throughput(translation/s): {loop / total_time}")
if __name__ == "__main__":
translate() The result is (done by V100):
When I ran the pytorch model, I had no warnings, but in optimum I got the following warnings:
Also the translation results are different like (optimum result seems to be strange):
I think the reason of the different between the average times could be just the results' different (because of the length of output tokens?) or other reasons. But I'm not sure why the result using opimum is strange. |
Hi @soocheolnoh, thanks for testing. From my side, for the mt5 model, the generated results are different w/. V.S. w/o. IO binding, which is not normal as IO Binding is not supposed to change the result(only the place to put the data should be different), it might be a bug on the post-processing of the outputs. I will take a closer look, and fix the beam search ASAP. |
Hi @soocheolnoh The fix has been done, there was a bug on the output population, thanks for pointing it out. Now you shall get the same translation result w/. or w/o. IOBinding.
Also, share some performance numbers tested with the previous snippet here:
The issue is closed, but feel free to reopen it or ping me if you have extra questions about IOBinding. @soocheolnoh @Matthieu-Tinycoaching Thanks again for helping us improve Optimum. |
Thanks for the quick response!! @JingyaHuang |
System Info
Who can help?
@JingyaHuang @echarlaix
Information
Tasks
examples
folder (such as GLUE/SQuAD, ...)Reproduction
I compared inference on GPU of a native torch
Helsinki-NLP/opus-mt-fr-en
model with respect to the optimized onnx model thanks to Optimum library. So, I have defined a fastAPI microservice based on two classes below for GPU both torch and optimized ONNX, repsectively:Expected behavior
When load testing the model on my local computer, I was surprised by two things:
2022-09-28 08:20:21.214094612 [W:onnxruntime:Default, onnxruntime_pybind_state.cc:566 CreateExecutionProviderInstance] Failed to create CUDAExecutionProvider. Please reference https://onnxruntime.ai/docs/reference/execution-providers/CUDA-ExecutionProvider.html#requirements to ensure all dependencies are met.
Does this mean the
CUDAExecutionProvider
is not working even if I set it in?:What could be caused that? I saw in https://onnxruntime.ai/docs/execution-providers/CUDA-ExecutionProvider.html that CUDA 11.6 is not mentionned, could it be this?
The text was updated successfully, but these errors were encountered: