-
Notifications
You must be signed in to change notification settings - Fork 27.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Pipeline seems slower in 4.11+ #14125
Comments
Hi @Dref360 , First of all, thanks for the script and benchmarks, very helpful. Short answer: #13724 this PR should solve your specific use case with Long answer: You example is slightly odd, by being a single token repeated 200 times, so batching yields better results than non batching. If you use longer sentences, you can get better or worse performance:
12s on 4.10.3
11s on 4.10.3 That somehow seems to average out on random strings, leading to closer performance (in our internal testing) which is why we're NOT batching by default anymore. The biggest batching place would be on GPU (not your case) but the 4.10 performance on GPU was pretty bad anyway because the pipeline API didn't allow for streaming properly. This example is the perfect example where batching yields vastly faster results, but it might not be representative of other workloads (just a caveat for readers, always measure performance on your own models/data to make sure what works best) On longer sequences, the matrix multiplications get larger, and batching does not allow the CPU to get better throughput than without (GPU get the benefit for larger payloads). For core maintainers: @LysandreJik , @sgugger with the propose PR for batch support on pipelines we can also take the opportunity to use We could also do like for overall inputs/outputs and have some way for pipelines to express if they batch by default or not and be as backward compatible as possible in terms of performance. I am unsure it's worth it (as mentionned in this comment, performance might have increased on other models/data) , but definitely an option. Definitely something that was overlooked when I observed the similar performance, I didn't look at these kind of inputs where it does make a difference. |
Ah! Thank you for the quick and very detailled response. My usecase is mostly small sentences so that must be why we saw such a massive slowdown. Thank you for your help! We will wait for this PR to be merged :) |
@Narsil in your comment (the long answer) you mention GPU performance is poor for transformer pipelines for previous versions (4.10 or earlier). I'm currently using 4.12.0 and observe that the GPU isn't fully utilized . I'm using a sentiment analysis hugging face model: https://huggingface.co/yiyanghkust/finbert-tone with the following setup: Environment: Hardware:
Software:
I'm running the following code:
Running gpustat (https://pypi.org/project/gpustat/) on the node reports the following while the above code is running (which does so for about 35 secs) shows the following: 1026-105747-9wr7sbr7-10-139-64-13 Fri Oct 29 09:32:34 2021 450.80.02 Here you see that only 39% of the GPU is used. Is there a reason why it isn't near 100%? For comparison, the same model based on pytorch_pretrained_bert only which can be found here: https://github.com/yya518/FinBERT/blob/master/FinBert%20Model%20Example.ipynb using the same input of sentences, it performs significantly faster (approx 5 seconds). Here is the gpu usage: With this approach the gpu uasge is close to full capacity. Although I only tested this model, I suspect inference on a GPU using other tranformer models will also under utilize the GPU. Will the ability to set the batch size greater than 1 via the PR help with this? I see that the PR #13724 has been merged. When can we expect the next release? Thanks! |
Hi @alwayscurious ,
You are reading 100% GPU usage but much lower speed on your colab example because all your examples are padded to The ~50% GPU utilization of the first example, is because the example+model is a bit small so no all the GPU is required to run, meaning part of the GPU is idle. However it's still running faster that the "old" example, because it's not wasting cyles on the padded tokens. If I remove the padding I fall back on roughly By adding On a T4, you might be able to push the size of the batch even more, however I always tell users to be careful, running on mock data and real data is likely to be different, by adding more to the batch, you risk getting into OOM errors on live data that might be Another knob you can turn is Did I omit anything in this analysis ? |
I dont see the issue on master. Thank you! |
@Narsil thanks for your insightful reply! Indeed I also observed the 35 seconds with transformers vs approx 3 minutes with "old code" performance you mention after fixing a bug in the original "old" code (I was using a previous version of the notebook and realized it had a bug compared to the latest one in the link). I used the latest release of transformers (4.12.2). Regarding the input setup you are correct to add 1000 and I forgot to add that to the notebook (only included it in the code snippet). From the release notes of 4.12.2, I see the batch_size is included: v4.12.2...master. Can I use this version or do I need to build the transformers package manually from master? I set the batch size to 64 but continue to see aprox. 35 seconds for inference compared to approx. 5 seconds that you observe on your GTX 970. I'll setup a colab notebook with GPU runtime to verify. Thanks again! |
@Narsil, triggered by @Dref360 comment, I realized using the following:
creates a package directly from the latest commits to master. I verified the performance you observed with batch size of 64 (approx 7s on a K80 GPU). I included a link to the notebook for reference. Huggingface FinBertTone Model performance on GPU Thanks again for your help and the reference! :) |
@alwayscurious glad to be of help. Again, batch_size will depend on data + model + hardware, so try to keep track of some measure if possible (GPU utilization is the easy one, but the amount of padding is another one, measuring everything will slow you down so .... :)). Enabling automated batch_size is something we would like to enable, but it's quite tricky, and maybe not worth it. At least now you are in control. |
I'll close the issue now that it is merged on master. Cheer |
Hello! When I upgraded Transformers, I got a massive slowdown. Might be related to the new DataLoader used in Pipeline.
Happy to help!
Cheers,
Environment info
Environment
transformers
version: 4.12.0.dev0Who can help
Models:
Library:
Model I am using (Bert, XLNet ...): DistilBert, but I suspect this is for all Pipeline.
The problem arises when using:
The tasks I am working on is:
To reproduce
Steps to reproduce the behavior:
Expected behavior
I would expect the same performance if possible, or a way to bypass Pytorch DataLoader.
The text was updated successfully, but these errors were encountered: