-
Notifications
You must be signed in to change notification settings - Fork 27.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Implement a batch_size
parameter in the pipeline
object
#13141
Comments
@sgugger ? |
Batching on inference is something to be very cautious about, because alignment might heavily penalize the speed of inference. See #11251 and https://gist.github.com/Narsil/ee5c09875e74fa6f018dc6d014f6c06c for more information. Cuda OOM errors are most likely due to the fact that you are padding way too much, and actually showcase the slow down. The big refactor mentionned by @LysandreJik is ready here https://github.com/Narsil/transformers/tree/iterable_pipelines With said PR, you should be able to actually stream all your data to the GPU leading to a massive speedup (like DataLoader), and if you want to do batching because you know it will speedup (please measure real payloads, it's unlikely to be significant, so make sure it is a speedup) you can do it by manually using |
Great (useful) work @Narsil thanks a lot. Is it planned to be released in |
I don't think it will make it in time, it's a pretty massive change, we're pulling in stuff bit by bit to make sure we're not breaking anything (we're in a phase where we're strengthening the tests first) |
This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread. Please note that issues that do not follow the contributing guidelines are likely to be ignored. |
@xegulon the modifications have landed in master, can you confirm it speeds up inference without the need for |
This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread. Please note that issues that do not follow the contributing guidelines are likely to be ignored. |
🚀 Feature request
Implement a
batch_size
parameter in thepipeline
object, so that when we call it, it computes the predictions by batches of sentences and then does get CUDA Out of Memory errors.Ideally, this optional argument would have a good default, computed from the tokenizer's parameters and the hardware the code is running on.
References to this need in the forum:
https://discuss.huggingface.co/t/how-to-make-pipeline-automatically-scale/7432/3
https://discuss.huggingface.co/t/how-to-change-the-batch-size-in-a-pipeline/8738
Motivation
When making inference on very long list of sentences using the
pipeline
object, I often get CUDA OOM errors.Your contribution
I could try :)
The text was updated successfully, but these errors were encountered: