Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implement a batch_size parameter in the pipeline object #13141

Closed
xegulon opened this issue Aug 16, 2021 · 8 comments
Closed

Implement a batch_size parameter in the pipeline object #13141

xegulon opened this issue Aug 16, 2021 · 8 comments

Comments

@xegulon
Copy link

xegulon commented Aug 16, 2021

🚀 Feature request

Implement a batch_size parameter in the pipeline object, so that when we call it, it computes the predictions by batches of sentences and then does get CUDA Out of Memory errors.

Ideally, this optional argument would have a good default, computed from the tokenizer's parameters and the hardware the code is running on.

References to this need in the forum:
https://discuss.huggingface.co/t/how-to-make-pipeline-automatically-scale/7432/3
https://discuss.huggingface.co/t/how-to-change-the-batch-size-in-a-pipeline/8738

Motivation

When making inference on very long list of sentences using the pipeline object, I often get CUDA OOM errors.

Your contribution

I could try :)

@xegulon
Copy link
Author

xegulon commented Aug 23, 2021

@sgugger ?

@LysandreJik
Copy link
Member

Hello @xegulon, this is in line with some work currently underway by @Narsil

@Narsil
Copy link
Contributor

Narsil commented Aug 23, 2021

@xegulon,

Batching on inference is something to be very cautious about, because alignment might heavily penalize the speed of inference.

See #11251 and https://gist.github.com/Narsil/ee5c09875e74fa6f018dc6d014f6c06c for more information.

Cuda OOM errors are most likely due to the fact that you are padding way too much, and actually showcase the slow down.

The big refactor mentionned by @LysandreJik is ready here https://github.com/Narsil/transformers/tree/iterable_pipelines

With said PR, you should be able to actually stream all your data to the GPU leading to a massive speedup (like DataLoader), and if you want to do batching because you know it will speedup (please measure real payloads, it's unlikely to be significant, so make sure it is a speedup) you can do it by manually using Dataloader, preprocess, forward and postprocess.
The proposed PR will use DataLoader (for pt) by default if you send lists too. You can also send directly Datasets.

@xegulon
Copy link
Author

xegulon commented Aug 26, 2021

Great (useful) work @Narsil thanks a lot. Is it planned to be released in v4.10.0?

@Narsil
Copy link
Contributor

Narsil commented Aug 26, 2021

I don't think it will make it in time, it's a pretty massive change, we're pulling in stuff bit by bit to make sure we're not breaking anything (we're in a phase where we're strengthening the tests first)

@github-actions
Copy link

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.

Please note that issues that do not follow the contributing guidelines are likely to be ignored.

@Narsil
Copy link
Contributor

Narsil commented Sep 20, 2021

@xegulon the modifications have landed in master, can you confirm it speeds up inference without the need for batch_size ?

@github-actions
Copy link

github-actions bot commented Nov 8, 2021

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.

Please note that issues that do not follow the contributing guidelines are likely to be ignored.

@Narsil Narsil closed this as completed Nov 8, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants
@Narsil @LysandreJik @xegulon and others