-
Notifications
You must be signed in to change notification settings - Fork 30.7k
[Whisper] Computing features on GPU in batch mode for whisper feature extractor. #29900
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Whisper] Computing features on GPU in batch mode for whisper feature extractor. #29900
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Very nice addition!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good. Let's add a test the uses require_torch_gpu
to make sure this runs on GPU!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, let's ask for @sanchit-gandhi 's approval as well here!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for this PR @vaibhavagg303 - performance improvements to Whisper are always most welcome! In general, I'm in favour of the changes towards faster feature extractors. Do you have any numbers on how much faster this CUDA variant is? My only reluctancy in merging this PR would be if the performance gains are negligible, but we add extra complexity due to the potential new device
argument. To check this, you can adapt and update the toy benchmark that was proposed as part of the first torch stft addition: #26119 (comment). Otherwise, I've left some small suggestions below.
cc @kamilakesbi for viz |
#29901 |
Thanks for the benchmark link @vaibhavagg303, that's most helpful! Once follow-up question to your benchmark: you mention that torch gpu stft "cuts the average time for batches of 8 from 1.5 seconds to 0.25 seconds". How does the compute time change for bsz=1, since this is also a common use-case for computing features in short-form mode (<30s). |
As a follow-up PR: we'll need to update the ASR pipeline class to compute batched input features to leverage this speed-up, since currently it uses bsz=1 always for the feature extraction
|
This is the average computation time for 8 audios.
|
Thanks @yashjogi. To clarify my question above, what's the speed up of |
It's around |
Also, we found out that this is a huge bottleneck for training Whisper -- this simple change reduced the training time by almost 9 times, depending on the CPU resources of the machine we're training on. This means that it would significantly decrease the time to train distil-whisper as well! https://github.com/huggingface/distil-whisper/blob/main/training/run_distillation.py |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the updates @vaibhavagg303 - just one small suggestion about the docstring, otherwise LGTM!
Hi @ArthurZucker, can you please review and approve these changes? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for getting through this and bringing fast whisper processing! 🤗
Thanks, @sanchit-gandhi @ArthurZucker |
Thanks for your contribution @vaibhavagg303! |
What does this PR do?
This pull request introduces a feature enabling the computation of audio features for the Whisper model in batch mode on the GPU. This enhancement substantially reduces latency during feature extraction, thereby enhancing overall performance.
cc: @yashjogi
@sanchit-gandhi @ArthurZucker @hollance Please look into it.
fixes #29901