New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Whisper] Computing features on GPU in batch mode for whisper feature extractor. #29900
[Whisper] Computing features on GPU in batch mode for whisper feature extractor. #29900
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Very nice addition!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good. Let's add a test the uses require_torch_gpu
to make sure this runs on GPU!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, let's ask for @sanchit-gandhi 's approval as well here!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for this PR @vaibhavagg303 - performance improvements to Whisper are always most welcome! In general, I'm in favour of the changes towards faster feature extractors. Do you have any numbers on how much faster this CUDA variant is? My only reluctancy in merging this PR would be if the performance gains are negligible, but we add extra complexity due to the potential new device
argument. To check this, you can adapt and update the toy benchmark that was proposed as part of the first torch stft addition: #26119 (comment). Otherwise, I've left some small suggestions below.
cc @kamilakesbi for viz |
#29901 |
Thanks for the benchmark link @vaibhavagg303, that's most helpful! Once follow-up question to your benchmark: you mention that torch gpu stft "cuts the average time for batches of 8 from 1.5 seconds to 0.25 seconds". How does the compute time change for bsz=1, since this is also a common use-case for computing features in short-form mode (<30s). |
As a follow-up PR: we'll need to update the ASR pipeline class to compute batched input features to leverage this speed-up, since currently it uses bsz=1 always for the feature extraction
|
This is the average computation time for 8 audios.
|
Thanks @yashjogi. To clarify my question above, what's the speed up of |
It's around |
Also, we found out that this is a huge bottleneck for training Whisper -- this simple change reduced the training time by almost 9 times, depending on the CPU resources of the machine we're training on. This means that it would significantly decrease the time to train distil-whisper as well! https://github.com/huggingface/distil-whisper/blob/main/training/run_distillation.py |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the updates @vaibhavagg303 - just one small suggestion about the docstring, otherwise LGTM!
Hi @ArthurZucker, can you please review and approve these changes? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for getting through this and bringing fast whisper processing! 🤗
Thanks, @sanchit-gandhi @ArthurZucker |
Thanks for your contribution @vaibhavagg303! |
… extractor. (huggingface#29900) * add _torch_extract_fbank_features_batch function in feature_extractor_whisper * reformat feature_extraction_whisper.py file * handle batching in single function * add gpu test & doc * add batch test & device in each __call__ * add device arg in doc string --------- Co-authored-by: vaibhav.aggarwal <vaibhav.aggarwal@sprinklr.com>
… extractor. (#29900) * add _torch_extract_fbank_features_batch function in feature_extractor_whisper * reformat feature_extraction_whisper.py file * handle batching in single function * add gpu test & doc * add batch test & device in each __call__ * add device arg in doc string --------- Co-authored-by: vaibhav.aggarwal <vaibhav.aggarwal@sprinklr.com>
… extractor. (#29900) * add _torch_extract_fbank_features_batch function in feature_extractor_whisper * reformat feature_extraction_whisper.py file * handle batching in single function * add gpu test & doc * add batch test & device in each __call__ * add device arg in doc string --------- Co-authored-by: vaibhav.aggarwal <vaibhav.aggarwal@sprinklr.com>
What does this PR do?
This pull request introduces a feature enabling the computation of audio features for the Whisper model in batch mode on the GPU. This enhancement substantially reduces latency during feature extraction, thereby enhancing overall performance.
cc: @yashjogi
@sanchit-gandhi @ArthurZucker @hollance Please look into it.
fixes #29901