You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
In the WhisperFeatureExtractor class, to reduce the time it takes to compute audio features, an option should be added to compute the features on GPU in batch mode.
Motivation
Currently, the Whisper Feature Extractor processes audio features on the CPU one by one for each batch, which slows down the process. Additionally, it doesn't utilize GPU operations despite using torch functions.
We ( + @vaibhavagg303 ) modified the code to enable batch feature extraction on CPU, cutting the average time for batches of 8 from 1.5 seconds to 0.25 seconds — depending on available CPU resources. Furthermore, we transferred all variables — mel filters and hanning window — in the function _torch_extract_fbank_features to GPU, further reducing the average time to 0.02 seconds — a significant reduction from 1.5 seconds.
Whisper Models, already with high latency, experience even greater latency due to processing input features on the CPU without batching. By doing batch-wise inference on GPU, the total inference time for Whisper-Small is reduced significantly, bringing it down by whopping 25%.
Feature request
In the WhisperFeatureExtractor class, to reduce the time it takes to compute audio features, an option should be added to compute the features on GPU in batch mode.
Motivation
Currently, the Whisper Feature Extractor processes audio features on the CPU one by one for each batch, which slows down the process. Additionally, it doesn't utilize GPU operations despite using torch functions.
We ( + @vaibhavagg303 ) modified the code to enable batch feature extraction on CPU, cutting the average time for batches of 8 from 1.5 seconds to 0.25 seconds — depending on available CPU resources. Furthermore, we transferred all variables — mel filters and hanning window — in the function _torch_extract_fbank_features to GPU, further reducing the average time to 0.02 seconds — a significant reduction from 1.5 seconds.
Whisper Models, already with high latency, experience even greater latency due to processing input features on the CPU without batching. By doing batch-wise inference on GPU, the total inference time for Whisper-Small is reduced significantly, bringing it down by whopping 25%.
@sanchit-gandhi
The code used to measure the time taken is as follows.
Your contribution
This is the PR by @vaibhavagg303.
#29900
The text was updated successfully, but these errors were encountered: