[Whisper] Computing features on CPU slows down WhisperFeatureExtractor #29901

yashjogi · 2024-03-27T10:55:21Z

Feature request

In the WhisperFeatureExtractor class, to reduce the time it takes to compute audio features, an option should be added to compute the features on GPU in batch mode.

Motivation

Currently, the Whisper Feature Extractor processes audio features on the CPU one by one for each batch, which slows down the process. Additionally, it doesn't utilize GPU operations despite using torch functions.

We ( + @vaibhavagg303 ) modified the code to enable batch feature extraction on CPU, cutting the average time for batches of 8 from 1.5 seconds to 0.25 seconds — depending on available CPU resources. Furthermore, we transferred all variables — mel filters and hanning window — in the function _torch_extract_fbank_features to GPU, further reducing the average time to 0.02 seconds — a significant reduction from 1.5 seconds.

Whisper Models, already with high latency, experience even greater latency due to processing input features on the CPU without batching. By doing batch-wise inference on GPU, the total inference time for Whisper-Small is reduced significantly, bringing it down by whopping 25%.

@sanchit-gandhi

The code used to measure the time taken is as follows.

from transformers import WhisperProcessor, WhisperForConditionalGeneration
feature_extractor = WhisperFeatureExtractor.from_pretrained("openai/whisper-small")
import librosa
import time

processor = WhisperProcessor.from_pretrained("openai/whisper-small", language='english')
model = WhisperForConditionalGeneration.from_pretrained("openai/whisper-small")
model.config.forced_decoder_ids = None
model = model.to('cuda')

total_time = 0
batch_size= 8

for i in range(0, len(df), batch_size):
  sample = df[i:i+ batch_size]
  files = sample['file'].tolist()
  sample_array = []
  for file in files:
    data, sample_rate = librosa.load(file, sr = 16000)
    sample_array.append(data)
 
 st = time.time()
  input_features = feature_extractor(sample_array, sampling_rate=16000).input_features
  input_features = torch.tensor(input_features).to('cuda')

  with torch.no_grad():
    predicted_ids = model.generate(input_features, language='english')
  transcription = processor.batch_decode(predicted_ids, skip_special_tokens=True)

  total_time += time.time() - st

Your contribution

This is the PR by @vaibhavagg303.
#29900

The text was updated successfully, but these errors were encountered:

ArthurZucker mentioned this issue Mar 27, 2024

[Whisper] Computing features on GPU in batch mode for whisper feature extractor. #29900

Merged

ArthurZucker closed this as completed in #29900 Apr 8, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Whisper] Computing features on CPU slows down WhisperFeatureExtractor #29901

[Whisper] Computing features on CPU slows down WhisperFeatureExtractor #29901

yashjogi commented Mar 27, 2024 •

edited

[Whisper] Computing features on CPU slows down WhisperFeatureExtractor #29901

[Whisper] Computing features on CPU slows down WhisperFeatureExtractor #29901

Comments

yashjogi commented Mar 27, 2024 • edited

Feature request

Motivation

Your contribution

yashjogi commented Mar 27, 2024 •

edited