Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Whisper] Computing features on CPU slows down WhisperFeatureExtractor #29901

Closed
yashjogi opened this issue Mar 27, 2024 · 0 comments · Fixed by #29900
Closed

[Whisper] Computing features on CPU slows down WhisperFeatureExtractor #29901

yashjogi opened this issue Mar 27, 2024 · 0 comments · Fixed by #29900

Comments

@yashjogi
Copy link

yashjogi commented Mar 27, 2024

Feature request

In the WhisperFeatureExtractor class, to reduce the time it takes to compute audio features, an option should be added to compute the features on GPU in batch mode.

Motivation

Currently, the Whisper Feature Extractor processes audio features on the CPU one by one for each batch, which slows down the process. Additionally, it doesn't utilize GPU operations despite using torch functions.

We ( + @vaibhavagg303 ) modified the code to enable batch feature extraction on CPU, cutting the average time for batches of 8 from 1.5 seconds to 0.25 seconds — depending on available CPU resources. Furthermore, we transferred all variables — mel filters and hanning window — in the function _torch_extract_fbank_features to GPU, further reducing the average time to 0.02 seconds — a significant reduction from 1.5 seconds.

Whisper Models, already with high latency, experience even greater latency due to processing input features on the CPU without batching. By doing batch-wise inference on GPU, the total inference time for Whisper-Small is reduced significantly, bringing it down by whopping 25%.

@sanchit-gandhi

The code used to measure the time taken is as follows.

from transformers import WhisperProcessor, WhisperForConditionalGeneration
feature_extractor = WhisperFeatureExtractor.from_pretrained("openai/whisper-small")
import librosa
import time

processor = WhisperProcessor.from_pretrained("openai/whisper-small", language='english')
model = WhisperForConditionalGeneration.from_pretrained("openai/whisper-small")
model.config.forced_decoder_ids = None
model = model.to('cuda')

total_time = 0
batch_size= 8

for i in range(0, len(df), batch_size):
  sample = df[i:i+ batch_size]
  files = sample['file'].tolist()
  sample_array = []
  for file in files:
    data, sample_rate = librosa.load(file, sr = 16000)
    sample_array.append(data)
 
 st = time.time()
  input_features = feature_extractor(sample_array, sampling_rate=16000).input_features
  input_features = torch.tensor(input_features).to('cuda')

  with torch.no_grad():
    predicted_ids = model.generate(input_features, language='english')
  transcription = processor.batch_decode(predicted_ids, skip_special_tokens=True)

  total_time += time.time() - st

Your contribution

This is the PR by @vaibhavagg303.
#29900

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
1 participant