Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ONNX converted Whisper model takes more than twice the VRAM of the torch version #1869

Closed
2 of 4 tasks
bruno-hays opened this issue May 22, 2024 · 1 comment
Closed
2 of 4 tasks
Labels
bug Something isn't working

Comments

@bruno-hays
Copy link

System Info

optimum==1.19.2
torch==2.1.2
transformers==4.39.3
onnxruntime-gpu==1.17.1
CUDA Version: 12.2
GPU: L4

Who can help?

@michaelbenayoun

Information

  • The official example scripts
  • My own modified scripts

Tasks

  • An officially supported task in the examples folder (such as GLUE/SQuAD, ...)
  • My own task or dataset (give details below)

Reproduction (minimal, reproducible, runnable)

I wrote an easily reproducible example using either ORT or torch:

with torch:

import torch
from datasets import load_dataset, Audio
from transformers import WhisperForConditionalGeneration, WhisperProcessor, WhisperTokenizer, pipeline
model_path = "openai/whisper-large-v3"
model = WhisperForConditionalGeneration.from_pretrained(model_path)
tokenizer = WhisperTokenizer.from_pretrained(model_path)
processor = WhisperProcessor.from_pretrained(model_path)

asr_pipeline = pipeline('automatic-speech-recognition',
                        model=model, tokenizer=tokenizer, feature_extractor=processor.feature_extractor, device="cuda:0")

memory_used = (torch.cuda.mem_get_info()[1] - torch.cuda.mem_get_info()[0]) / 1024**3
print("model_size", memory_used)

ds = load_dataset("PolyAI/minds14", name="fr-FR", split="train", streaming = True)
ds = ds.cast_column("audio", Audio(sampling_rate=16_000))
audio = next(iter(ds))["audio"]["array"]

a = asr_pipeline(audio,
                 return_timestamps=False,
                 chunk_length_s=30,
                 batch_size=1)
memory_used = (torch.cuda.mem_get_info()[1] - torch.cuda.mem_get_info()[0]) / 1024**3
print("max_size", memory_used)

I get model_size: 6.23 GB, max_size: 6.91 GB

with onnx:

from optimum.onnxruntime import ORTModelForSpeechSeq2Seq, AutoOptimizationConfig, ORTOptimizer
from datasets import load_dataset, Audio
import torch
from transformers import WhisperProcessor, WhisperTokenizer
from optimum.pipelines import pipeline

model_path = "openai/whisper-large-v3"
model = ORTModelForSpeechSeq2Seq.from_pretrained(model_path, provider="CUDAExecutionProvider", export = True)
tokenizer = WhisperTokenizer.from_pretrained(model_path)
processor = WhisperProcessor.from_pretrained(model_path)

asr_pipeline = pipeline('automatic-speech-recognition',
                        model=model, tokenizer=tokenizer, feature_extractor=processor.feature_extractor,device="cuda:0",
                        accelerator="ort")

memory_used = (torch.cuda.mem_get_info()[1] - torch.cuda.mem_get_info()[0]) / 1024**3
print("model_size", memory_used)

ds = load_dataset("PolyAI/minds14", name="fr-FR", split="train", streaming = True)
ds = ds.cast_column("audio", Audio(sampling_rate=16_000))
audio = next(iter(ds))["audio"]["array"]

a = asr_pipeline(audio,
                 return_timestamps=False,
                 chunk_length_s=30,
                 batch_size=1)

memory_used = (torch.cuda.mem_get_info()[1] - torch.cuda.mem_get_info()[0]) / 1024**3
print("max_size", memory_used)

I get model_size: 11,26GB, max_size: 11,99GB

Expected behavior

The VRAM requirements using either ORT or torch should be comparable.

@bruno-hays bruno-hays added the bug Something isn't working label May 22, 2024
@bruno-hays bruno-hays changed the title ONNX converted Whisper model takes more than twice the VRAM than the torch version ONNX converted Whisper model takes more than twice the VRAM of the torch version May 22, 2024
@bruno-hays
Copy link
Author

I found the answer to my question:

The increase in memory requirement is due to the allocation of memory for the maximum length of the KV-cache, not a bug. The Whisper implementation in the Transformers library is now compatible with torch.compile thanks to the availability of a static KV-cache. This compatibility has resulted in better performance compared to optimizations with optimum in my experiments. However, there is still a large memory overhead due to the static allocation of the cache.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

1 participant