ONNX converted Whisper model takes more than twice the VRAM of the torch version #1869

bruno-hays · 2024-05-22T16:19:00Z

System Info

optimum==1.19.2
torch==2.1.2
transformers==4.39.3
onnxruntime-gpu==1.17.1
CUDA Version: 12.2
GPU: L4

Who can help?

@michaelbenayoun

Information

The official example scripts
My own modified scripts

Tasks

An officially supported task in the examples folder (such as GLUE/SQuAD, ...)
My own task or dataset (give details below)

Reproduction (minimal, reproducible, runnable)

I wrote an easily reproducible example using either ORT or torch:

with torch:

import torch
from datasets import load_dataset, Audio
from transformers import WhisperForConditionalGeneration, WhisperProcessor, WhisperTokenizer, pipeline
model_path = "openai/whisper-large-v3"
model = WhisperForConditionalGeneration.from_pretrained(model_path)
tokenizer = WhisperTokenizer.from_pretrained(model_path)
processor = WhisperProcessor.from_pretrained(model_path)

asr_pipeline = pipeline('automatic-speech-recognition',
                        model=model, tokenizer=tokenizer, feature_extractor=processor.feature_extractor, device="cuda:0")

memory_used = (torch.cuda.mem_get_info()[1] - torch.cuda.mem_get_info()[0]) / 1024**3
print("model_size", memory_used)

ds = load_dataset("PolyAI/minds14", name="fr-FR", split="train", streaming = True)
ds = ds.cast_column("audio", Audio(sampling_rate=16_000))
audio = next(iter(ds))["audio"]["array"]

a = asr_pipeline(audio,
                 return_timestamps=False,
                 chunk_length_s=30,
                 batch_size=1)
memory_used = (torch.cuda.mem_get_info()[1] - torch.cuda.mem_get_info()[0]) / 1024**3
print("max_size", memory_used)

I get model_size: 6.23 GB, max_size: 6.91 GB

with onnx:

from optimum.onnxruntime import ORTModelForSpeechSeq2Seq, AutoOptimizationConfig, ORTOptimizer
from datasets import load_dataset, Audio
import torch
from transformers import WhisperProcessor, WhisperTokenizer
from optimum.pipelines import pipeline

model_path = "openai/whisper-large-v3"
model = ORTModelForSpeechSeq2Seq.from_pretrained(model_path, provider="CUDAExecutionProvider", export = True)
tokenizer = WhisperTokenizer.from_pretrained(model_path)
processor = WhisperProcessor.from_pretrained(model_path)

asr_pipeline = pipeline('automatic-speech-recognition',
                        model=model, tokenizer=tokenizer, feature_extractor=processor.feature_extractor,device="cuda:0",
                        accelerator="ort")

memory_used = (torch.cuda.mem_get_info()[1] - torch.cuda.mem_get_info()[0]) / 1024**3
print("model_size", memory_used)

ds = load_dataset("PolyAI/minds14", name="fr-FR", split="train", streaming = True)
ds = ds.cast_column("audio", Audio(sampling_rate=16_000))
audio = next(iter(ds))["audio"]["array"]

a = asr_pipeline(audio,
                 return_timestamps=False,
                 chunk_length_s=30,
                 batch_size=1)

memory_used = (torch.cuda.mem_get_info()[1] - torch.cuda.mem_get_info()[0]) / 1024**3
print("max_size", memory_used)

I get model_size: 11,26GB, max_size: 11,99GB

Expected behavior

The VRAM requirements using either ORT or torch should be comparable.

The text was updated successfully, but these errors were encountered:

bruno-hays · 2024-09-28T09:42:24Z

I found the answer to my question:

The increase in memory requirement is due to the allocation of memory for the maximum length of the KV-cache, not a bug. The Whisper implementation in the Transformers library is now compatible with torch.compile thanks to the availability of a static KV-cache. This compatibility has resulted in better performance compared to optimizations with optimum in my experiments. However, there is still a large memory overhead due to the static allocation of the cache.

bruno-hays added the bug Something isn't working label May 22, 2024

bruno-hays changed the title ~~ONNX converted Whisper model takes more than twice the VRAM than the torch version~~ ONNX converted Whisper model takes more than twice the VRAM of the torch version May 22, 2024

bruno-hays closed this as completed Sep 28, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ONNX converted Whisper model takes more than twice the VRAM of the torch version #1869

ONNX converted Whisper model takes more than twice the VRAM of the torch version #1869

bruno-hays commented May 22, 2024

bruno-hays commented Sep 28, 2024

ONNX converted Whisper model takes more than twice the VRAM of the torch version #1869

ONNX converted Whisper model takes more than twice the VRAM of the torch version #1869

Comments

bruno-hays commented May 22, 2024

System Info

Who can help?

Information

Tasks

Reproduction (minimal, reproducible, runnable)

Expected behavior

bruno-hays commented Sep 28, 2024