Skip to content

[Docs] Document the VRAM-vs-audio-duration relationship — RTX 4090 (24GB) OOMs on >30min audio at default sdpa attention #367

@oldmanstillcan

Description

@oldmanstillcan

Hi! First — thanks for shipping VibeVoice-ASR. The single-pass transcription + diarization + timestamps + multilingual story is a significant step beyond stitching whisper + pyannote together, and we've been excited to evaluate it.

This is a documentation suggestion based on running the model on a 24GB GPU (RTX 4090 — RunPod community-cloud pod). Sharing the empirical numbers in case they're useful, and proposing a docs addition.

Context

I tested VibeVoice-ASR-7B on four real-world Twitter/X Spaces audio captures of varying length (30 min to ~107 min). Setup followed the repo's installation instructions:

  • Hardware: RTX 4090 (24 GB VRAM), 12 vCPU, 62 GB RAM
  • Image: runpod/pytorch:1.0.3-cu1281-torch291-ubuntu2404 (not the recommended NVIDIA DLC, but a comparable PyTorch base)
  • Install: git clone https://github.com/microsoft/VibeVoice && cd VibeVoice && python3 -m venv .venv && source .venv/bin/activate && pip install -e .
  • Versions: transformers 4.57.6, torch 2.11.0+cu130, Python 3.12
  • Inference: python demo/vibevoice_asr_inference_from_file.py --model_path microsoft/VibeVoice-ASR --audio_files <audio.m4a> --device cuda --attn_implementation sdpa

Observation

Audio duration Result Notes
30 min ✅ Success — 5m 45s wall-clock (~5.2× realtime), peak VRAM ~22 GB Default sdpa
50 min torch.OutOfMemoryError: tried to allocate 6.50 GiB, 4.34 GiB free Default sdpa
92 min torch.OutOfMemoryError: tried to allocate 1.47 GiB, 1.46 GiB free Default sdpa
25 min (chunked) ✅ Success — ~3 min wall-clock per chunk Default sdpa, 5 separate Python invocations

The ~30-min ceiling on a 24GB card is a hard wall at default settings. Setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True did not change the outcome.

Suggestion

The README states:

🕒 60-minute Single-Pass Processing: VibeVoice ASR accepts up to 60 minutes of continuous audio input within 64K token length.

The "60-minute single-pass" capability is real, but conditional on having enough VRAM — implicitly H100 / A100 80G class (the recommended container nvcr.io/nvidia/pytorch:25.12-py3 makes this assumption). On consumer/prosumer 24GB cards, the practical limit is closer to 30 min at default sdpa attention.

Two suggestions for the docs/vibevoice-asr.md page:

  1. Add a "Hardware Requirements" section that documents the VRAM-vs-audio-duration relationship empirically. Even ballpark numbers (e.g. "24 GB → ~30 min, 40 GB → ~60 min, 80 GB → 60 min comfortably") would set expectations correctly. The current README implies the 60-min capability is universal.

  2. Recommend flash_attention_2 as the path for tighter GPUs. The README mentions installing flash-attn manually, but doesn't connect it to "this is what makes 60 min work on 24 GB." If flash-attn does in fact bring 60-min audio into 24GB territory (we haven't validated this ourselves yet), that's worth surfacing.

  3. Optional: an example showing chunking on smaller GPUs, with a note about the cross-chunk speaker-continuity caveat (each chunk gets independent diarization IDs unless you stitch them post-hoc).

Why this matters in practice

For self-hosted deployments outside Microsoft Research (we're building a transcription tool for Twitter Spaces — open source voice AI is exactly the kind of foundation we want to build on), most teams will start with 24-GB consumer cards before committing to 80-GB hardware. Knowing the VRAM ceiling up-front saves wasted pod-time discovering it.

Happy to add more data points if useful — we have the audio files (publicly-accessible Twitter Spaces) and inference logs from the four runs.

Thanks again for the model and the open-source release.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions