[Docs] Document the VRAM-vs-audio-duration relationship — RTX 4090 (24GB) OOMs on >30min audio at default sdpa attention

Hi! First — thanks for shipping VibeVoice-ASR. The single-pass transcription + diarization + timestamps + multilingual story is a significant step beyond stitching `whisper + pyannote` together, and we've been excited to evaluate it.

This is a documentation suggestion based on running the model on a 24GB GPU (RTX 4090 — RunPod community-cloud pod). Sharing the empirical numbers in case they're useful, and proposing a docs addition.

### Context

I tested VibeVoice-ASR-7B on four real-world Twitter/X Spaces audio captures of varying length (30 min to ~107 min). Setup followed the repo's installation instructions:

- **Hardware:** RTX 4090 (24 GB VRAM), 12 vCPU, 62 GB RAM
- **Image:** `runpod/pytorch:1.0.3-cu1281-torch291-ubuntu2404` (not the recommended NVIDIA DLC, but a comparable PyTorch base)
- **Install:** `git clone https://github.com/microsoft/VibeVoice && cd VibeVoice && python3 -m venv .venv && source .venv/bin/activate && pip install -e .`
- **Versions:** `transformers 4.57.6`, `torch 2.11.0+cu130`, Python 3.12
- **Inference:** `python demo/vibevoice_asr_inference_from_file.py --model_path microsoft/VibeVoice-ASR --audio_files <audio.m4a> --device cuda --attn_implementation sdpa`

### Observation

| Audio duration | Result | Notes |
|---|---|---|
| 30 min | ✅ Success — 5m 45s wall-clock (~5.2× realtime), peak VRAM ~22 GB | Default `sdpa` |
| 50 min | ❌ `torch.OutOfMemoryError`: tried to allocate 6.50 GiB, 4.34 GiB free | Default `sdpa` |
| 92 min | ❌ `torch.OutOfMemoryError`: tried to allocate 1.47 GiB, 1.46 GiB free | Default `sdpa` |
| 25 min (chunked) | ✅ Success — ~3 min wall-clock per chunk | Default `sdpa`, 5 separate Python invocations |

The ~30-min ceiling on a 24GB card is a hard wall at default settings. Setting `PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True` did not change the outcome.

### Suggestion

The README states:

> **🕒 60-minute Single-Pass Processing**: VibeVoice ASR accepts up to **60 minutes** of continuous audio input within 64K token length.

The "60-minute single-pass" capability is real, but conditional on having enough VRAM — implicitly H100 / A100 80G class (the recommended container `nvcr.io/nvidia/pytorch:25.12-py3` makes this assumption). On consumer/prosumer 24GB cards, the practical limit is closer to 30 min at default `sdpa` attention.

Two suggestions for the `docs/vibevoice-asr.md` page:

1. **Add a "Hardware Requirements" section** that documents the VRAM-vs-audio-duration relationship empirically. Even ballpark numbers (e.g. "24 GB → ~30 min, 40 GB → ~60 min, 80 GB → 60 min comfortably") would set expectations correctly. The current README implies the 60-min capability is universal.

2. **Recommend `flash_attention_2`** as the path for tighter GPUs. The README mentions installing flash-attn manually, but doesn't connect it to "this is what makes 60 min work on 24 GB." If flash-attn does in fact bring 60-min audio into 24GB territory (we haven't validated this ourselves yet), that's worth surfacing.

3. *Optional:* an example showing chunking on smaller GPUs, with a note about the cross-chunk speaker-continuity caveat (each chunk gets independent diarization IDs unless you stitch them post-hoc).

### Why this matters in practice

For self-hosted deployments outside Microsoft Research (we're building a transcription tool for Twitter Spaces — open source voice AI is exactly the kind of foundation we want to build on), most teams will start with 24-GB consumer cards before committing to 80-GB hardware. Knowing the VRAM ceiling up-front saves wasted pod-time discovering it.

Happy to add more data points if useful — we have the audio files (publicly-accessible Twitter Spaces) and inference logs from the four runs.

Thanks again for the model and the open-source release.


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Docs] Document the VRAM-vs-audio-duration relationship — RTX 4090 (24GB) OOMs on >30min audio at default sdpa attention #367

Context

Observation

Suggestion

Why this matters in practice

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Audio duration	Result	Notes
30 min	✅ Success — 5m 45s wall-clock (~5.2× realtime), peak VRAM ~22 GB	Default `sdpa`
50 min	❌ `torch.OutOfMemoryError`: tried to allocate 6.50 GiB, 4.34 GiB free	Default `sdpa`
92 min	❌ `torch.OutOfMemoryError`: tried to allocate 1.47 GiB, 1.46 GiB free	Default `sdpa`
25 min (chunked)	✅ Success — ~3 min wall-clock per chunk	Default `sdpa`, 5 separate Python invocations

[Docs] Document the VRAM-vs-audio-duration relationship — RTX 4090 (24GB) OOMs on >30min audio at default sdpa attention #367

Description

Context

Observation

Suggestion

Why this matters in practice

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions