Hi! First — thanks for shipping VibeVoice-ASR. The single-pass transcription + diarization + timestamps + multilingual story is a significant step beyond stitching whisper + pyannote together, and we've been excited to evaluate it.
This is a documentation suggestion based on running the model on a 24GB GPU (RTX 4090 — RunPod community-cloud pod). Sharing the empirical numbers in case they're useful, and proposing a docs addition.
Context
I tested VibeVoice-ASR-7B on four real-world Twitter/X Spaces audio captures of varying length (30 min to ~107 min). Setup followed the repo's installation instructions:
- Hardware: RTX 4090 (24 GB VRAM), 12 vCPU, 62 GB RAM
- Image:
runpod/pytorch:1.0.3-cu1281-torch291-ubuntu2404 (not the recommended NVIDIA DLC, but a comparable PyTorch base)
- Install:
git clone https://github.com/microsoft/VibeVoice && cd VibeVoice && python3 -m venv .venv && source .venv/bin/activate && pip install -e .
- Versions:
transformers 4.57.6, torch 2.11.0+cu130, Python 3.12
- Inference:
python demo/vibevoice_asr_inference_from_file.py --model_path microsoft/VibeVoice-ASR --audio_files <audio.m4a> --device cuda --attn_implementation sdpa
Observation
| Audio duration |
Result |
Notes |
| 30 min |
✅ Success — 5m 45s wall-clock (~5.2× realtime), peak VRAM ~22 GB |
Default sdpa |
| 50 min |
❌ torch.OutOfMemoryError: tried to allocate 6.50 GiB, 4.34 GiB free |
Default sdpa |
| 92 min |
❌ torch.OutOfMemoryError: tried to allocate 1.47 GiB, 1.46 GiB free |
Default sdpa |
| 25 min (chunked) |
✅ Success — ~3 min wall-clock per chunk |
Default sdpa, 5 separate Python invocations |
The ~30-min ceiling on a 24GB card is a hard wall at default settings. Setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True did not change the outcome.
Suggestion
The README states:
🕒 60-minute Single-Pass Processing: VibeVoice ASR accepts up to 60 minutes of continuous audio input within 64K token length.
The "60-minute single-pass" capability is real, but conditional on having enough VRAM — implicitly H100 / A100 80G class (the recommended container nvcr.io/nvidia/pytorch:25.12-py3 makes this assumption). On consumer/prosumer 24GB cards, the practical limit is closer to 30 min at default sdpa attention.
Two suggestions for the docs/vibevoice-asr.md page:
-
Add a "Hardware Requirements" section that documents the VRAM-vs-audio-duration relationship empirically. Even ballpark numbers (e.g. "24 GB → ~30 min, 40 GB → ~60 min, 80 GB → 60 min comfortably") would set expectations correctly. The current README implies the 60-min capability is universal.
-
Recommend flash_attention_2 as the path for tighter GPUs. The README mentions installing flash-attn manually, but doesn't connect it to "this is what makes 60 min work on 24 GB." If flash-attn does in fact bring 60-min audio into 24GB territory (we haven't validated this ourselves yet), that's worth surfacing.
-
Optional: an example showing chunking on smaller GPUs, with a note about the cross-chunk speaker-continuity caveat (each chunk gets independent diarization IDs unless you stitch them post-hoc).
Why this matters in practice
For self-hosted deployments outside Microsoft Research (we're building a transcription tool for Twitter Spaces — open source voice AI is exactly the kind of foundation we want to build on), most teams will start with 24-GB consumer cards before committing to 80-GB hardware. Knowing the VRAM ceiling up-front saves wasted pod-time discovering it.
Happy to add more data points if useful — we have the audio files (publicly-accessible Twitter Spaces) and inference logs from the four runs.
Thanks again for the model and the open-source release.
Hi! First — thanks for shipping VibeVoice-ASR. The single-pass transcription + diarization + timestamps + multilingual story is a significant step beyond stitching
whisper + pyannotetogether, and we've been excited to evaluate it.This is a documentation suggestion based on running the model on a 24GB GPU (RTX 4090 — RunPod community-cloud pod). Sharing the empirical numbers in case they're useful, and proposing a docs addition.
Context
I tested VibeVoice-ASR-7B on four real-world Twitter/X Spaces audio captures of varying length (30 min to ~107 min). Setup followed the repo's installation instructions:
runpod/pytorch:1.0.3-cu1281-torch291-ubuntu2404(not the recommended NVIDIA DLC, but a comparable PyTorch base)git clone https://github.com/microsoft/VibeVoice && cd VibeVoice && python3 -m venv .venv && source .venv/bin/activate && pip install -e .transformers 4.57.6,torch 2.11.0+cu130, Python 3.12python demo/vibevoice_asr_inference_from_file.py --model_path microsoft/VibeVoice-ASR --audio_files <audio.m4a> --device cuda --attn_implementation sdpaObservation
sdpatorch.OutOfMemoryError: tried to allocate 6.50 GiB, 4.34 GiB freesdpatorch.OutOfMemoryError: tried to allocate 1.47 GiB, 1.46 GiB freesdpasdpa, 5 separate Python invocationsThe ~30-min ceiling on a 24GB card is a hard wall at default settings. Setting
PYTORCH_CUDA_ALLOC_CONF=expandable_segments:Truedid not change the outcome.Suggestion
The README states:
The "60-minute single-pass" capability is real, but conditional on having enough VRAM — implicitly H100 / A100 80G class (the recommended container
nvcr.io/nvidia/pytorch:25.12-py3makes this assumption). On consumer/prosumer 24GB cards, the practical limit is closer to 30 min at defaultsdpaattention.Two suggestions for the
docs/vibevoice-asr.mdpage:Add a "Hardware Requirements" section that documents the VRAM-vs-audio-duration relationship empirically. Even ballpark numbers (e.g. "24 GB → ~30 min, 40 GB → ~60 min, 80 GB → 60 min comfortably") would set expectations correctly. The current README implies the 60-min capability is universal.
Recommend
flash_attention_2as the path for tighter GPUs. The README mentions installing flash-attn manually, but doesn't connect it to "this is what makes 60 min work on 24 GB." If flash-attn does in fact bring 60-min audio into 24GB territory (we haven't validated this ourselves yet), that's worth surfacing.Optional: an example showing chunking on smaller GPUs, with a note about the cross-chunk speaker-continuity caveat (each chunk gets independent diarization IDs unless you stitch them post-hoc).
Why this matters in practice
For self-hosted deployments outside Microsoft Research (we're building a transcription tool for Twitter Spaces — open source voice AI is exactly the kind of foundation we want to build on), most teams will start with 24-GB consumer cards before committing to 80-GB hardware. Knowing the VRAM ceiling up-front saves wasted pod-time discovering it.
Happy to add more data points if useful — we have the audio files (publicly-accessible Twitter Spaces) and inference logs from the four runs.
Thanks again for the model and the open-source release.