Fast inference engine for DACVAE, a neural audio codec that compresses and reconstructs audio using a convolutional encoder-decoder with a VAE bottleneck. This library accelerates DACVAE inference up to 11.2x on NVIDIA GPUs through graph-level optimizations — with no custom kernels, no quality loss at FP32, and no changes to model weights.
NVIDIA H100 PCIe | facebook/dacvae-watermarked (107.7M params) | 100s audio @ 48kHz
| Method | Latency | Speedup | Real-time Factor |
|---|---|---|---|
| PyTorch FP32 | 1,047 ms | 1.0x | 96x |
| + channels_last + wn_off | 549 ms | 1.9x | 182x |
| + torch.compile + graph | 209 ms | 5.0x | 478x |
| Method | Latency | Speedup | RTF | SNR vs FP32 |
|---|---|---|---|---|
| PyTorch FP16 | 775 ms | 1.4x | 129x | 40.4 dB |
| + channels_last + wn_off | 307 ms | 3.4x | 326x | 40.2 dB |
| + torch.compile + graph (FP16) | 93 ms | 11.2x | 1,071x | 40.2 dB |
| + torch.compile + graph (BF16) | 100 ms | 10.5x | 1,004x | 29.8 dB |
pip install git+https://github.com/kadirnar/fast-dacvae.gitfrom dacvae import DACVAE
from dacvae.optimize import optimize_dacvae
import torch
model = DACVAE.load("facebook/dacvae-watermarked").cuda().eval()
audio = torch.randn(1, 1, 4800000, device="cuda")
# FP32 — zero quality loss, ~209ms
replay = optimize_dacvae(model, audio, dtype="fp32")
output = replay()
# FP16 — fastest, ~93ms
replay = optimize_dacvae(model, audio, dtype="fp16")
output = replay()
# BF16 — ~100ms
replay = optimize_dacvae(model, audio, dtype="bf16")
output = replay()- PyTorch 2.9+
- NVIDIA GPU (Hopper/Ampere)
Apache 2.0