A production-ready FastAPI server that exposes the VibeVoice TTS model as an OpenAI-compatible API, with Docker support and comprehensive voice management.
- Docker-First Deployment: Production-ready Docker setup with GPU support (recommended)
- OpenAI-Compatible API: Drop-in replacement for OpenAI's TTS API (
/v1/audio/speech) - Unlimited Custom Voices: Automatically load any voice from a directory - just drop audio files and restart
- Multi-Format Support: MP3, WAV, FLAC, AAC, M4A, Opus, PCM
- Streaming Support: Real-time audio streaming for long-form content
- Voice Management: Dynamic voice loading, OpenAI voice mapping, and custom voice presets
- Production Ready: Health checks, error handling, CORS support, and comprehensive logging
Docker is the recommended deployment method - it handles all dependencies, ensures consistent environments, and is production-ready.
# Clone the repository
git clone https://github.com/ncoder-ai/VibeVoice-FastAPI.git
cd VibeVoice-FastAPI
# Copy and configure environment
cp docker-env.example .env
# Edit .env - set VOICES_DIR to your voice files path
# Build and run
docker-compose up -d
# Check logs
docker-compose logs -fThe API will be available at http://localhost:8001
See DOCKER_QUICKSTART.md for detailed Docker instructions.
For development or if you prefer bare-metal installation:
# Clone the repository
git clone https://github.com/ncoder-ai/VibeVoice-FastAPI.git
cd VibeVoice-FastAPI
# Run setup script
./setup.sh
# Configure environment
cp env.example .env
# Edit .env with your settings
# Start server
./start.sh- API README - Complete API documentation with examples, voice management, and troubleshooting
- Docker Quickstart - Docker deployment quickstart guide
POST /v1/audio/speech- Generate speech from text (OpenAI-compatible)GET /v1/audio/voices- List all available voices
POST /v1/vibevoice/generate- Advanced generation with multi-speaker supportGET /v1/vibevoice/voices- List all voices with detailed infoGET /v1/vibevoice/health- Detailed health check
curl -X POST http://localhost:8001/v1/audio/speech \
-H "Content-Type: application/json" \
-d '{
"model": "tts-1",
"input": "Hello, this is a test of the VibeVoice API",
"voice": "alloy",
"response_format": "mp3"
}' \
--output speech.mp3# List all available voices
curl http://localhost:8001/v1/audio/voices
# List with OpenAI format
curl http://localhost:8001/v1/audio/voices | jqVibeeVoice Large: Huggingface: rsxdalv/VibeVoice-Large [https://huggingface.co/rsxdalv/VibeVoice-Large]
VibeVoice 1.5B: Huggingface microsoft/VibeVoice-1.5B [https://huggingface.co/microsoft/VibeVoice-1.5B]
The API includes 6 OpenAI-compatible voice mappings:
alloy,echo,fable,onyx,nova,shimmer
Simply place audio files (.wav, .mp3, .flac, .m4a, etc.) in your VOICES_DIR and restart the server. All files are automatically loaded as voice presets!
# Add a custom voice
cp my_voice.wav /path/to/voices/custom_voice.wav
# Restart server - voice is now available!You can use any voice name directly in API requests:
curl -X POST http://localhost:8001/v1/audio/speech \
-H "Content-Type: application/json" \
-d '{
"model": "tts-1",
"input": "Testing custom voice",
"voice": "custom_voice",
"response_format": "wav"
}'Key environment variables (see env.example for full list):
# Model Configuration
VIBEVOICE_MODEL_PATH=microsoft/VibeVoice-1.5B # or local path
VIBEVOICE_DEVICE=cuda # cuda, cpu, or mps
VIBEVOICE_INFERENCE_STEPS=10 # 5-50, higher = better quality
# Voice Configuration
VOICES_DIR=demo/voices # Directory with voice files
# API Configuration
API_PORT=8001
API_CORS_ORIGINS=*
# Generation Defaults
DEFAULT_CFG_SCALE=1.3 # 1.0-3.0
DEFAULT_RESPONSE_FORMAT=mp3Docker is the recommended and preferred deployment method. It provides:
- β Consistent environment across all systems
- β No dependency conflicts
- β Easy GPU configuration
- β Production-ready setup
- β Simplified updates and maintenance
- Docker and Docker Compose
- NVIDIA Container Toolkit (for GPU support)
- NVIDIA GPU with 8GB+ VRAM (for 1.5B model) or 16GB+ (for Large model)
# Copy and configure environment
cp docker-env.example .env
# Edit .env - set VOICES_DIR to your voice files path
# Build and run
docker-compose up -d
# Check logs
docker-compose logs -fSee DOCKER_QUICKSTART.md for complete Docker deployment guide.
# Create virtual environment
python3.12 -m venv venv
source venv/bin/activate
# Install dependencies
pip install -e .
pip install -r requirements-api.txt
# Install PyTorch (CUDA)
pip install torch --index-url https://download.pytorch.org/whl/cu128
# Optional: Install flash-attn for faster inference
# See setup.sh for pre-built wheel installation# Start server
./start.sh
# Test API
curl http://localhost:8001/health
curl http://localhost:8001/v1/audio/voices| Model | Size | Context | Max Length | VRAM Required |
|---|---|---|---|---|
| VibeVoice-1.5B | 1.5B | 64K | ~90 min | 8GB+ |
| VibeVoice-Large | 7B | 32K | ~45 min | 16GB+ |
Models are automatically downloaded from HuggingFace on first use.
For Docker Deployment (Recommended):
- Docker: Docker and Docker Compose installed
- GPU: NVIDIA GPU with 8GB+ VRAM (for 1.5B model) or 16GB+ (for Large model)
- NVIDIA Container Toolkit: Required for GPU support
- RAM: 16GB minimum, 32GB recommended
- Storage: 10GB minimum, 50GB recommended (with model cache)
- OS: Linux (recommended), macOS, or Windows (with WSL2)
For Local Installation:
- Python: 3.12
- GPU: NVIDIA GPU with 8GB+ VRAM
- RAM: 16GB minimum, 32GB recommended
- Storage: 10GB minimum, 50GB recommended
- OS: Linux, macOS, or Windows (with WSL2)
- The API does not include authentication by default. For production use, add authentication middleware or deploy behind a reverse proxy with authentication.
- Voice files are loaded from the configured directory - ensure proper file permissions.
- Model weights are downloaded from HuggingFace - verify model integrity in production.
This project maintains the original VibeVoice model codebase. Please refer to the original VibeVoice license for model usage terms.
Contributions are welcome! Please feel free to submit a Pull Request.
- VibeVoice Team at Microsoft for the original model
- shijincai for maintaining a backup of the original VibeVoice codebase
- FastAPI for the excellent web framework
- HuggingFace for model hosting and transformers library
- Language Support: Primarily English and Chinese. Other languages may produce unexpected results.
- Non-Speech Audio: The model focuses on speech synthesis and may generate background music or sounds spontaneously.
- Commercial Use: This model is intended for research and development. Test thoroughly before production use.
- Check GPU availability:
nvidia-smi - Verify Python version:
python3 --version(should be 3.12) - Check dependencies:
pip list | grep torch
- Use smaller model:
VIBEVOICE_MODEL_PATH=microsoft/VibeVoice-1.5B - Reduce inference steps:
VIBEVOICE_INFERENCE_STEPS=5 - Use CPU mode:
VIBEVOICE_DEVICE=cpu(much slower)
- Verify
VOICES_DIRpath in.env - Check file permissions
- Ensure audio files are in supported formats
For more help, see the API README or open an issue.
Made with β€οΈ for the VibeVoice community