Skip to content

ncoder-ai/VibeVoice-FastAPI

Repository files navigation

VibeVoice FastAPI Server

A production-ready FastAPI server that exposes the VibeVoice TTS model as an OpenAI-compatible API, with Docker support and comprehensive voice management.

FastAPI Python Docker OpenAI Compatible

πŸš€ Features

  • Docker-First Deployment: Production-ready Docker setup with GPU support (recommended)
  • OpenAI-Compatible API: Drop-in replacement for OpenAI's TTS API (/v1/audio/speech)
  • Unlimited Custom Voices: Automatically load any voice from a directory - just drop audio files and restart
  • Multi-Format Support: MP3, WAV, FLAC, AAC, M4A, Opus, PCM
  • Streaming Support: Real-time audio streaming for long-form content
  • Voice Management: Dynamic voice loading, OpenAI voice mapping, and custom voice presets
  • Production Ready: Health checks, error handling, CORS support, and comprehensive logging

πŸ“‹ Quick Start

Docker is the recommended deployment method - it handles all dependencies, ensures consistent environments, and is production-ready.

Docker Deployment (Recommended)

# Clone the repository
git clone https://github.com/ncoder-ai/VibeVoice-FastAPI.git
cd VibeVoice-FastAPI

# Copy and configure environment
cp docker-env.example .env
# Edit .env - set VOICES_DIR to your voice files path

# Build and run
docker-compose up -d

# Check logs
docker-compose logs -f

The API will be available at http://localhost:8001

See DOCKER_QUICKSTART.md for detailed Docker instructions.

Local Installation (Alternative)

For development or if you prefer bare-metal installation:

# Clone the repository
git clone https://github.com/ncoder-ai/VibeVoice-FastAPI.git
cd VibeVoice-FastAPI

# Run setup script
./setup.sh

# Configure environment
cp env.example .env
# Edit .env with your settings

# Start server
./start.sh

πŸ“– Documentation

  • API README - Complete API documentation with examples, voice management, and troubleshooting
  • Docker Quickstart - Docker deployment quickstart guide

🎯 API Endpoints

OpenAI-Compatible Endpoints

  • POST /v1/audio/speech - Generate speech from text (OpenAI-compatible)
  • GET /v1/audio/voices - List all available voices

VibeVoice-Specific Endpoints

  • POST /v1/vibevoice/generate - Advanced generation with multi-speaker support
  • GET /v1/vibevoice/voices - List all voices with detailed info
  • GET /v1/vibevoice/health - Detailed health check

Example: Generate Speech

curl -X POST http://localhost:8001/v1/audio/speech \
  -H "Content-Type: application/json" \
  -d '{
    "model": "tts-1",
    "input": "Hello, this is a test of the VibeVoice API",
    "voice": "alloy",
    "response_format": "mp3"
  }' \
  --output speech.mp3

Example: List Voices

# List all available voices
curl http://localhost:8001/v1/audio/voices

# List with OpenAI format
curl http://localhost:8001/v1/audio/voices | jq

MODEL MANAGEMENT

VibeeVoice Large: Huggingface: rsxdalv/VibeVoice-Large [https://huggingface.co/rsxdalv/VibeVoice-Large]

VibeVoice 1.5B: Huggingface microsoft/VibeVoice-1.5B [https://huggingface.co/microsoft/VibeVoice-1.5B]

🎀 Voice Management

Using OpenAI-Compatible Voices

The API includes 6 OpenAI-compatible voice mappings:

  • alloy, echo, fable, onyx, nova, shimmer

Using Custom Voices

Simply place audio files (.wav, .mp3, .flac, .m4a, etc.) in your VOICES_DIR and restart the server. All files are automatically loaded as voice presets!

# Add a custom voice
cp my_voice.wav /path/to/voices/custom_voice.wav
# Restart server - voice is now available!

Direct Voice Usage

You can use any voice name directly in API requests:

curl -X POST http://localhost:8001/v1/audio/speech \
  -H "Content-Type: application/json" \
  -d '{
    "model": "tts-1",
    "input": "Testing custom voice",
    "voice": "custom_voice",
    "response_format": "wav"
  }'

βš™οΈ Configuration

Key environment variables (see env.example for full list):

# Model Configuration
VIBEVOICE_MODEL_PATH=microsoft/VibeVoice-1.5B  # or local path
VIBEVOICE_DEVICE=cuda                           # cuda, cpu, or mps
VIBEVOICE_INFERENCE_STEPS=10                    # 5-50, higher = better quality

# Voice Configuration
VOICES_DIR=demo/voices                           # Directory with voice files

# API Configuration
API_PORT=8001
API_CORS_ORIGINS=*

# Generation Defaults
DEFAULT_CFG_SCALE=1.3                            # 1.0-3.0
DEFAULT_RESPONSE_FORMAT=mp3

🐳 Docker Deployment

Docker is the recommended and preferred deployment method. It provides:

  • βœ… Consistent environment across all systems
  • βœ… No dependency conflicts
  • βœ… Easy GPU configuration
  • βœ… Production-ready setup
  • βœ… Simplified updates and maintenance

Requirements

  • Docker and Docker Compose
  • NVIDIA Container Toolkit (for GPU support)
  • NVIDIA GPU with 8GB+ VRAM (for 1.5B model) or 16GB+ (for Large model)

Quick Start

# Copy and configure environment
cp docker-env.example .env
# Edit .env - set VOICES_DIR to your voice files path

# Build and run
docker-compose up -d

# Check logs
docker-compose logs -f

See DOCKER_QUICKSTART.md for complete Docker deployment guide.

πŸ”§ Development

Setup Development Environment

# Create virtual environment
python3.12 -m venv venv
source venv/bin/activate

# Install dependencies
pip install -e .
pip install -r requirements-api.txt

# Install PyTorch (CUDA)
pip install torch --index-url https://download.pytorch.org/whl/cu128

# Optional: Install flash-attn for faster inference
# See setup.sh for pre-built wheel installation

Running Tests

# Start server
./start.sh

# Test API
curl http://localhost:8001/health
curl http://localhost:8001/v1/audio/voices

πŸ“Š Supported Models

Model Size Context Max Length VRAM Required
VibeVoice-1.5B 1.5B 64K ~90 min 8GB+
VibeVoice-Large 7B 32K ~45 min 16GB+

Models are automatically downloaded from HuggingFace on first use.

πŸ› οΈ System Requirements

For Docker Deployment (Recommended):

  • Docker: Docker and Docker Compose installed
  • GPU: NVIDIA GPU with 8GB+ VRAM (for 1.5B model) or 16GB+ (for Large model)
  • NVIDIA Container Toolkit: Required for GPU support
  • RAM: 16GB minimum, 32GB recommended
  • Storage: 10GB minimum, 50GB recommended (with model cache)
  • OS: Linux (recommended), macOS, or Windows (with WSL2)

For Local Installation:

  • Python: 3.12
  • GPU: NVIDIA GPU with 8GB+ VRAM
  • RAM: 16GB minimum, 32GB recommended
  • Storage: 10GB minimum, 50GB recommended
  • OS: Linux, macOS, or Windows (with WSL2)

πŸ” Security Notes

  • The API does not include authentication by default. For production use, add authentication middleware or deploy behind a reverse proxy with authentication.
  • Voice files are loaded from the configured directory - ensure proper file permissions.
  • Model weights are downloaded from HuggingFace - verify model integrity in production.

πŸ“ License

This project maintains the original VibeVoice model codebase. Please refer to the original VibeVoice license for model usage terms.

🀝 Contributing

Contributions are welcome! Please feel free to submit a Pull Request.

πŸ™ Acknowledgments

  • VibeVoice Team at Microsoft for the original model
  • shijincai for maintaining a backup of the original VibeVoice codebase
  • FastAPI for the excellent web framework
  • HuggingFace for model hosting and transformers library

πŸ“š Additional Resources

⚠️ Limitations

  • Language Support: Primarily English and Chinese. Other languages may produce unexpected results.
  • Non-Speech Audio: The model focuses on speech synthesis and may generate background music or sounds spontaneously.
  • Commercial Use: This model is intended for research and development. Test thoroughly before production use.

πŸ†˜ Troubleshooting

Server won't start

  • Check GPU availability: nvidia-smi
  • Verify Python version: python3 --version (should be 3.12)
  • Check dependencies: pip list | grep torch

Out of memory errors

  • Use smaller model: VIBEVOICE_MODEL_PATH=microsoft/VibeVoice-1.5B
  • Reduce inference steps: VIBEVOICE_INFERENCE_STEPS=5
  • Use CPU mode: VIBEVOICE_DEVICE=cpu (much slower)

Voice not found errors

  • Verify VOICES_DIR path in .env
  • Check file permissions
  • Ensure audio files are in supported formats

For more help, see the API README or open an issue.


Made with ❀️ for the VibeVoice community

About

FastAPI wrapper around original Vibevoice 1.5B and 7B models

Resources

License

Stars

Watchers

Forks

Packages

No packages published

Contributors 2

  •  
  •