Generic Speech AI Platform - Ollama for Voice Models
Vocal is an API-first speech AI platform with automatic OpenAPI spec generation, auto-generated SDK, and Ollama-style model management. Built with a generic registry pattern supporting multiple providers.
# 1. Run with uvx (no installation needed!)
uvx --from vocal-ai vocal serve
# 2. Visit interactive docs
# Open: http://localhost:8000/docs
# 3. Pull a model and transcribe
uvx --from vocal-ai vocal models pull Systran/faster-whisper-tiny
uvx --from vocal-ai vocal run your_audio.mp3That's it! Models auto-download on first use.
Pro tip: For development, clone the repo and run make help to see all available commands.
- π― API-First Architecture: FastAPI with auto-generated OpenAPI spec
- π Interactive Docs: Swagger UI at
/docsendpoint - π¦ Auto-Generated SDK: Python SDK generated from OpenAPI spec
- π Ollama-Style: Model registry with pull/list/delete commands
- π Fast Inference: faster-whisper (4x faster than OpenAI Whisper)
- β‘ GPU Acceleration: Automatic CUDA detection with VRAM optimization
- π 99+ Languages: Support for multilingual transcription
- π Extensible: Generic provider pattern (HuggingFace, local, custom)
- π€ OpenAI Compatible:
/v1/audio/transcriptionsand/v1/audio/speechendpoints - π Neural TTS: Kokoro-82M, Qwen3-TTS (0.6B / 1.7B), Piper, or system voices
- π‘ Streaming TTS: Chunked audio delivery via
"stream": trueβ first bytes arrive immediately - π¨ CLI Tool: Typer-based CLI with rich console output
- π» Cross-Platform: Full support for Windows, macOS, and Linux
- β Production Ready: 36/36 tests passing with real audio assets
-
Python 3.10+
-
ffmpeg β required for audio format conversion (mp3, opus, aac, flac, pcm). WAV output works without it.
# macOS brew install ffmpeg # Ubuntu / Debian sudo apt install ffmpeg # Windows choco install ffmpeg
# Run directly with uvx (no installation needed)
uvx --from vocal-ai vocal serve
# Or install with pip
pip install vocal-ai
vocal serve
# Optional backends (install what you need)
pip install vocal-ai[kokoro] # Kokoro-82M neural TTS (CPU/GPU)
pip install vocal-ai[qwen3-tts] # Qwen3-TTS 0.6B / 1.7B (CUDA required)git clone https://github.com/niradler/vocal
cd vocal
make install
make serveThe API will be available at:
- API: http://localhost:8000
- Interactive Docs: http://localhost:8000/docs π
- OpenAPI Spec: http://localhost:8000/openapi.json
- Health: http://localhost:8000/health
# Production
uvx --from vocal-ai vocal serve
# Development with auto-reload (from source)
make serve-devfrom vocal import VocalSDK
# Initialize client
client = VocalSDK(base_url="http://localhost:8000")
# List models (Ollama-style)
models = client.models.list()
for model in models['models']:
print(f"{model['id']}: {model['status']}")
# Download model if needed (Ollama-style pull)
client.models.download("Systran/faster-whisper-tiny")
# Transcribe audio (OpenAI-compatible)
result = client.audio.transcribe(
file="audio.mp3",
model="Systran/faster-whisper-tiny"
)
print(result['text'])
# Text-to-Speech (default: mp3)
audio = client.audio.text_to_speech(
text="Hello, world!",
model="pyttsx3"
)
with open("output.mp3", "wb") as f:
f.write(audio)
# TTS with specific format and voice
audio = client.audio.text_to_speech(
text="Hello!",
response_format="wav", # mp3, wav, opus, aac, flac, pcm
voice="Samantha"
)# Start server
uvx --from vocal-ai vocal serve
# Transcribe audio
uvx --from vocal-ai vocal run audio.mp3
# List models
uvx --from vocal-ai vocal models list
# Download model
uvx --from vocal-ai vocal models pull Systran/faster-whisper-tiny
# Delete model
uvx --from vocal-ai vocal models delete Systran/faster-whisper-tinyTranscribe Audio:
curl -X POST "http://localhost:8000/v1/audio/transcriptions" \
-F "file=@audio.mp3" \
-F "model=Systran/faster-whisper-tiny"Text-to-Speech:
# Default format is mp3
curl -X POST "http://localhost:8000/v1/audio/speech" \
-H "Content-Type: application/json" \
-d '{"model":"pyttsx3","input":"Hello world"}' \
--output speech.mp3
# Request specific format (mp3, wav, opus, aac, flac, pcm)
curl -X POST "http://localhost:8000/v1/audio/speech" \
-H "Content-Type: application/json" \
-d '{"model":"pyttsx3","input":"Hello world","response_format":"wav"}' \
--output speech.wav
# Streaming: receive audio chunks as they are generated
curl -X POST "http://localhost:8000/v1/audio/speech" \
-H "Content-Type: application/json" \
-d '{"model":"kokoro","input":"Hello world","response_format":"pcm","stream":true}' \
--output - | play -r 24000 -e signed -b 16 -c 1 -t raw -
# Kokoro neural TTS (requires: pip install vocal-ai[kokoro])
curl -X POST "http://localhost:8000/v1/audio/speech" \
-H "Content-Type: application/json" \
-d '{"model":"hexgrad/Kokoro-82M","input":"Hello world","voice":"af_heart"}' \
--output speech.mp3
# Qwen3-TTS (requires: pip install vocal-ai[qwen3-tts] + CUDA GPU)
curl -X POST "http://localhost:8000/v1/audio/speech" \
-H "Content-Type: application/json" \
-d '{"model":"qwen3-tts-1.7b-custom","input":"Hello world","voice":"aiden"}' \
--output speech.mp3# Basic usage
docker compose up
# With GPU support
docker compose --profile gpu up
# Custom port
docker run -p 9000:8000 niradler/vocal-apiPort already in use:
# Windows
netstat -ano | findstr :8000
taskkill /F /PID <PID>
# Linux/Mac
lsof -ti:8000 | xargs killGPU not detected:
# Check CUDA
python -c "import torch; print(torch.cuda.is_available())"
# Check device info
curl http://localhost:8000/v1/system/deviceVocal runs on Windows, macOS, and Linux out of the box. The TTS engine automatically selects the best available backend per platform:
| Platform | TTS Backend | Notes |
|---|---|---|
| macOS | say (NSSpeechSynthesizer) |
170+ built-in voices |
| Linux | espeak / espeak-ng |
Install via apt install espeak |
| Windows | SAPI5 (via pyttsx3) | Uses system voices |
Audio output is normalized through ffmpeg, supporting all formats (mp3, wav, opus, aac, flac, pcm) regardless of platform. Requires ffmpeg for non-WAV output formats.
List all available models
Query params:
status: Filter by status (available, downloading, not_downloaded)task: Filter by task (stt, tts)
Get model information
Download a model (Ollama-style "pull")
Check download progress
Delete a downloaded model
Transcribe audio to text.
Parameters:
file(required): Audio file (mp3, wav, m4a, etc.)model(required): Model ID (e.g., "Systran/faster-whisper-tiny")language(optional): 2-letter language code (e.g., "en", "es")response_format(optional): "json" (default), "text", "srt", "vtt"temperature(optional): Sampling temperature (0.0-1.0, default: 0.0)
Response:
{
"text": "Hello, how are you today?",
"language": "en",
"duration": 2.5,
"segments": [
{
"id": 0,
"start": 0.0,
"end": 2.5,
"text": "Hello, how are you today?"
}
]
}Translate audio to English text.
Convert text to speech.
Parameters:
model(required): TTS model to use (e.g.,"pyttsx3","hexgrad/Kokoro-82M","qwen3-tts-1.7b-custom")input(required): Text to synthesizevoice(optional): Voice ID to use (seeGET /v1/audio/voices)speed(optional): Speech speed multiplier (0.25-4.0, default: 1.0)response_format(optional):mp3(default),wav,opus,aac,flac,pcmstream(optional):false(default) β set totrueto receive audio as chunked transfer. With Kokoro,wavandpcmformats yield real per-sentence chunks; other formats fall back to a single chunk after generation.
Response (stream: false): Returns audio file in specified format with headers:
X-Duration: Audio duration in secondsX-Sample-Rate: Audio sample rate
Response (stream: true):
Returns Transfer-Encoding: chunked streaming response. Audio bytes arrive as they are generated β first chunk delivered before full synthesis completes.
List available TTS voices.
Response:
{
"voices": [
{
"id": "default",
"name": "Default Voice",
"language": "en",
"gender": null
}
],
"total": 1
}Health check endpoint
Interactive Swagger UI for API testing
OpenAPI specification (auto-generated)
| Model ID | Size | Parameters | VRAM | Speed | Backend |
|---|---|---|---|---|---|
Systran/faster-whisper-tiny |
~75MB | 39M | 1GB+ | Fastest | CTranslate2 |
Systran/faster-whisper-base |
~145MB | 74M | 1GB+ | Fast | CTranslate2 |
Systran/faster-whisper-small |
~488MB | 244M | 2GB+ | Good | CTranslate2 |
Systran/faster-whisper-medium |
~1.5GB | 769M | 5GB+ | Better | CTranslate2 |
Systran/faster-whisper-large-v3 |
~3.1GB | 1.5B | 10GB+ | Best | CTranslate2 |
All STT models support 99+ languages. Use alias whisper-tiny, whisper-base, etc. for short names.
| Alias / Model ID | Size | Parameters | VRAM | Languages | Install extra |
|---|---|---|---|---|---|
pyttsx3 |
β | β | None | System voices | built-in |
kokoro / hexgrad/Kokoro-82M |
~347MB | 82M | 4GB+ | en (30+ voices) | [kokoro] |
kokoro-onnx / onnx-community/Kokoro-82M-ONNX |
~1.3GB | 82M | 6GB+ | en | [kokoro] |
qwen3-tts-0.6b / Qwen/Qwen3-TTS-...-0.6B-Base |
~2.3GB | 915M | 8GB+ | zh/en/ja/ko/de/fr/ru/pt/es/it | [qwen3-tts] |
qwen3-tts-1.7b / Qwen/Qwen3-TTS-...-1.7B-Base |
~4.2GB | 1.9B | 8GB+ | zh/en/ja/ko/de/fr/ru/pt/es/it | [qwen3-tts] |
qwen3-tts-0.6b-custom |
~2.3GB | 906M | 8GB+ | zh/en/ja/ko/de/fr/ru/pt/es/it | [qwen3-tts] |
qwen3-tts-1.7b-custom |
~4.2GB | 1.9B | 8GB+ | zh/en/ja/ko | [qwen3-tts] |
Kokoro runs on CPU or GPU and supports real per-sentence streaming. Qwen3-TTS requires an NVIDIA CUDA GPU.
# Install optional backends
pip install vocal-ai[kokoro] # Kokoro neural TTS
pip install vocal-ai[qwen3-tts] # Qwen3-TTS (CUDA required)
pip install vocal-ai[kokoro,qwen3-tts] # BothNote (Kokoro): The
kokoropackage uses the spaCyen_core_web_smmodel for English text processing. PyPI does not allow packages to declare direct URL dependencies, so it is not listed in the install extras. If Kokoro raises an error about a missing spaCy model, install it manually:python -m spacy download en_core_web_sm
Vocal automatically detects and optimizes for your hardware:
When NVIDIA GPU is available:
- Automatic Detection: GPU is detected and used automatically
- Optimal Compute Types:
- 8GB+ VRAM:
float16(best quality) - 4-8GB VRAM:
int8_float16(balanced) - <4GB VRAM:
int8(most efficient)
- 8GB+ VRAM:
- 4x-10x Faster: GPU inference is significantly faster than CPU
- Memory Management: Automatic GPU cache clearing
When GPU is not available:
- Multi-threading: Uses optimal CPU threads based on core count
- Quantization:
int8quantization for faster CPU inference - VAD Filtering: Voice Activity Detection for improved performance
# View device info via API
curl http://localhost:8000/v1/system/device
# Or via SDK
from vocal_sdk import VocalSDK
client = VocalSDK()
info = client._request('GET', '/v1/system/device')
print(info)Example output:
{
"platform": "Windows",
"cpu_count": 16,
"cuda_available": true,
"gpu_count": 1,
"gpu_devices": [
{
"name": "NVIDIA GeForce RTX 4090",
"vram_gb": 24.0,
"compute_capability": "8.9"
}
]
}- GPU Usage: Models automatically use GPU when available
- Model Selection:
tiny/basemodels: Work well on CPUsmall/medium: Best on GPU with 4GB+ VRAMlarge: Requires GPU with 8GB+ VRAM
- Batch Processing: Load model once, transcribe multiple files
- VAD Filter: Enabled by default for better performance
The CLI provides an intuitive command-line interface for common tasks.
# Transcribe audio file
vocal run audio.mp3
# Specify model
vocal run audio.mp3 --model Systran/faster-whisper-base
# Specify language
vocal run audio.mp3 --language en
# Output formats
vocal run audio.mp3 --format text
vocal run audio.mp3 --format json
vocal run audio.mp3 --format srt
vocal run audio.mp3 --format vtt# List all models
vocal models list
# Filter by status
vocal models list --status available
vocal models list --status not_downloaded
# Download a model
vocal models pull Systran/faster-whisper-tiny
# Delete a model
vocal models delete Systran/faster-whisper-tiny
vocal models delete Systran/faster-whisper-tiny --force# Start API server (default: http://0.0.0.0:8000)
vocal serve
# Custom host and port
vocal serve --host localhost --port 9000
# Enable auto-reload for development
vocal serve --reloadThe project uses a uv workspace with multiple packages:
packages/core: Core model registry and adapters (no dependencies on API)packages/api: FastAPI server (depends on core)packages/sdk: Auto-generated SDK (generates from API OpenAPI spec)packages/cli: CLI tool (uses SDK)
All tests use real audio assets from test_assets/audio/ with validated transcriptions.
# Using Makefile
make test-quick
# Or directly
uv run python scripts/validate.py# Using Makefile
make test
# With verbose output
make test-verbose
# Or using pytest directly
uv run python -m pytest tests/test_e2e.py -vCurrent Status: 36/36 tests passing β
Test coverage includes:
- API health and device information (GPU detection)
- Model management (list, download, status, delete)
- Audio transcription with real M4A and MP3 files
- Text-to-Speech synthesis in all formats (mp3, wav, opus, aac, flac, pcm)
- TTS voice selection and speed control
- Audio format validation and Content-Type headers
- Error handling for invalid models, files, and formats
- Performance and model reuse optimization
make gpu-check# Using Makefile
make lint # Check code quality
make format # Format code
make check # Lint + format check
# Or using ruff directly
uv run ruff format .
uv run ruff check .Vocal includes a comprehensive Makefile for common tasks:
make help # Show all available commands
# Setup
make install # Install dependencies
make sync # Sync dependencies
# Testing
make test # Run full test suite
make test-quick # Quick validation
make test-verbose # Verbose test output
make gpu-check # Check GPU detection
# Development
make serve # Start API server
make serve-dev # Start with auto-reload
make cli # Show CLI help
make docs # Open API docs in browser
# Code Quality
make lint # Run linter
make format # Format code
make check # Lint + format check
# Cleanup
make clean # Remove cache files
make clean-models # Remove downloaded models
# Quick aliases
make t # Alias for test
make s # Alias for serve
make l # Alias for lint
make f # Alias for formatCreate a .env file:
APP_NAME=Vocal API
VERSION=0.1.0
DEBUG=true
CORS_ORIGINS=["*"]
MAX_UPLOAD_SIZE=26214400
# TTS Configuration
VOCAL_TTS_SAMPLE_RATE=16000 # Output sample rate in Hz (default: 16000)TTS Configuration:
VOCAL_TTS_SAMPLE_RATE: Output sample rate for all TTS audio (default:16000Hz / 16 kHz)- Common values:
8000(phone quality),16000(wideband),22050(CD half),44100(CD quality),48000(professional) - All TTS output will be resampled to this rate via ffmpeg
- Common values:
Models are cached at: ~/.cache/vocal/models/
We welcome contributions!
# Fork and clone
git clone <your-fork-url>
cd vocal
# Setup
uv venv && uv sync
# Run tests
make test
# Submit PR
git checkout -b feature/your-feature
# Make changes, commit, and push- Core model registry with provider pattern
- Model management API (list, download, delete)
- SDK generation from OpenAPI spec
- Interactive Swagger UI docs
- CLI tool (Typer-based)
- Text-to-Speech (TTS) support
- Keep-alive model caching (5min default)
- GPU acceleration with CUDA
- OpenAI-compatible endpoints
- Published to PyPI as
vocal-ai - Kokoro-82M neural TTS adapter (CPU/GPU, 30+ voices, streaming)
- Qwen3-TTS 0.6B / 1.7B adapter (CUDA, 10 languages, custom-voice variants)
- Streaming TTS via
"stream": trueβ chunked transfer, first bytes before full generation faster-qwen3-ttsas optional install extra (pip install vocal-ai[qwen3-tts])
1. Fix Model Metadata
- Why: Models currently show
0size and missing info, looks unfinished - How: Fetch actual sizes from HuggingFace, populate all fields in registry
2. Model Show Command
- Why: Users need to inspect models before downloading (like
ollama show) - How:
vocal models show whisper-tinydisplays params, size, languages, VRAM
3. Model Aliases
- Why: Typing full paths is tedious (
Systran/faster-whisper-tiny) - How: Use short names:
vocal run audio.mp3 -m whisper-tiny
4. OpenAI Realtime API Compatible Endpoint (/v1/realtime)
- Why: The OpenAI Realtime API is becoming the standard protocol for low-latency voice agents. A self-hosted, local-first drop-in would make Vocal the go-to backend for any voice agent that currently points at OpenAI.
- How: WebSocket endpoint implementing the OpenAI Realtime event protocol (session lifecycle,
input_audio_buffer.append,response.audio.delta). Audio input is piped through Vocal's STT adapters, text response comes from a configurable external LLM URL (Ollama, vLLM, or any OpenAI-compatible chat endpoint), and audio output streams back via Vocal's TTS adapters. Vocal stays focused on voice; the LLM is pluggable.
/v1/realtime (WebSocket)
mic audio β Vocal STT β text β [your LLM] β text β Vocal TTS β audio chunks
5. Voice Registry System
- Why: Voices should be managed like models, not just system TTS
- How:
vocal voices list/pull/showwith downloadable voice models
6. Voice Cloning (XTTS-v2)
- Why: Custom voices are the killer feature for TTS
- How:
vocal voices clone my-voice --sample recording.wav
7. Voice Preview
- Why: Users want to test voices before using them
- How:
vocal voices sample kokoro-en "Hello world"generates quick sample
Built with:
- FastAPI - Web framework
- faster-whisper - STT engine
- HuggingFace Hub - Model distribution
- uv - Python package manager