Give your AI coding agent a voice. Uses Mistral's Voxtral TTS to speak responses aloud — turning your terminal into a conversational AI assistant.
Works with any CLI agent that can run shell commands: Claude Code, Cursor, Copilot CLI, Aider, Gemini CLI, or even plain curl. Runs anywhere: Docker, macOS, Linux, Windows, WSL.
Includes a zero-dependency local TTS fallback that needs no API key at all.
Natural-sounding voice with emotional tones via Mistral's API. The free tier works for development and testing. Requires an API key.
Every major OS has built-in text-to-speech. No server, no API key, no dependencies. Sounds robotic but works instantly:
# macOS
say "Hello world"
# Windows / WSL
powershell.exe -Command "Add-Type -AssemblyName System.Speech; (New-Object System.Speech.Synthesis.SpeechSynthesizer).Speak('Hello world')"
# Linux (install espeak if not present)
espeak "Hello world"You can use both — Voxtral for quality, local TTS as a fallback when the server is down or you don't want to burn API credits.
git clone https://github.com/kshitizshankar/cli-agents-voice-interface.git
cd cli-agents-voice-interface
# Add your keys
cp .env.example .env
# Edit .env with your Mistral API key(s)
# Build and run
docker build -t cli-voice .
docker run -d -p 8765:8765 --env-file .env --name cli-voice cli-voiceWith Docker, use /tts (returns audio bytes) — your client handles playback since the container can't access speakers.
# Test — fetch WAV and play locally
curl -o test.wav 'http://localhost:8765/tts?tone=cheerful&text=Hello+world'
afplay test.wav # macOS
aplay test.wav # Linux
# Windows: powershell.exe -Command "(New-Object Media.SoundPlayer 'test.wav').PlaySync()"git clone https://github.com/kshitizshankar/cli-agents-voice-interface.git
cd cli-agents-voice-interface
python3 -m venv .venv
source .venv/bin/activate
pip install httpx
cp .env.example .env
# Edit .env with your Mistral API key(s)
python3 server.py# Test — server plays audio directly through your speakers
curl 'http://localhost:8765/speak?tone=cheerful&text=Hello+world'Get API keys from console.mistral.ai. Add them to .env:
MISTRAL_API_KEYS=key1,key2,key3Multiple keys enable automatic rotation when one hits rate limits.
Best for native (non-Docker) setups where the server can access your speakers.
| Parameter | Default | Description |
|---|---|---|
text |
(required) | The text to speak |
voice |
paul |
Voice character: paul, oliver, or jane |
tone |
neutral |
Emotional tone (see below) |
bg |
0 |
Set to 1 for fire-and-forget (returns instantly, plays in background) |
For multi-sentence text, uses streaming playback — plays each sentence as it arrives while prefetching the next in parallel.
Best for Docker, remote deployments, or when the client handles playback.
| Parameter | Default | Description |
|---|---|---|
text |
(required) | The text to speak |
voice |
paul |
Voice character: paul, oliver, or jane |
tone |
neutral |
Emotional tone (see below) |
Returns audio/wav content. Zero temp files created server-side.
Returns JSON with all voice characters and their available tones.
Hot-reload .env without restarting the server. Use after adding or rotating API keys.
| Voice | Gender | Accent | Available tones |
|---|---|---|---|
paul (default) |
Male | US English | neutral, happy, cheerful, confident, excited, sad, frustrated, angry |
oliver |
Male | British English | neutral |
jane |
Female | British English | sarcasm |
If you request a tone that doesn't exist for a voice (e.g. oliver + excited), it gracefully falls back to that voice's default tone.
Want a different voice? Mistral's Voxtral TTS supports voice cloning from a few seconds of audio — any accent, any language. See the Voxtral TTS docs for details. You can also browse all preset voices with the included list_voices.py script.
Give each AI agent a distinct voice so you can tell them apart:
Claude Code → voice=paul (US male, 8 emotional tones)
Gemini CLI → voice=oliver (British male)
Third agent → voice=jane (British female, sarcastic)
For multi-sentence text (via /speak), the server:
- Splits text into sentences
- Fetches audio for sentence 1 from the Mistral API
- Starts playing sentence 1 as soon as it arrives (~2 seconds)
- While sentence 1 plays, prefetches sentence 2 in parallel
- Chains through all sentences seamlessly
You hear the first words within ~2 seconds, regardless of total text length.
This works with any CLI AI agent that can execute shell commands — Claude Code, Cursor, Copilot CLI, Aider, Gemini CLI, etc.
Docker:
docker run -d -p 8765:8765 --env-file .env --restart unless-stopped --name cli-voice cli-voiceNative:
cd /path/to/cli-agents-voice-interface
source .venv/bin/activate
nohup python3 server.py > /tmp/tts-server.log 2>&1 &Add the prompt below to your agent's instructions — CLAUDE.md for Claude Code, system prompt for Gemini CLI, custom instructions for Cursor, etc.
Copy everything inside the code block into your agent's instructions. Replace
/path/to/speak.shwith the actual path where you cloned the repo.
## Voice Output
You can speak responses aloud using a helper script. The script handles everything:
Voxtral TTS if the server is running, automatic fallback to local system TTS if not,
and cross-platform audio playback (macOS, Linux, Windows, WSL).
### How to speak
Run this command to speak:
/path/to/speak.sh "Your text here" TONE VOICE
Examples:
/path/to/speak.sh "Hello! Let me take a look at this code." cheerful paul
/path/to/speak.sh "Interesting approach here." confident oliver
/path/to/speak.sh "Oh wonderful, another singleton pattern." sarcasm jane
Parameters:
- Text (required): what to say
- Tone (optional, default: neutral): neutral, happy, cheerful, confident, excited, sad, frustrated, angry
- Voice (optional, default: paul): paul (US male), oliver (British male), jane (British female)
IMPORTANT: This command blocks for 5-10 seconds while audio plays through the speakers.
That is normal — it is fetching audio from an API and playing it. Do not cancel it.
If the TTS server is not running, it automatically falls back to local system TTS (faster, robotic).
### How it works (so you understand the architecture)
In WSL, you are running Linux but audio hardware belongs to Windows. Linux audio tools
like aplay cannot reach Windows speakers. The script handles this by:
1. Fetching a WAV file from the TTS server (Docker or native)
2. Converting the Linux file path to a Windows path using wslpath
3. Playing audio via powershell.exe which runs on the Windows side and can access speakers
4. Cleaning up the temp file
On macOS it uses afplay, on native Linux it uses aplay. The script auto-detects.
### Rules
- Voice and text are DIFFERENT channels. Never duplicate content across both.
- Voice is for: reactions, confirmations, questions, encouragement, high-level summaries.
For discussions and back-and-forth, longer conversational voice is great.
- Text is for: code, file paths, commands, technical details, lists, anything to read or copy.
- NEVER say file paths, code, URLs, or technical details aloud. That is what text is for.
- Think: "would a human colleague say this out loud?" If not, it is text-only.
### Good examples
- "Done, the server is updated and running."
- "Found the bug — it was a null check. Fix is in."
- "Hey, quick question — do you want me to refactor this or just patch it?"
### Bad examples (never do this)
- "I updated slash home slash user slash server dot py with the new config."
- "The error was on line 47 of src utils parser ts where the optional chaining..."
- Reading out file paths, URLs, or code aloud.Docker — use --restart unless-stopped (shown above).
Native — add to your shell profile (~/.bashrc, ~/.zshrc, etc.):
if ! pgrep -f "python3 server.py" > /dev/null; then
cd /path/to/cli-agents-voice-interface
source .venv/bin/activate
nohup python3 server.py > /tmp/tts-server.log 2>&1 &
fi| File | Purpose |
|---|---|
server.py |
TTS HTTP server with streaming playback, voice selection, and key rotation |
speak.sh |
One-command voice for agents — handles platform detection, playback, and fallback |
Dockerfile |
Container image — Python 3.12-slim + httpx |
.env.example |
Template for API keys |
speak.py |
Standalone Python CLI script |
LICENSE |
MIT license |
list_voices.py |
List available Voxtral voices |
list_all_voices.py |
List all voices with details |
For Voxtral TTS:
- Python 3.10+ with
httpx— or just Docker - Mistral API key — free tier works for development/testing, paid plans for production (console.mistral.ai)
- Self-hosted option — Voxtral TTS can run locally if you have a capable GPU. Zero API calls, zero rate limits, full privacy. Just swap the API endpoint in server.py to point to your local instance.
- For
/speak(server-side playback): auto-detectsafplay(macOS),aplay(Linux), PowerShell (Windows/WSL)
For local system TTS:
- Nothing. Built into macOS, Windows, and most Linux distros.
MIT