CLI AI Agents' Voice Interface

Give your AI coding agent a voice. Uses Mistral's Voxtral TTS to speak responses aloud — turning your terminal into a conversational AI assistant.

Works with any CLI agent that can run shell commands: Claude Code, Cursor, Copilot CLI, Aider, Gemini CLI, or even plain curl. Runs anywhere: Docker, macOS, Linux, Windows, WSL.

Includes a zero-dependency local TTS fallback that needs no API key at all.

Two ways to get voice

1. Voxtral TTS (high quality, needs API key)

Natural-sounding voice with emotional tones via Mistral's API. The free tier works for development and testing. Requires an API key.

2. Local system TTS (instant, zero setup)

Every major OS has built-in text-to-speech. No server, no API key, no dependencies. Sounds robotic but works instantly:

# macOS
say "Hello world"

# Windows / WSL
powershell.exe -Command "Add-Type -AssemblyName System.Speech; (New-Object System.Speech.Synthesis.SpeechSynthesizer).Speak('Hello world')"

# Linux (install espeak if not present)
espeak "Hello world"

You can use both — Voxtral for quality, local TTS as a fallback when the server is down or you don't want to burn API credits.

Quick start (Voxtral)

Docker (recommended)

git clone https://github.com/kshitizshankar/cli-agents-voice-interface.git
cd cli-agents-voice-interface

# Add your keys
cp .env.example .env
# Edit .env with your Mistral API key(s)

# Build and run
docker build -t cli-voice .
docker run -d -p 8765:8765 --env-file .env --name cli-voice cli-voice

With Docker, use /tts (returns audio bytes) — your client handles playback since the container can't access speakers.

# Test — fetch WAV and play locally
curl -o test.wav 'http://localhost:8765/tts?tone=cheerful&text=Hello+world'
afplay test.wav        # macOS
aplay test.wav         # Linux
# Windows: powershell.exe -Command "(New-Object Media.SoundPlayer 'test.wav').PlaySync()"

Native Python

git clone https://github.com/kshitizshankar/cli-agents-voice-interface.git
cd cli-agents-voice-interface
python3 -m venv .venv
source .venv/bin/activate
pip install httpx

cp .env.example .env
# Edit .env with your Mistral API key(s)

python3 server.py

# Test — server plays audio directly through your speakers
curl 'http://localhost:8765/speak?tone=cheerful&text=Hello+world'

Get your API keys

Get API keys from console.mistral.ai. Add them to .env:

MISTRAL_API_KEYS=key1,key2,key3

Multiple keys enable automatic rotation when one hits rate limits.

API

`GET /speak` — server plays audio

Best for native (non-Docker) setups where the server can access your speakers.

Parameter	Default	Description
`text`	(required)	The text to speak
`voice`	`paul`	Voice character: `paul`, `oliver`, or `jane`
`tone`	`neutral`	Emotional tone (see below)
`bg`	`0`	Set to `1` for fire-and-forget (returns instantly, plays in background)

For multi-sentence text, uses streaming playback — plays each sentence as it arrives while prefetching the next in parallel.

`GET /tts` — returns WAV audio bytes

Best for Docker, remote deployments, or when the client handles playback.

Parameter	Default	Description
`text`	(required)	The text to speak
`voice`	`paul`	Voice character: `paul`, `oliver`, or `jane`
`tone`	`neutral`	Emotional tone (see below)

Returns audio/wav content. Zero temp files created server-side.

`GET /voices` — list available voices

Returns JSON with all voice characters and their available tones.

`GET /reload`

Hot-reload .env without restarting the server. Use after adding or rotating API keys.

Voice characters

Voice	Gender	Accent	Available tones
`paul` (default)	Male	US English	neutral, happy, cheerful, confident, excited, sad, frustrated, angry
`oliver`	Male	British English	neutral
`jane`	Female	British English	sarcasm

If you request a tone that doesn't exist for a voice (e.g. oliver + excited), it gracefully falls back to that voice's default tone.

Want a different voice? Mistral's Voxtral TTS supports voice cloning from a few seconds of audio — any accent, any language. See the Voxtral TTS docs for details. You can also browse all preset voices with the included list_voices.py script.

Multi-agent setup

Give each AI agent a distinct voice so you can tell them apart:

Claude Code  → voice=paul    (US male, 8 emotional tones)
Gemini CLI   → voice=oliver  (British male)
Third agent  → voice=jane    (British female, sarcastic)

Streaming playback

For multi-sentence text (via /speak), the server:

Splits text into sentences
Fetches audio for sentence 1 from the Mistral API
Starts playing sentence 1 as soon as it arrives (~2 seconds)
While sentence 1 plays, prefetches sentence 2 in parallel
Chains through all sentences seamlessly

You hear the first words within ~2 seconds, regardless of total text length.

Using with your AI agent

This works with any CLI AI agent that can execute shell commands — Claude Code, Cursor, Copilot CLI, Aider, Gemini CLI, etc.

Step 1: Start the server (optional — local TTS needs no server)

Docker:

docker run -d -p 8765:8765 --env-file .env --restart unless-stopped --name cli-voice cli-voice

Native:

cd /path/to/cli-agents-voice-interface
source .venv/bin/activate
nohup python3 server.py > /tmp/tts-server.log 2>&1 &

Step 2: Add the voice prompt to your agent

Add the prompt below to your agent's instructions — CLAUDE.md for Claude Code, system prompt for Gemini CLI, custom instructions for Cursor, etc.

Voice Prompt for AI Agents

Copy everything inside the code block into your agent's instructions. Replace /path/to/speak.sh with the actual path where you cloned the repo.

## Voice Output

You can speak responses aloud using a helper script. The script handles everything:
Voxtral TTS if the server is running, automatic fallback to local system TTS if not,
and cross-platform audio playback (macOS, Linux, Windows, WSL).

### How to speak

Run this command to speak:
  /path/to/speak.sh "Your text here" TONE VOICE

Examples:
  /path/to/speak.sh "Hello! Let me take a look at this code." cheerful paul
  /path/to/speak.sh "Interesting approach here." confident oliver
  /path/to/speak.sh "Oh wonderful, another singleton pattern." sarcasm jane

Parameters:
  - Text (required): what to say
  - Tone (optional, default: neutral): neutral, happy, cheerful, confident, excited, sad, frustrated, angry
  - Voice (optional, default: paul): paul (US male), oliver (British male), jane (British female)

IMPORTANT: This command blocks for 5-10 seconds while audio plays through the speakers.
That is normal — it is fetching audio from an API and playing it. Do not cancel it.
If the TTS server is not running, it automatically falls back to local system TTS (faster, robotic).

### How it works (so you understand the architecture)

In WSL, you are running Linux but audio hardware belongs to Windows. Linux audio tools
like aplay cannot reach Windows speakers. The script handles this by:
1. Fetching a WAV file from the TTS server (Docker or native)
2. Converting the Linux file path to a Windows path using wslpath
3. Playing audio via powershell.exe which runs on the Windows side and can access speakers
4. Cleaning up the temp file

On macOS it uses afplay, on native Linux it uses aplay. The script auto-detects.

### Rules
- Voice and text are DIFFERENT channels. Never duplicate content across both.
- Voice is for: reactions, confirmations, questions, encouragement, high-level summaries.
  For discussions and back-and-forth, longer conversational voice is great.
- Text is for: code, file paths, commands, technical details, lists, anything to read or copy.
- NEVER say file paths, code, URLs, or technical details aloud. That is what text is for.
- Think: "would a human colleague say this out loud?" If not, it is text-only.

### Good examples
- "Done, the server is updated and running."
- "Found the bug — it was a null check. Fix is in."
- "Hey, quick question — do you want me to refactor this or just patch it?"

### Bad examples (never do this)
- "I updated slash home slash user slash server dot py with the new config."
- "The error was on line 47 of src utils parser ts where the optional chaining..."
- Reading out file paths, URLs, or code aloud.

Auto-start

Docker — use --restart unless-stopped (shown above).

Native — add to your shell profile (~/.bashrc, ~/.zshrc, etc.):

if ! pgrep -f "python3 server.py" > /dev/null; then
    cd /path/to/cli-agents-voice-interface
    source .venv/bin/activate
    nohup python3 server.py > /tmp/tts-server.log 2>&1 &
fi

Files

File	Purpose
`server.py`	TTS HTTP server with streaming playback, voice selection, and key rotation
`speak.sh`	One-command voice for agents — handles platform detection, playback, and fallback
`Dockerfile`	Container image — Python 3.12-slim + httpx
`.env.example`	Template for API keys
`speak.py`	Standalone Python CLI script
`LICENSE`	MIT license
`list_voices.py`	List available Voxtral voices
`list_all_voices.py`	List all voices with details

Requirements

For Voxtral TTS:

Python 3.10+ with httpx — or just Docker
Mistral API key — free tier works for development/testing, paid plans for production (console.mistral.ai)
Self-hosted option — Voxtral TTS can run locally if you have a capable GPU. Zero API calls, zero rate limits, full privacy. Just swap the API endpoint in server.py to point to your local instance.
For /speak (server-side playback): auto-detects afplay (macOS), aplay (Linux), PowerShell (Windows/WSL)

For local system TTS:

Nothing. Built into macOS, Windows, and most Linux distros.

License

MIT

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

CLI AI Agents' Voice Interface

Two ways to get voice

1. Voxtral TTS (high quality, needs API key)

2. Local system TTS (instant, zero setup)

Quick start (Voxtral)

Docker (recommended)

Native Python

Get your API keys

API

`GET /speak` — server plays audio

`GET /tts` — returns WAV audio bytes

`GET /voices` — list available voices

`GET /reload`

Voice characters

Multi-agent setup

Streaming playback

Using with your AI agent

Step 1: Start the server (optional — local TTS needs no server)

Step 2: Add the voice prompt to your agent

Voice Prompt for AI Agents

Auto-start

Files

Requirements

License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 18 Commits
.dockerignore		.dockerignore
.env.example		.env.example
.gitignore		.gitignore
Dockerfile		Dockerfile
LICENSE		LICENSE
README.md		README.md
list_all_voices.py		list_all_voices.py
list_voices.py		list_voices.py
server.py		server.py
speak.py		speak.py
speak.sh		speak.sh

Folders and files

Latest commit

History

Repository files navigation

CLI AI Agents' Voice Interface

Two ways to get voice

1. Voxtral TTS (high quality, needs API key)

2. Local system TTS (instant, zero setup)

Quick start (Voxtral)

Docker (recommended)

Native Python

Get your API keys

API

GET /speak — server plays audio

GET /tts — returns WAV audio bytes

GET /voices — list available voices

GET /reload

Voice characters

Multi-agent setup

Streaming playback

Using with your AI agent

Step 1: Start the server (optional — local TTS needs no server)

Step 2: Add the voice prompt to your agent

Voice Prompt for AI Agents

Auto-start

Files

Requirements

License

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

`GET /speak` — server plays audio

`GET /tts` — returns WAV audio bytes

`GET /voices` — list available voices

`GET /reload`

Packages