Python 3.12 CLI for audio transcription and text-to-speech through OpenRouter.
Old transcription behavior is now available through the audio transcribe subcommand:
audio transcribe /path/to/fileText-to-speech is available through audio tts and uses google/gemini-3.1-flash-tts-preview by default:
audio tts "Hello from Gemini TTS" --out speech.oggInstall dependencies with uv:
uv syncPut your OpenRouter key into .env:
OPENROUTER_API_KEY=sk-or-v1-...TTS conversion requires ffmpeg, because Gemini TTS output is requested as raw PCM and then converted locally to ogg or mp3.
For local development, run through uv:
uv run audio transcribe /path/to/audio.wav
uv run audio tts "Hello" --out speech.oggTo expose audio as a regular command in your current Python environment:
uv tool install .audio transcribe [OPTIONS] MEDIA_PATHOptions:
--model TEXT OpenRouter model to use. Default: google/gemini-3.1-flash-lite
--prompt TEXT Instruction sent with the audio. Default: Generate a transcript of the speech.
--out TEXT stdout or output file path. Default: stdout
--timeout FLOAT HTTP timeout in seconds. Default: 120
Examples:
audio transcribe ./voice-message.mp3
audio transcribe ./voice-message.mp3 --out transcript.txt
audio transcribe ./voice-message.mp3 > transcript.txt
audio transcribe ./voice-message.mp3 --prompt "Transcribe this speech verbatim. Keep the original language."For OpenClaw, use:
["audio", "transcribe", "{{MediaPath}}"]Supported transcription file extensions: aac, aiff, flac, m4a, mp3, ogg, opus, pcm16, pcm24, wav, webm.
audio tts [OPTIONS] [TEXT ...]If TEXT is omitted, audio tts reads text from stdin.
Options:
--voices Print supported voices and exit
--models Print available OpenRouter TTS models and exit
--model TEXT OpenRouter TTS model. Default: google/gemini-3.1-flash-tts-preview
--voice TEXT Voice to use. Default: Zephyr for Gemini, alloy for OpenAI TTS
--out TEXT Output audio path. Default: speech.ogg
--format FORMAT Output format: ogg or mp3. Inferred from --out when omitted;
when set, --out extension is adjusted to match
--timeout FLOAT HTTP timeout in seconds. Default: 120
Examples:
audio tts "Привет, это голосовой ответ" --out answer.ogg
audio tts "Hello" --voice Puck --out answer.mp3
audio tts --model openai/gpt-4o-mini-tts-2025-12-15 "Hello" --voice nova --out answer.ogg
printf "Long text" | audio tts --out narration.ogg
audio tts --voices
audio tts --modelsaudio tts --models queries OpenRouter for models with output_modalities=speech and prints all currently available TTS model IDs.
openai/gpt-4o-audio-preview is not a TTS model on OpenRouter's /audio/speech endpoint. For OpenAI TTS, use openai/gpt-4o-mini-tts-2025-12-15.
Voices:
Zephyr
Puck
Charon
Kore
Fenrir
Leda
Orus
Aoede
Callirrhoe
Autonoe
Enceladus
Iapetus
Umbriel
Algieba
Despina
Erinome
Algenib
TTS requests pcm from OpenRouter first. The CLI then converts the 24 kHz, 16-bit, mono PCM stream to ogg by default, or to mp3 when requested. If a provider rejects PCM and only supports MP3, the CLI automatically retries with response_format=mp3 and converts that file when needed.
The CLI uses the OpenAI Python SDK pointed at https://openrouter.ai/api/v1.
Transcription sends base64-encoded local audio to the chat completions API using an input_audio message part, which keeps custom transcription prompts available.