Go implementation of KittenTTS — an ultra-lightweight ONNX-based text-to-speech engine. Self-contained binaries with no Python dependency.
Try it now: the Colab notebook builds the project and synthesizes speech in three clicks — no local setup, no GPU.
It produces two binaries:
kitten-tts— a CLI tool for one-off speech generation. Ideally suited for AI agent skills.kitten-tts-server— an OpenAI-compatible API server with SSE streaming support.
Adapted from: KittenML/KittenTTS (Apache-2.0). All model weights are from the original project.
- Ultra-lightweight — 15M to 80M parameter models; smallest is just 25 MB (int8)
- CPU-optimized — ONNX-based inference runs efficiently without a GPU
- 8 built-in voices — Bella, Jasper, Luna, Bruno, Rosie, Hugo, Kiki, and Leo
- Adjustable speech speed — control playback rate via
-speed - Text normalization — built-in pipeline handles numbers and currencies
- 24 kHz output — high-quality audio at a standard sample rate
- Multiple audio formats — MP3, FLAC, WAV, and PCM (pure Go) plus OGG Opus (via libopus)
This port relies on three system dependencies:
Go inference uses yalue/onnxruntime_go, which loads the ONNX Runtime shared library dynamically at runtime.
# macOS
brew install onnxruntime
# Ubuntu/Debian — download a release from
# https://github.com/microsoft/onnxruntime/releases and place
# libonnxruntime.so on your library pathThe library is auto-detected at common locations (/usr/local/lib, /opt/homebrew/lib, /usr/lib, …). To point at a specific file, set the ONNXRUNTIME_LIB_PATH environment variable:
export ONNXRUNTIME_LIB_PATH=/path/to/libonnxruntime.dylib# macOS
brew install espeak-ng
# Ubuntu/Debian
sudo apt-get install -y espeak-ngOpus output is encoded with libopus via cgo. The hraban/opus binding links both libopus and libopusfile (#cgo pkg-config: opus opusfile), so both are build-time dependencies (and must be on the library path at runtime). Building therefore requires a C compiler and CGO_ENABLED=1 (the default for native builds).
# macOS
brew install opus opusfile pkg-config
# Ubuntu/Debian
sudo apt-get install -y libopus-dev libopusfile-dev pkg-configThe libopus/libopusfile/pkg-config packages above are only needed when building from source. The released binaries statically link libopus and libopusfile, so the target machine needs only the ONNX Runtime shared library (
dlopen'd at runtime) and espeak-ng — no opus libraries required.
| Model | Parameters | Size | Download |
|---|---|---|---|
| kitten-tts-mini | 80M | 80 MB | KittenML/kitten-tts-mini-0.8 |
| kitten-tts-micro | 40M | 41 MB | KittenML/kitten-tts-micro-0.8 |
| kitten-tts-nano | 15M | 56 MB | KittenML/kitten-tts-nano-0.8 |
| kitten-tts-nano (int8) | 15M | 25 MB | KittenML/kitten-tts-nano-0.8-int8 |
Models are not vendored in this repository. Fetch one into ./models with the
helper script:
scripts/fetch_model.sh nano-int8 # also: nano, micro, mini (default: nano-int8)It reads the ONNX/voices filenames from the model's config.json, so it works
for every model above. (./models is git-ignored.) Equivalent manual download:
mkdir -p models/kitten-tts-nano-int8
for FILE in config.json kitten_tts_nano_v0_8.onnx voices.npz; do
curl -L -o "models/kitten-tts-nano-int8/$FILE" \
"https://huggingface.co/KittenML/kitten-tts-nano-0.8-int8/resolve/main/$FILE"
donego build -o bin/ ./...
# Binaries at: bin/kitten-tts and bin/kitten-tts-serverBuilding requires a C compiler and libopus/libopusfile (see Dependencies), since Opus encoding is compiled in via cgo.
Pushing a v* tag triggers .github/workflows/release.yml,
which builds the binaries with go build on GitHub runners and publishes a
GitHub Release with one .tar.gz per platform plus checksums.txt:
git tag v0.1.1
git push origin v0.1.1libopus/libopusfile are statically linked from source, so each target is built on its own native runner — Linux amd64/arm64 and macOS arm64 — except darwin/amd64, which is cross-compiled on the Apple Silicon runner (clang is a universal toolchain). To build locally:
go build -o bin/ ./...scripts/smoke_test.sh builds the binaries and exercises
every audio format plus SSE streaming, fully offline, against a local model:
# Pass a model dir (or set KITTEN_MODEL_DIR, or place one at models/kitten-tts-nano-int8)
scripts/smoke_test.sh /path/to/models/kitten-tts-nano-int8
# Test prebuilt/release binaries instead of building:
KITTEN_BIN_DIR=dist/... scripts/smoke_test.sh /path/to/modelIt checks the CLI (wav/mp3/flac/opus/pcm + -list-voices) and the server
(/health, /v1/models, all formats, streaming, and 400 validation), printing a
pass/fail summary and exiting non-zero on failure. Needs espeak-ng and the ONNX
Runtime library; uses ffprobe for codec checks when available.
Following Go convention, flags come before the positional arguments
(<model_dir> <text> [voice]):
# Basic usage (outputs output.wav)
./bin/kitten-tts ./models/kitten-tts-nano-int8 'Hello, world!' Bruno
# Specify voice, speed, and output (flags first)
./bin/kitten-tts -voice Luna -speed 1.2 -output hello.wav ./models/kitten-tts-nano-int8 'Hello, world!'
# Encode directly to another format with -format
./bin/kitten-tts -format mp3 -output hello.mp3 ./models/kitten-tts-nano-int8 'Hello, world!'
# List available voices
./bin/kitten-tts -list-voices ./models/kitten-tts-nano-int8CLI flags (single or double dash both work):
| Flag | Default | Description |
|---|---|---|
-voice, -v |
Bruno |
Voice name (overrides the positional voice) |
-speed, -s |
1.0 |
Speech speed multiplier |
-output, -o |
output.wav |
Output file path |
-format |
wav |
Output format: wav, mp3, flac, opus, pcm |
-no-clean |
Disable text normalization (numbers, currency) | |
-list-voices |
List available voices and exit |
Because the CLI uses Go's standard
flagpackage, any flag placed after a positional argument is treated as a positional. Put flags first.
Flags first, then the model directory:
./bin/kitten-tts-server -host 0.0.0.0 -port 8080 ./models/kitten-tts-nano-int8The server exposes an OpenAI-compatible /v1/audio/speech endpoint:
curl -X POST http://localhost:8080/v1/audio/speech \
-H 'Content-Type: application/json' \
-d '{
"model": "kitten-tts",
"input": "Hello, world! This is KittenTTS running as an API server.",
"voice": "alloy"
}' \
--output speech.mp3Request body:
| Field | Type | Default | Description |
|---|---|---|---|
input |
string | (required) | Text to synthesize |
voice |
string | (required) | Voice name (OpenAI or KittenTTS names) |
model |
string | "" |
Accepted for compatibility; ignored |
response_format |
string | "mp3" |
Output audio format (see below) |
speed |
float | 1.0 |
Speech speed multiplier (0.25–4.0) |
stream |
bool | false |
Enable SSE streaming (requires "pcm" format) |
Supported audio formats:
| Format | Content-Type | Description |
|---|---|---|
mp3 |
audio/mpeg |
MP3 (resampled to 44.1 kHz, pure-Go shine encoder) |
flac |
audio/flac |
FLAC lossless (24 kHz native, pure-Go mewkiz/flac) |
wav |
audio/wav |
WAV 16-bit PCM (24 kHz native) |
pcm |
audio/pcm |
Raw 16-bit signed little-endian PCM (24 kHz) |
opus |
audio/ogg |
Opus in OGG container (resampled to 48 kHz) |
aac |
— | Not supported (returns error) |
Note on Opus: there is no pure-Go Opus encoder, so Opus uses libopus via cgo (
hraban/opus) plus a hand-written RFC 7845 OGG writer. libopus and libopusfile must be installed at build time (see Dependencies).
API endpoints:
| Method | Path | Description |
|---|---|---|
POST |
/v1/audio/speech |
Generate speech from text |
GET |
/v1/models |
List loaded model |
GET |
/health |
Health check |
Voice mapping (OpenAI → KittenTTS):
| OpenAI | KittenTTS | Gender |
|---|---|---|
| alloy | Bella | Female |
| echo | Jasper | Male |
| fable | Luna | Female |
| onyx | Bruno | Male |
| nova | Rosie | Female |
| shimmer | Hugo | Male |
All 8 KittenTTS voices (Bella, Jasper, Luna, Bruno, Rosie, Hugo, Kiki, Leo) can also be used directly by name.
For lower time-to-first-audio on longer texts, set "stream": true with "response_format": "pcm". The server returns Server-Sent Events with base64-encoded PCM audio chunks, compatible with the OpenAI streaming TTS format:
curl -N -X POST http://localhost:8080/v1/audio/speech \
-H "Content-Type: application/json" \
-d '{
"model": "kitten-tts",
"input": "Hello, this is a streaming test. Each sentence is sent as a separate audio chunk.",
"voice": "alloy",
"response_format": "pcm",
"stream": true
}'Each event is a JSON object on a data: line:
data: {"type":"speech.audio.delta","delta":"<base64-encoded-pcm>"}
data: {"type":"speech.audio.delta","delta":"<base64-encoded-pcm>"}
data: {"type":"speech.audio.done"}
The delta field contains 16-bit signed little-endian PCM at 24 kHz, base64-encoded. The first chunk is split at the earliest clause boundary for fast initial playback.
The tts package is a self-contained engine. New returns a *tts.Model;
encoders live in the audio package behind a small interface.
package main
import (
"os"
"github.com/itamaker/kitten-tts-go/audio"
"github.com/itamaker/kitten-tts-go/tts"
)
func main() {
model, err := tts.New("models/kitten-tts-nano-int8")
if err != nil {
panic(err)
}
defer model.Close()
samples, err := model.Generate("Hello, world!", "Bruno", 1.0, true)
if err != nil {
panic(err)
}
enc, _ := audio.NewEncoder("mp3") // or wav, flac, opus, pcm
data, _ := enc.Encode(samples)
os.WriteFile("hello.mp3", data, 0o644)
}The phonemizer is an interface, so you can swap espeak-ng for your own (handy in tests) via a functional option:
model, err := tts.New(dir, tts.WithPhonemizer(myPhonemizer))
// myPhonemizer implements tts.Phonemizer: Phonemize(string) (string, error)kitten-tts-go/
├── tts/ # Core TTS engine (importable library)
│ ├── tts.go # Model type, Generate/GenerateChunk, options
│ ├── load.go # New: read config.json and load a model directory
│ ├── onnx.go # ONNX session, isolated behind one file
│ ├── phonemes.go # Phonemizer interface, espeak-ng impl, token IDs
│ ├── normalize.go # Number/currency/whitespace normalization
│ ├── chunk.go # Sentence/streaming text chunking
│ └── voices.go # NPZ voice embedding loader
├── audio/ # Encoder interface + one file per format
│ ├── audio.go # Encoder interface, registry, NewEncoder, resampling
│ ├── wav.go # WAV + PCM
│ ├── mp3.go # MP3 (shine)
│ ├── flac.go # FLAC (mewkiz)
│ └── opus.go # OGG Opus (libopus) + hand-written OGG container
└── cmd/
├── kitten-tts/ # CLI (stdlib flag)
└── kitten-tts-server/ # OpenAI-compatible API server (net/http)
- Normalization — Expands numbers ("42" → "forty-two"), currencies ("$10.50" → "ten dollars and fifty cents"), and collapses whitespace
- Phonemization — Converts English text to IPA phonemes via a
Phonemizer(espeak-ng by default) - Token encoding — Maps IPA phonemes to integer token IDs using a symbol table matching the original Python implementation
- Voice selection — Loads style embeddings from the NPZ voice file
- ONNX inference — Runs the model with input tokens, voice style, and speed parameters
- Audio encoding — An
audio.Encoderproduces MP3 (default), FLAC, WAV, Opus, or raw PCM
A few idiomatic-Go choices worth knowing:
tts.Modelis the core type, constructed withtts.New.- Audio formats are an
Encoderinterface resolved by name from a registry (audio.NewEncoder), one file per format — not a single switch statement. - Phonemization is the
Phonemizerinterface; the default is espeak-ng andtts.WithPhonemizerlets you swap it (e.g. in tests). - The CLI uses the standard-library
flagpackage (flags before positionals); the server is plainnet/http. - ONNX inference uses
yalue/onnxruntime_go, whichdlopens the runtime — there is no cgo link to onnxruntime. - Encoders: MP3 (
shine-mp3) and FLAC (mewkiz/flac) are pure Go; Opus useshraban/opus(libopus via cgo). AAC is not supported.
MIT — see LICENSE.
This is an independent Go implementation. The upstream KittenTTS project and its model weights are Apache-2.0 (see Acknowledgments); those weights are downloaded separately and are not part of this repository.
- KittenML for the original KittenTTS models and Python library
- yalue/onnxruntime_go for the Go ONNX Runtime bindings
- braheezy/shine-mp3 and mewkiz/flac for pure-Go audio encoding
- hraban/opus and libopus for Opus encoding
- espeak-ng for phonemization