Skip to content

Releases: psyb0t/docker-talkies

v0.9.0

09 Jun 07:37

Choose a tag to compare

docker-talkies v0.9.0 — Nemotron-3.5-ASR (parakeet.cpp) + GPU drain b…

v0.8.0

31 May 16:34

Choose a tag to compare

docker-talkies v0.8.0 — Qwen3-TTS CustomVoice + VoiceDesign + 1.7B Ba…

v0.7.0

31 May 10:17

Choose a tag to compare

docker-talkies v0.7.0 — Qwen3-TTS PCM streaming + supply-chain

bump-on-mutation Makefile workflow.

Minor release. Two user-visible threads.

1. PCM streaming for Qwen3-TTS. response_format="pcm" against a
   qwen3_tts model now streams the raw PCM body via HTTP/1.1
   chunked transfer-encoding instead of buffering the full
   utterance. First-audio latency drops from ~3-8 s (synthesise +
   buffer) to ~200-700 ms (TTFA on first decoded chunk). Marked
   WIP in the original development commit — surface is live, edge
   cases still soaking. Other formats + Kokoro backends are
   unchanged. New env var TALKIES_QWEN3_STREAM_CHUNK_SIZE (default
   8) controls codec-steps-per-chunk.

2. pkg-* Makefile workflow. New make targets (pkg-lock / pkg-add /
   pkg-update / pkg-upgrade / pkg-remove) call
   scripts/bump_exclude_newer.sh before any uv operation so the
   [tool.uv] exclude-newer age gate is always anchored to the
   moment of the mutation. Closes the "silent drift forward" hole.

Plus housekeeping: .gitattributes enforces LF on shell scripts,
Dockerfile.cuda strips CRLF defensively, qwen3-tts xvec_only kwarg
fix landed (parallel patch — same content as v0.6.1's fix).

Caller code that assumed Content-Length on /v1/audio/speech needs
to adapt for the qwen3_tts + response_format=pcm case. Every
other code path is wire-compatible with v0.6.1.

v0.6.2 was a local-only tag (never published) — this is the next
public release.

v0.6.1

30 May 16:10

Choose a tag to compare

docker-talkies v0.6.1 — fix qwen3-tts kwarg regression from v0.6.0.

Patch release. v0.6.0 shipped PR #1's Qwen3-TTS instructions
wiring + x-vector fallback with a wrong kwarg name on
`model.generate_voice_clone(...)` — every Qwen3 synth request
500'd with TypeError.

Fix: `x_vector_only_mode=` → `xvec_only=` (the correct name on
faster_qwen3_tts==0.2.6's higher-level voice-clone API).
`instruct=` was already right.

New tests guard the instructions field, the x-vector fallback,
and the Kokoro protocol-bump compatibility path.

Kokoro slugs (kokoro-82m, kokoro-82m-nvidia) were unaffected by
the v0.6.0 regression.

No breaking change. No new dependency.

v0.6.0

30 May 15:45

Choose a tag to compare

docker-talkies v0.6.0 — kokoro-82m-nvidia ONNX backend, qwen3-tts

instructions wiring, self-spawning integration test harness.

Minor bump. Three additive threads, no breaking change.

1. New TTS slug `kokoro-82m-nvidia` (nvidia/kokoro-82M-onnx-opt,
   Apache-2.0). Same Kokoro-82M weights as `kokoro-82m`, same
   40-voice catalog, same wire shape, served via ONNXRuntime
   against NVIDIA's TensorRT-friendly ONNX export. No PyTorch on
   the inference hot path. G2P via espeak-ng. Pick this slug for
   ORT execution; pick `kokoro-82m` for misaki-driven G2P quality.

2. PR #1 (martincohen): qwen3-tts now honours the `instructions`
   request field — passed through to faster-qwen3-tts as the
   `instruct` parameter. Voices without a sibling `.txt` transcript
   now fall back to x-vector-only mode (with a warning log)
   instead of returning 400. Kokoro continues to accept and ignore
   `instructions` for OpenAI wire-shape parity.

3. Integration test harness refactor. Every test_*.sh / e2e_*.sh
   self-spawns its own --rm --gpus all container on an ephemeral
   port, runs its checks, tears the container down on EXIT trap.
   `bash tests/integration/<file>` does the whole lifecycle
   without an external orchestrator. `run.sh` is now a dispatcher
   that runs each file as a subprocess.

Round-trip verified: kokoro-82m-nvidia synth →
whisper-large-v3-turbo transcribes to the expected phrase,
proving the ONNX backend produces intelligible English, not just
well-formed bytes. test_speech.sh 15/15, test_endpoints.sh 7/7,
e2e_kokoro_nvidia.sh 7/7, 11 unit tests green.

No breaking change. New slug is additive; every other slug
behaves identically (with Qwen3 `instructions` now honoured
instead of dropped — behaviour upgrade, not wire-shape change).

v0.5.0

28 May 18:23

Choose a tag to compare

docker-talkies v0.5.0 — drop distil-whisper-large-v3.

Minor bump (breaking pre-1.0). distil-whisper-large-v3 was English-
only and lived alongside the multilingual whisper-large-v3 (OG, max
accuracy) and whisper-large-v3-turbo (multilingual, 8× faster) —
redundant for the value it provided. Removing it.

CUDA registry now: 6 ASR (whisper×2, parakeet, canary×3) + 2 TTS
(kokoro, qwen3) = 8 models.
CPU registry now: 3 ASR (whisper×2, canary-180m) + 1 TTS (kokoro)
= 4 models.

Migration: TALKIES_ENABLED_MODELS=...distil-whisper-large-v3 → drop
the slug or replace with whisper-large-v3-turbo (multilingual) or
whisper-large-v3 (max accuracy). No API or wire-format change.

v0.4.1

28 May 17:30

Choose a tag to compare

docker-talkies v0.4.1 — README rewrite for above-the-fold conversion.

Patch release. Pure docs refresh + a tiny .gitignore tweak. No
behavior change, no API change, no new models or endpoints.

README highlights:
- One-sentence tagline + Python drop-in snippet in the first 25 lines
  (was buried in 4 paragraphs of prose).
- 7 single-line feature bullets: ASR / TTS / voice cloning / hot
  swap / MCP / diarization / CPU+CUDA.
- Quick Start trimmed to 1 docker run + 1 curl above the fold; full
  examples + TOC folded into <details>.
- Dense "how it works" prose moved below the fold, unchanged.

.gitignore: about.txt added to local-tooling section.

v0.4.0

28 May 17:08

Choose a tag to compare

docker-talkies v0.4.0 — Qwen3-TTS voice cloning + custom voices.

Second TTS engine (qwen3-tts-0.6b, CUDA-only) alongside Kokoro, with a
/data/custom-voices/ user-mount convention for voice cloning. Renames
the local host cache dir ~/.talkies-models → ~/.talkies-data.

Highlights:
- faster-qwen3-tts 0.2.6 backend, bfloat16 + SDPA. First synth captures
  CUDA graphs (~30-60s); subsequent calls sub-second.
- 3 builtin Qwen3 voices bundled (alloy/echo/fable as cloned samples)
  plus user-mountable /data/custom-voices/. Nested subdirs preserved
  in the voice name. Sibling <name>.txt (ref text) and <name>.lang
  (language) honored.
- Path-traversal guard on voice resolution.
- /v1/audio/voices now reports origin: "builtin" | "custom".
- Qwen3 CUDA check deferred to load time so the server boots on CPU
  hosts when qwen3-tts-0.6b is excluded via TALKIES_ENABLED_MODELS.
- Integration suite: 7 new qwen3 tests; transcribe loop skips TTS
  slugs via a /v1/models-derived ASR-only list.

Backwards-compatible: existing /v1/audio/speech against kokoro-82m,
/v1/audio/transcriptions, the MCP tool surface, and all model slugs
work identically.

v0.3.0

28 May 14:32

Choose a tag to compare

docker-talkies v0.3.0 — Kokoro TTS.

Adds OpenAI-compatible /v1/audio/speech with mp3/opus/aac/flac/wav/pcm
output, /v1/audio/voices discovery, kokoro-82m in both CPU and CUDA
images. New backend protocol split (BackendBase / ASRBackend /
TTSBackend). Cross-modality eviction shares one VRAM pool between ASR
and TTS; idle TTL sweeper applies to both.

Both runtime images now bundle en_core_web_sm so Kokoro's English G2P
never tries to pip-download at first call (runtime has no pip).

Integration suite gains a cross-modality round-trip test (Kokoro synth
→ fast ASR → assert expected words) plus CPU/memory caps on the test
container to keep the host responsive while inference is running.

Backwards-compatible: all existing ASR endpoints, model slugs, MCP
tools, and response shapes work identically.

v0.2.1

28 May 11:58

Choose a tag to compare

docker-talkies v0.2.1 — agent skill scaffolding + speaches credit.

Docs-only release. Adds .agents/.skills/talkies/ (SKILL.md +
references/setup.md + scripts/bulk_transcribe.sh) so AI agents can
discover and use the talkies API without re-reading the full README.
README gains a Credits section linking speaches as the inspiration
project. No runtime, API, config, or wire-format changes.