Releases: psyb0t/docker-talkies
Releases · psyb0t/docker-talkies
v0.9.0
v0.8.0
docker-talkies v0.8.0 — Qwen3-TTS CustomVoice + VoiceDesign + 1.7B Ba…
v0.7.0
docker-talkies v0.7.0 — Qwen3-TTS PCM streaming + supply-chain bump-on-mutation Makefile workflow. Minor release. Two user-visible threads. 1. PCM streaming for Qwen3-TTS. response_format="pcm" against a qwen3_tts model now streams the raw PCM body via HTTP/1.1 chunked transfer-encoding instead of buffering the full utterance. First-audio latency drops from ~3-8 s (synthesise + buffer) to ~200-700 ms (TTFA on first decoded chunk). Marked WIP in the original development commit — surface is live, edge cases still soaking. Other formats + Kokoro backends are unchanged. New env var TALKIES_QWEN3_STREAM_CHUNK_SIZE (default 8) controls codec-steps-per-chunk. 2. pkg-* Makefile workflow. New make targets (pkg-lock / pkg-add / pkg-update / pkg-upgrade / pkg-remove) call scripts/bump_exclude_newer.sh before any uv operation so the [tool.uv] exclude-newer age gate is always anchored to the moment of the mutation. Closes the "silent drift forward" hole. Plus housekeeping: .gitattributes enforces LF on shell scripts, Dockerfile.cuda strips CRLF defensively, qwen3-tts xvec_only kwarg fix landed (parallel patch — same content as v0.6.1's fix). Caller code that assumed Content-Length on /v1/audio/speech needs to adapt for the qwen3_tts + response_format=pcm case. Every other code path is wire-compatible with v0.6.1. v0.6.2 was a local-only tag (never published) — this is the next public release.
v0.6.1
docker-talkies v0.6.1 — fix qwen3-tts kwarg regression from v0.6.0. Patch release. v0.6.0 shipped PR #1's Qwen3-TTS instructions wiring + x-vector fallback with a wrong kwarg name on `model.generate_voice_clone(...)` — every Qwen3 synth request 500'd with TypeError. Fix: `x_vector_only_mode=` → `xvec_only=` (the correct name on faster_qwen3_tts==0.2.6's higher-level voice-clone API). `instruct=` was already right. New tests guard the instructions field, the x-vector fallback, and the Kokoro protocol-bump compatibility path. Kokoro slugs (kokoro-82m, kokoro-82m-nvidia) were unaffected by the v0.6.0 regression. No breaking change. No new dependency.
v0.6.0
docker-talkies v0.6.0 — kokoro-82m-nvidia ONNX backend, qwen3-tts instructions wiring, self-spawning integration test harness. Minor bump. Three additive threads, no breaking change. 1. New TTS slug `kokoro-82m-nvidia` (nvidia/kokoro-82M-onnx-opt, Apache-2.0). Same Kokoro-82M weights as `kokoro-82m`, same 40-voice catalog, same wire shape, served via ONNXRuntime against NVIDIA's TensorRT-friendly ONNX export. No PyTorch on the inference hot path. G2P via espeak-ng. Pick this slug for ORT execution; pick `kokoro-82m` for misaki-driven G2P quality. 2. PR #1 (martincohen): qwen3-tts now honours the `instructions` request field — passed through to faster-qwen3-tts as the `instruct` parameter. Voices without a sibling `.txt` transcript now fall back to x-vector-only mode (with a warning log) instead of returning 400. Kokoro continues to accept and ignore `instructions` for OpenAI wire-shape parity. 3. Integration test harness refactor. Every test_*.sh / e2e_*.sh self-spawns its own --rm --gpus all container on an ephemeral port, runs its checks, tears the container down on EXIT trap. `bash tests/integration/<file>` does the whole lifecycle without an external orchestrator. `run.sh` is now a dispatcher that runs each file as a subprocess. Round-trip verified: kokoro-82m-nvidia synth → whisper-large-v3-turbo transcribes to the expected phrase, proving the ONNX backend produces intelligible English, not just well-formed bytes. test_speech.sh 15/15, test_endpoints.sh 7/7, e2e_kokoro_nvidia.sh 7/7, 11 unit tests green. No breaking change. New slug is additive; every other slug behaves identically (with Qwen3 `instructions` now honoured instead of dropped — behaviour upgrade, not wire-shape change).
v0.5.0
docker-talkies v0.5.0 — drop distil-whisper-large-v3. Minor bump (breaking pre-1.0). distil-whisper-large-v3 was English- only and lived alongside the multilingual whisper-large-v3 (OG, max accuracy) and whisper-large-v3-turbo (multilingual, 8× faster) — redundant for the value it provided. Removing it. CUDA registry now: 6 ASR (whisper×2, parakeet, canary×3) + 2 TTS (kokoro, qwen3) = 8 models. CPU registry now: 3 ASR (whisper×2, canary-180m) + 1 TTS (kokoro) = 4 models. Migration: TALKIES_ENABLED_MODELS=...distil-whisper-large-v3 → drop the slug or replace with whisper-large-v3-turbo (multilingual) or whisper-large-v3 (max accuracy). No API or wire-format change.
v0.4.1
docker-talkies v0.4.1 — README rewrite for above-the-fold conversion. Patch release. Pure docs refresh + a tiny .gitignore tweak. No behavior change, no API change, no new models or endpoints. README highlights: - One-sentence tagline + Python drop-in snippet in the first 25 lines (was buried in 4 paragraphs of prose). - 7 single-line feature bullets: ASR / TTS / voice cloning / hot swap / MCP / diarization / CPU+CUDA. - Quick Start trimmed to 1 docker run + 1 curl above the fold; full examples + TOC folded into <details>. - Dense "how it works" prose moved below the fold, unchanged. .gitignore: about.txt added to local-tooling section.
v0.4.0
docker-talkies v0.4.0 — Qwen3-TTS voice cloning + custom voices. Second TTS engine (qwen3-tts-0.6b, CUDA-only) alongside Kokoro, with a /data/custom-voices/ user-mount convention for voice cloning. Renames the local host cache dir ~/.talkies-models → ~/.talkies-data. Highlights: - faster-qwen3-tts 0.2.6 backend, bfloat16 + SDPA. First synth captures CUDA graphs (~30-60s); subsequent calls sub-second. - 3 builtin Qwen3 voices bundled (alloy/echo/fable as cloned samples) plus user-mountable /data/custom-voices/. Nested subdirs preserved in the voice name. Sibling <name>.txt (ref text) and <name>.lang (language) honored. - Path-traversal guard on voice resolution. - /v1/audio/voices now reports origin: "builtin" | "custom". - Qwen3 CUDA check deferred to load time so the server boots on CPU hosts when qwen3-tts-0.6b is excluded via TALKIES_ENABLED_MODELS. - Integration suite: 7 new qwen3 tests; transcribe loop skips TTS slugs via a /v1/models-derived ASR-only list. Backwards-compatible: existing /v1/audio/speech against kokoro-82m, /v1/audio/transcriptions, the MCP tool surface, and all model slugs work identically.
v0.3.0
docker-talkies v0.3.0 — Kokoro TTS. Adds OpenAI-compatible /v1/audio/speech with mp3/opus/aac/flac/wav/pcm output, /v1/audio/voices discovery, kokoro-82m in both CPU and CUDA images. New backend protocol split (BackendBase / ASRBackend / TTSBackend). Cross-modality eviction shares one VRAM pool between ASR and TTS; idle TTL sweeper applies to both. Both runtime images now bundle en_core_web_sm so Kokoro's English G2P never tries to pip-download at first call (runtime has no pip). Integration suite gains a cross-modality round-trip test (Kokoro synth → fast ASR → assert expected words) plus CPU/memory caps on the test container to keep the host responsive while inference is running. Backwards-compatible: all existing ASR endpoints, model slugs, MCP tools, and response shapes work identically.
v0.2.1
docker-talkies v0.2.1 — agent skill scaffolding + speaches credit. Docs-only release. Adds .agents/.skills/talkies/ (SKILL.md + references/setup.md + scripts/bulk_transcribe.sh) so AI agents can discover and use the talkies API without re-reading the full README. README gains a Credits section linking speaches as the inspiration project. No runtime, API, config, or wire-format changes.