Environment
- Voicebox v0.5.0 (macOS .app bundle, PyInstaller-packaged)
- macOS Darwin 25.4.0, arm64 (Apple Silicon, 128 GB RAM)
- Python 3.12.10 (bundled)
mlx==0.31.2, mlx_audio==0.4.1, qwen_tts==0.1.1
- Model:
mlx-community/Qwen3-TTS-12Hz-1.7B-Base-bf16 (downloaded, 4.2 GB)
/health reports: backend_type: mlx, backend_variant: cpu, gpu_type: MPS (Apple Silicon), model_loaded: true, gpu_available: true
Symptom
Every generation fails immediately with:
There is no Stream(gpu, 1) in current thread.
The UI stays on "Loading model..." but the SQLite row in voicebox.db is already status=failed. No Python traceback is written to the server log — only the message is captured in the generations.error column.
Reproduction (100% repeatable)
After a clean app start (Ready logged, model loaded), call the API directly:
curl -X POST http://127.0.0.1:17493/generate \
-H 'Content-Type: application/json' \
-d '{"profile_id":"<any-existing-profile>","text":"hello","language":"en","engine":"qwen","model_size":"1.7B"}'
Polling /generate/{id}/status flips to failed within ~1 second, every time. Confirmed across multiple profiles, languages (zh/en), and short/long inputs.
Likely root cause
mlx_audio/stt/generate.py:224 (and almost certainly similar code on the TTS path):
generation_stream = mx.new_stream(mx.default_device())
The stream is created on the main/import thread, but inference runs in a uvicorn worker thread (run_in_executor / asyncio.to_thread). MLX streams are thread-local, so the worker thread sees no Stream(gpu, 1) and raises.
Fix direction: either recreate the stream inside the worker thread, or use mlx.core.new_thread_local_stream (added in MLX 0.31).
Suggested follow-ups
- Move
mx.new_stream into the worker that actually runs inference, or switch to new_thread_local_stream.
- Let the exception traceback through to the server log on generation failure — currently the message is silently captured in SQLite only, which makes the issue look like "model still loading" in the UI.
- The
backend_variant: cpu while backend_type: mlx and gpu_type: MPS reported by /health looks inconsistent and is worth a sanity check.
Workaround
None found from the user side — backend cannot be switched away from MLX in the bundled build, and the error is raised before any inference happens.
Environment
mlx==0.31.2,mlx_audio==0.4.1,qwen_tts==0.1.1mlx-community/Qwen3-TTS-12Hz-1.7B-Base-bf16(downloaded, 4.2 GB)/healthreports:backend_type: mlx,backend_variant: cpu,gpu_type: MPS (Apple Silicon),model_loaded: true,gpu_available: trueSymptom
Every generation fails immediately with:
The UI stays on "Loading model..." but the SQLite row in
voicebox.dbis alreadystatus=failed. No Python traceback is written to the server log — only the message is captured in thegenerations.errorcolumn.Reproduction (100% repeatable)
After a clean app start (
Readylogged, model loaded), call the API directly:Polling
/generate/{id}/statusflips tofailedwithin ~1 second, every time. Confirmed across multiple profiles, languages (zh/en), and short/long inputs.Likely root cause
mlx_audio/stt/generate.py:224(and almost certainly similar code on the TTS path):The stream is created on the main/import thread, but inference runs in a uvicorn worker thread (
run_in_executor/asyncio.to_thread). MLX streams are thread-local, so the worker thread sees noStream(gpu, 1)and raises.Fix direction: either recreate the stream inside the worker thread, or use
mlx.core.new_thread_local_stream(added in MLX 0.31).Suggested follow-ups
mx.new_streaminto the worker that actually runs inference, or switch tonew_thread_local_stream.backend_variant: cpuwhilebackend_type: mlxandgpu_type: MPSreported by/healthlooks inconsistent and is worth a sanity check.Workaround
None found from the user side — backend cannot be switched away from MLX in the bundled build, and the error is raised before any inference happens.