Qwen TTS generation fails 100% with `Stream(gpu, 1) not in current thread` on Apple Silicon (v0.5.0)

## Environment

- Voicebox **v0.5.0** (macOS .app bundle, PyInstaller-packaged)
- macOS Darwin 25.4.0, arm64 (Apple Silicon, 128 GB RAM)
- Python 3.12.10 (bundled)
- `mlx==0.31.2`, `mlx_audio==0.4.1`, `qwen_tts==0.1.1`
- Model: `mlx-community/Qwen3-TTS-12Hz-1.7B-Base-bf16` (downloaded, 4.2 GB)
- `/health` reports: `backend_type: mlx`, `backend_variant: cpu`, `gpu_type: MPS (Apple Silicon)`, `model_loaded: true`, `gpu_available: true`

## Symptom

Every generation fails **immediately** with:

```
There is no Stream(gpu, 1) in current thread.
```

The UI stays on "Loading model..." but the SQLite row in `voicebox.db` is already `status=failed`. No Python traceback is written to the server log — only the message is captured in the `generations.error` column.

## Reproduction (100% repeatable)

After a clean app start (`Ready` logged, model loaded), call the API directly:

```bash
curl -X POST http://127.0.0.1:17493/generate \
  -H 'Content-Type: application/json' \
  -d '{"profile_id":"<any-existing-profile>","text":"hello","language":"en","engine":"qwen","model_size":"1.7B"}'
```

Polling `/generate/{id}/status` flips to `failed` within ~1 second, every time. Confirmed across multiple profiles, languages (zh/en), and short/long inputs.

## Likely root cause

`mlx_audio/stt/generate.py:224` (and almost certainly similar code on the TTS path):

```python
generation_stream = mx.new_stream(mx.default_device())
```

The stream is created on the main/import thread, but inference runs in a uvicorn worker thread (`run_in_executor` / `asyncio.to_thread`). MLX streams are **thread-local**, so the worker thread sees no `Stream(gpu, 1)` and raises.

Fix direction: either recreate the stream inside the worker thread, or use [`mlx.core.new_thread_local_stream`](https://ml-explore.github.io/mlx/build/html/python/_autosummary/mlx.core.new_thread_local_stream.html) (added in MLX 0.31).

## Suggested follow-ups

1. Move `mx.new_stream` into the worker that actually runs inference, or switch to `new_thread_local_stream`.
2. Let the exception traceback through to the server log on generation failure — currently the message is silently captured in SQLite only, which makes the issue look like "model still loading" in the UI.
3. The `backend_variant: cpu` while `backend_type: mlx` and `gpu_type: MPS` reported by `/health` looks inconsistent and is worth a sanity check.

## Workaround

None found from the user side — backend cannot be switched away from MLX in the bundled build, and the error is raised before any inference happens.


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Qwen TTS generation fails 100% with `Stream(gpu, 1) not in current thread` on Apple Silicon (v0.5.0) #699

Environment

Symptom

Reproduction (100% repeatable)

Likely root cause

Suggested follow-ups

Workaround

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Qwen TTS generation fails 100% with Stream(gpu, 1) not in current thread on Apple Silicon (v0.5.0) #699

Description

Environment

Symptom

Reproduction (100% repeatable)

Likely root cause

Suggested follow-ups

Workaround

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions

Qwen TTS generation fails 100% with `Stream(gpu, 1) not in current thread` on Apple Silicon (v0.5.0) #699