Local voice agent daemon for StackChan.
See docs/README.md for the architecture and local verification flow.
- Rust agent core
litert-lmCLI as the multimodal LLM workermlx-audiostdio workers for local Qwen3-ASR / Qwen3-TTS- StackChan bridge over WebSocket or Serial
- SQLite memory layer, to be added next
Install the local Python audio dependencies:
mise run audio:syncBy default, config/gateway.toml runs Qwen3-ASR and Qwen3-TTS
as stdio workers. mlx_audio.server is still available through
mise run audio:server for HTTP fallback testing.
Runtime configs are split by target memory:
config/gateway.toml default, aligned with the 8GB profile
config/gateway-8gb.toml E2B + Qwen3-ASR 0.6B 4bit + Qwen3-TTS Base 0.6B 4bit
config/gateway-8gb-tts-prewarm.toml 8GB experiment, exclusive with TTS prewarm during LLM
config/gateway-8gb-tts-hot.toml 8GB experiment, TTS hot with ASR reload
config/gateway-8gb-all-hot.toml 8GB full-residency experiment
config/gateway-16gb.toml E4B + Qwen3-ASR 1.7B 8bit + Qwen3-TTS Base 1.7B 8bit
config/client-pc.toml PC microphone client settings
config/firmware.example.toml template for StackChan firmware settings
config/firmware.toml local firmware settings, ignored by git
Silero VAD is used for server-side utterance detection. It is included in the model download tasks:
mise run models:download-8gb
mise run models:path-vadRun use-case demos:
mise run models:download-8gb
mise run demo:tts
mise run demo:stt
mise run demo:voice-roundtrip
mise run demo:voice-roundtrip-and-play
SILERO_MODEL_PATH=/path/to/silero_vad.onnx mise run demo:voice-stream-roundtrip
SILERO_MODEL_PATH=/path/to/silero_vad.onnx mise run bench:voice-stream-turn-e2eQwen3-TTS VoiceDesign で声ガチャを回し、StackChan の custom voice 用 ref-audio/ref-text を保存します:
uv run scripts/stackchan_voice_gacha.py声ガチャは gateway runtime config ではなく、scripts/stackchan_voice_gacha.py
の引数で制御します。既定モデルは mlx-community/Qwen3-TTS-12Hz-1.7B-VoiceDesign-bf16 です。
保存済みの gacha/ref wav を再生確認する場合:
uv run scripts/play_stackchan_voice.py既存の音声ファイルとテキストを custom voice 参照フォーマットに保存する場合:
uv run scripts/import_stackchan_voice_ref.py \
~/Downloads/sample.mp3 \
"この音声で実際に読まれているテキストです。" \
--name my_voice保存後、config/gateway.toml の tts_args を CustomVoice モデルに切り替え、
出力された ref-audio/ref-text を追加すると、
daemon 経由で StackChan がその声で喋ります:
tts_args = [
"--model", "mlx-community/Qwen3-TTS-12Hz-0.6B-CustomVoice-4bit",
"--ref-audio", "/Users/yykt/.stackchan-ghost/voices/stackchan_ref.wav",
"--ref-text-file", "/Users/yykt/.stackchan-ghost/voices/stackchan_ref.txt",
"--streaming-interval", "0.2",
"--lang-code", "japanese",
]保存済み ref-audio で単体の発話 wav だけ確認したい場合:
uv run scripts/custom_voice_tts.py \
--ref-audio ~/.stackchan-ghost/voices/stackchan_ref.wav \
--ref-text-file ~/.stackchan-ghost/voices/stackchan_ref.txt \
--output target/custom-voice.wav \
"こんにちは。スタックチャンのカスタムボイスです。"Run the daemon shell:
mise run agent:serveRun the StackChan bridge as a WebSocket server:
mise run agent:serverExplicit profile tasks are also available:
mise run models:download-8gb
mise run agent:server-8gb
mise run agent:server-8gb-tts-prewarm
mise run agent:server-8gb-tts-hot
mise run agent:server-8gb-all-hot
mise run agent:server-16gbThe WebSocket voice path uses Silero VAD. models:download-8gb and
models:download-16gb include it. To fetch only the VAD model:
mise run models:download-vad
mise run models:path-vadBy default it listens on ws://0.0.0.0:8787. StackChan can send a plain
text line/message for debugging:
{"type":"utterance","text":"こんにちは"}For the firmware voice path, authenticate with hello, start a continuous Opus
stream, and keep sending 20 ms binary Opus frames. Server-side Silero VAD cuts
utterances and starts STT -> LLM -> TTS when speech ends.
{"type":"hello","device_id":"stackchan-cores3","token":"dev-stackchan-ghost","protocol_version":1,"audio_codec":"opus","sample_rate":16000,"channels":1,"frame_ms":20}{"type":"audio_stream_start","format":"opus","framing":"binary","sample_rate":16000,"channels":1,"frame_ms":20}WebSocket binary Opus frames are prefixed with an 8 byte little-endian header:
u32 sequence, u32 timestamp_ms, followed by the Opus payload.
Serial Opus frames use binary framing: magic/version 0x1e 'S' 'G' 1,
frame_type=1, flags, header_len=20, payload_len, sequence,
timestamp_ms, Opus payload, then CRC16-CCITT. JSON control/config/log messages
remain newline-delimited text. The server returns TTS Opus audio over serial
with the same binary framing, not base64 JSON chunks.
The legacy/debug PCM path is still available. Start a continuous PCM stream and keep sending 20 ms binary PCM16LE frames:
{"type":"audio_stream_start","format":"pcm16le","sample_rate":16000,"channels":1,"frame_ms":20}Each binary frame is 640 bytes: 16000Hz * 20ms * 1ch * 2 bytes. Serial can use
base64 JSON chunks instead:
{"type":"audio_chunk","encoding":"base64","data":"..."}The legacy PCM response is JSON messages over WebSocket or newline-delimited JSON over serial. TTS audio is decoded on the server and returned as chunked PCM16LE:
{"type":"reply","transcription":"こんにちは","text":"こんにちは、元気です。","stt_complete_ms":420,"llm_complete_ms":180,"tts_complete_ms":760,"turn_ms":1360}
{"type":"audio_stream_start","format":"pcm16le","sample_rate":16000,"channels":1,"frame_ms":20}
{"type":"audio_chunk","encoding":"base64","data":"..."}
{"type":"audio_stream_end"}Device clients should keep their own microphone gated while they are playing the reply audio. The PC microphone example does this locally by waiting for the output sink to finish playback before it resumes sending microphone frames.
Run the StackChan bridge over serial:
mise run agent:serialThe default serial port and baud rate come from config/gateway.toml; override
them with cargo run -- serial --port /dev/tty.usbserial --baud-rate 115200.
For STT -> LLM -> TTS bottleneck checks, write structured logs and summarize them:
mise run bench:voice-turn-jsonThis writes logs/voice-turn.jsonl and prints per-phase count, average, p50,
p95, and max latency. The important fields are stt_complete_ms,
llm_complete_ms, tts_complete_ms, and turn_ms.
For a long-running server, use the same JSON log switch and analyze the saved stderr later:
mise run agent:server
python3 scripts/analyze_voice_log.py logs/server.jsonlTo talk through the same WebSocket audio path from a PC microphone, start the bridge in one terminal:
mise run models:download-8gb
mise run agent:server-8gbThen run the example client:
mise run pc:voice-loopThe PC microphone client reads its WebSocket URL and token from
config/client-pc.toml. Override them with --url or --token when needed.
It captures the default microphone, sends 20 ms pcm16le 16000Hz mono frames to
the bridge, receives StackChan-style reply PCM chunks, and plays them through the
default speaker. Pick a specific microphone with
cargo run --example pc_mic_voice_loop -- --input-device "Device Name".
The gateway saves the generated reply audio it sends to clients as WAV files in
logs/captures/ by default. Use these files to distinguish TTS generation
quality from PC-side playback issues. Override the directory with
STACKCHAN_GHOST_CAPTURE_TTS_DIR=/path/to/captures when needed.