Skip to content

katya4oyu/stackchan-ghost

Repository files navigation

stackchan-ghost

Local voice agent daemon for StackChan.

See docs/README.md for the architecture and local verification flow.

MVP shape

  • Rust agent core
  • litert-lm CLI as the multimodal LLM worker
  • mlx-audio stdio workers for local Qwen3-ASR / Qwen3-TTS
  • StackChan bridge over WebSocket or Serial
  • SQLite memory layer, to be added next

mlx-audio workers

Install the local Python audio dependencies:

mise run audio:sync

By default, config/gateway.toml runs Qwen3-ASR and Qwen3-TTS as stdio workers. mlx_audio.server is still available through mise run audio:server for HTTP fallback testing.

Runtime configs are split by target memory:

config/gateway.toml       default, aligned with the 8GB profile
config/gateway-8gb.toml   E2B + Qwen3-ASR 0.6B 4bit + Qwen3-TTS Base 0.6B 4bit
config/gateway-8gb-tts-prewarm.toml  8GB experiment, exclusive with TTS prewarm during LLM
config/gateway-8gb-tts-hot.toml  8GB experiment, TTS hot with ASR reload
config/gateway-8gb-all-hot.toml  8GB full-residency experiment
config/gateway-16gb.toml  E4B + Qwen3-ASR 1.7B 8bit + Qwen3-TTS Base 1.7B 8bit
config/client-pc.toml            PC microphone client settings
config/firmware.example.toml     template for StackChan firmware settings
config/firmware.toml             local firmware settings, ignored by git

Silero VAD is used for server-side utterance detection. It is included in the model download tasks:

mise run models:download-8gb
mise run models:path-vad

Run use-case demos:

mise run models:download-8gb
mise run demo:tts
mise run demo:stt
mise run demo:voice-roundtrip
mise run demo:voice-roundtrip-and-play
SILERO_MODEL_PATH=/path/to/silero_vad.onnx mise run demo:voice-stream-roundtrip
SILERO_MODEL_PATH=/path/to/silero_vad.onnx mise run bench:voice-stream-turn-e2e

StackChan voice gacha

Qwen3-TTS VoiceDesign で声ガチャを回し、StackChan の custom voice 用 ref-audio/ref-text を保存します:

uv run scripts/stackchan_voice_gacha.py

声ガチャは gateway runtime config ではなく、scripts/stackchan_voice_gacha.py の引数で制御します。既定モデルは mlx-community/Qwen3-TTS-12Hz-1.7B-VoiceDesign-bf16 です。

保存済みの gacha/ref wav を再生確認する場合:

uv run scripts/play_stackchan_voice.py

既存の音声ファイルとテキストを custom voice 参照フォーマットに保存する場合:

uv run scripts/import_stackchan_voice_ref.py \
  ~/Downloads/sample.mp3 \
  "この音声で実際に読まれているテキストです。" \
  --name my_voice

保存後、config/gateway.tomltts_args を CustomVoice モデルに切り替え、 出力された ref-audio/ref-text を追加すると、 daemon 経由で StackChan がその声で喋ります:

tts_args = [
  "--model", "mlx-community/Qwen3-TTS-12Hz-0.6B-CustomVoice-4bit",
  "--ref-audio", "/Users/yykt/.stackchan-ghost/voices/stackchan_ref.wav",
  "--ref-text-file", "/Users/yykt/.stackchan-ghost/voices/stackchan_ref.txt",
  "--streaming-interval", "0.2",
  "--lang-code", "japanese",
]

保存済み ref-audio で単体の発話 wav だけ確認したい場合:

uv run scripts/custom_voice_tts.py \
  --ref-audio ~/.stackchan-ghost/voices/stackchan_ref.wav \
  --ref-text-file ~/.stackchan-ghost/voices/stackchan_ref.txt \
  --output target/custom-voice.wav \
  "こんにちは。スタックチャンのカスタムボイスです。"

Run the daemon shell:

mise run agent:serve

Run the StackChan bridge as a WebSocket server:

mise run agent:server

Explicit profile tasks are also available:

mise run models:download-8gb
mise run agent:server-8gb
mise run agent:server-8gb-tts-prewarm
mise run agent:server-8gb-tts-hot
mise run agent:server-8gb-all-hot
mise run agent:server-16gb

The WebSocket voice path uses Silero VAD. models:download-8gb and models:download-16gb include it. To fetch only the VAD model:

mise run models:download-vad
mise run models:path-vad

By default it listens on ws://0.0.0.0:8787. StackChan can send a plain text line/message for debugging:

{"type":"utterance","text":"こんにちは"}

For the firmware voice path, authenticate with hello, start a continuous Opus stream, and keep sending 20 ms binary Opus frames. Server-side Silero VAD cuts utterances and starts STT -> LLM -> TTS when speech ends.

{"type":"hello","device_id":"stackchan-cores3","token":"dev-stackchan-ghost","protocol_version":1,"audio_codec":"opus","sample_rate":16000,"channels":1,"frame_ms":20}
{"type":"audio_stream_start","format":"opus","framing":"binary","sample_rate":16000,"channels":1,"frame_ms":20}

WebSocket binary Opus frames are prefixed with an 8 byte little-endian header: u32 sequence, u32 timestamp_ms, followed by the Opus payload.

Serial Opus frames use binary framing: magic/version 0x1e 'S' 'G' 1, frame_type=1, flags, header_len=20, payload_len, sequence, timestamp_ms, Opus payload, then CRC16-CCITT. JSON control/config/log messages remain newline-delimited text. The server returns TTS Opus audio over serial with the same binary framing, not base64 JSON chunks.

The legacy/debug PCM path is still available. Start a continuous PCM stream and keep sending 20 ms binary PCM16LE frames:

{"type":"audio_stream_start","format":"pcm16le","sample_rate":16000,"channels":1,"frame_ms":20}

Each binary frame is 640 bytes: 16000Hz * 20ms * 1ch * 2 bytes. Serial can use base64 JSON chunks instead:

{"type":"audio_chunk","encoding":"base64","data":"..."}

The legacy PCM response is JSON messages over WebSocket or newline-delimited JSON over serial. TTS audio is decoded on the server and returned as chunked PCM16LE:

{"type":"reply","transcription":"こんにちは","text":"こんにちは、元気です。","stt_complete_ms":420,"llm_complete_ms":180,"tts_complete_ms":760,"turn_ms":1360}
{"type":"audio_stream_start","format":"pcm16le","sample_rate":16000,"channels":1,"frame_ms":20}
{"type":"audio_chunk","encoding":"base64","data":"..."}
{"type":"audio_stream_end"}

Device clients should keep their own microphone gated while they are playing the reply audio. The PC microphone example does this locally by waiting for the output sink to finish playback before it resumes sending microphone frames.

Run the StackChan bridge over serial:

mise run agent:serial

The default serial port and baud rate come from config/gateway.toml; override them with cargo run -- serial --port /dev/tty.usbserial --baud-rate 115200.

Latency analysis

For STT -> LLM -> TTS bottleneck checks, write structured logs and summarize them:

mise run bench:voice-turn-json

This writes logs/voice-turn.jsonl and prints per-phase count, average, p50, p95, and max latency. The important fields are stt_complete_ms, llm_complete_ms, tts_complete_ms, and turn_ms.

For a long-running server, use the same JSON log switch and analyze the saved stderr later:

mise run agent:server
python3 scripts/analyze_voice_log.py logs/server.jsonl

PC microphone example

To talk through the same WebSocket audio path from a PC microphone, start the bridge in one terminal:

mise run models:download-8gb
mise run agent:server-8gb

Then run the example client:

mise run pc:voice-loop

The PC microphone client reads its WebSocket URL and token from config/client-pc.toml. Override them with --url or --token when needed.

It captures the default microphone, sends 20 ms pcm16le 16000Hz mono frames to the bridge, receives StackChan-style reply PCM chunks, and plays them through the default speaker. Pick a specific microphone with cargo run --example pc_mic_voice_loop -- --input-device "Device Name".

The gateway saves the generated reply audio it sends to clients as WAV files in logs/captures/ by default. Use these files to distinguish TTS generation quality from PC-side playback issues. Override the directory with STACKCHAN_GHOST_CAPTURE_TTS_DIR=/path/to/captures when needed.

About

Local-first persona and memory definition for running StackChan conversations through pi-coding-agent sessions.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors