-
-
Notifications
You must be signed in to change notification settings - Fork 13
Text to Speech
Ava supports two TTS playback modes for voice assistant replies: Standard TTS (URL-based) and Streaming TTS (PCM-based). The mode is selected in Settings → Voice → Voice Replies.
Standard TTS waits until Home Assistant generates the full voice reply, then downloads and plays it from a URL.
Flow:
- HA sends
TTS_STARTwith the reply text (displayed as subtitle) - HA sends
TTS_ENDwith the TTS audio URL - Ava downloads the audio file and plays it via ExoPlayer
- On playback completion, Ava sends
announce_finishedto HA
Characteristics:
- Better compatibility — works with any HA TTS provider
- Supports floating subtitle overlay (word-by-word or full text)
- Supports wake sound and stop sound
- Higher latency — first audio is heard only after the full reply is generated and downloaded
- HTTP streaming playback with configurable connect/read timeout (30s)
Early TTS URL: HA may include the TTS URL in RUN_START before the conversation even begins. Ava caches this URL and uses it when TTS_END arrives, reducing latency.
Streaming TTS starts playing audio while the reply is still being generated on the HA side. HA sends PCM audio chunks in real time.
Flow:
- HA sends
TTS_START— Ava enters Responding state - HA sends
TTS_STREAM_START— Ava opens anAudioTrack(16 kHz, 16-bit, mono PCM) - HA sends PCM audio chunks via
VoiceAssistantAudiomessages — Ava writes them toAudioTrackin real time - HA sends
TTS_STREAM_END— Ava drains remaining audio, then completes
Characteristics:
- Faster first response — audio starts playing as soon as the first PCM chunk arrives
- No floating subtitle overlay (subtitles are suppressed in streaming mode because text arrives in fragments)
- Requires HA server with streaming TTS output support (HA
SPEAKERfeature flag) - PCM audio is played via raw
AudioTrack, not ExoPlayer - Volume scaling is applied per-sample on PCM data before writing to the track
- Buffered chunk handling: if PCM chunks arrive before
TTS_STREAM_START, they are buffered and flushed when the stream opens (max 256 frames)
Feature flag negotiation: When streaming TTS is enabled in settings, Ava sets the SPEAKER feature flag in the VoiceAssistantConfigurationRequest. HA uses this to decide whether to send PCM streams.
| Feature | Standard TTS | Streaming TTS |
|---|---|---|
| First audio latency | Higher (wait for full reply) | Lower (play while generating) |
| Floating subtitles | Yes | No |
| Wake sound | Yes | Yes |
| Stop sound | Yes | Yes |
| Server requirement | Any HA TTS provider | HA with streaming output support |
| Playback engine | ExoPlayer (HTTP) | AudioTrack (raw PCM) |
| Audio format | Any (URL-based) | 16 kHz 16-bit mono PCM |
| Whisper response | Yes | Yes |
Ava supports HA's VoiceAssistantAnnounceRequest for proactive announcements (e.g., timer finished, ask_question). This is separate from the conversation TTS flow.
Flow:
- HA sends
AnnounceRequestwithmedia_id(and optionalpreannounce_media_id) - Ava ducks media playback, plays the announcement audio
- On completion, Ava sends
announce_finishedto HA - If
start_conversation=true, Ava starts a new voice pipeline to listen for the user's response (10s timeout)
Preannounce: If a preannounce_media_id is provided, Ava plays it first (e.g., "Attention:"), then plays the main media. A 3-second load timeout skips the preannounce if it fails to start.
Ava can temporarily lower the playback volume for wake sounds and TTS replies, useful for bedside or quiet environments.
Settings:
| Setting | Description | Default |
|---|---|---|
| Voice playback volume | Enable whisper mode | Off |
| Adaptive | Lower volume only when speech/ambient is quiet | On |
| Quiet volume | Volume level for whisper mode (15%–100%) | 30% |
Adaptive mode:
- Wake sound: Measures ambient mic level before wake detection. If below threshold (0.04 RMS), lowers wake sound volume.
- TTS reply: Measures user's speech peak during the session. If below threshold, lowers TTS playback volume.
- After TTS finishes, the original system volume is restored.
Fixed mode: Always uses the configured quiet volume for wake sound and TTS, regardless of ambient/speech level.
Above 30% quiet volume, the toast warns that whisper mode won't effectively avoid disturbing others.
Ava displays localized toast messages for voice pipeline errors. Error messages are determined by the error code from HA and displayed in the user's locale.
Supported languages: English (default), Chinese (zh), German (de), Russian (ru), Portuguese (pt), Vietnamese (vi).
Locale detection: Ava uses LocaleUtils to detect the system locale at runtime. If the locale is not one of the supported languages, English is used as fallback.
| Error Code Pattern | String Key | English Message |
|---|---|---|
stt-no-text |
pipeline_error_no_speech |
Speech detected, no text |
timeout / timed-out
|
pipeline_error_no_response |
Response timed out, possibly network delay or busy service |
stt-* / intent-* / tts*
|
pipeline_error_config |
Voice processing paused, config may need a look |
cloud-auth |
pipeline_error_cloud_auth |
Cloud auth issue, re-login |
wake / duplicate
|
pipeline_error_wake |
Wake signal anomaly, retry |
| (other) | pipeline_error_unknown |
Temporary issue, will retry shortly |
| (HA disconnected) | pipeline_error_ha_disconnected |
Server offline, check network |
| (HA start failed) | pipeline_error_ha_start_failed |
Pipeline init issue, check config |
| (TTS playback) | pipeline_error_tts_playback |
Voice reply playback failed: TTS address timed out, check the address |
| Language | pipeline_error_no_speech |
pipeline_error_tts_playback |
|---|---|---|
| English | Speech detected, no text | Voice reply playback failed: TTS address timed out, check the address |
| Chinese | 检测到语音,但无文本 | 语音回复播放失败:TTS 地址超时请检查地址 |
| German | Sprache erkannt, kein Text | Sprachantwort-Wiedergabe fehlgeschlagen: TTS-Adresse timed out, Adresse prüfen |
| Russian | Речь есть, текста нет | Ошибка воспроизведения голосового ответа: тайм-аут адреса TTS, проверьте адрес |
| Portuguese | Voz detectada, sem texto | Falha na reprodução da resposta por voz: endereço TTS expirou, verifique o endereço |
| Vietnamese | Có giọng nói, không có văn bản | Phát phản hồi giọng nói thất bại: địa chỉ TTS hết thời gian, kiểm tra địa chỉ |
- TTS playback uses
USAGE_MEDIAwithAUDIOFOCUS_GAIN_TRANSIENT_MAY_DUCK - When TTS starts, Ava requests audio focus, which may duck other media players
- On TTS completion, audio focus is released, allowing other media to resume
- The TTS player volume follows the system media volume
- Whisper mode can temporarily override the volume during TTS playback
All TTS-related settings are in Settings → Voice → Voice Replies.
| Setting | Description | Default |
|---|---|---|
| TTS mode | Standard or Streaming | Standard |
| Floating subtitle overlay | Show subtitles in floating window during conversations | Off |
| Voice playback volume (Whisper) | Lower volume for quiet speech | Off |
| Adaptive whisper | Only whisper when speech is quiet | On |
| Quiet volume | Volume for whisper mode | 30% |
AI Hub TTS is a Home Assistant add-on based on the Kokoro model. It is the recommended TTS engine for Ava.
Features:
- Fully offline — no internet required after model download
- 26 high-quality voices
- CPU inference, generation < 1 second
- Wyoming protocol with auto-discovery — HA picks it up automatically
- Memory usage < 500 MB
Installation:
- In Home Assistant: Settings → Add-ons → Add-on Store
- Click ⋮ → Repositories → Add
https://github.com/truemanshum/ai-hub-tts - Refresh and install AI Hub TTS
- First start auto-downloads the model (~500 MB)
- Settings → Voice Assistants → select AI Hub TTS as the TTS engine
Configuration:
voice: af_heart # Voice selection
sample_rate: 24000 # Sample rate
debug: false # Debug modeWith Ava: Works in both Standard and Streaming TTS modes. For Streaming TTS, ensure the HA voice assistant pipeline has streaming output enabled.
Back to Voice Control