-
-
Notifications
You must be signed in to change notification settings - Fork 13
Wake Word Engines
Ava has two on-device wake word engines. Both run entirely locally — no audio leaves the device for wake detection.
Compatible with Android 5-16.
| Dimension | Value |
|---|---|
| Architecture | TFLite binary classification |
| Model size | 50-80KB, uint8 quantized |
| Inference | 10ms frame, stride 3 |
| Frontend | microfeatures |
| Decision | 5-frame sliding window mean > threshold |
| Output | Scalar 0-1 probability |
| Wake-word swap | Full retraining required |
| CPU / memory | Minimal |
| False wake defense | Threshold only (zero-sum trade-off) |
| Interpretability | None |
| Best for | Low-end Android 5+ persistent background |
microWakeWord is the default engine. It uses TensorFlow Lite with tiny quantized models. Each wake word is a separate .tflite model file paired with a .json config. The detector runs a sliding window average over the last 5 frames and triggers when the average probability exceeds the cutoff.
Built-in micro models (9):
| Model ID | Wake Word | Author |
|---|---|---|
hey_jarvis |
Hey Jarvis | Kevin Ahrendt |
alexa |
Alexa | Kevin Ahrendt |
hey_home_assistant |
Hey Home Assistant | Michael Hansen |
hey_mycroft |
Hey Mycroft | Kevin Ahrendt |
hey_luna |
Hey Luna | adamlonsdale |
hey_peppa_pig |
Hey Peppa Pig | Michael Hansen |
okay_computer |
Okay Computer | Michael Hansen |
okay_nabu |
OK Nabu | Kevin Ahrendt |
choo_choo_homie |
Choo Choo Homie | Michael Hansen |
Stop word: stop (Stop) by Kevin Ahrendt.
| Dimension | Value |
|---|---|
| Architecture | ONNX CTC phoneme decoding + edit distance |
| Model size | ~500KB |
| Inference | 80ms cycle, 1300ms window, 128×40 feature map |
| Frontend | Log-Mel + energy gate + incremental feature extraction |
| Decision | Energy gate → CTC phoneme decode → edit distance ≤2 → counter/borderline confirm → per-keyword cooldown |
| Output | Phoneme sequence with traceability |
| Wake-word swap | Manifest JSON hot-swap |
| CPU / memory | Significantly higher |
| False wake defense | Multi-layer independent gates |
| Interpretability | Phoneme-level debugging |
| Best for | Noise-sensitive, explainability-required deployments |
vsWakeWord uses ONNX Runtime with CTC (Connectionist Temporal Classification) decoding. Instead of a binary "is this the wake word?" classifier, it decodes the audio into a phoneme sequence and matches against target phonemes with edit distance. This makes it more robust to noise and accents, at the cost of higher CPU usage.
Built-in vs models (3):
| Model ID | Wake Word | Type |
|---|---|---|
hey_jarvis |
Hey Jarvis | Wake word |
ok_nabu |
OK Nabu | Wake word |
ok_stop |
Ok Stop | Stop classifier |
The engine uses a multi-layer gating architecture. Each layer independently filters out false wakes before the next layer runs:
Layer 1 — Energy Gate (Sleep/Wake)
A lightweight RMS energy gate runs before any model inference. When the room is quiet, the ONNX model is completely asleep — no tensor computation, no phoneme decoding. The gate uses dual thresholds: a sleep threshold and a wake threshold. Once audio energy exceeds the wake threshold, the engine transitions to active mode. After 2.4 seconds of continuous silence, it returns to sleep.
Layer 2 — Buffered Audio Replay
While the energy gate was sleeping, audio was being buffered. When the gate opens (someone starts speaking), the buffered audio is replayed through the ONNX model in a single batch. This means the beginning of the wake word — which happened while the gate was still deciding — is not lost. The user experiences instant wake detection without waiting for the model to warm up from silence.
Layer 3 — Incremental Feature Extraction
vsWakeWord extracts log-Mel features from audio using a ring buffer. Instead of recomputing the full feature window every chunk, the engine shifts the existing feature buffer and only computes the new frames. On a 1300ms window with 80ms chunks, this means computing approximately 6 new frames instead of 128 every cycle — a 20x reduction in FFT work.
Layer 4 — CTC Phoneme Decode
The ONNX model outputs frame-level phoneme log-probabilities. A CTC decoder matches the output sequence against target phoneme sequences for each registered wake word. The decoder supports multiple pronunciations per wake word and uses edit distance to tolerate minor phoneme variations.
Layer 5 — Counter / Borderline Confirmation
Two confirmation strategies prevent single-window false triggers:
- Counter mode: Requires consecutive matches (typically 2) before triggering. A soft hold mechanism tolerates brief dips between syllables when speaking quickly. High-confidence detections can bypass the counter entirely.
- Borderline mode: Detections near the threshold require a second confirming window within a short time frame. High-confidence detections trigger immediately.
Layer 6 — Per-Keyword Cooldown
After a keyword triggers, it enters a cooldown period (default 2000ms, configurable per keyword in the manifest). This prevents the same wake word from firing repeatedly during a single utterance.
Layer 7 — Adaptive Gain Normalization
A smooth adaptive gain normalizes voice volume before inference. This ensures the model sees consistently-leveled audio regardless of distance or microphone sensitivity. Consistent input means CTC confidence scores are more stable, which means the confirmation gates reach their threshold in fewer attempts — faster triggering with fewer false rejects.
Each vs model is a pair: id.json (manifest) + id.ort (ONNX model). The manifest defines:
{
"name": "hey_jarvis",
"format": "vs-wake-word-ctc-v1",
"recommended_threshold": 0.61,
"input": { "shape": [1, 128, 40], "feature": "log_mel" },
"output": { "shape": [1, 49, 52], "meaning": "frame_level_phoneme_log_probabilities" },
"feature_config": {
"sample_rate": 16000, "window_ms": 1300, "frame_ms": 25, "hop_ms": 10,
"n_fft": 512, "n_mels": 40, "f_min": 80.0, "f_max": 7600.0
},
"ctc": {
"vocab_size": 52, "blank_id": 1, "max_edit_distance": 1,
"wake_word_targets": [[27, 9, 15, 2, 24, 44, 5, 3, 36, 41, 15, 38]],
"wake_word_target_phonemes": [["h", "e", "ɪ", " ", "d", "ʒ", "ɑ", "ː", "ɹ", "v", "ɪ", "s"]]
},
"runtime": {
"required_hits": 2, "hit_mode": "consecutive",
"cooldown_ms": 2000, "high_confidence_bypass": 6.8
},
"stop_classifier": false
}Key fields:
-
wake_word_targets— phoneme ID sequences to match (multiple pronunciations allowed) -
wake_word_target_phonemes— human-readable phoneme sequences for debugging -
max_edit_distance— how many phoneme substitutions/insertions/deletions are tolerated -
runtime.required_hits— how many consecutive matches needed to trigger (2 = double confirm) -
runtime.high_confidence_bypass— skip the hit counter if confidence is very high -
stop_classifier—truefor stop-word models (different gating logic)
| Dimension | microWakeWord | vsWakeWord |
|---|---|---|
| Architecture | TFLite binary classification | ONNX CTC phoneme decoding + edit distance |
| Model size | 50-80KB, uint8 quantized | ~500KB |
| Inference | 10ms frame, stride 3 | 80ms cycle, 1300ms window, 128×40 feature map |
| Frontend | microfeatures | Log-Mel + energy gate + incremental extraction |
| Decision | 5-frame sliding window mean | Energy gate → CTC decode → edit distance → counter/borderline confirm → cooldown |
| Output | Scalar 0-1 probability | Phoneme sequence with traceability |
| Wake-word swap | Full retraining required | Manifest JSON hot-swap |
| CPU / memory | Minimal | Significantly higher |
| False wake defense | Threshold only (zero-sum trade-off) | Multi-layer independent gates |
| Interpretability | None | Phoneme-level debugging |
| Best for | Low-end Android 5+ persistent background | Noise-sensitive, explainability-required deployments |
Each engine stores wake words independently (microWakeWords / vsWakeWords). Switching engines auto-restores the last selection — no more lost models or silent failures.
Cross-engine ID mapping: micro's okay_nabu auto-maps to VS's ok_nabu. HA-configured wake words also resolve correctly across engines.
Service auto-restarts on engine switch, keeping the detector in sync with settings.
- Open Ava app
- Go to Settings -> Voice Config
- Find Wake Word Engine and choose microWakeWord or vsWakeWord
- Find Wake Word 1 option
- Select your preferred wake word from the list
- Optionally configure Wake Word 2 for dual wake word mode
- New wake word takes effect after service restart (auto)
Adjust sensitivity to control how easily the wake word triggers:
- Higher sensitivity = easier to trigger, but more false positives
- Lower sensitivity = fewer false positives, but may miss quiet speech
Each wake word can have its own wake sound:
- Wake Word 1 Sound: Played when Wake Word 1 is detected
- Wake Word 2 Sound: Played when Wake Word 2 is detected
- Default Sound: Used if no custom sound is set
- None: Silent recording start
Ava provides clear visual feedback during wake and conversation:
Wake Instant:
- Colorful ripple expanding from screen center
- Android 13+: RuntimeShader with distorted halo + star particles
- Android 7: Soft circular diffusion
- Compatibility paths for other versions
Conversation (when Floating Subtitle is disabled):
- Full-screen edge glow that changes with state:
- Listening: Edge light breathes with microphone volume
- Processing: Slow breathing animation
- Speaking: Pulsates with TTS energy
Dual Wake-Word Color Coding:
- Wake Word 1 = green (default)
- Wake Word 2 = blue (default)
- Ripple and edge light match the triggered wake word
- Custom colors available in Settings → Extensions → Interface → Voice feedback colors
- 7 rainbow presets (red through purple) also available
Technical Notes:
- Edge glow uses pre-rendered Gaussian blur bitmaps for performance
- Ripple animation driven by system uptime (prevents Kiosk devices with "animation duration = 0" from killing the effect)
- Android 7.0/7.1 optimized to clean circular diffusion without Shader dependency
Stop words interrupt the current conversation or stop Ava's response (e.g., timer alarm).
| Stop Word | Engine | Model ID | Description |
|---|---|---|---|
| Stop | microWakeWord | stop |
Default stop word for micro engine |
| Ok Stop | vsWakeWord | ok_stop |
Stop word for vs engine (stop_classifier: true) |
Stop-word detection only runs when actually needed (since 0.5.2):
- A timer alarm is actively ringing
- A voice session is in progress (Listening, Processing, Responding)
During idle standby with no alarm, the stop model is skipped — cutting CPU load and heat.
Ava Pro is based on brownard/Ava. This section covers only the wake word engine differences.
| Dimension | brownard/Ava (original) | Ava Pro (knoop7/Ava) |
|---|---|---|
| Engine count | 1 (microWakeWord) | 2 (microWakeWord + vsWakeWord) |
| microWakeWord engine | TFLite binary classification, sliding window threshold | Same engine, same 9 built-in models |
| vsWakeWord engine | Not available | ONNX CTC phoneme decoding + edit distance + multi-layer false wake gates |
| Built-in models | 9 micro (.tflite) | 9 micro (.tflite) + 3 vs (.ort) |
| Model format (micro) | .tflite + .json | .tflite + .json (identical, V2/V3 compatible) |
| Model format (vs) | N/A | .ort (ONNX) + .json manifest with CTC phoneme targets |
| Custom model loading | DocumentTreeWakeWordProvider (SAF folder picker in Settings) | In-app import (Wake Word Library) + APK assets injection — see Custom Wake Words |
| False wake defense | Threshold only (single layer) | Threshold (micro) or multi-layer gates (vs) |
| Inference runtime | TensorFlow Lite | TensorFlow Lite + ONNX Runtime (reduced build, CPU EP only) |
| CPU / memory footprint | Minimal | Minimal (micro) or significantly higher (vs) |
The microWakeWord engine itself is identical between both apps — same models, same inference path. The speed difference comes from what happens around the engine, not inside it.
1. Stop word detection is no longer always-on.
The original runs the stop-word model continuously alongside the wake-word model — two models inference every audio chunk, 24/7, even when nothing is happening. Ava Pro (since 0.5.2) only activates the stop-word model when it is actually useful: when a timer alarm is ringing, or when a voice session is in progress (Listening / Processing / Responding). During idle standby, the stop model is completely skipped. This cuts continuous CPU load in half during the 99% of the time the device is just waiting.
2. vsWakeWord skips inference during silence.
When using the vsWakeWord engine, the energy gate completely bypasses ONNX inference when the room is quiet. The ONNX model only wakes up when the gate detects voice-like audio. On a quiet device sitting in a hallway, this means the ONNX engine effectively sleeps most of the time, while still catching the wake word the moment someone speaks.
3. Buffered audio replay on gate open.
When the energy gate transitions from closed to open (someone starts speaking), Ava Pro replays the last ~1 second of buffered audio through the ONNX model in a single batch. This means the beginning of the wake word — which happened while the gate was still deciding — is not lost. The user experiences instant wake detection without waiting for the model to warm up from silence.
4. Incremental feature extraction.
vsWakeWord extracts log-Mel features from audio. Instead of recomputing the full feature window every chunk, Ava Pro shifts the existing feature buffer and only computes the new frames. On a 1300ms window with 80ms chunks, this means computing ~6 new frames instead of ~128 every cycle — a 20x reduction in FFT work.
5. Adaptive gain normalization.
vsWakeWord applies a smooth adaptive gain to normalize voice volume before inference. This means the model sees consistently-leveled audio regardless of distance or microphone sensitivity. Consistent input means the CTC confidence scores are more stable, which means the confirmation gates reach their threshold in fewer attempts — faster triggering with fewer false rejects.
6. ONNX Runtime is a stripped build.
The ONNX Runtime shipped with Ava Pro is a custom reduced build — only the CPU execution provider, no GPU/NNAPI delegates, no training APIs. This makes model loading and session creation faster, and the native library is smaller to load into memory. The tradeoff is no hardware acceleration, but for wake-word-sized models the CPU path is already fast enough and avoids the latency and compatibility issues of GPU/NNAPI on fragmented Android devices.
Net effect: On a typical device in idle standby, Ava Pro's wake-word CPU usage is lower than the original because stop-word inference is skipped. When someone speaks, vsWakeWord's energy gate + buffered replay + incremental features make detection feel instant despite the heavier model. The microWakeWord engine path matches the original's speed; the vsWakeWord path trades higher peak CPU for smarter gating and faster perceived response.
Back to Voice Control