v0.6.0 — Audio multimodal (Gemma 4 E2B speech understanding)
🎤 Audio multimodal — Gemma 4 E2B can hear
Record on the phone, send, get a spoken-language answer. Full pipeline runs
on-device.
PCM (16 kHz) → Swift mel spectrogram → audio.mlmodelc (12-layer Conformer, ANE)
→ 1536-dim audio tokens → feature-injected into Gemma 4 decoder
→ streaming text output
- Encoder: Conformer 12 layers, INT4-palettized, ANE-resident.
- Projection:
output_proj + RMSNorm + embed_projis fused into the
encoder graph now, so no Swift-side gemm at inference. Older 1024-dim
encoders still work via the Swift projection fallback. - Recorder:
AudioRecorder(AVAudioEngine, mono 16 kHz, max duration
synced from the model'smel_frames). - UI: mic button in
CoreMLLLMChatgated onsupportsAudio. Empty
prompt + audio auto-fills "What do you hear in this audio?".
Structure
ModelDownloaderis now in theCoreMLLLMpackage (was in the example
app). Single source of truth — other apps linking the library get it too.- 4xx handling: hard-fails on required files (weights/model.mil/
coremldata.bin); toleratesmetadata.jsonandanalytics/coremldata.bin
404s so a slightly incomplete HF upload doesn't abort the download. - Parallel downloads (4 connections) for large model files.
Other additions (experimental, not wired to the UI)
These ship as scaffolds in the library so branches can build on them.
Nothing is on by default.
MirrorSpeculativeLoop.swift— parallel NPU+GPU speculative decoding
(Apple MLR 2026 paper).SpeculativeLoop.swift— EAGLE-3 draft / fusion / verify wiring.PrefixKVCache.swift— persistent prefix KV cache for fast TTFT on
cached system prompts.ComputePreferenceLoader.swift— sidecar loader for dual ANE/GPU
mlpackage variants.
Docs
docs/AUDIO.md— audio pipeline design + conversion notes.docs/SPEED_8K.md— 8K-context roadmap (W8A8, DuoAttention, TriForce,
Mirror speculative).docs/UNEXPLORED_APPROACHES.md— six unexplored directions (A/B/C/D/E/F)
with effort estimates.docs/ALTERNATIVE_APPROACHES.md— outside-Gemma-4 options (distillation,
Turbo SKU) and why they're out of current scope.docs/EAGLE3_INTEGRATION_STATE.md— EAGLE-3 Phase 2A/2B/3 status,
including on-device bench findings.docs/POST_BENCH_PRIORITIES.md— performance-first priority ordering
after the on-device EAGLE-3 bench.
Bench status (iPhone 17 Pro, Gemma 4 E2B)
| Path | tok/s @ 2K |
|---|---|
| Baseline decode | 28–31 |
| Prefill (estimated) | ~154 |
| TTFT, 2K prompt | ~13 s (target next release) |
Audio adds ~100 ms one-shot inference for the encoder (1000 mel frames) +
the existing decode budget.
Upgrade notes
If you have an existing Gemma 4 E2B model downloaded and the audio button
doesn't appear after upgrading:
- The old downloader didn't fetch
mel_filterbank.bin,
output_proj_weight.npy,output_proj_bias.npy,embed_proj_weight.npy. - Delete the model from the app and re-download, or use
devicectl device copyto push the four files intoDocuments/Models/gemma4-e2b/.
The output_proj_* / embed_proj_* .npy files are optional on the shipped
encoder (projection is fused inside the graph) but the mel_filterbank.bin
is required.
Commits since v0.5.1
60 commits. Highlights:
16dc23e— audio projection: skip Swift gemm when model already emits
1536-dim featuresb803de1— ModelDownloader tolerates 404 on optional mlmodelc files3a0c9a8— hub-manifest fix1fe6890/75fb268— EAGLE-3 scaffolding5410de7— audio support initial commita9c9b7d— MQA + QLoRA recoveryb0779c7— 8K speed research doc
Full list: v0.5.1...v0.6.0