Release v0.6.0 — Audio multimodal (Gemma 4 E2B speech understanding) · john-rocky/CoreML-LLM

🎤 Audio multimodal — Gemma 4 E2B can hear

Record on the phone, send, get a spoken-language answer. Full pipeline runs
on-device.

PCM (16 kHz) → Swift mel spectrogram → audio.mlmodelc (12-layer Conformer, ANE)
            → 1536-dim audio tokens → feature-injected into Gemma 4 decoder
            → streaming text output

Encoder: Conformer 12 layers, INT4-palettized, ANE-resident.
Projection: output_proj + RMSNorm + embed_proj is fused into the
encoder graph now, so no Swift-side gemm at inference. Older 1024-dim
encoders still work via the Swift projection fallback.
Recorder: AudioRecorder (AVAudioEngine, mono 16 kHz, max duration
synced from the model's mel_frames).
UI: mic button in CoreMLLLMChat gated on supportsAudio. Empty
prompt + audio auto-fills "What do you hear in this audio?".

Structure

ModelDownloader is now in the CoreMLLLM package (was in the example
app). Single source of truth — other apps linking the library get it too.
4xx handling: hard-fails on required files (weights/model.mil/
coremldata.bin); tolerates metadata.json and analytics/coremldata.bin
404s so a slightly incomplete HF upload doesn't abort the download.
Parallel downloads (4 connections) for large model files.

Other additions (experimental, not wired to the UI)

These ship as scaffolds in the library so branches can build on them.
Nothing is on by default.

MirrorSpeculativeLoop.swift — parallel NPU+GPU speculative decoding
(Apple MLR 2026 paper).
SpeculativeLoop.swift — EAGLE-3 draft / fusion / verify wiring.
PrefixKVCache.swift — persistent prefix KV cache for fast TTFT on
cached system prompts.
ComputePreferenceLoader.swift — sidecar loader for dual ANE/GPU
mlpackage variants.

Docs

docs/AUDIO.md — audio pipeline design + conversion notes.
docs/SPEED_8K.md — 8K-context roadmap (W8A8, DuoAttention, TriForce,
Mirror speculative).
docs/UNEXPLORED_APPROACHES.md — six unexplored directions (A/B/C/D/E/F)
with effort estimates.
docs/ALTERNATIVE_APPROACHES.md — outside-Gemma-4 options (distillation,
Turbo SKU) and why they're out of current scope.
docs/EAGLE3_INTEGRATION_STATE.md — EAGLE-3 Phase 2A/2B/3 status,
including on-device bench findings.
docs/POST_BENCH_PRIORITIES.md — performance-first priority ordering
after the on-device EAGLE-3 bench.

Bench status (iPhone 17 Pro, Gemma 4 E2B)

Path	tok/s @ 2K
Baseline decode	28–31
Prefill (estimated)	~154
TTFT, 2K prompt	~13 s (target next release)

Audio adds ~100 ms one-shot inference for the encoder (1000 mel frames) +
the existing decode budget.

Upgrade notes

If you have an existing Gemma 4 E2B model downloaded and the audio button
doesn't appear after upgrading:

The old downloader didn't fetch mel_filterbank.bin,
output_proj_weight.npy, output_proj_bias.npy, embed_proj_weight.npy.
Delete the model from the app and re-download, or use devicectl device copy to push the four files into Documents/Models/gemma4-e2b/.

The output_proj_* / embed_proj_* .npy files are optional on the shipped
encoder (projection is fused inside the graph) but the mel_filterbank.bin
is required.

Commits since v0.5.1

60 commits. Highlights:

16dc23e — audio projection: skip Swift gemm when model already emits
1536-dim features
b803de1 — ModelDownloader tolerates 404 on optional mlmodelc files
3a0c9a8 — hub-manifest fix
1fe6890 / 75fb268 — EAGLE-3 scaffolding
5410de7 — audio support initial commit
a9c9b7d — MQA + QLoRA recovery
b0779c7 — 8K speed research doc

Full list: v0.5.1...v0.6.0

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

v0.6.0 — Audio multimodal (Gemma 4 E2B speech understanding)

Choose a tag to compare

Sorry, something went wrong.