Skip to content

v0.6.0 — Audio multimodal (Gemma 4 E2B speech understanding)

Choose a tag to compare

@john-rocky john-rocky released this 12 Apr 21:23
· 280 commits to main since this release

🎤 Audio multimodal — Gemma 4 E2B can hear

Record on the phone, send, get a spoken-language answer. Full pipeline runs
on-device.

PCM (16 kHz) → Swift mel spectrogram → audio.mlmodelc (12-layer Conformer, ANE)
            → 1536-dim audio tokens → feature-injected into Gemma 4 decoder
            → streaming text output
  • Encoder: Conformer 12 layers, INT4-palettized, ANE-resident.
  • Projection: output_proj + RMSNorm + embed_proj is fused into the
    encoder graph now, so no Swift-side gemm at inference. Older 1024-dim
    encoders still work via the Swift projection fallback.
  • Recorder: AudioRecorder (AVAudioEngine, mono 16 kHz, max duration
    synced from the model's mel_frames).
  • UI: mic button in CoreMLLLMChat gated on supportsAudio. Empty
    prompt + audio auto-fills "What do you hear in this audio?".

Structure

  • ModelDownloader is now in the CoreMLLLM package (was in the example
    app). Single source of truth — other apps linking the library get it too.
  • 4xx handling: hard-fails on required files (weights/model.mil/
    coremldata.bin); tolerates metadata.json and analytics/coremldata.bin
    404s so a slightly incomplete HF upload doesn't abort the download.
  • Parallel downloads (4 connections) for large model files.

Other additions (experimental, not wired to the UI)

These ship as scaffolds in the library so branches can build on them.
Nothing is on by default.

  • MirrorSpeculativeLoop.swift — parallel NPU+GPU speculative decoding
    (Apple MLR 2026 paper).
  • SpeculativeLoop.swift — EAGLE-3 draft / fusion / verify wiring.
  • PrefixKVCache.swift — persistent prefix KV cache for fast TTFT on
    cached system prompts.
  • ComputePreferenceLoader.swift — sidecar loader for dual ANE/GPU
    mlpackage variants.

Docs

  • docs/AUDIO.md — audio pipeline design + conversion notes.
  • docs/SPEED_8K.md — 8K-context roadmap (W8A8, DuoAttention, TriForce,
    Mirror speculative).
  • docs/UNEXPLORED_APPROACHES.md — six unexplored directions (A/B/C/D/E/F)
    with effort estimates.
  • docs/ALTERNATIVE_APPROACHES.md — outside-Gemma-4 options (distillation,
    Turbo SKU) and why they're out of current scope.
  • docs/EAGLE3_INTEGRATION_STATE.md — EAGLE-3 Phase 2A/2B/3 status,
    including on-device bench findings.
  • docs/POST_BENCH_PRIORITIES.md — performance-first priority ordering
    after the on-device EAGLE-3 bench.

Bench status (iPhone 17 Pro, Gemma 4 E2B)

Path tok/s @ 2K
Baseline decode 28–31
Prefill (estimated) ~154
TTFT, 2K prompt ~13 s (target next release)

Audio adds ~100 ms one-shot inference for the encoder (1000 mel frames) +
the existing decode budget.

Upgrade notes

If you have an existing Gemma 4 E2B model downloaded and the audio button
doesn't appear after upgrading:

  • The old downloader didn't fetch mel_filterbank.bin,
    output_proj_weight.npy, output_proj_bias.npy, embed_proj_weight.npy.
  • Delete the model from the app and re-download, or use devicectl device copy to push the four files into Documents/Models/gemma4-e2b/.

The output_proj_* / embed_proj_* .npy files are optional on the shipped
encoder (projection is fused inside the graph) but the mel_filterbank.bin
is required.

Commits since v0.5.1

60 commits. Highlights:

  • 16dc23e — audio projection: skip Swift gemm when model already emits
    1536-dim features
  • b803de1 — ModelDownloader tolerates 404 on optional mlmodelc files
  • 3a0c9a8 — hub-manifest fix
  • 1fe6890 / 75fb268 — EAGLE-3 scaffolding
  • 5410de7 — audio support initial commit
  • a9c9b7d — MQA + QLoRA recovery
  • b0779c7 — 8K speed research doc

Full list: v0.5.1...v0.6.0