Skip to content

feat(asr): add Apple Speech provider (macOS 26+)#40

Merged
missuo merged 5 commits into
missuo:mainfrom
erning:feature/apple-speech
Apr 6, 2026
Merged

feat(asr): add Apple Speech provider (macOS 26+)#40
missuo merged 5 commits into
missuo:mainfrom
erning:feature/apple-speech

Conversation

@erning

@erning erning commented Mar 31, 2026

Copy link
Copy Markdown
Collaborator

Summary

Add Apple Speech as a new on-device ASR provider using Apple's SpeechAnalyzer and SpeechTranscriber APIs (macOS 26+). Zero-config, zero-download speech recognition with system-managed language assets.

  • New KoeAppleSpeech Swift package bridges Apple's Speech framework to Rust via C FFI
  • Audio flows as PCM16 → AsyncStream<AnalyzerInput>SpeechAnalyzer → progressive transcription
  • Results accumulated using Apple's official result.isFinal model: finalized segments build a stable prefix, volatile segments show in-progress recognition — no string-overlap heuristics
  • Dictionary entries passed as contextualStrings for vocabulary bias
  • Speech assets managed by macOS via AssetInventory (download/release/status through FFI)
  • Setup Wizard: language picker, asset status indicator, download/release buttons
  • Speech Recognition permission: requested at startup, checked defensively at session start, shown in menu bar

Implementation

Layer Detail
Swift (KoeAppleSpeech) AppleSpeechManager — session lifecycle, audio bridging, finalizedTranscript + volatileTranscript accumulation; CBridge@_cdecl FFI for session control and asset management
Rust (koe-asr) AppleSpeechProvider — FFI wrapper, PCM routing, tokio mpsc event channel
Rust (koe-core) Provider creation from asr.provider = "apple-speech" config, locale and dictionary wiring
Obj-C (KoeApp) Setup Wizard UI (locale popup, asset status, download/release), permission flow, status bar permission item

Runtime requirements

  • Minimum deployment target: macOS 14.0 (unchanged)
  • Apple Speech requires: macOS 26.0+ — all code paths gated with @available(macOS 26.0, *); invisible on older systems
  • New permission: NSSpeechRecognitionUsageDescription in Info.plist (only needed for Apple Speech)
  • Feature flag: apple-speech (enabled by default, excludable with --no-default-features)

App bundle changes

  • Size increase: ~100–200 KB (compiled Swift, no embedded models)
  • No bundled models: speech assets are system-managed, downloaded on-demand
  • New framework: Speech.framework (system framework, always present)
  • Zero new third-party dependencies

Test plan

  • Build succeeds on macOS 26 (Apple Silicon and Intel targets)
  • Build succeeds on macOS < 26 (code compiles, runtime-gated)
  • Setup Wizard: Apple Speech option only appears on macOS 26+
  • Setup Wizard: language picker populates, saved locale restores correctly
  • Setup Wizard: asset download/release work, status updates
  • Speech Recognition permission requested at startup when apple-speech configured
  • Menu bar shows Speech Recognition permission status when applicable
  • Short utterance recognized and pasted correctly
  • Long utterance (30s+): finalized segments accumulate, no truncation
  • Mixed Chinese/English transcription works
  • Provider switching (doubao → apple-speech → mlx) works without restart
  • Other providers (Doubao, Qwen, MLX, sherpa-onnx) unaffected

erning added 4 commits April 1, 2026 12:47
Add KoeAppleSpeech Swift package with SpeechAnalyzer + SpeechTranscriber
(macOS 26+) for zero-config on-device speech recognition. Audio fed as
Int16 PCM via AsyncStream bridge. Dictionary entries passed as contextual
strings. Asset status check and auto-install before session start.

Rust side: AppleSpeechProvider implements AsrProvider trait with FFI
bridge following the same pattern as KoeMLX. Feature-gated behind
apple-speech flag in koe-asr.

Swift FFI includes session management (start/feed/stop/cancel) and
asset management (is_available/asset_status/install_asset/release_asset/
supported_locales) for Setup Wizard integration.
Add apple-speech feature flag, AppleSpeechAsrConfig (locale, default
zh-Hans), provider dispatch match arm, and DEFAULT_CONFIG_YAML section.
Dictionary entries passed as contextual strings for vocabulary bias.
Add KoeAppleSpeech package reference and Speech framework to both Koe
and Koe-x86 targets. Add NSSpeechRecognitionUsageDescription to
Info.plist. Add speech recognition permission check/request methods
to SPPermissionManager.
Add Apple Speech (On-Device) to ASR provider popup with @available
guard. Dynamic locale list from SpeechTranscriber.supportedLocales,
sorted by localized display name. Asset status display with download
button (auto-download on Save if not installed). Locale picker replaces
model UI when selected. Status shows "Installed — managed by macOS"
with secondary color hint.
@erning erning force-pushed the feature/apple-speech branch from bc767fa to 5a53176 Compare April 1, 2026 04:47
DESIGN.md: add section 30.5 (architecture, audio flow, availability,
locale handling), update provider lists, feature flags, setup wizard,
permissions, and summary.

README.md: add to provider list, config example, permissions table,
architecture diagram, ASR pipeline, and Local ASR section.
@erning erning force-pushed the feature/apple-speech branch from 6cda2b3 to c94c598 Compare April 1, 2026 08:44

@missuo missuo left a comment

Copy link
Copy Markdown
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Review: PR #40 — feat(asr): add Apple Speech provider (macOS 26+)

Overall: Well-structured, follows existing patterns (KoeMLX). A few concerns:

Architecture

  • Good: Follows the same FFI pattern as KoeMLX (Swift → C → Rust), generation-based session management, callback locking
  • Good: @available(macOS 26.0, *) gating throughout, invisible on older systems
  • Good: Asset management FFI for Setup Wizard integration

Concerns

  1. Semaphore + Task pattern in CBridge.swift_supportedLocalesImpl, _assetStatusImpl, _releaseAssetImpl all use DispatchSemaphore.wait() to block the calling thread while spawning a Task. If called from the main thread, this deadlocks if the async work needs main-thread access. Consider documenting these must be called from a background thread, or use a different synchronization pattern.

  2. SFSpeechRecognizer.requestAuthorization with semaphore in koeAppleSpeechStartSession — same deadlock risk if called from main thread.

  3. Memory safety of event_tx_ptr — The leaked Box<Sender> pattern works but is fragile. If connect() is called twice without close(), the first sender leaks. The reclaim_sender in close() and Drop helps, but connect() should call reclaim_sender() at the top to handle re-connection.

  4. audioFormat force-unwrapAVAudioFormat(...)! in AppleSpeechManager will crash if the format is unsupported. Unlikely for 16kHz mono Int16, but a guard with error callback would be safer.

  5. Missing Definite events — The provider only emits Interim (type 0) and Final (type 2). The result.isFinal is used to accumulate finalized segments internally, but no Definite (type 1) events are sent upstream. This means the core's TranscriptAggregator won't receive definite confirmations. Is this intentional?

Minor

  • .build/ gitignore addition is good (SPM build directory)
  • apple-speech feature flag name is consistent with sherpa-onnx convention

@missuo missuo merged commit f9e3a86 into missuo:main Apr 6, 2026
@erning erning deleted the feature/apple-speech branch April 7, 2026 02:06
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants