feat(asr): add Apple Speech provider (macOS 26+)#40
Conversation
Add KoeAppleSpeech Swift package with SpeechAnalyzer + SpeechTranscriber (macOS 26+) for zero-config on-device speech recognition. Audio fed as Int16 PCM via AsyncStream bridge. Dictionary entries passed as contextual strings. Asset status check and auto-install before session start. Rust side: AppleSpeechProvider implements AsrProvider trait with FFI bridge following the same pattern as KoeMLX. Feature-gated behind apple-speech flag in koe-asr. Swift FFI includes session management (start/feed/stop/cancel) and asset management (is_available/asset_status/install_asset/release_asset/ supported_locales) for Setup Wizard integration.
Add apple-speech feature flag, AppleSpeechAsrConfig (locale, default zh-Hans), provider dispatch match arm, and DEFAULT_CONFIG_YAML section. Dictionary entries passed as contextual strings for vocabulary bias.
Add KoeAppleSpeech package reference and Speech framework to both Koe and Koe-x86 targets. Add NSSpeechRecognitionUsageDescription to Info.plist. Add speech recognition permission check/request methods to SPPermissionManager.
Add Apple Speech (On-Device) to ASR provider popup with @available guard. Dynamic locale list from SpeechTranscriber.supportedLocales, sorted by localized display name. Asset status display with download button (auto-download on Save if not installed). Locale picker replaces model UI when selected. Status shows "Installed — managed by macOS" with secondary color hint.
bc767fa to
5a53176
Compare
DESIGN.md: add section 30.5 (architecture, audio flow, availability, locale handling), update provider lists, feature flags, setup wizard, permissions, and summary. README.md: add to provider list, config example, permissions table, architecture diagram, ASR pipeline, and Local ASR section.
6cda2b3 to
c94c598
Compare
missuo
left a comment
There was a problem hiding this comment.
Review: PR #40 — feat(asr): add Apple Speech provider (macOS 26+)
Overall: Well-structured, follows existing patterns (KoeMLX). A few concerns:
Architecture
- Good: Follows the same FFI pattern as KoeMLX (Swift → C → Rust), generation-based session management, callback locking
- Good:
@available(macOS 26.0, *)gating throughout, invisible on older systems - Good: Asset management FFI for Setup Wizard integration
Concerns
-
Semaphore + Task pattern in CBridge.swift —
_supportedLocalesImpl,_assetStatusImpl,_releaseAssetImplall useDispatchSemaphore.wait()to block the calling thread while spawning aTask. If called from the main thread, this deadlocks if the async work needs main-thread access. Consider documenting these must be called from a background thread, or use a different synchronization pattern. -
SFSpeechRecognizer.requestAuthorizationwith semaphore inkoeAppleSpeechStartSession— same deadlock risk if called from main thread. -
Memory safety of
event_tx_ptr— The leakedBox<Sender>pattern works but is fragile. Ifconnect()is called twice withoutclose(), the first sender leaks. Thereclaim_senderinclose()andDrophelps, butconnect()should callreclaim_sender()at the top to handle re-connection. -
audioFormatforce-unwrap —AVAudioFormat(...)!inAppleSpeechManagerwill crash if the format is unsupported. Unlikely for 16kHz mono Int16, but a guard with error callback would be safer. -
Missing
Definiteevents — The provider only emitsInterim(type 0) andFinal(type 2). Theresult.isFinalis used to accumulate finalized segments internally, but noDefinite(type 1) events are sent upstream. This means the core'sTranscriptAggregatorwon't receive definite confirmations. Is this intentional?
Minor
.build/gitignore addition is good (SPM build directory)apple-speechfeature flag name is consistent withsherpa-onnxconvention
Summary
Add Apple Speech as a new on-device ASR provider using Apple's
SpeechAnalyzerandSpeechTranscriberAPIs (macOS 26+). Zero-config, zero-download speech recognition with system-managed language assets.AsyncStream<AnalyzerInput>→SpeechAnalyzer→ progressive transcriptionresult.isFinalmodel: finalized segments build a stable prefix, volatile segments show in-progress recognition — no string-overlap heuristicscontextualStringsfor vocabulary biasAssetInventory(download/release/status through FFI)Implementation
KoeAppleSpeech)AppleSpeechManager— session lifecycle, audio bridging,finalizedTranscript + volatileTranscriptaccumulation;CBridge—@_cdeclFFI for session control and asset managementkoe-asr)AppleSpeechProvider— FFI wrapper, PCM routing, tokio mpsc event channelkoe-core)asr.provider = "apple-speech"config, locale and dictionary wiringKoeApp)Runtime requirements
@available(macOS 26.0, *); invisible on older systemsNSSpeechRecognitionUsageDescriptionin Info.plist (only needed for Apple Speech)apple-speech(enabled by default, excludable with--no-default-features)App bundle changes
Speech.framework(system framework, always present)Test plan