Fully offline, on-device Speech-to-Text and Text-to-Speech for React Native, powered by sherpa-onnx and Nitro Modules.
- All inference runs on-device — no network calls, no cloud dependency
- Models are not bundled — consumers download and manage their own model files
- New Architecture only (Nitro Modules)
- iOS 15.5+, Android API 29+
| Feature | Description |
|---|---|
| STT Streaming | Real-time transcription with partial + final results. Best with transducer/Zipformer models. |
| STT VAD-gated | VAD detects end-of-speech, then runs batch inference. Best with Whisper models for conversational AI. |
| TTS Streaming | Generate speech from text with streaming PCM output. Supports VITS, Kokoro, Matcha models. |
| VAD Standalone | Voice Activity Detection as a standalone utility for custom pipelines. |
| Mic Capture | Built-in microphone capture (16kHz mono). Also supports external audio via feedAudio(). |
npm install react-native-nitro-voice react-native-nitro-modulesAdd the following line to your app's ios/Podfile inside the target block, before calling use_react_native!:
pod 'sherpa-onnx-ios', :path => '../node_modules/react-native-nitro-voice'Then run:
cd ios && pod installCocoaPods will download the sherpa-onnx XCFrameworks (~370 MB) from the upstream GitHub release automatically on first install. No manual framework management required.
sherpa-onnx is included as a Gradle dependency automatically.
Add JitPack to your project-level build.gradle if not already present:
allprojects {
repositories {
maven { url 'https://jitpack.io' }
}
}Models are not bundled with the library. Download models from the sherpa-onnx model zoo and place them in your app's accessible file system.
| Type | Required Files | Best For |
|---|---|---|
whisper |
encoder.onnx, decoder.onnx, tokens.txt |
VAD-gated batch mode, high accuracy |
transducer |
encoder.onnx, decoder.onnx, joiner.onnx, tokens.txt |
Streaming mode, real-time captions |
paraformer |
model.onnx, tokens.txt |
Streaming or batch, balanced |
nemo_ctc |
model.onnx, tokens.txt |
Streaming mode, fast inference |
sense_voice |
model.onnx, tokens.txt |
Batch mode, multilingual |
| Type | Required Files |
|---|---|
vits |
model.onnx, tokens.txt, optional: lexicon.txt, data/ |
kokoro |
model.onnx, voices.bin, tokens.txt, data/ |
matcha |
acoustic_model.onnx, vocoder.onnx, tokens.txt, optional: data/ |
Single file: silero_vad.onnx — download from silero-vad releases
Example: download a small Whisper model and Silero VAD for quick testing.
# Whisper tiny.en (quantized, ~40 MB)
curl -SL -o sherpa-onnx-whisper-tiny.en.tar.bz2 \
https://github.com/k2-fsa/sherpa-onnx/releases/download/asr-models/sherpa-onnx-whisper-tiny.en.tar.bz2
tar xjf sherpa-onnx-whisper-tiny.en.tar.bz2
# Silero VAD
curl -SL -o silero_vad.onnx \
https://github.com/k2-fsa/sherpa-onnx/releases/download/asr-models/silero_vad.onnxCopy the resulting files to a device-accessible directory (e.g. via react-native-fs or Expo FileSystem) before passing paths to the library.
Add microphone usage description to your Info.plist:
<key>NSMicrophoneUsageDescription</key>
<string>Used for speech recognition</string>Add the RECORD_AUDIO permission to your AndroidManifest.xml:
<uses-permission android:name="android.permission.RECORD_AUDIO" />You must also request the permission at runtime before calling startMic() or using the default mic-enabled mode. Use PermissionsAndroid from React Native or a library like react-native-permissions.
import { NitroSTT } from 'react-native-nitro-voice';
const stt = await NitroSTT.create({
modelDir: '/path/to/whisper-model',
type: 'whisper',
language: 'en',
});
// Start VAD-gated batch recognition (mic starts automatically)
await stt.startVADGated('/path/to/silero_vad.onnx', {
onTranscript: (text) => {
console.log('Transcript:', text);
},
});
// ... user speaks, pauses → clean transcript per utterance
// Stop (mic stops automatically)
await stt.stop();
await stt.destroy();import { NitroSTT } from 'react-native-nitro-voice';
const stt = await NitroSTT.create({
modelDir: '/path/to/transducer-model',
type: 'transducer',
});
await stt.startStreaming({
onPartial: (text) => console.log('Partial:', text),
onFinal: (text) => console.log('Final:', text),
});
// Mic starts automatically — stop with:
await stt.stop();const stt = await NitroSTT.create(config);
// Disable automatic mic — feed audio manually
await stt.startStreaming(callbacks, { mic: false });
// Feed pre-recorded or streamed audio
// Accepts any sample rate — resampled to 16kHz internally
stt.feedAudio(pcmArrayBuffer, 44100);import { NitroTTS } from 'react-native-nitro-voice';
const tts = await NitroTTS.create({
modelDir: '/path/to/kokoro-model',
type: 'kokoro',
speed: 1.0,
speakerId: 0,
});
console.log(`Sample rate: ${tts.sampleRate}, Speakers: ${tts.numSpeakers}`);
await tts.speak('Hello, world!', {
onAudioChunk: (samples, sampleRate) => {
// Feed PCM Float32 to your audio player
// e.g. expo-av, react-native-audio-api
},
onComplete: () => {
console.log('Done speaking');
},
});
await tts.destroy();import { NitroVAD } from 'react-native-nitro-voice';
const vad = await NitroVAD.create({
modelPath: '/path/to/silero_vad.onnx',
threshold: 0.5,
minSilenceDuration: 0.5,
minSpeechDuration: 0.25,
});
const cleanup = vad.start({
onSpeechStart: () => console.log('Speech started'),
onSpeechEnd: (audio) => {
console.log(`Speech ended, ${audio.byteLength} bytes of audio`);
},
});
// Feed 16kHz mono Float32 PCM chunks
vad.processChunk(audioChunk);
// Stop
cleanup();
vad.destroy();| Use Case | Mode | Model Type | Why |
|---|---|---|---|
| Conversational AI | VAD-gated | Whisper | Clean utterance boundaries, high accuracy |
| Live captions | Streaming | Transducer/Zipformer | Low latency, partial results |
| Voice commands | VAD-gated | Paraformer | Fast batch inference |
| Dictation | Streaming | Transducer | Real-time feedback |
| Multilingual | VAD-gated | SenseVoice | Multi-language support |
| Method | Description |
|---|---|
NitroSTT.create(config: STTConfig) |
Factory — creates and initializes STT engine |
startStreaming(callbacks, options?) |
Start streaming recognition with onPartial/onFinal. Starts mic by default. |
startVADGated(vadModelPath, callbacks, options?) |
Start VAD-gated batch recognition with onTranscript. Starts mic by default. |
feedAudio(samples, sampleRate) |
Feed external audio (any sample rate, resampled internally) |
startMic() |
Manually start device microphone (for advanced use) |
stopMic() |
Manually stop microphone capture |
stop() |
Stop current recognition session (stops mic if active) |
destroy() |
Release all native resources |
| Method | Description |
|---|---|
NitroTTS.create(config: TTSConfig) |
Factory — creates and initializes TTS engine |
speak(text, callbacks) |
Generate speech with streaming onAudioChunk/onComplete |
stop() |
Cancel in-progress generation |
destroy() |
Release all native resources |
sampleRate |
Output sample rate of loaded model |
numSpeakers |
Number of speakers in loaded model |
| Method | Description |
|---|---|
NitroVAD.create(config: VADConfig) |
Factory — creates and initializes VAD |
start(callbacks) |
Register onSpeechStart/onSpeechEnd callbacks. Returns cleanup function. |
processChunk(samples) |
Feed 16kHz mono Float32 PCM audio |
reset() |
Clear accumulated audio state |
destroy() |
Release all native resources |
type STTModelType = 'whisper' | 'transducer' | 'paraformer' | 'nemo_ctc' | 'sense_voice'
interface STTConfig {
modelDir: string // Path to directory containing model files
type: STTModelType
language?: string // e.g. 'en', 'fr', 'zh' — required for Whisper
}
type TTSModelType = 'vits' | 'kokoro' | 'matcha'
interface TTSConfig {
modelDir: string // Path to directory containing model files
type: TTSModelType
speakerId?: number // Speaker index for multi-speaker models (default: 0)
speed?: number // Playback speed multiplier (default: 1.0)
}
interface VADConfig {
modelPath: string // Path to silero_vad.onnx
threshold?: number // Speech detection threshold (default: 0.5)
minSilenceDuration?: number // Seconds of silence to end speech (default: 0.5)
minSpeechDuration?: number // Minimum seconds to count as speech (default: 0.25)
}
interface STTOptions {
mic?: boolean // Start microphone automatically (default: true)
}The example/ directory contains a demo app showing:
- VAD-gated Whisper STT with microphone input
- Kokoro TTS with text input
The example app downloads its models from an R2 bucket. Before running, copy example/.env.sample to example/.env and set the bucket base URL:
R2_BASE_URL=https://pub-28d1fdcf7fc645feb5a92306699262f7 DOT r2 DOT dev
To run:
# Install deps
npm install
cd example
npm run ios
# or
npm run androidMIT