OmniVAD

English | 中文

Cross-platform toolkit for FireRedVAD — SOTA voice activity detection and audio event detection.

Three models, one toolkit, runs everywhere:

Model	What it does	Output
VAD	Speech detection (non-stream)	Speech timestamps
Stream-VAD	Real-time speech detection (frame-by-frame)	Per-frame speech probability
AED	Audio event detection (non-stream)	Speech / Singing / Music timestamps

All models are based on DFSMN architecture, ~2.2MB each (~588K params), support 100+ languages.

Packages

Python (`omnivad/`)

PyPI package with native C bindings (ncnn). Models bundled in wheel.

pip install omnivad

CLI:

omnivad audio.wav                        # VAD + AED → audio.TextGrid
omnivad audio.wav -o out.json            # Output as JSON
omnivad audio.wav -o out.srt             # Output as SRT
omnivad audio.wav -o out.vtt             # Output as WebVTT
omnivad audio.wav -f srt                 # Format flag (textgrid/json/srt/vtt)
omnivad audio.wav -m vad                 # VAD only
omnivad audio.wav -m aed                 # AED only (speech/singing/music)
omnivad long.wav --chunk 600 --overlap 2 # Chunked processing for large audio
python -m omnivad audio.wav              # Also works

Python API:

from omnivad import OmniVAD, OmniStreamVAD, OmniAED
import numpy as np

vad = OmniVAD()

# File path — auto-loads as float32 [-1,1]
result = vad.detect("audio.wav")
# {'duration': 2.24, 'timestamps': [(0.26, 1.82)]}

# Float32 array [-1.0, 1.0] — from soundfile, torchaudio, librosa
result = vad.detect(float32_array)

# Int16 array — from raw WAV, microphone PCM
result = vad.detect(np.array([...], dtype=np.int16))

# Large audio — chunked processing with overlap
# overlap_seconds must be smaller than chunk_seconds
result = vad.detect("long.wav", chunk_seconds=600, overlap_seconds=2)

# Stream VAD — real-time, feed 160 samples (10ms) at a time
# Accepts float32 in [-1, 1] (Web Audio, soundfile, torch) or int16 PCM
svad = OmniStreamVAD()
frame = None
while frame is None:
    frame = svad.process(pcm_160)  # np.float32 or np.int16
# StreamResult(time=0.420s, confidence=0.95, is_speech=True)

# FastClone — share model weights, minimal memory per stream
clone = svad.clone()  # instant, ~0 memory overhead
clone.process(pcm_160)  # fully independent state

# AED — speech + singing + music
aed = OmniAED()
events = aed.detect("audio.wav")
# {'duration': 22.0, 'events': {'speech': [...], 'singing': [...], 'music': [...]}}

Platforms: macOS (arm64/x86_64), Linux (x86_64/aarch64), Windows (x86_64)

C/C++ Native Library (`native/`)

Unified C API with ncnn backend. Single header, single library.

#include "omnivad.h"

int err = OMNI_OK;

// VAD — whole audio to speech segments
OmniVadHandle vad = omni_vad_create("vad.omnivad", &err);
omni_vad_detect_int16(vad, pcm, num_samples, &config, &segments, &count);
// segments[0] = { start: 0.44, end: 1.82 }

// Stream VAD — real-time, 10ms per frame
// Two entries: omni_stream_vad_process (float [-1,1]), _int16 (int16 PCM)
OmniStreamVadHandle svad = omni_stream_vad_create("stream-vad.omnivad", 0.5f, &err);
omni_stream_vad_process(svad, float_160_samples, 160, &result);   // FP32
omni_stream_vad_process_int16(svad, pcm_160_samples, 160, &result); // int16

// FastClone — share model weights across streams
OmniStreamVadHandle clone = omni_stream_vad_clone(svad, &err);
omni_stream_vad_process_int16(clone, other_pcm, 160, &result);  // independent state

// AED — speech + singing + music detection
OmniAedHandle aed = omni_aed_create("aed.omnivad", &err);
omni_aed_detect_int16(aed, pcm, num_samples, &config, &segments, &count);
// segments[0] = { start: 0.09, end: 12.32, cls: OMNI_AED_MUSIC }

Build:

# Prerequisites: cmake, ncnn (brew install ncnn)
cd native
cmake -B build && cmake --build build -j$(nproc)

# Test
./build/test_all ../models/ audio.wav

Platforms: macOS (arm64/x86_64), Linux (x86_64/aarch64), Windows (x86_64), Android (armeabi-v7a/arm64-v8a)

TypeScript/JavaScript (`packages/omnivad/`)

Works in both browser and Node.js via ncnn WebAssembly. Zero dependencies, models bundled.

import { OmniVAD, OmniStreamVAD, OmniAED } from 'omnivad';

// Non-stream VAD — models loaded automatically from bundled WASM
const vad = await OmniVAD.create();
const result = vad.detect(audioFloat32Array);  // Float32Array [-1.0, 1.0]
// { duration: 2.32, timestamps: [[0.44, 1.82]] }

// Also accepts Int16Array (raw PCM)
const result2 = vad.detect(pcmInt16Array);

// Stream VAD — frame-by-frame or full-audio batch mode
const svad = await OmniStreamVAD.create();
// processFrame() accepts Float32Array [-1, 1] or Int16Array — dispatch by dtype
const frame = svad.processFrame(float32_160);  // null until enough audio is buffered
const full = svad.detectFull(audioFloat32Array);
// { probabilities: Float32Array(...), numFrames: 98, duration: 1.0 }

// AED — speech + singing + music
const aed = await OmniAED.create();
const events = aed.detect(audioFloat32Array);
// { duration: 22.0, events: { speech: [...], singing: [...], music: [...] }, ratios: { ... } }

Build:

cd packages/omnivad
pnpm install && pnpm build
# Output: dist/index.js + dist/index.cjs + dist/index.d.ts + dist/wasm/*

Thread Safety

Component	Shared handle	Independent handles	Notes
OmniVAD	Safe	Safe	`ncnn::Net` is read-only; each call creates a local `Fbank` and `Extractor`
OmniAED	Safe	Safe	Same architecture as VAD
OmniStreamVAD	Unsafe	Safe	Mutable internal state (`audio_buffer`, `cache`, `frame_offset`)

Guidelines:

OmniVAD and OmniAED instances can be safely shared across threads for concurrent inference. The Python workers parameter in detect(..., workers=N) already uses this pattern.
OmniStreamVAD instances must not be shared across threads. Create one instance per thread for parallel streaming.
Handle creation (omni_*_create) should be done sequentially — ncnn's model loading is not designed for highly concurrent initialization.
Never call close() / destroy() on a handle while another thread is using it.

Running thread-safety tests:

# Python
pytest tests/test_thread_safety.py -v

# C++ (requires ncnn)
./native/build/test_thread_safety models/ tests/data/hello_en.wav [threads] [repeats]

Audio Input

High-level APIs accept 16kHz mono audio only. Two formats, same convention across all 3 model types and all 3 layers (C / Python / TypeScript):

float32 / Float32Array in [-1, 1] (Web Audio, soundfile, torch)
int16 / Int16Array PCM (WAV, microphone)

Wrappers dispatch by dtype to the matching C entry — never scale or convert in Python/JS. All scaling lives in the C library: the f32 entry multiplies by 32768.0f, the _int16 entry casts to float.

Method	FP32 entry	int16 entry
`OmniVAD.detect / detect_probs`	`omni_vad_detect[_probs]`	`omni_vad_detect[_probs]_int16`
`OmniAED.detect / detect_probs`	`omni_aed_detect[_probs]`	`omni_aed_detect[_probs]_int16`
`OmniStreamVAD.process`	`omni_stream_vad_process`	`omni_stream_vad_process_int16`
`OmniStreamVAD.detect_full`	`omni_stream_vad_detect_full`	`omni_stream_vad_detect_full_int16`

For exact contracts see native/include/omnivad.h.

Audio Pipeline

16kHz PCM → Fbank (80-dim, 25ms window, 10ms shift) → CMVN → DFSMN → Sigmoid → Post-processing → Segments
                     Povey window                        μ/σ    ~2.2MB   [0,1]    4-state machine
                     pre-emphasis 0.97                                            merge/split/extend

Streaming VAD — `OmniStreamVAD`

For long audio (live streams, hours-long recordings, real-time captioning), OmniStreamVAD processes audio frame-by-frame and emits segment-boundary events on the same call that confirms the boundary — bit-identical to upstream FireRedVAD's FireRedStreamVad.

Each successful process() call returns a result with both per-frame probabilities AND segment-boundary flags:

Field	Meaning
`confidence`	raw model probability `[0, 1]`
`smoothed_prob`	causal moving-average over `smooth_window_size` frames
`is_speech`	`smoothed_prob >= threshold`
`is_speech_start`	`True` on the frame that confirms a new SPEECH segment
`is_speech_end`	`True` on the frame that confirms a SPEECH segment end
`frame_idx`	1-based frame index (multiply by 0.01 for seconds)
`speech_start_frame`	1-based segment start (when `is_speech_start`)
`speech_end_frame`	1-based segment end (when `is_speech_end`)

Configuration (defaults match upstream FireRedVAD)

Parameter	Default	Meaning
`threshold`	`0.5`	Speech activation threshold
`smooth_window_size`	`5`	Causal moving-average window (frames)
`pad_start_frame`	`5`	Extend confirmed segment START backward by N frames
`min_speech_frame`	`8`	Min continuous speech frames to confirm START (~80ms)
`max_speech_frame`	`2000`	Force-split when SPEECH-state count hits this (~20s)
`min_silence_frame`	`20`	Min continuous silence frames to confirm END (~200ms)

Python

from omnivad import OmniStreamVAD
import numpy as np

vad = OmniStreamVAD()                              # upstream defaults
pcm = np.fromfile("speech.pcm", dtype=np.int16)

for i in range(0, len(pcm), 160):                  # 10ms chunks
    result = vad.process(pcm[i : i + 160])
    if result is None:
        continue
    if result.is_speech_start:
        print(f"START @ {result.speech_start_frame * 0.01:.2f}s")
    if result.is_speech_end:
        print(f"END   @ {result.speech_end_frame * 0.01:.2f}s")

# Or get [(start_sec, end_sec), ...] in one call:
segments = OmniStreamVAD().detect_segments("speech.wav")

TypeScript

import { OmniStreamVAD } from "omnivad";

const vad = await OmniStreamVAD.create();
for (let i = 0; i + 160 <= pcm.length; i += 160) {
    const result = vad.processFrame(pcm.subarray(i, i + 160));
    if (!result) continue;
    if (result.isSpeechStart) {
        console.log(`START @ ${(result.speechStartFrame * 0.01).toFixed(2)}s`);
    }
    if (result.isSpeechEnd) {
        console.log(`END   @ ${(result.speechEndFrame * 0.01).toFixed(2)}s`);
    }
}

Pairing with `merge_chunks`

OmniStreamVAD emits raw VAD segments. To pack them into Whisper-sized 30s chunks for downstream ASR, feed the emitted [start, end] pairs to merge_chunks (see next section).

Chunking — `merge_chunks` / `mergeChunks`

After VAD produces a list of speech (start, end) segments, the chunking utility groups them into duration-bounded chunks suitable for downstream ASR / forced alignment / TTS. It is a pure function with no model dependency — Python uses ctypes, TypeScript uses Emscripten WASM, and C calls the native function directly. All three bindings share a single C implementation in native/src/chunking.cpp.

from omnivad import merge_chunks
chunks = merge_chunks(timestamps, max_chunk_secs=30.0, mode="greedy")

import { mergeChunks } from "omnivad";
const chunks = await mergeChunks(timestamps, { maxChunkSecs: 30.0, mode: "longest_gap" });

Pipeline (5 steps; Steps 1–2 and 4–5 are shared by both modes)

input (sorted segments)
  │
  ├─ Step 1: drop segments with duration < min_speech_secs
  │
  ├─ Step 2: pre-merge consecutive segments with gap < min_silence_secs
  │          (cascades; takes max(end) on overlap)
  │
  ├─ Step 3: pack into chunks  ─┬─ mode = "greedy"
  │                              │     sequential append; split when next
  │                              │     would exceed max_chunk_secs OR gap > max_gap_secs
  │                              │
  │                              └─ mode = "longest_gap"
  │                                    recursive split at the longest gap
  │                                    until every chunk's span ≤ max_chunk_secs
  │
  ├─ Step 4: equal hard-split any chunk still longer than max_chunk_secs
  │          (only triggers when a single segment alone exceeds max_chunk_secs)
  │
  └─ Step 5: apply pad_onset_secs (clamped to ≥ 0) and pad_offset_secs
             output chunks: (start, end, seg_start_idx, seg_count)

Mode comparison

Property	`greedy` (default)	`longest_gap`
Strategy	Sequential append until next overflow	Recursive split at longest internal gap until each chunk fits `max_chunk_secs`
Honors `max_chunk_secs`	Yes — hard upper bound	Yes — recursion stops when chunk span ≤ `max_chunk_secs`
Boundary location	First overflow point	Longest pause inside the over-long span
Honors `max_gap_secs`	Yes — split at first `gap > max_gap_secs`	Yes — recursion also stops only when no internal gap exceeds `max_gap_secs`
Single seg > `max_chunk_secs`	Step 4 equal hard-split	Same — Step 4 fallback
Determinism	Deterministic	Deterministic; leftmost wins on tie
Recommended for	Whisper / whisperX-style ASR (fixed-length input, padded to 30s)	Variable-length-input models — forced alignment, TTS, encoder-style ASR. Splits at natural pauses; no fixed-length padding required.

Example with the same input, both modes (max_chunk_secs=20):

Input (max_chunk_secs = 20):
  seg 0 = (0, 5)
  seg 1 = (8, 10)     gap from seg 0 = 3
  seg 2 = (20, 25)    gap from seg 1 = 10   ← longer

greedy
  start cur = (0, 5)
  accept seg 1            → cur = (0, 10)   [length 10 ≤ 20 ✓]
  next seg 2 would_exceed:  25 - 0 = 25 > 20  → SPLIT
  chunks: [(0, 10, 0, 2), (20, 25, 2, 1)]

longest_gap
  span = 25 > 20            → must split
  longest gap = 10 at idx 1 → cut between seg 1 and seg 2
    left  = [seg 0, seg 1]  span = 10 ≤ 20 ✓ → keep
    right = [seg 2]         span = 5  ≤ 20 ✓ → keep
  chunks: [(0, 10, 0, 2), (20, 25, 2, 1)]

(In this minimal example both modes happen to agree. They diverge whenever the longest gap is not the first overflow point.)

`seg_start_idx` / `seg_count` semantics

These index into the post-Step-1+Step-2 view of the input — segments dropped by min_speech_secs and pre-merged by min_silence_secs are NOT in the indexing space. Both modes follow this convention.

Defaults

omni_chunk_config_default() (C / default_chunk_config() Python / DEFAULT_CHUNK_CONFIG TS) returns:

field	default	source
`max_chunk_secs`	`30.0`	seconds; matches Whisper's 30s input window
`max_gap_secs`	`INFINITY`	disabled
`pad_onset_secs` / `pad_offset_secs`	`0.04` / `0.04`
`min_speech_secs`	`0.0`	pairs with VAD `min_speech_frames`
`min_silence_secs`	`0.20`	matches VAD `min_silence_frames=20` @ 10ms shift
`mode`	`OMNI_CHUNK_GREEDY`	backward-compatible

Heads-up — Python convenience defaults differ. The Python kwargs of merge_chunks(...) use zeros for pad_onset_secs, pad_offset_secs, min_silence_secs (so the simplest call gives raw output). To match the canonical defaults, use the values returned by default_chunk_config(). See tests/test_chunking.py::test_python_convenience_defaults_differ_from_canonical.

Whisper / WhisperX-style ASR pipeline

OmniVAD (whole-audio, batch) + merge_chunks(mode="greedy") is the 1:1 equivalent of WhisperX's Binarize(max_duration=chunk_size) + greedy packing. Use this recipe when feeding chunks into Whisper-family ASR models that expect a fixed 30s input window:

from omnivad import OmniVAD, merge_chunks

vad = OmniVAD()                              # threshold=0.4 default — safer for Whisper
result = vad.detect("long-audio.wav")        # whole-audio batch VAD

chunks = merge_chunks(
    timestamps=result["timestamps"],
    max_chunk_secs=30.0,                     # Whisper's input window
    mode="greedy",                           # WhisperX behavior
    pad_onset_secs=0.04,
    pad_offset_secs=0.04,
    min_silence_secs=0.20,                   # matches VAD min_silence_frames=20
)
# Each chunk: { start, end, seg_start_idx, seg_count }
# Slice the audio at [start, end] and feed each slice to Whisper.

Notes:

Keep the default threshold=0.4. Whisper tolerates extra padding silence but is sensitive to clipped word edges (raising to 0.5 risks dropping weak word-initial/final consonants and triggering hallucinations).
Do not use mode="longest_gap" here — that mode targets variable-length-input models (forced alignment, TTS), not WhisperX.
For very long audio (>1 hour), pass chunk_seconds=600, overlap_seconds=2 to vad.detect(...) to limit peak memory.

Model Files

Prebuilt .omnivad bundles used by the Python package, TypeScript package, and local examples are already included in this repo under models/.

You only need to download upstream FireRedVAD checkpoints if you want to re-export ONNX or regenerate the native assets yourself.

# Download upstream PyTorch models + export to ONNX
pip install fireredvad
python -m fireredvad.bin.export_onnx --all

# Or download pre-exported ONNX models directly
# fireredvad_vad.onnx              — Non-stream VAD (2.3MB)
# fireredvad_aed.onnx              — Non-stream AED (2.3MB)
# fireredvad_stream_vad_with_cache.onnx — Stream VAD (2.2MB)

# For C/ncnn: convert ONNX → ncnn with pnnx
pip install pnnx
pnnx fireredvad_vad.onnx "inputshape=[1,100,80]"

Local Development

This section covers building OmniVAD from source and consuming the in-tree build from another project on the same machine — the loop you want when hacking on the C/C++ core, the Python wrapper, or the TS bindings.

Prerequisites

Target	Required	Notes
Python wheel	Python 3.10+, CMake 3.15+, a C++14 toolchain	`pip install -e .` runs scikit-build-core, which fetches ncnn automatically via CMake `FetchContent`.
Standalone C/C++ library	CMake 3.15+, a pre-installed ncnn (`brew install ncnn` or build from source)	`native/CMakeLists.txt` does not fetch ncnn — set `-DNCNN_ROOT=...` if it isn't on the default search path.
TypeScript bundle	Node 18+, pnpm	Builds `dist/index.{js,cjs,d.ts}` only — does not rebuild the WASM.
WASM module	emsdk (any recent version)	Required only when you change C/C++ code and need a fresh `dist/wasm/omnivad.wasm`.

Build the Python package (editable install)

pip install -e ".[dev]"

What this produces:

omnivad/libomnivad.{dylib,so,dll} — the shared library actually loaded at runtime by omnivad/_binding.py.
omnivad/models/*.omnivad — bundled model files (copied by CMake install(...)).
An editable entry in your environment's site-packages pointing back at the source tree.

When you change C/C++ code in native/, re-run pip install -e . to relink the dylib. (CMake's incremental build means this is fast.) Pure Python edits don't need a reinstall.

Build the TypeScript package

cd packages/omnivad
pnpm install
pnpm build          # tsup → dist/index.{js,cjs,d.ts}
pnpm typecheck      # tsc --noEmit

This step does not rebuild the WASM — it consumes whatever's already in dist/wasm/. If you only edited TS, you're done.

Build the WASM module (when you change C/C++)

EMSDK=/path/to/emsdk packages/omnivad/wasm/build.sh

The script writes omnivad.{js,cjs,wasm} directly into packages/omnivad/dist/wasm/. After this, re-run pnpm build only if you also changed TS.

The EMSDK env var must point at your emsdk root (the directory that contains emsdk_env.sh and upstream/emscripten/). The script aborts with a clear error if it's missing.

Consume the in-tree build from another repo

Python — `pip install -e <path>`

# In the target project's venv:
pip install -e /abs/path/to/OmniVAD-Kit          # editable, picks up your edits
# or, isolated wheel:
pip install /abs/path/to/OmniVAD-Kit             # builds and installs a fresh wheel

pip install -e is what you want for the dev loop — re-running it after a C/C++ edit relinks the dylib in place; pure Python edits are picked up without reinstalling.

TypeScript — three options, pick by use case

Option	Command	When to use
A. Tarball (closest to npm)	`cd packages/omnivad && pnpm pack` then in target: `pnpm add /abs/path/omnivad-0.2.8.tgz`	Verifying what real consumers will install. Clean, no symlink quirks.
B. `file:` protocol	In target `package.json`: `"omnivad": "file:../OmniVAD-Kit/packages/omnivad"`	In-tree monorepo-style consumption. Re-run `pnpm install` to pick up rebuilds.
C. Global link	`cd packages/omnivad && pnpm link --global` then in target: `pnpm link --global omnivad`	Fast iteration across many projects. Watch for peer/hoist quirks.

For all three, rebuild before testing:

cd packages/omnivad
pnpm build                                       # if only TS changed
EMSDK=/path/to/emsdk wasm/build.sh && pnpm build # if C/C++ changed

Full rebuild after a C/C++ change (cheat sheet)

# From the repo root:
pip install -e .                                       # Python dylib
EMSDK=/path/to/emsdk packages/omnivad/wasm/build.sh    # WASM (.wasm + glue)
( cd packages/omnivad && pnpm build )                  # TS bundle

Standalone C/C++ build (for native tests / embedding)

cd native
cmake -B build -DNCNN_ROOT=/path/to/ncnn   # only if ncnn isn't auto-discovered
cmake --build build -j$(nproc 2>/dev/null || sysctl -n hw.ncpu)
./build/test_all ../models ../tests/data/hello_en.wav

This is independent from the Python wheel build — the wheel uses CMake FetchContent to pull a pinned ncnn, while native/ expects a pre-installed one.

Lint / format

ruff check --fix . && ruff format .                    # Python (line-length 120)
( cd packages/omnivad && pnpm typecheck )              # TypeScript

Testing

# Run the full Python test suite
pip install -e ".[dev]"
pytest tests -v

# Utility scripts (not pytest — require external FireRedVAD models)
python tests/generate_reference.py            # Generate Python reference data
python tests/check_timestamp_accuracy.py      # Strict C vs Python comparison
python tests/vad_to_textgrid.py audio.wav     # Audio → TextGrid + RTF benchmark

Accuracy (C/ncnn vs Python, 5 audio files × 3 models):

Model	Timestamp Δ	Probability Δ	Status
VAD	≤ 0.020s	≤ 0.001	Exact match
AED (singing/music)	≤ 0.010s	≤ 0.013	Exact match
AED (speech)	≤ 0.030s	≤ 0.015	Match (ncnn fp16 edge cases on `event.wav`)
Stream-VAD (detect_full)	≤ 0.010s	≤ 0.001	Exact match

Project Structure

omnivad/
├── omnivad/                         # Python PyPI package
│   ├── __init__.py                  #   Public API: OmniVAD, OmniStreamVAD, OmniAED
│   ├── cli.py                       #   CLI entry point (omnivad command)
│   ├── _binding.py                  #   ctypes bindings to libomnivad
│   ├── vad.py                       #   OmniVAD (non-stream)
│   ├── stream_vad.py                #   OmniStreamVAD (real-time)
│   └── aed.py                       #   OmniAED (3-class)
├── native/                          # C/C++ library (ncnn backend)
│   ├── include/omnivad.h            #   Unified C API header
│   ├── src/omnivad.cpp              #   Core implementation
│   ├── frontend/                    #   Fbank/FFT/WAV (from FireRedVAD)
│   ├── test/                        #   4 test programs
│   └── CMakeLists.txt
├── packages/omnivad/                # TypeScript npm package
│   ├── src/
│   │   ├── vad.ts                   #   OmniVAD (non-stream)
│   │   ├── stream-vad.ts            #   OmniStreamVAD (real-time)
│   │   ├── aed.ts                   #   OmniAED (3-class)
│   │   ├── wasm-binding.ts          #   Emscripten/WASM bindings
│   │   ├── types.ts                 #   Public TypeScript types
│   │   ├── index.ts                 #   Package exports
│   │   └── wasm.d.ts                #   WASM module declarations
│   ├── package.json
│   └── tsconfig.json
└── tests/                           # Test suite
    ├── test_c_vs_python.py          #   Accuracy: omnivad vs Python reference
    ├── test_determinism.py          #   Repeated-run determinism
    ├── test_edge_cases.py           #   Edge cases: tiny/empty/silence inputs
    ├── smoke_test.py                #   CI smoke test (import + detect)
    ├── test_memory.sh               #   Native memory/leak checks
    ├── check_timestamp_accuracy.py  #   Strict C vs Python comparison (manual)
    ├── check_native.py              #   Native C binary validation (manual)
    ├── generate_reference.py        #   Generate Python reference data
    ├── vad_to_textgrid.py           #   Audio → TextGrid + RTF benchmark
    └── data/                        #   5 test audio files + reference JSON

Performance

RTF (Real-Time Factor) on Apple M-series, lower = faster:

Model	RTF	Speed
VAD	~0.003	~330x real-time
Stream-VAD	~0.002	~500x real-time
AED	~0.002	~500x real-time

Origin & Attribution

OmniVAD is a cross-platform deployment toolkit built on top of FireRedVAD, developed by Xiaohongshu (小红书). FireRedVAD provides high-quality Voice Activity Detection models and a lightweight Audio Event Detection model that can distinguish speech, singing, and music.

Original paper: FireRedVAD (arXiv:2603.10420)

What FireRedVAD provides: DFSMN-based models (~2.2MB each), Python inference code, PyTorch training, strong VAD benchmark results (FLEURS-VAD-102 F1: 97.57%).

What OmniVAD adds: Unified C API (ncnn backend) for native deployment, TypeScript/JavaScript npm package (ncnn WebAssembly) for browser and Node.js, cross-platform build system, comprehensive test suite with accuracy validation.

License

Apache-2.0 — same as the upstream FireRedVAD.

Credits

FireRedVAD — Kaituo Xu, Wenpeng Li, Kai Huang, Kun Liu (Xiaohongshu)
ncnn — Tencent
Emscripten — WebAssembly toolchain

Name		Name	Last commit message	Last commit date
Latest commit History 88 Commits
.github/workflows		.github/workflows
docs		docs
examples		examples
models		models
native		native
omnivad		omnivad
packages/omnivad		packages/omnivad
scripts		scripts
tests		tests
.gitignore		.gitignore
.pre-commit-config.yaml		.pre-commit-config.yaml
AGENTS.md		AGENTS.md
CLAUDE.md		CLAUDE.md
CMakeLists.txt		CMakeLists.txt
LICENSE		LICENSE
README.md		README.md
README.zh.md		README.zh.md
pyproject.toml		pyproject.toml

Folders and files

Latest commit

History

Repository files navigation

OmniVAD

Packages

Python (omnivad/)

C/C++ Native Library (native/)

TypeScript/JavaScript (packages/omnivad/)

Thread Safety

Audio Input

Audio Pipeline

Streaming VAD — OmniStreamVAD

Configuration (defaults match upstream FireRedVAD)

Python

TypeScript

Pairing with merge_chunks

Chunking — merge_chunks / mergeChunks

Pipeline (5 steps; Steps 1–2 and 4–5 are shared by both modes)

Mode comparison

seg_start_idx / seg_count semantics

Defaults

Whisper / WhisperX-style ASR pipeline

Model Files

Local Development

Prerequisites

Build the Python package (editable install)

Build the TypeScript package

Build the WASM module (when you change C/C++)

Consume the in-tree build from another repo

Python — pip install -e <path>

TypeScript — three options, pick by use case

Full rebuild after a C/C++ change (cheat sheet)

Standalone C/C++ build (for native tests / embedding)

Lint / format

Testing

Project Structure

Performance

Origin & Attribution

License

Credits

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Python (`omnivad/`)

C/C++ Native Library (`native/`)

TypeScript/JavaScript (`packages/omnivad/`)

Streaming VAD — `OmniStreamVAD`

Pairing with `merge_chunks`

Chunking — `merge_chunks` / `mergeChunks`

`seg_start_idx` / `seg_count` semantics

Python — `pip install -e <path>`

Packages