Summary
It would be great to support FunASR and SenseVoice as speech-to-text backends in LiveKit Agents.
Why
- FunASR is an open-source industrial-grade speech recognition toolkit with real-time streaming support, punctuation restoration, and inverse text normalization built-in. It supports 50+ languages and can run fully offline.
- SenseVoice is a speech foundation model excelling in multilingual speech recognition, emotion recognition, and audio event detection — with ultra-low latency (~70ms for 10s audio via SenseVoice-Small).
- Both are Apache 2.0 licensed and available on HuggingFace.
Comparison with Whisper
| Feature |
FunASR/SenseVoice |
Whisper |
| Real-time streaming |
✅ Native |
❌ Chunked |
| Punctuation & ITN |
✅ Built-in |
❌ Requires post-processing |
| Latency |
~70ms (SenseVoice-Small) |
Higher |
| Languages |
50+ |
99 |
| License |
Apache 2.0 |
MIT |
References
Would be happy to help with integration if there is interest!
Summary
It would be great to support FunASR and SenseVoice as speech-to-text backends in LiveKit Agents.
Why
Comparison with Whisper
References
Would be happy to help with integration if there is interest!