SenseVoice.cpp

「简体中文」|「English」

SenseVoice是具有音频理解能力的音频基础模型，包括语音识别（ASR）、语种识别（LID）、语音情感识别（SER）和声学事件分类（AEC）或声学事件检测（AED）。当前SenseVoice-small支持中、粤、英、日、韩语的多语言语音识别，情感识别和事件检测能力，具有极低的推理延迟。

本项目基于ggml推理框架。

1. 特性

基于ggml，不依赖其他第三方库, 致力于端侧部署
特征提取参考kaldi-native-fbank库，支持多线程特征提取。
可以使用flash attention解码（速度没有明显提升🤔不知道为啥）
支持Q3, Q4, Q5, Q6, Q8量化

1.1 未来计划

支持更多backend , 理论上来说，ggml支持以下后端，后续将会慢慢适配，欢迎贡献。

后端	平台	是否支持
CPU	All	✅
Metal	Apple Silicon	（im2col算子还有bug）
BLAS	All	✅
BLIS	All
SYCL	Intel and Nvidia GPU
MUSA	Moore Threads GPU
CUDA	Nvidia GPU	适配中
hipBLAS	AMD GPU
Vulkan	GPU
Cann	Ascend NPU

提升性能
修bug

2. 使用

直接下载模型或转换模型

可以直接从下面链接下载模型

huggingface modelscope

git lfs install
git clone https://huggingface.co/lovemefan/sense-voice-gguf.git
# 或从modelscope下载
git clone https://www.modelscope.cn/models/lovemefan/SenseVoiceGGUF.git

或许自行下载官方模型转换

# 下载官方模型
git lfs install
git clone https://www.modelscope.cn/iic/SenseVoiceSmall.git
# 转换模型
python scripts/convert-pt-to-gguf.py \
--model SenseVoiceSmall \
--output /path/to/export/gguf-fp32-sense-voice-small.bin \
--out_type f32

RUN

git clone https://github.com/lovemefan/SenseVoice.cpp
cd SenseVoice.cpp
git submodule sync && git submodule update --init --recursive

mkdir build && cd build
cmake -DCMAKE_BUILD_TYPE=Release .. && make -j 8

# -t means thread num， -t 指定线程数
./bin/sense-voice-main -m /path/gguf-fp16-sense-voice-small.bin /path/asr_example_zh.wav  -t 4 -ng

输出

当前使用sense-voice-f16模型输出

$./bin/sense-voice-main -m /data/code/SenseVoice.cpp/scripts/resources/gguf-fp16-sense-voice.bin /data/code/SenseVoice.cpp/scripts/resources/SenseVoiceSmall/example/asr_example_zh.wav  -t 4

sense_voice_small_init_from_file_with_params_no_state: loading model from '/data/code/SenseVoice.cpp/scripts/resources/gguf-fp16-sense-voice-small.bin'     
sense_voice_model_load: version:      3                                                                                                                     
sense_voice_model_load: alignment:   32 
sense_voice_model_load: data offset: 444480                                                                                                     
sense_voice_model_load: loading model                                                                                                                       
sense_voice_model_load: n_vocab = 25055                                                                                                                     
sense_voice_model_load: n_encoder_hidden_state = 512                                                                                                        
sense_voice_model_load: n_encoder_linear_units = 2048                                                                                                       
sense_voice_model_load: n_encoder_attention_heads  = 4                                                                                                      
sense_voice_model_load: n_encoder_layers = 50                                                                                                               
sense_voice_model_load: n_mels  = 80                                                                                                                        
sense_voice_model_load: ftype  = 1                                                                                                                          
sense_voice_model_load: vocab[25055] loaded 
sense_voice_model_load: CPU total size =   468.98 MB
sense_voice_model_load: n_tensors: 1197
sense_voice_model_load: load SenseVoiceSmall takes 0.213000 second 
sense_voice_init_state: compute buffer (encoder)   =   50.40 MB
sense_voice_init_state: compute buffer (decoder)   =   13.72 MB

system_info: n_threads = 4 / 256 | AVX = 1 | AVX2 = 1 | AVX512 = 0 | FMA = 1 | NEON = 0 | ARM_FMA = 0 | METAL = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 0 | SSE3 = 1 | SSSE3 = 1 | VSX = 0 | CUDA = 0 | COREML = 0 | OPENVINO = 0

main: processing audio (88747 samples, 5.54669 sec) , 4 threads, 1 processors, lang = auto...

sense_voice_pcm_to_feature_with_state: calculate fbank and cmvn takes 7.207 ms
<|zh|><|NEUTRAL|><|Speech|><|withitn|>欢迎大家来体验达摩院推出的语音识别模型。
sense_voice_full_with_state: decoder audio use 1.011289 s, rtf is 0.182323.

感谢以下项目

本项目借用并模仿来自whisper.cpp 的大部分c++代码
参考来自funasr的paraformer模型结构以及前向计算 FunASR
本项目参考并借用 kaldi-native-fbank中的fbank特征提取算法。 FunASR 中的lrf + cmvn 算法
借用了大量的前期工作paraformer.cpp, paraformer.cpp项目后续将继续更新

Name		Name	Last commit message	Last commit date
Latest commit History 80 Commits
.github/workflows		.github/workflows
cmake		cmake
examples		examples
scripts		scripts
sense-voice		sense-voice
.clang-format		.clang-format
.gitignore		.gitignore
.gitmodules		.gitmodules
CMakeLists.txt		CMakeLists.txt
LICENSE		LICENSE
README-EN.md		README-EN.md
README.md		README.md
sense-voice-small-ggml-graph.svg		sense-voice-small-ggml-graph.svg

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

SenseVoice.cpp

1. 特性

1.1 未来计划

2. 使用

直接下载模型或转换模型

RUN

输出

感谢以下项目

About

Releases

Packages

Languages

License

lovemefan/SenseVoice.cpp

Folders and files

Latest commit

History

Repository files navigation

SenseVoice.cpp

1. 特性

1.1 未来计划

2. 使用

直接下载模型或转换模型

RUN

输出

感谢以下项目

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages