这是一个使用现代C++17实现的高性能说话人识别系统,提供说话人注册、识别和验证功能。这个独立实现在保持生产级性能的同时,具有最小的依赖性,适用于实时说话人识别任务。
- 🎯 说话人识别:从音频样本中识别未知说话人
- ✅ 说话人验证:验证音频是否匹配特定的已知说话人
- 📊 多样本注册:使用多个音频样本注册说话人,提高准确性
- 💾 二进制数据库:高效的说话人嵌入存储和检索
- 🚀 实时处理:基于流的架构,支持实时音频处理
- 🔧 跨平台支持:支持Linux、macOS、Windows和RISC-V架构
- 🎛️ 可配置阈值:针对不同场景可调整相似度阈值
- 🎚️ 原始音频链路:保留未裁剪/未硬限幅的原始波形用于声纹推理
- 🔄 自适应重采样:内置FIR抗混叠滤波与单双声道自动混合
- 📦 最小依赖:仅需要ONNX Runtime和标准库
- 梅尔频率谱分析:完整实现FFT、梅尔滤波器组和对数压缩
- ONNX Runtime集成:利用ONNX模型生成最先进的说话人嵌入
- 余弦相似度匹配:使用归一化嵌入进行高效说话人匹配
- 性能优化:多线程支持、向量化操作和内存高效设计
- C++17兼容编译器(GCC 7+、Clang 5+、MSVC 2017+)
- CMake 3.14或更高版本
- CURL(用于自动下载模型)
- PortAudio(用于麦克风录音注册)
# 克隆仓库
git clone https://github.com/muggle-stack/speaker_recognition.git
cd speaker-recognition
# 构建项目(macOS 示例)
mkdir build && cd build
cmake -DPORTAUDIO_ROOT=/opt/homebrew ..
cmake --build .# 使用单个音频文件注册
./build/bin/register_speaker -n 张三 zhangsan_voice.wav
# 使用多个样本注册(推荐)
./build/bin/register_speaker -n 张三 sample1.wav sample2.wav sample3.wav
# 通过实时录音注册(自动录制 3 次,每次 4 秒)
./build/bin/register_speaker -n 张三
# 使用自定义数据库文件
./build/bin/register_speaker -d company.db -n 员工001 voice.wav# 从音频识别说话人
./build/bin/identify_speaker unknown_voice.wav
# 显示前5个匹配,阈值0.7
./build/bin/identify_speaker -n 5 -s 0.7 test.wav
# 验证特定说话人
./build/bin/identify_speaker -v 张三 test.wav
# 列出所有注册的说话人
./build/bin/identify_speaker -l| 参数 | 规格 |
|---|---|
| 格式 | WAV(PCM或IEEE浮点) |
| 采样率 | 自动重采样至16 kHz(FIR抗混叠) |
| 声道 | 单/双声道自动混合为单声道 |
| 最小时长 | 0.3秒 |
| 推荐时长 | 1-3秒 |
| 阈值 | 使用场景 | 说明 |
|---|---|---|
| 0.5-0.6 | 宽松 | 更高召回率,更多误报 |
| 0.6-0.7 | 平衡 | 默认设置(0.6) |
| 0.7-0.8 | 严格 | 更高精确率,更多漏报 |
默认模型:3dspeaker_speech_campplus_sv_zh-cn_16k-common.onnx
- 架构:CamP+(上下文感知掩码预训练增强版)
- 输入:梅尔频谱图(80个梅尔频带)
- 输出:192维说话人嵌入
- 语言:针对中文优化,支持其他语言
#include "speaker_recognition.hpp"
// 配置并创建嵌入器
EmbedderConfig config{
.model_path = "~/.cache/speaker-recognition/3dspeaker_speech_campplus_sv_zh-cn_16k-common.onnx",
.num_threads = 4,
.debug = false,
.provider = "cpu"
};
auto embedder = SpeakerEmbedder::Create(config);
// 创建说话人管理器
auto manager = SpeakerManager::Create(embedder->GetEmbeddingDimension());
// 加载已有数据库
manager->LoadDatabase("speakers.db");
// 使用多个样本注册新说话人
std::vector<std::vector<float>> embeddings;
for (const auto& file : {"sample1.wav", "sample2.wav", "sample3.wav"}) {
auto embedding = embedder->ComputeEmbeddingFromFile(file);
if (!embedding.empty()) {
embeddings.push_back(embedding);
}
}
manager->RegisterSpeaker("Alice", embeddings);
// 识别未知说话人
auto test_embedding = embedder->ComputeEmbeddingFromFile("unknown.wav");
auto matches = manager->GetBestMatches(test_embedding, 0.6, 3);
for (const auto& match : matches) {
std::cout << "Speaker: " << match.name
<< ", Similarity: " << match.score << std::endl;
}
// 保存数据库
manager->SaveDatabase("speakers.db");// 创建实时处理流
auto stream = embedder->CreateStream();
// 输入音频样本(例如来自麦克风)
while (capturing) {
std::vector<float> samples = GetAudioChunk();
stream->AcceptWaveform(16000, samples);
if (embedder->IsStreamReady(stream.get())) {
float rtf = 0.0f;
auto embedding = embedder->ComputeEmbedding(stream.get(), rtf);
auto matches = manager->GetBestMatches(embedding, 0.6f, 1);
// 处理结果...
stream = embedder->CreateStream();
}
}| 操作 | 时间 (ms) | RTF |
|---|---|---|
| mel特征提取 | ~18 | 0.001 |
| 模型推理 | ~233 | 0.023 |
| 完整流程 | ~250 | 0.025 |
RTF = 实时因子(处理时间 / 音频时长)
- 2025-10-01:
- 录音链路保留原始未裁剪波形并专用于声纹嵌入推理,解决大幅增益导致的特征失真问题。
wave_reader.cpp新增 FIR 抗混叠重采样与单双声道自动混合,WAV 文件可自动转换为 16 kHz 单声道输入。
speaker-recognition/
├── include/ # 头文件
│ ├── speaker_recognition.hpp # 主 API
│ ├── mel_filterbank.hpp # 音频特征
│ ├── model_downloader.hpp # 模型缓存辅助工具
│ └── onnx_utils.hpp # ONNX 集成
├── src/ # 实现与命令行工具
│ ├── audio_recorder.cpp # 基于 PortAudio 的录音工具
│ ├── register_speaker.cpp # CLI:注册(文件/录音)
│ ├── identify_speaker.cpp # CLI:识别/验证
│ ├── speaker_embedder.cpp # 嵌入生成
│ ├── speaker_manager.cpp # 数据库管理
│ ├── mel_filterbank.cpp # 特征提取
│ ├── mel_filterbank_optimized.cpp
│ ├── model_downloader.cpp # 自动下载模型
│ └── wave_reader.cpp # 音频读写
├── CMakeLists.txt # 构建配置
├── compile.sh # 构建脚本
├── README.md / # 文档
├── test/ # 示例 wav 文件
└── build/ # 构建后生成的二进制文件
本项目基于Apache 2.0许可证发布。详见LICENSE文件。
- ONNX Runtime团队提供的推理引擎
- 3D-Speaker项目提供的预训练模型
- 原始sherpa-onnx项目的启发
A high-performance speaker recognition system implemented in modern C++17, providing speaker registration, identification, and verification capabilities. This standalone implementation offers minimal dependencies while maintaining production-ready performance for real-time speaker recognition tasks.
- 🎯 Speaker Identification: Identify unknown speakers from audio samples
- ✅ Speaker Verification: Verify if audio matches a specific known speaker
- 📊 Multi-Sample Enrollment: Register speakers with multiple audio samples for improved accuracy
- 💾 Binary Database: Efficient storage and retrieval of speaker embeddings
- 🚀 Real-Time Processing: Stream-based architecture for live audio processing
- 🔧 Cross-Platform: Supports Linux, macOS, Windows, and RISC-V architectures
- 🎛️ Configurable Thresholds: Adjustable similarity thresholds for different use cases
- 🎚️ Clean Audio Path: Preserve unclipped raw waveform dedicated to embedding inference
- 🔄 Adaptive Resampling: Built-in FIR anti-alias filtering with automatic mono/stereo folding
- 📦 Minimal Dependencies: Only requires ONNX Runtime and standard libraries
- Mel-Frequency Spectral Analysis: Complete implementation with FFT, mel filterbank, and log compression
- ONNX Runtime Integration: Leverages ONNX models for state-of-the-art speaker embeddings
- Cosine Similarity Matching: Efficient speaker matching using normalized embeddings
- Optimized Performance: Multi-threading support, vectorized operations, and memory-efficient design
- C++17 compatible compiler (GCC 7+, Clang 5+, MSVC 2017+)
- CMake 3.14 or higher
- CURL (for automatic model downloading)
- PortAudio (for microphone enrollment)
# Clone the repository
git clone https://github.com/muggle-stack/speaker_recognition.git
cd speaker-recognition
# Build the project (macOS example)
mkdir build && cd build
cmake -DPORTAUDIO_ROOT=/opt/homebrew ..
cmake --build .# Register with single audio file
./build/bin/register_speaker -n john john_voice.wav
# Register with multiple samples (recommended)
./build/bin/register_speaker -n john sample1.wav sample2.wav sample3.wav
# Register via live recording (3 x 4 seconds)
./build/bin/register_speaker -n john
# Use custom database file
./build/bin/register_speaker -d company.db -n employee001 voice.wav# Identify speaker from audio
./build/bin/identify_speaker unknown_voice.wav
# Show top 5 matches with threshold 0.7
./build/bin/identify_speaker -n 5 -s 0.7 test.wav
# Verify specific speaker
./build/bin/identify_speaker -v john test.wav
# List all registered speakers
./build/bin/identify_speaker -l#include "speaker_recognition.hpp"
// Configure and create embedder
EmbedderConfig config{
.model_path = "~/.cache/speaker-recognition/3dspeaker_speech_campplus_sv_zh-cn_16k-common.onnx",
.num_threads = 4,
.debug = false,
.provider = "cpu"
};
auto embedder = SpeakerEmbedder::Create(config);
// Create speaker manager
auto manager = SpeakerManager::Create(embedder->GetEmbeddingDimension());
// Load existing database
manager->LoadDatabase("speakers.db");
// Register new speaker with multiple samples
std::vector<std::vector<float>> embeddings;
for (const auto& file : {"sample1.wav", "sample2.wav", "sample3.wav"}) {
auto embedding = embedder->ComputeEmbeddingFromFile(file);
if (!embedding.empty()) {
embeddings.push_back(embedding);
}
}
manager->RegisterSpeaker("Alice", embeddings);
// Identify unknown speaker
auto test_embedding = embedder->ComputeEmbeddingFromFile("unknown.wav");
auto matches = manager->GetBestMatches(test_embedding, 0.6, 3);
for (const auto& match : matches) {
std::cout << "Speaker: " << match.name
<< ", Similarity: " << match.score << std::endl;
}
// Save database
manager->SaveDatabase("speakers.db");// Create stream for real-time processing
auto stream = embedder->CreateStream();
// Feed audio samples (e.g., from microphone)
while (capturing) {
std::vector<float> samples = GetAudioChunk();
stream->AcceptWaveform(16000, samples);
if (embedder->IsStreamReady(stream.get())) {
float rtf = 0.0f;
auto embedding = embedder->ComputeEmbedding(stream.get(), rtf);
auto matches = manager->GetBestMatches(embedding, 0.6f, 1);
// Handle results...
stream = embedder->CreateStream();
}
}| Parameter | Specification |
|---|---|
| Format | WAV (PCM or IEEE float); live recording supported |
| Sample Rate | Auto-resampled to 16 kHz (FIR anti-alias filtering) |
| Channels | Mono/stereo automatically folded to mono |
| Minimum Duration | 0.3 seconds |
| Recommended Duration | 1-3 seconds |
| Threshold | Use Case | Description |
|---|---|---|
| 0.5-0.6 | Permissive | Higher recall, more false positives |
| 0.6-0.7 | Balanced | Default setting (0.6) |
| 0.7-0.8 | Strict | Higher precision, more false negatives |
Default model: 3dspeaker_speech_campplus_sv_zh-cn_16k-common.onnx
- Architecture: CamP+ (Context-Aware Masked Pretraining Plus)
- Input: Mel-spectrogram (80 mel bins)
- Output: 192-dimensional speaker embedding
- Language: Optimized for Chinese, works for other languages
| Operation | Time (ms) | RTF |
|---|---|---|
| Feature Extraction | ~10 | 0.001 |
| Model Inference | ~233 | 0.023 |
| Total Pipeline | ~250 | 0.025 |
RTF = Real-Time Factor (processing time / audio duration)
speaker-recognition/
├── include/ # Header files
│ ├── speaker_recognition.hpp # Main API
│ ├── mel_filterbank.hpp # Audio features
│ ├── model_downloader.hpp # Model caching helper
│ └── onnx_utils.hpp # ONNX integration
├── src/ # Implementation & CLI tools
│ ├── audio_recorder.cpp # PortAudio-based recorder
│ ├── register_speaker.cpp # CLI: enrollment (file/recording)
│ ├── identify_speaker.cpp # CLI: identification/verification
│ ├── speaker_embedder.cpp # Embedding generation
│ ├── speaker_manager.cpp # Database management
│ ├── mel_filterbank.cpp # Feature extraction
│ ├── mel_filterbank_optimized.cpp
│ ├── model_downloader.cpp # Auto-download models
│ └── wave_reader.cpp # Audio I/O
├── CMakeLists.txt # Build configuration
├── compile.sh # Convenience build script
├── README.md / # Documentation
├── test/ # Sample wav files
└── build/ # Generated binaries (after build)
This project is released under the Apache 2.0 License. See LICENSE file for details.
- ONNX Runtime team for the inference engine
- 3D-Speaker project for the pre-trained models
- Original sherpa-onnx project for inspiration