Skip to content

muggle-stack/speaker_recognition

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

3 Commits
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Speaker Recognition System

中文 | English

概述

这是一个使用现代C++17实现的高性能说话人识别系统,提供说话人注册、识别和验证功能。这个独立实现在保持生产级性能的同时,具有最小的依赖性,适用于实时说话人识别任务。

核心特性

  • 🎯 说话人识别:从音频样本中识别未知说话人
  • 说话人验证:验证音频是否匹配特定的已知说话人
  • 📊 多样本注册:使用多个音频样本注册说话人,提高准确性
  • 💾 二进制数据库:高效的说话人嵌入存储和检索
  • 🚀 实时处理:基于流的架构,支持实时音频处理
  • 🔧 跨平台支持:支持Linux、macOS、Windows和RISC-V架构
  • 🎛️ 可配置阈值:针对不同场景可调整相似度阈值
  • 🎚️ 原始音频链路:保留未裁剪/未硬限幅的原始波形用于声纹推理
  • 🔄 自适应重采样:内置FIR抗混叠滤波与单双声道自动混合
  • 📦 最小依赖:仅需要ONNX Runtime和标准库

技术亮点

  • 梅尔频率谱分析:完整实现FFT、梅尔滤波器组和对数压缩
  • ONNX Runtime集成:利用ONNX模型生成最先进的说话人嵌入
  • 余弦相似度匹配:使用归一化嵌入进行高效说话人匹配
  • 性能优化:多线程支持、向量化操作和内存高效设计

快速开始

环境要求

  • C++17兼容编译器(GCC 7+、Clang 5+、MSVC 2017+)
  • CMake 3.14或更高版本
  • CURL(用于自动下载模型)
  • PortAudio(用于麦克风录音注册)

安装步骤

# 克隆仓库
git clone https://github.com/muggle-stack/speaker_recognition.git
cd speaker-recognition

# 构建项目(macOS 示例)
mkdir build && cd build
cmake -DPORTAUDIO_ROOT=/opt/homebrew ..
cmake --build .

基本用法

1. 注册说话人

# 使用单个音频文件注册
./build/bin/register_speaker -n 张三 zhangsan_voice.wav

# 使用多个样本注册(推荐)
./build/bin/register_speaker -n 张三 sample1.wav sample2.wav sample3.wav

# 通过实时录音注册(自动录制 3 次,每次 4 秒)
./build/bin/register_speaker -n 张三

# 使用自定义数据库文件
./build/bin/register_speaker -d company.db -n 员工001 voice.wav

2. 识别说话人

# 从音频识别说话人
./build/bin/identify_speaker unknown_voice.wav

# 显示前5个匹配,阈值0.7
./build/bin/identify_speaker -n 5 -s 0.7 test.wav

# 验证特定说话人
./build/bin/identify_speaker -v 张三 test.wav

# 列出所有注册的说话人
./build/bin/identify_speaker -l

配置说明

音频要求

参数 规格
格式 WAV(PCM或IEEE浮点)
采样率 自动重采样至16 kHz(FIR抗混叠)
声道 单/双声道自动混合为单声道
最小时长 0.3秒
推荐时长 1-3秒

相似度阈值

阈值 使用场景 说明
0.5-0.6 宽松 更高召回率,更多误报
0.6-0.7 平衡 默认设置(0.6)
0.7-0.8 严格 更高精确率,更多漏报

模型配置

默认模型:3dspeaker_speech_campplus_sv_zh-cn_16k-common.onnx

  • 架构:CamP+(上下文感知掩码预训练增强版)
  • 输入:梅尔频谱图(80个梅尔频带)
  • 输出:192维说话人嵌入
  • 语言:针对中文优化,支持其他语言

API 使用

C++ API 示例

#include "speaker_recognition.hpp"

// 配置并创建嵌入器
EmbedderConfig config{
    .model_path = "~/.cache/speaker-recognition/3dspeaker_speech_campplus_sv_zh-cn_16k-common.onnx",
    .num_threads = 4,
    .debug = false,
    .provider = "cpu"
};
auto embedder = SpeakerEmbedder::Create(config);

// 创建说话人管理器
auto manager = SpeakerManager::Create(embedder->GetEmbeddingDimension());

// 加载已有数据库
manager->LoadDatabase("speakers.db");

// 使用多个样本注册新说话人
std::vector<std::vector<float>> embeddings;
for (const auto& file : {"sample1.wav", "sample2.wav", "sample3.wav"}) {
    auto embedding = embedder->ComputeEmbeddingFromFile(file);
    if (!embedding.empty()) {
        embeddings.push_back(embedding);
    }
}
manager->RegisterSpeaker("Alice", embeddings);

// 识别未知说话人
auto test_embedding = embedder->ComputeEmbeddingFromFile("unknown.wav");
auto matches = manager->GetBestMatches(test_embedding, 0.6, 3);

for (const auto& match : matches) {
    std::cout << "Speaker: " << match.name
              << ", Similarity: " << match.score << std::endl;
}

// 保存数据库
manager->SaveDatabase("speakers.db");

流式处理

// 创建实时处理流
auto stream = embedder->CreateStream();

// 输入音频样本(例如来自麦克风)
while (capturing) {
    std::vector<float> samples = GetAudioChunk();
    stream->AcceptWaveform(16000, samples);

    if (embedder->IsStreamReady(stream.get())) {
        float rtf = 0.0f;
        auto embedding = embedder->ComputeEmbedding(stream.get(), rtf);
        auto matches = manager->GetBestMatches(embedding, 0.6f, 1);
        // 处理结果...
        stream = embedder->CreateStream();
    }
}

性能指标

操作 时间 (ms) RTF
mel特征提取 ~18 0.001
模型推理 ~233 0.023
完整流程 ~250 0.025

RTF = 实时因子(处理时间 / 音频时长)

更新日志

  • 2025-10-01
    • 录音链路保留原始未裁剪波形并专用于声纹嵌入推理,解决大幅增益导致的特征失真问题。
    • wave_reader.cpp 新增 FIR 抗混叠重采样与单双声道自动混合,WAV 文件可自动转换为 16 kHz 单声道输入。

项目结构

speaker-recognition/
├── include/                    # 头文件
│   ├── speaker_recognition.hpp # 主 API
│   ├── mel_filterbank.hpp      # 音频特征
│   ├── model_downloader.hpp    # 模型缓存辅助工具
│   └── onnx_utils.hpp          # ONNX 集成
├── src/                        # 实现与命令行工具
│   ├── audio_recorder.cpp      # 基于 PortAudio 的录音工具
│   ├── register_speaker.cpp    # CLI:注册(文件/录音)
│   ├── identify_speaker.cpp    # CLI:识别/验证
│   ├── speaker_embedder.cpp    # 嵌入生成
│   ├── speaker_manager.cpp     # 数据库管理
│   ├── mel_filterbank.cpp      # 特征提取
│   ├── mel_filterbank_optimized.cpp
│   ├── model_downloader.cpp    # 自动下载模型
│   └── wave_reader.cpp         # 音频读写
├── CMakeLists.txt              # 构建配置
├── compile.sh                  # 构建脚本
├── README.md /                 # 文档
├── test/                       # 示例 wav 文件
└── build/                      # 构建后生成的二进制文件

许可证

本项目基于Apache 2.0许可证发布。详见LICENSE文件。

致谢

  • ONNX Runtime团队提供的推理引擎
  • 3D-Speaker项目提供的预训练模型
  • 原始sherpa-onnx项目的启发

Overview

A high-performance speaker recognition system implemented in modern C++17, providing speaker registration, identification, and verification capabilities. This standalone implementation offers minimal dependencies while maintaining production-ready performance for real-time speaker recognition tasks.

Key Features

  • 🎯 Speaker Identification: Identify unknown speakers from audio samples
  • Speaker Verification: Verify if audio matches a specific known speaker
  • 📊 Multi-Sample Enrollment: Register speakers with multiple audio samples for improved accuracy
  • 💾 Binary Database: Efficient storage and retrieval of speaker embeddings
  • 🚀 Real-Time Processing: Stream-based architecture for live audio processing
  • 🔧 Cross-Platform: Supports Linux, macOS, Windows, and RISC-V architectures
  • 🎛️ Configurable Thresholds: Adjustable similarity thresholds for different use cases
  • 🎚️ Clean Audio Path: Preserve unclipped raw waveform dedicated to embedding inference
  • 🔄 Adaptive Resampling: Built-in FIR anti-alias filtering with automatic mono/stereo folding
  • 📦 Minimal Dependencies: Only requires ONNX Runtime and standard libraries

Technical Highlights

  • Mel-Frequency Spectral Analysis: Complete implementation with FFT, mel filterbank, and log compression
  • ONNX Runtime Integration: Leverages ONNX models for state-of-the-art speaker embeddings
  • Cosine Similarity Matching: Efficient speaker matching using normalized embeddings
  • Optimized Performance: Multi-threading support, vectorized operations, and memory-efficient design

Quick Start

Prerequisites

  • C++17 compatible compiler (GCC 7+, Clang 5+, MSVC 2017+)
  • CMake 3.14 or higher
  • CURL (for automatic model downloading)
  • PortAudio (for microphone enrollment)

Installation

# Clone the repository
git clone https://github.com/muggle-stack/speaker_recognition.git
cd speaker-recognition

# Build the project (macOS example)
mkdir build && cd build
cmake -DPORTAUDIO_ROOT=/opt/homebrew ..
cmake --build .

Basic Usage

1. Register a Speaker

# Register with single audio file
./build/bin/register_speaker -n john john_voice.wav

# Register with multiple samples (recommended)
./build/bin/register_speaker -n john sample1.wav sample2.wav sample3.wav

# Register via live recording (3 x 4 seconds)
./build/bin/register_speaker -n john

# Use custom database file
./build/bin/register_speaker -d company.db -n employee001 voice.wav

2. Identify a Speaker

# Identify speaker from audio
./build/bin/identify_speaker unknown_voice.wav

# Show top 5 matches with threshold 0.7
./build/bin/identify_speaker -n 5 -s 0.7 test.wav

# Verify specific speaker
./build/bin/identify_speaker -v john test.wav

# List all registered speakers
./build/bin/identify_speaker -l

API Usage

C++ API Example

#include "speaker_recognition.hpp"

// Configure and create embedder
EmbedderConfig config{
    .model_path = "~/.cache/speaker-recognition/3dspeaker_speech_campplus_sv_zh-cn_16k-common.onnx",
    .num_threads = 4,
    .debug = false,
    .provider = "cpu"
};
auto embedder = SpeakerEmbedder::Create(config);

// Create speaker manager
auto manager = SpeakerManager::Create(embedder->GetEmbeddingDimension());

// Load existing database
manager->LoadDatabase("speakers.db");

// Register new speaker with multiple samples
std::vector<std::vector<float>> embeddings;
for (const auto& file : {"sample1.wav", "sample2.wav", "sample3.wav"}) {
    auto embedding = embedder->ComputeEmbeddingFromFile(file);
    if (!embedding.empty()) {
        embeddings.push_back(embedding);
    }
}
manager->RegisterSpeaker("Alice", embeddings);

// Identify unknown speaker
auto test_embedding = embedder->ComputeEmbeddingFromFile("unknown.wav");
auto matches = manager->GetBestMatches(test_embedding, 0.6, 3);

for (const auto& match : matches) {
    std::cout << "Speaker: " << match.name
              << ", Similarity: " << match.score << std::endl;
}

// Save database
manager->SaveDatabase("speakers.db");

Stream Processing

// Create stream for real-time processing
auto stream = embedder->CreateStream();

// Feed audio samples (e.g., from microphone)
while (capturing) {
    std::vector<float> samples = GetAudioChunk();
    stream->AcceptWaveform(16000, samples);

    if (embedder->IsStreamReady(stream.get())) {
        float rtf = 0.0f;
        auto embedding = embedder->ComputeEmbedding(stream.get(), rtf);
        auto matches = manager->GetBestMatches(embedding, 0.6f, 1);
        // Handle results...
        stream = embedder->CreateStream();
    }
}

Configuration

Audio Requirements

Parameter Specification
Format WAV (PCM or IEEE float); live recording supported
Sample Rate Auto-resampled to 16 kHz (FIR anti-alias filtering)
Channels Mono/stereo automatically folded to mono
Minimum Duration 0.3 seconds
Recommended Duration 1-3 seconds

Similarity Thresholds

Threshold Use Case Description
0.5-0.6 Permissive Higher recall, more false positives
0.6-0.7 Balanced Default setting (0.6)
0.7-0.8 Strict Higher precision, more false negatives

Model Configuration

Default model: 3dspeaker_speech_campplus_sv_zh-cn_16k-common.onnx

  • Architecture: CamP+ (Context-Aware Masked Pretraining Plus)
  • Input: Mel-spectrogram (80 mel bins)
  • Output: 192-dimensional speaker embedding
  • Language: Optimized for Chinese, works for other languages

Performance

Operation Time (ms) RTF
Feature Extraction ~10 0.001
Model Inference ~233 0.023
Total Pipeline ~250 0.025

RTF = Real-Time Factor (processing time / audio duration)

Project Structure

speaker-recognition/
├── include/                    # Header files
│   ├── speaker_recognition.hpp # Main API
│   ├── mel_filterbank.hpp      # Audio features
│   ├── model_downloader.hpp    # Model caching helper
│   └── onnx_utils.hpp          # ONNX integration
├── src/                        # Implementation & CLI tools
│   ├── audio_recorder.cpp      # PortAudio-based recorder
│   ├── register_speaker.cpp    # CLI: enrollment (file/recording)
│   ├── identify_speaker.cpp    # CLI: identification/verification
│   ├── speaker_embedder.cpp    # Embedding generation
│   ├── speaker_manager.cpp     # Database management
│   ├── mel_filterbank.cpp      # Feature extraction
│   ├── mel_filterbank_optimized.cpp
│   ├── model_downloader.cpp    # Auto-download models
│   └── wave_reader.cpp         # Audio I/O
├── CMakeLists.txt              # Build configuration
├── compile.sh                  # Convenience build script
├── README.md /                 # Documentation
├── test/                       # Sample wav files
└── build/                      # Generated binaries (after build)

License

This project is released under the Apache 2.0 License. See LICENSE file for details.

Acknowledgments

  • ONNX Runtime team for the inference engine
  • 3D-Speaker project for the pre-trained models
  • Original sherpa-onnx project for inspiration

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors