This is the official implementation of the paper:
PhiNet: Speaker Verification with Phonetic Interpretability
Yi Ma, Shuai Wang, Tianchi Liu, Haizhou Li
Automatic speaker verification (ASV) systems typically lack the transparency required for high-accountability applications. Inspired by how human experts perform forensic speaker comparison (FSC), we propose PhiNet, a speaker verification network with phonetic interpretability, designed to enhance both local and global interpretability by leveraging phonetic evidence in decision-making.
- Local Interpretability: PhiNet provides detailed phonetic-level comparisons for each trial, revealing each phoneme's contribution to the verification decision, enabling manual inspection of speaker-specific features.
- Global Interpretability: PhiNet ranks phonemes based on their distinctiveness for speaker identification, helping researchers understand potential system biases.
- First self-interpretable speaker verification network that explains its decision-making process
- Dual interpretability through phoneme distinctiveness (local trial-level + global pattern-level)
- Training scheme that simulates the verification process, ensuring consistency between training and inference
- Achieves performance comparable to black-box ASV models (e.g., ECAPA-TDNN) while providing meaningful explanations
- Python 3.8+
- PyTorch 1.10+
- Other dependencies is same with [WeSpeaker]
git clone https://github.com/mmmmayi/PhiNet.git
cd PhiNet
pip install -r requirements.txtPhiNet is trained and evaluated on the following datasets:
| Dataset | Usage | Description |
|---|---|---|
| VoxCeleb1 | Training / Test | Celebrity speech from YouTube interviews |
| VoxCeleb2 | Training | Extended celebrity speech dataset |
| SITW | Test | Speakers in the Wild |
| LibriSpeech | Test | Read English speech from audiobooks |
| MUSAN | Augmentation | Music, speech, and noise corpus |
| RIR Noises | Augmentation | Room impulse responses |
Stage 1: Prepare VoxCeleb Data
Run stage 1 in examples/voxceleb/v2/run.sh. Modify the VoxCeleb2 data path in local/prepare_data.sh (stage 4) to your local directory:
# Edit local/prepare_data.sh, change the VoxCeleb2 path to yours
bash examples/voxceleb/v2/run.sh --stage 1Stage 2: Extract Features
Run stage 2 with raw data type:
bash examples/voxceleb/v2/run.sh --stage 2 --data_type rawStage 3: Prepare Augmentation Data
Generate file lists for RIRS_NOISES and MUSAN datasets. Each file should list the paths to all audio samples (one per line). Refer to the existing rirs_list and musan_list files in the data directory for the format.
Stage 4: Configure Training Parameters
Modify stage 3 parameters in examples/voxceleb/v2/run.sh:
--reverb_data: path to your RIR reverb data list--noise_data: path to your MUSAN noise data list--pho_path: path to phoneme alignment files (no change needed if using default)
Stage 5: Configure Phoneme Path
Update the phoneme file path at line 231 in wespeaker/dataset/processor.py to point to each sample's phoneme alignment file.
Stage 6: Start Training
bash examples/voxceleb/v2/run.sh --stage 3If you find this work useful, please cite:
@article{ma2025phinet,
title={PhiNet: Speaker Verification with Phonetic Interpretability},
author={Ma, Yi and Wang, Shuai and Liu, Tianchi and Li, Haizhou},
journal={IEEE/ACM Transactions on Audio, Speech, and Language Processing},
year={2025}
}- ExPO: Explainable Phonetic Trait-Oriented Network for Speaker Verification (our prior work)
- WeSpeaker: The speaker embedding learning toolkit this project is built upon
This project builds upon and is inspired by the work of several open-source repositories. We extend our gratitude to the authors and contributors of the following projects:
Thanks for these authors to open source their code!