(In progress)
- (~2024.02) Develop an improved version of OS-KDFT
- (~2024.02) Check on other tasks (speech recognition, keyword spotting, etc.)
(Done)
- (2024.01.15) Upload evaluation scripts
This repository offers source code for following paper:
- Title : One-Step Knowledge Distillation and Fine-Tuning in Using Large Pre-Trained Self-Supervised Learning Models for Speaker Verification (Accepted at Interspeech2023)
- Autor : Jungwoo Heo, Chan-yeong Lim, Ju-ho Kim, Hyun-seo Shin, Ha-Jin Yu
We provide experiment scripts, trained model, and training logs.
The application of speech self-supervised learning (SSL) models has achieved remarkable performance in speaker verification (SV). However, there is a computational cost hurdle in employing them, which makes development and deployment difficult. Several studies have simply compressed SSL models through knowledge distillation (KD) without considering the target task. Consequently, these methods could not extract SV-tailored features. This paper suggests One-Step Knowledge Distillation and Fine-Tuning (OS-KDFT), which incorporates KD and fine-tuning (FT). We optimize a student model for SV during KD training to avert the distillation of inappropriate information for the SV. OS-KDFT could downsize Wav2Vec 2.0 based ECAPA-TDNN size by approximately 76.2%, and reduce the SSL model's inference time by 79% while presenting an EER of 0.98%. The proposed OS-KDFT is validated across VoxCeleb1 and VoxCeleb2 datasets and W2V2 and HuBERT SSL models.
You can get the experimental code via hyperlinks.
Note that we provide our trained model weights and training logs (such as loss, validation results) for re-implementation. You can find these in 'params' folder stored in each 'only evaluation' folder.
- HuBERT compression in speaker verification, EER 4.75% in VoxCeleb1 (train & evaluation, only evaluation)
- WavLM compression in speaker verification, EER 4.25% in VoxCeleb1 (train & evaluation, only evaluation)
Docker file summary
Docker
nvcr.io/nvidia/pytorch:23.08-py3
Python
3.8.12
Pytorch
2.1.0a0+29c30b1
Torchaudio
2.0.1
(We conducted experiment using 2 or 4 NVIDIA RTX A5000 GPUs)
Depending on the task you want to perform, you'll need the following datasets.
- Speaker verification: (VoxCeleb1) or (VoxCeleb2, MUSAN, RIR reverberation)
- Keyword spotting: To be updated
You can get the experimental code via the hyperlinks in the "What can I do in this repository?" section.
Set experimental arguments in arguments.py
file. Here is list of system arguments to set.
1. 'usable_gpu': {YOUR_PATH} # ex) '0,1,2,3'
'path_log' is path of saving experiments.
input type is str
2. 'path_...': {YOUR_PATH}
'path_...' is path where ... dataset is stored.
input type is str
We have a basic logger that stores information in local. However, if you would like to use an additional online logger (wandb or neptune):
- In
arguments.py
# Wandb: Add 'wandb_user' and 'wandb_token'
# Neptune: Add 'neptune_user' and 'neptune_token'
# input this arguments in "system_args" dictionary:
# for example
'wandb_user' : 'user-name',
'wandb_token' : 'WANDB_TOKEN',
'neptune_user' : 'user-name',
'neptune_token' : 'NEPTUNE_TOKEN'
- In
main.py
# Just remove "#" in logger which you use
logger = LogModuleController.Builder(args['name'], args['project'],
).tags(args['tags']
).description(args['description']
).save_source_files(args['path_scripts']
).use_local(args['path_log']
#).use_wandb(args['wandb_user'], args['wandb_token'] <- here
#).use_neptune(args['neptune_user'], args['neptune_token'] <- here
).build()
Just run main.py in scripts!
> python main.py
Please cite this paper if you make use of the code.
@inproceedings{heo23_interspeech,
author={Jungwoo Heo and Chan-yeong Lim and Ju-ho Kim and Hyun-seo Shin and Ha-Jin Yu},
title={{One-Step Knowledge Distillation and Fine-Tuning in Using Large Pre-Trained Self-Supervised Learning Models for Speaker Verification}},
year=2023,
booktitle={Proc. INTERSPEECH 2023},
pages={5271--5275},
doi={10.21437/Interspeech.2023-605}
}