This Git repository for the official PyTorch implementation of """AV-SepFormer: Cross-attention SepFormer for Audio-Visual Target Speaker Extraction", accepted by ICASSP 2023.
📜[Full Paper] ▶[Demo] 💿[Checkpoint]
-
Linux
-
python >= 3.8
-
Anaconda or Miniconda
-
NVIDIA GPU + CUDA CuDNN (CPU can also be supported)
Install Anaconda or Miniconda, and then install conda and pip packages:
# Create conda environment
conda create --name av_sep python=3.8
conda activate av_sep
# Install required packages
pip install -r requiremens.txt
Clone the repository:
git clone https://github.com/lin9x/AV-Sepformer.git
cd AV-Sepformer
Scripts to preprocess the voxceleb2 datasets is the same as which in MuSE. You can dirctly go to this repository to preprocess your data. Pairs of our data is in data_list *
First, you need to modify the various configurations in config/avsepformer.yaml for training.
Then you can run training:
source activate av_sep
CUDA_VISIBILE_DEVISIBLE=0,1 python3 run_avsepformer.py run config/avsepformer.yaml
If you want to train other audio-visual speech separation systems, AV-ConvTasNet and MuSE is available in our repo. Just turn to the run_system.py and config/system.yaml to train your own model.
The data preparation follows the operation in MuSE Github Repository.