Singing Voice Conversion based on Whisper & neural source-filter BigVGAN

Black technology based on the three giants of artificial intelligence:

OpenAI's whisper, 680,000 hours in multiple languages

Nvidia's bigvgan, anti-aliasing for speech generation

Microsoft's adapter, high-efficiency for fine-tuning

LoRA is not fully implemented in this project, but it can be found here: LoRA TTS & paper

use pretrain model to fine tune

lora-svc-baker.mp4

Dataset preparation

Necessary pre-processing:

1 accompaniment separation, UVR
2 cut audio, less than 30 seconds for whisper, slicer

then put the dataset into the data_raw directory according to the following file structure

data_raw
├───speaker0
│   ├───000001.wav
│   ├───...
│   └───000xxx.wav
└───speaker1
    ├───000001.wav
    ├───...
    └───000xxx.wav

Install dependencies

1 software dependency

pip install -r requirements.txt
2 download the Timbre Encoder: Speaker-Encoder by @mueller91, put best_model.pth.tar into speaker_pretrain/
3 download whisper model multiple language medium model, Make sure to download medium.pt，put it into whisper_pretrain/

Tip: whisper is built-in, do not install it additionally, it will conflict and report an error
4 download pretrain model maxgan_pretrain_32K.pth, and do test

python svc_inference.py --config configs/maxgan.yaml --model maxgan_pretrain_32K.pth --spk ./configs/singers/singer0001.npy --wave test.wav

Data preprocessing

use this command if you want to automate this:

python3 prepare/easyprocess.py

or step by step, as follows:

1， re-sampling

generate audio with a sampling rate of 16000Hz

python prepare/preprocess_a.py -w ./data_raw -o ./data_svc/waves-16k -s 16000

generate audio with a sampling rate of 32000Hz

python prepare/preprocess_a.py -w ./data_raw -o ./data_svc/waves-32k -s 32000
2， use 16K audio to extract pitch

python prepare/preprocess_f0.py -w data_svc/waves-16k/ -p data_svc/pitch
3， use 16K audio to extract ppg

python prepare/preprocess_ppg.py -w data_svc/waves-16k/ -p data_svc/whisper
4， use 16k audio to extract timbre code

python prepare/preprocess_speaker.py data_svc/waves-16k/ data_svc/speaker
5， extract the singer code for inference

python prepare/preprocess_speaker_ave.py data_svc/speaker/ data_svc/singer
6， use 32k audio to generate training index

python prepare/preprocess_train.py
7， training file debugging

python prepare/preprocess_zzz.py -c configs/maxgan.yaml

data_svc/
└── waves-16k
│    └── speaker0
│    │      ├── 000001.wav
│    │      └── 000xxx.wav
│    └── speaker1
│           ├── 000001.wav
│           └── 000xxx.wav
└── waves-32k
│    └── speaker0
│    │      ├── 000001.wav
│    │      └── 000xxx.wav
│    └── speaker1
│           ├── 000001.wav
│           └── 000xxx.wav
└── pitch
│    └── speaker0
│    │      ├── 000001.pit.npy
│    │      └── 000xxx.pit.npy
│    └── speaker1
│           ├── 000001.pit.npy
│           └── 000xxx.pit.npy
└── whisper
│    └── speaker0
│    │      ├── 000001.ppg.npy
│    │      └── 000xxx.ppg.npy
│    └── speaker1
│           ├── 000001.ppg.npy
│           └── 000xxx.ppg.npy
└── speaker
│    └── speaker0
│    │      ├── 000001.spk.npy
│    │      └── 000xxx.spk.npy
│    └── speaker1
│           ├── 000001.spk.npy
│           └── 000xxx.spk.npy
└── singer
    ├── speaker0.spk.npy
    └── speaker1.spk.npy

Train

0， if fine-tuning based on the pre-trained model, you need to download the pre-trained model: maxgan_pretrain_32K.pth

set pretrain: "./maxgan_pretrain_32K.pth" in configs/maxgan.yaml，and adjust the learning rate appropriately, eg 1e-5
1， start training

python svc_trainer.py -c configs/maxgan.yaml -n svc
2， resume training

python svc_trainer.py -c configs/maxgan.yaml -n svc -p chkpt/svc/***.pth
3， view log

tensorboard --logdir logs/

Inference

use this command if you want a GUI that does all the commands below:

python3 svc_gui.py

or step by step, as follows:

1， export inference model

python svc_export.py --config configs/maxgan.yaml --checkpoint_path chkpt/svc/***.pt
2， use whisper to extract content encoding, without using one-click reasoning, in order to reduce GPU memory usage

python whisper/inference.py -w test.wav -p test.ppg.npy
3， extract the F0 parameter to the csv text format

python pitch/inference.py -w test.wav -p test.csv
4， specify parameters and infer

python svc_inference.py --config configs/maxgan.yaml --model maxgan_g.pth --spk ./data_svc/singers/your_singer.npy --wave test.wav --ppg test.ppg.npy --pit test.csv

when --ppg is specified, when the same audio is reasoned multiple times, it can avoid repeated extraction of audio content codes; if it is not specified, it will be automatically extracted;

when --pit is specified, the manually tuned F0 parameter can be loaded; if not specified, it will be automatically extracted;

generate files in the current directory:svc_out.wav

args --config --model --spk --wave --ppg --pit --shift

name config path model path speaker wave input wave ppg wave pitch pitch shift
5, post by vad

python svc_inference_post.py --ref test.wav --svc svc_out.wav --out svc_post.wav

Source of code and References

Adapter-Based Extension of Multi-Speaker Text-to-Speech Model for New Speakers

AdaSpeech: Adaptive Text to Speech for Custom Voice

https://github.com/nii-yamagishilab/project-NN-Pytorch-scripts/tree/master/project/01-nsf

https://github.com/mindslab-ai/univnet [paper]

https://github.com/openai/whisper/ [paper]

https://github.com/NVIDIA/BigVGAN [paper]

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Singing Voice Conversion based on Whisper & neural source-filter BigVGAN

Dataset preparation

Install dependencies

Data preprocessing

Train

Inference

Source of code and References

About

Releases 9

Packages

Contributors 5

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 222 Commits
configs		configs
model		model
pitch		pitch
prepare		prepare
speaker		speaker
speaker_pretrain		speaker_pretrain
utils		utils
vad		vad
whisper		whisper
whisper_pretrain		whisper_pretrain
LICENSE		LICENSE
README.md		README.md
delete.py		delete.py
environment.yml		environment.yml
requirements.txt		requirements.txt
svc_bash.sh		svc_bash.sh
svc_export.py		svc_export.py
svc_gui.py		svc_gui.py
svc_inference.py		svc_inference.py
svc_inference_post.py		svc_inference_post.py
svc_trainer.py		svc_trainer.py

License

PlayVoice/lora-svc

Folders and files

Latest commit

History

Repository files navigation

Singing Voice Conversion based on Whisper & neural source-filter BigVGAN

Dataset preparation

Install dependencies

Data preprocessing

Train

Inference

Source of code and References

About

Topics

Resources

License

Stars

Watchers

Forks

Releases 9

Packages 0

Contributors 5

Languages

Packages