**Original Repo：https://github.com/yxlllc/DDSP-SVC**

The training speed of a SVC project, very suitable for the free version, within the limit you can get a training degree can be used for the model file, while inference on the quality of the input source is very low, the quality of the data requirements between diffsvc and sovits, higher than sovits.

The Colab part of the code is borrowed from sovits' Colab.

**If you have no special needs, the default option for this notebook is optimal.**

Process the data in advance.

Requirements **mono pure human voice**, body whispering sound / gas sound is not suitable (difficult to extract F0), it is best to remove.

Processing split into 2s-10s, maybe more than 20s are possible, this project is very not eaten memory, but 2s below is not allowed. **It is recommended that about 1000 segments of data is sufficient, more may not be better.**

Remember to resample to 44.1kHz when processing, non-sampling rate can run but will greatly reduce the efficiency. Remember to match the loudness.

**Use Adobe Audition's™ Loudness Matching feature to resample and modify the channel and loudness matching in one go.**

Because of the free version limit, it is best to pre-process locally and not to use non-44.1kHz data, which can be trained well in one limit time with average data quality.

If your dataset is not of high quality, please set 'f0_extractor' to 'crepe' in the configuration file. crepe algorithm has the best noise immunity, but at the cost of greatly increasing the time needed for data preprocessing.

Dataset file structure:

Place all the training set data (.wav format audio slices) in data/train/audio

Put all the validation set data (.wav format audio slices) into data/val/audio

**It is recommended to use draw.py for the validation set extraction.**

Pack the data folder into a zip format named data.zip and upload it to the root directory of Google Cloud Drive.

It is recommended to pre-process locally to save time on the limit, after pre-processing the same method as above can be packed and uploaded.

Two training methods are provided, "combsub-based model (combsub)" and "sinusoidal additive synthesizer-based model (sin)", the latter is less comprehensive than the former, but still provides options.

**2.0 updates combsub, if you want to continue training the previous model please use the combsub-old.yaml configuration file.**

Modify the hyperparameters (such as "bs") to increase the memory occupation may not improve efficiency (I tried to reduce anyway), the default parameter is about 8.2batch / s, almost no overfitting.

3/6 added the code for downloading the bottom die, the test convergence speed is increased by about 40%, and there is almost no tone leakage at 50k, so it is still possible to try.

3/8 updated with multi-speaker training, the dataset structure is as follows, single person training can still use the previous dataset with no effect. Multi-speaker can be enabled by modifying the 'n_spk' option in the configuration file.

```
# Training set
# 1st speaker
data/train/audio/1/aaa.wav
data/train/audio/1/bbb.wav
...
# 2 speaker
data/train/audio/2/ccc.wav
data/train/audio/2/ddd.wav
...

# Validation set
# 1st speaker
data/val/audio/1/eee.wav
data/val/audio/1/fff.wav
...
# 2nd speaker
data/val/audio/2/ggg.wav
data/val/audio/2/hhh.wav
...
```

3/21 before PR a small script to draw cards, lazy to organize their own verification set can use this.

Updated 3/22 with 2.0 related and additional text.

In [None]:
#@title See what card was drawn ~~ basically T4 ~~
!nvidia-smi

In [None]:
#@title Clone github repository
!git clone https://github.com/yxlllc/DDSP-SVC

In [None]:
#@title Install dependencies
%cd /content/DDSP-SVC
!pip install pyworld praat-parselmouth torchcrepe einops local_attention wave

Two vocoder options are provided, the community-contributed vocoder trained by the OpenVPI team and the fishaudio-trained vocoder. **For DDSP-SVC, you don't have much reason to use the latter because of the presence of eak.**

In [None]:
#@title Download the necessary files
#@markdown ##Select vocoder
way = "OpenVPI" #@param ["OpenVPI","fishaudio"]

!wget -P pretrain/hubert/ https://github.com/bshall/hubert/releases/download/v0.1/hubert-soft-0d54a1f4.pt
if way == "OpenVPI":
  !wget -P pretrain/ https://github.com/openvpi/vocoders/releases/download/nsf-hifigan-v1/nsf_hifigan_20221211.zip
  !unzip -d /content/DDSP-SVC/pretrain /content/DDSP-SVC/pretrain/nsf_hifigan_20221211.zip
if way == "fishaudio":
  !wget -P pretrain/ https://github.com/fishaudio/fish-diffusion/releases/download/v2.0.0/nsf_hifigan-stable-v1.zip
  !unzip -d /content/DDSP-SVC/pretrain /content/DDSP-SVC/pretrain/nsf_hifigan-stable-v1.zip

In [None]:
#@title Loading Google Cloud Drive
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


File structure:

Place all the training set data (.wav format audio slices) into data/train/audio

Put all the validation set data (.wav format audio slices) into data/val/audio

In [None]:
#@title Get dataset from cloud drive

#@markdown ###Modify the path and file name here

#@markdown ___

import os

zip_file_path = "\"/content/drive/MyDrive/data.zip\"" #@param {type:"string"}
!unzip $zip_file_path -d /content/tacotron2/wavs

The card draw script is set based on 1000 data, please increase "SAMPLE_MIN" if the data is too small.

In [None]:
#@title extract training set, already extracted can be skipped
!python draw.py

In [None]:
#@title data preprocessing, already processed can be skipped
#@markdown ##Select training method
way = "combsub" #@param ["combsub","combsub-old","sins"]

if way == "combsub":
  !python preprocess.py -c configs/combsub.yaml
if way == "combsub-old":
  !python preprocess.py -c configs/combsub-old.yaml
if way == "sins":
  !python preprocess.py -c configs/sins.yaml

Change the name of the "dataset.zip" file to "data.zip" to continue to use the dataset in the next training without the need to reprocess it.

In [None]:
#@title Pack/backup datasets
!zip -r dataset.zip /content/DDSP-SVC/data
!cp /content/DDSP-SVC/dataset.zip /content/drive/MyDrive/

In [None]:
#@title set model backup
#@markdown **whether to backup the model to the cloud disk, colab at any time to explode the recommended backup, the default save to the cloud disk root directory DDSP-SVC folder**
Save_to_drive = True #@param {type:"boolean"}
if Save_to_drive:
  !rm -rf /content/DDSP-SVC/exp
  !mkdir -p /content/drive/MyDrive/DDSPSVC
  !ln -s /content/drive/MyDrive/DDSPSVC /content/DDSP-SVC/exp

**Please change the 'n_spk' parameter in the configuration file to '2' when using the pre-trained model, regardless of whether you train a multi-speaker model or not.**

**If you want to train a model with more speakers, don't download the pre-trained model.**

~~If you need to train many multi-speaker models for a long time, maybe you can practice a base model by yourself.~~

In [None]:
#@title Download pre-trained model (optional)
#@markdown ## Select training method
way = "combsub" #@param ["combsub","combsub-old"]
if way == "combsub":
  !wget -P exp https://github.com/yxlllc/DDSP-SVC/releases/download/2.0/opencpop+kiritan.zip
if way == "combsub-old":
  !wget -P exp https://github.com/yxlllc/DDSP-SVC/releases/download/1.1/opencpop+kiritan.zip
!mkdir /content/DDSP-SVC/exp/combsub-test/
!mkdir /content/DDSP-SVC/exp/sin-test/
!unzip -d /content/DDSP-SVC/exp /content/DDSP-SVC/exp/opencpop+kiritan.zip
!cp /content/DDSP-SVC/exp/opencpop/model_300000.pt /content/DDSP-SVC/exp/sin-test/
!cp /content/DDSP-SVC/exp/opencpop/model_300000.pt /content/DDSP-SVC/exp/combsub-test/

**Before training, please open the "/DDSP-SVC/config" folder in the "File" tab on the left column and change line 35 of "combsub.yaml " or "sins.yaml" (depending on your training method) on line 35, change "cache_device: 'cpu'" to " cache_device: 'cuda'", check whether the 'n_spk' parameter is correct or not, and click the "File" tab on the top bar to save.** This will further speed up the training process.

**This will further speed up the training.** **This will further speed up the training.** 

**Can you increase the value of "lr" in line 39 in the pre-training period by using the base model? (Not tested, you need to master it by yourself, don't change it if you are not sure)**

**If you have a very, very, very, very huge data set, which may explode the video memory, in this case you can choose not to modify the cache_device, but more recommended to reduce the amount of data, more data may not be good.**

In [None]:
#@title Start training
#@markdown ##Select training method
way = "combsub" #@param ["combsub","sins"]

%load_ext tensorboard
%tensorboard --logdir exp

%cd /content/DDSP-SVC
if way == "combsub":
  !python train.py -c configs/combsub.yaml
if way == "combsub-old":
  !python train.py -c configs/combsub-old.yaml
if way == "sins":
  !python train.py -c configs/sins.yaml

In [None]:
#@title inference (multi-speakers modify the code to specify the speaker by themselves)
#@markdown **Upload the processed ".wav" input source file to the cloud drive root directory and modify the following options**

#@markdown **" .wav "File filename**
input = "input" #@param {type:"string"}
input_path = "/content/drive/MyDrive/"
input_name =  input_path + input
model_path = "/content/drive/MyDrive/DDSP-SVC"
#@markdown **pitch adjustment**
keychange = "0"  #@param {type:"string"}
# 默认 enhancer_adaptive_key = 0 正常音域范围内将有更高的音质
# 设置 enhancer_adaptive_key > 0 可将增强器适配于更高的音域
#@markdown **EAK setting**
enhancer_adaptive_key = "0"  #@param {type:"string"}
!python main.py -i {input_name}.wav -m {model_path}/{way}/model_best.pt -o {input_name}_result.wav -k {keychange} -eak {enhancer_adaptive_key}