Support diffusion-based acoustic models #175

r9y9 · 2022-11-27T07:57:22Z

Summary

Diffusion-related

Add diffusion-based acoustic models. The denoiser is the same as DiffSinger's one, but now we can combine it with our multi-stream acoustic models. For example, we can combine the autoregressive F0 model and diffusion-based MGC/BAP/MEL prediction models.
Adjusted the training script to support diffusion-based acoustic models. Currently, only NPSSMDNMultistreamParametricModel and MDNMultistreamSeparateF0MelModel can be configured with diffusion-based models
Add comments as many as possible for the configs of multi-stream models including ones with diffusion models

Luckily, thanks to the modularized design of NNSVS, no changes were needed to the synthesis scripts. Also, very small changes were needed to the training script.

Samples: https://r9y9.github.io/projects/nnsvs/#bonus-samples

Recipe configs

This PR also updates some training configs based on my recent experiments.

Add training configs specifically for diffusion-based models
Update common configs for all the recipes (not necessarily related to diffusion models though)
Update SiFiGAN configs to use discrete F0 instead of continuous F0. I empirically found discrete F0 is better.

fixes #167

Limitations

Training is much slower than DiffSinger. There are several reasons: 1) our acoustic model uses frame-level autoregression that is very slow, 2) our data loading mechanism is not optimized 3) we train several models (i.e. multi-stream model) at the same time. It is possible to speed up training, but It is difficult to address them with my limited spare time at the moment. I leave the speed-up issues in the future work.

Notes on design choice

While the original DiffSinger used a self-attention-based encoder for the diffusion model, I decided to use a simpler encoder based on Sinsy's acoustic model architecture (FFConvLSTM). I found it works well with a significantly smaller memory footprint.

How to use

Please check recipes/namine_ritsu_utagoe_db/dev-48k-melf0 as an example.

config.yaml

Mel features:

acoustic_model: acoustic_nnsvs_melf0_ar_f0_diff_mel
acoustic_train: diffusion
acoustic_data: melf0_diffusion

WORLD:

acoustic_model: acoustic_nnsvs_world_multi_ar_f0_diff_mgcbap
acoustic_train: diffusion
acoustic_data: world_diffusion

You can also specify the above by command line.

Steps

Up to stage 3 is done as usual.

SiFi-GAN training

NOTE: Training 200k steps would be enough for testing purposes. Try 600k steps only if you want to maximize the performance.

Mel features:

CUDA_VISIBLE_DEVICES="0" ./run.sh  --stage 13 --stop-stage 13 --vocoder_model nnsvs_melf0_sifigan_sr48k

WORLD

CUDA_VISIBLE_DEVICES="0" ./run.sh  --stage 13 --stop-stage 13 --vocoder_model nnsvs_world_sifigan_sr48k

Training diffusion-based acoustic model

Mel features

CUDA_VISIBLE_DEVICES="0" ./run.sh  --stage 4 --stop-stage 4 --acoustic-model acoustic_nnsvs_melf0_ar_f0_diff_mel --pretrained-vocoder-checkpoint $PWD/exp/namine_ritsu/nnsvs_melf0_sifigan_sr48k/checkpoint-600000steps.pkl --acoustic-data melf0_diffusion --acoustic-train diffusion

WORLD:

CUDA_VISIBLE_DEVICES="0" ./run.sh  --stage 4 --stop-stage 4 --acoustic-model acoustic_nnsvs_world_multi_ar_f0_diff_mgcbap --pretrained-vocoder-checkpoint $PWD/exp/namine_ritsu/nnsvs_world_sifigan_sr48k/checkpoint-600000steps.pkl --acoustic-data melf0_diffusion --acoustic-train diffusion

Synthesize waveforms

Mel features

CUDA_VISIBLE_DEVICES="0" ./run.sh  --stage 6 --stop-stage 6 --acoustic-model acoustic_nnsvs_melf0_ar_f0_diff_mel --vocoder-eval-checkpoint $PWD/exp/namine_ritsu/nnsvs_melf0_sifigan_sr48k/checkpoint-600000steps.pkl --synthesis melf0_gv_usfgan

WORLD

CUDA_VISIBLE_DEVICES="0" ./run.sh  --stage 6 --stop-stage 6 --acoustic-model acoustic_nnsvs_world_multi_ar_f0_diff_mgcbap --vocoder-eval-checkpoint $PWD/exp/namine_ritsu/nnsvs_world_sifigan_sr48k/checkpoint-600000steps.pkl --synthesis world_gv_usfgan

for now NPSSMDNMultistreamModel is supported

so one can easily configure the models also clarified which parts need to be adjusted when using custom hed

…eatures

- use variance predictor mdn for time-lag - add comments - myconfig -> world

with diffusion-based acoustic model config

codecov-commenter · 2022-11-27T08:06:56Z

Codecov Report

Merging #175 (1ed285c) into master (a6b6611) will increase coverage by 0.31%.
The diff coverage is 62.85%.

@@            Coverage Diff             @@
##           master     #175      +/-   ##
==========================================
+ Coverage   64.08%   64.39%   +0.31%     
==========================================
  Files          39       43       +4     
  Lines        5346     6005     +659     
==========================================
+ Hits         3426     3867     +441     
- Misses       1920     2138     +218

Impacted Files	Coverage Δ
nnsvs/acoustic_models/multistream.py	`99.28% <ø> (+1.42%)`	⬆️
nnsvs/acoustic_models/util.py	`57.53% <0.00%> (-5.16%)`	⬇️
nnsvs/diffsinger/pe.py	`0.00% <ø> (ø)`
nnsvs/train_util.py	`8.68% <0.00%> (-0.14%)`	⬇️
nnsvs/diffsinger/fs2.py	`57.60% <57.60%> (ø)`
nnsvs/diffsinger/diffusion.py	`72.67% <72.67%> (ø)`
nnsvs/diffsinger/denoiser.py	`98.59% <98.59%> (ø)`
nnsvs/base.py	`90.47% <100.00%> (+0.47%)`	⬆️
nnsvs/diffsinger/__init__.py	`100.00% <100.00%> (ø)`
... and 2 more

📣 We’re building smart automated test selection to slash your CI/CD build times. Learn more

r9y9 · 2022-11-27T09:27:20Z

Uploaded demo samples: https://r9y9.github.io/projects/nnsvs/#bonus-samples

r9y9 · 2022-11-27T12:34:04Z

The tests are all green. Good to go.

r9y9 added 30 commits November 16, 2022 21:58

rename diffsinger_compat to diffsinger

dfdf6f8

Add diffsinger's license

818c5eb

Add denoiser module

bbf1b51

Add GaussianDiffusion

aa788b6

Add README

bc1d65a

minimal fixes to support training diffusion models

bfa854c

for now NPSSMDNMultistreamModel is supported

Add encoder parameter for GaussianDiffusion

a3fdc64

Fix eval for diffusion models

8dead86

Fix for mel model

0ec3513

Fix

324f08f

Fix warings

9d52aea

Remove default func

9d6147a

normalize and denormalize for diffusion model

aa858f1

Merge remote-tracking branch 'origin/master' into diffsinger

0b5a8a0

Merge remote-tracking branch 'origin/master' into diffsinger

a179d44

Merge remote-tracking branch 'origin/master' into diffsinger

de30be2

Make FS2's FFTBlocks usable by NNSVS

59948e9

fixup: fix for cuda

92401f8

Add workaround for plotting bug

99f9340

Merge remote-tracking branch 'origin/master' into diffsinger

bb15207

Merge remote-tracking branch 'origin/master' into diffsinger

fb07711

Merge remote-tracking branch 'origin/master' into diffsinger

edb273f

Merge remote-tracking branch 'origin/master' into diffsinger

56b19cb

add entries in init .pyt

5f039b6

Merge remote-tracking branch 'origin/master' into diffsinger

ece8c3e

Merge remote-tracking branch 'origin/master' into diffsinger

8e27144

Fixes for diffusion models

0021835

Add comments for MDN-based multi-stream models

fd48a1d

Add print for easier debugging

e2fb8cf

Add tests for diffusion models

cac54e7

r9y9 added 12 commits November 27, 2022 15:51

use eps 1e-6 for mel features following DiffSinger's settings

05b748d

sifigan: use discrete f0

006036f

rm v2

d8606e1

Add comments to model configs as many as possible

2c2e9ec

so one can easily configure the models also clarified which parts need to be adjusted when using custom hed

train_acoustic: remove finetuen.yaml as it is not used

dc83b4c

train_acoustic.train: add config for diffusion models

9c4dd7f

train_acoustic.data: update data configs

a849a03

Add configs of diffusion-based multistream models for mel and WORLD f…

b41b9ef

…eatures

Update all recipes

936d597

- use variance predictor mdn for time-lag - add comments - myconfig -> world

Add dev-48-melf0 recipe for namine ritsu

f82af2c

with diffusion-based acoustic model config

Fix wrong lf0_idx

b99a807

use diffusion model as default for ritsu's recipe

18c6b1e

r9y9 added recipes acoustic model new feature new feature labels Nov 27, 2022

r9y9 added this to the v0.1.0 release milestone Nov 27, 2022

r9y9 self-assigned this Nov 27, 2022

fix typo

cc64876

r9y9 changed the title ~~Diffusion-based acoustic models~~ Support diffusion-based acoustic models Nov 27, 2022

r9y9 added 2 commits November 27, 2022 17:16

Fix dup code in tests

987a474

Fix lint

6cd5e08

expose norm_scale and add comments for it

1ed285c

This was referenced Nov 27, 2022

preprocess: Add workaround for out-liers #176

Merged

Recipe for opencpop database #105

Closed

r9y9 merged commit be6cb24 into master Nov 27, 2022

r9y9 deleted the diffsinger branch November 27, 2022 12:36

r9y9 mentioned this pull request Nov 27, 2022

Add recipes for the Opencpop corpus #177

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Support diffusion-based acoustic models #175

Support diffusion-based acoustic models #175

r9y9 commented Nov 27, 2022 •

edited

codecov-commenter commented Nov 27, 2022 •

edited

r9y9 commented Nov 27, 2022

r9y9 commented Nov 27, 2022

Support diffusion-based acoustic models #175

Support diffusion-based acoustic models #175

Conversation

r9y9 commented Nov 27, 2022 • edited

Summary

Diffusion-related

Recipe configs

Limitations

Notes on design choice

How to use

config.yaml

Steps

SiFi-GAN training

Training diffusion-based acoustic model

Synthesize waveforms

codecov-commenter commented Nov 27, 2022 • edited

Codecov Report

r9y9 commented Nov 27, 2022

r9y9 commented Nov 27, 2022

r9y9 commented Nov 27, 2022 •

edited

codecov-commenter commented Nov 27, 2022 •

edited