Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support diffusion-based acoustic models #175

Merged
merged 47 commits into from Nov 27, 2022
Merged

Support diffusion-based acoustic models #175

merged 47 commits into from Nov 27, 2022

Conversation

r9y9
Copy link
Collaborator

@r9y9 r9y9 commented Nov 27, 2022

Summary

Diffusion-related

  • Add diffusion-based acoustic models. The denoiser is the same as DiffSinger's one, but now we can combine it with our multi-stream acoustic models. For example, we can combine the autoregressive F0 model and diffusion-based MGC/BAP/MEL prediction models.
  • Adjusted the training script to support diffusion-based acoustic models. Currently, only NPSSMDNMultistreamParametricModel and MDNMultistreamSeparateF0MelModel can be configured with diffusion-based models
  • Add comments as many as possible for the configs of multi-stream models including ones with diffusion models

Luckily, thanks to the modularized design of NNSVS, no changes were needed to the synthesis scripts. Also, very small changes were needed to the training script.

Samples: https://r9y9.github.io/projects/nnsvs/#bonus-samples

Recipe configs

This PR also updates some training configs based on my recent experiments.

  • Add training configs specifically for diffusion-based models
  • Update common configs for all the recipes (not necessarily related to diffusion models though)
  • Update SiFiGAN configs to use discrete F0 instead of continuous F0. I empirically found discrete F0 is better.

fixes #167

Limitations

  • Training is much slower than DiffSinger. There are several reasons: 1) our acoustic model uses frame-level autoregression that is very slow, 2) our data loading mechanism is not optimized 3) we train several models (i.e. multi-stream model) at the same time. It is possible to speed up training, but It is difficult to address them with my limited spare time at the moment. I leave the speed-up issues in the future work.

Notes on design choice

While the original DiffSinger used a self-attention-based encoder for the diffusion model, I decided to use a simpler encoder based on Sinsy's acoustic model architecture (FFConvLSTM). I found it works well with a significantly smaller memory footprint.

How to use

Please check recipes/namine_ritsu_utagoe_db/dev-48k-melf0 as an example.

config.yaml

Mel features:

acoustic_model: acoustic_nnsvs_melf0_ar_f0_diff_mel
acoustic_train: diffusion
acoustic_data: melf0_diffusion

WORLD:

acoustic_model: acoustic_nnsvs_world_multi_ar_f0_diff_mgcbap
acoustic_train: diffusion
acoustic_data: world_diffusion

You can also specify the above by command line.

Steps

Up to stage 3 is done as usual.

SiFi-GAN training

NOTE: Training 200k steps would be enough for testing purposes. Try 600k steps only if you want to maximize the performance.

Mel features:

CUDA_VISIBLE_DEVICES="0" ./run.sh  --stage 13 --stop-stage 13 --vocoder_model nnsvs_melf0_sifigan_sr48k

WORLD

CUDA_VISIBLE_DEVICES="0" ./run.sh  --stage 13 --stop-stage 13 --vocoder_model nnsvs_world_sifigan_sr48k

Training diffusion-based acoustic model

Mel features

CUDA_VISIBLE_DEVICES="0" ./run.sh  --stage 4 --stop-stage 4 --acoustic-model acoustic_nnsvs_melf0_ar_f0_diff_mel --pretrained-vocoder-checkpoint $PWD/exp/namine_ritsu/nnsvs_melf0_sifigan_sr48k/checkpoint-600000steps.pkl --acoustic-data melf0_diffusion --acoustic-train diffusion

WORLD:

CUDA_VISIBLE_DEVICES="0" ./run.sh  --stage 4 --stop-stage 4 --acoustic-model acoustic_nnsvs_world_multi_ar_f0_diff_mgcbap --pretrained-vocoder-checkpoint $PWD/exp/namine_ritsu/nnsvs_world_sifigan_sr48k/checkpoint-600000steps.pkl --acoustic-data melf0_diffusion --acoustic-train diffusion

Synthesize waveforms

Mel features

CUDA_VISIBLE_DEVICES="0" ./run.sh  --stage 6 --stop-stage 6 --acoustic-model acoustic_nnsvs_melf0_ar_f0_diff_mel --vocoder-eval-checkpoint $PWD/exp/namine_ritsu/nnsvs_melf0_sifigan_sr48k/checkpoint-600000steps.pkl --synthesis melf0_gv_usfgan

WORLD

CUDA_VISIBLE_DEVICES="0" ./run.sh  --stage 6 --stop-stage 6 --acoustic-model acoustic_nnsvs_world_multi_ar_f0_diff_mgcbap --vocoder-eval-checkpoint $PWD/exp/namine_ritsu/nnsvs_world_sifigan_sr48k/checkpoint-600000steps.pkl --synthesis world_gv_usfgan

@r9y9 r9y9 added this to the v0.1.0 release milestone Nov 27, 2022
@r9y9 r9y9 self-assigned this Nov 27, 2022
@r9y9 r9y9 changed the title Diffusion-based acoustic models Support diffusion-based acoustic models Nov 27, 2022
@codecov-commenter
Copy link

codecov-commenter commented Nov 27, 2022

Codecov Report

Merging #175 (1ed285c) into master (a6b6611) will increase coverage by 0.31%.
The diff coverage is 62.85%.

@@            Coverage Diff             @@
##           master     #175      +/-   ##
==========================================
+ Coverage   64.08%   64.39%   +0.31%     
==========================================
  Files          39       43       +4     
  Lines        5346     6005     +659     
==========================================
+ Hits         3426     3867     +441     
- Misses       1920     2138     +218     
Impacted Files Coverage Δ
nnsvs/acoustic_models/multistream.py 99.28% <ø> (+1.42%) ⬆️
nnsvs/acoustic_models/util.py 57.53% <0.00%> (-5.16%) ⬇️
nnsvs/diffsinger/pe.py 0.00% <ø> (ø)
nnsvs/train_util.py 8.68% <0.00%> (-0.14%) ⬇️
nnsvs/diffsinger/fs2.py 57.60% <57.60%> (ø)
nnsvs/diffsinger/diffusion.py 72.67% <72.67%> (ø)
nnsvs/diffsinger/denoiser.py 98.59% <98.59%> (ø)
nnsvs/base.py 90.47% <100.00%> (+0.47%) ⬆️
nnsvs/diffsinger/__init__.py 100.00% <100.00%> (ø)
... and 2 more

📣 We’re building smart automated test selection to slash your CI/CD build times. Learn more

@r9y9
Copy link
Collaborator Author

r9y9 commented Nov 27, 2022

Uploaded demo samples: https://r9y9.github.io/projects/nnsvs/#bonus-samples

@r9y9
Copy link
Collaborator Author

r9y9 commented Nov 27, 2022

The tests are all green. Good to go.

@r9y9 r9y9 merged commit be6cb24 into master Nov 27, 2022
@r9y9 r9y9 deleted the diffsinger branch November 27, 2022 12:36
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Diffusion-based acoustic models
2 participants