Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Dim-wise MDN: attempt to improve MDN-based models #44

Merged
merged 10 commits into from
Nov 26, 2020
Merged

Conversation

r9y9
Copy link
Collaborator

@r9y9 r9y9 commented Nov 16, 2020

What

Here's my attempt to improve MDN-based models using dimension-wise 1-D GMMs.

As a bonus, I added several improvements to the MDN implementation following #39. Note that I haven't addressed the numerical stability problems yet.

Motivation

I noticed that MDN samples tend to be over-smoothed (sometimes consonant sounds are not very clear) than those of MSE-loss based models. My hypothesis is that modeling high-dimensional data (199 in most recipes) with GMMs are difficult especially if the number of mixtures is large, even the diagonal covariance matrix is assumed. To alleviate the difficulty, I added an option to enable dimension-wise MDN. Instead of modeling joint distribution with 199-dim GMMs, it uses 1-D GMMs for each feature dimension separately, which I suppose is easier to train and converges to the better local opitma.

In the implementation, mixture weights are predicted for each feature dimension (shape B x T x G x D_out where B, T, G, and D_out are the batch size, number of frames, number of mixtures, number of output dimensions), while those are predicted for all features in the current MDN (shape B x T x G).

Note

Note that I cannot find any papers that do the same thing as my approach so far. So I might be wrong. Maybe I just cannot find good parameters for the normal MDN.

As far as I know,

use the same MDN formulation as in #20 by @taroushirani). So I believe the MDN implementation (#20) is correct. This is also supported by a similar loss curve (from the paper [1]):

Screenshot from 2020-11-16 12-05-57

Note for Conv1dResnetMDN: increasing the number of epochs from 50 to 200 improved the perceptual quality.

Results

To reproduce, try the following recipe:

cd egs/nit-song070/svs-world-conv-mdn

MDN

(Assuming data preparation is finished)

CUDA_VISIBLE_DEVICES=0 ./run.sh --stage 2 --stop-stage 6 --tag baseline --acoustic-model acoustic_baseline

Dim-wise MDN

CUDA_VISIBLE_DEVICES=0 ./run.sh --stage 2 --stop-stage 6 --tag new-mdn --acoustic-model acoustic_mdn 

Based on my perceptual testing, dim-wise MDN is better than the normal MDN. In particular, /s/ sound is much clearer in dim-wise MDN.

@taroushirani What do you think of the new approach? Any comments are welcome!

I confirmed the dim-wise MDN works good for NIT-song070 dataset but haven't tested the performance on other datasets. It would great if any of you could check the performance on your dataset and share the results with us.

ref #20 , #39

- Add dim_wise option; if True, features are modeled in the dimension wise
manner by dim-wise independent GMMs. This would probably workaround the
training difficulty on high dimentional data.
- Compute log probs using centered tareget
- Use lower log_pi_min/log_sigma_min. Set to log(1e-9)
- Code formatting fixes
@taroushirani
Copy link
Contributor

I struggled to find whether dim-wise MDN can be superior to conventional MDN mathematically but I couldn't. However, the sound generated by using dim-wise MDN[1] is better than that of conventional MDN[2] (especially the breath-related noise is reduced in dim-wise MDN).

I used the same models as that of nit-song070/svs-world-conv-mdn, but with default settings, the back propagation of pow in torch.distributions.Normal returned nan in stage 3 as same as #39 . It was solved by changing log_pi_min and log_sigma_min of mdn_loss from -20.0 to -7.0.

  1. https://soundcloud.com/user-883019797/nnsvs_ofuton_p_utagoe_db_svs_world_conv_mdn_world_s_end_supernova_8vb_dim_wise_mdn
  2. https://soundcloud.com/user-883019797/nnsvs_ofuton_p_utagoe_db_svs_world_conv_mdn_world_s_end_supernova_8vb

I'll try other datasets.

@taroushirani
Copy link
Contributor

I also tried dim-wise MDN with natsume_singing. The noise around breath did not improve as much as I expected, but the sibilants generated by using dim-wise MDN[1] seem to be better than that of conventional MDN[2].

  1. https://soundcloud.com/user-883019797/nnsvs_natsume_singing_svs_world_conv_mdn_world_s_end_supernova_8vb_dim_wise_mdn
  2. https://soundcloud.com/user-883019797/nnsvs_natsume_singing_svs_world_conv_mdn_world_s_end_supernova_8vb

log sigma min is a quality-sensitive parameter in Gaussian WaveNet,
so I would like to keep it not too large. Let me try with -9, which
I found it good for waveform modeling tasks.

I used log sigma min -9 for https://arxiv.org/abs/2010.14151
@r9y9
Copy link
Collaborator Author

r9y9 commented Nov 17, 2020

Thank you for sharing your results! Interesting, I haven't seen this kind of noise in my experiments. Also, thank you for reporting the numerical stability issues. I changed the clipping parameters: ed2b277

@r9y9
Copy link
Collaborator Author

r9y9 commented Nov 26, 2020

I've also confirmed this works okay for the oniku_kurumi database. Let us merge the PR and continue further improvements on follow-up PRs.

@r9y9 r9y9 added the enhancement New feature or request label Nov 26, 2020
@r9y9 r9y9 merged commit a78b222 into master Nov 26, 2020
@r9y9 r9y9 deleted the mdn-improvements branch November 26, 2020 16:29
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants