Dim-wise MDN: attempt to improve MDN-based models #44

r9y9 · 2020-11-16T03:31:10Z

What

Here's my attempt to improve MDN-based models using dimension-wise 1-D GMMs.

As a bonus, I added several improvements to the MDN implementation following #39. Note that I haven't addressed the numerical stability problems yet.

Motivation

I noticed that MDN samples tend to be over-smoothed (sometimes consonant sounds are not very clear) than those of MSE-loss based models. My hypothesis is that modeling high-dimensional data (199 in most recipes) with GMMs are difficult especially if the number of mixtures is large, even the diagonal covariance matrix is assumed. To alleviate the difficulty, I added an option to enable dimension-wise MDN. Instead of modeling joint distribution with 199-dim GMMs, it uses 1-D GMMs for each feature dimension separately, which I suppose is easier to train and converges to the better local opitma.

In the implementation, mixture weights are predicted for each feature dimension (shape B x T x G x D_out where B, T, G, and D_out are the batch size, number of frames, number of mixtures, number of output dimensions), while those are predicted for all features in the current MDN (shape B x T x G).

Note

Note that I cannot find any papers that do the same thing as my approach so far. So I might be wrong. Maybe I just cannot find good parameters for the normal MDN.

As far as I know,

[1] An autoregressive recurrent mixture density network for parametric speech synthesis https://www.semanticscholar.org/paper/An-autoregressive-recurrent-mixture-density-network-Wang-Takaki/7b6964a96afeb2d32c79d2891d9ddfb66d182be9
[2] Robust TTS duration modeling using DNNs http://www.research.ed.ac.uk/portal/files/23618761/henter2016robust_final_1.pdf

use the same MDN formulation as in #20 by @taroushirani). So I believe the MDN implementation (#20) is correct. This is also supported by a similar loss curve (from the paper [1]):

Note for Conv1dResnetMDN: increasing the number of epochs from 50 to 200 improved the perceptual quality.

Results

To reproduce, try the following recipe:

cd egs/nit-song070/svs-world-conv-mdn

MDN

(Assuming data preparation is finished)

CUDA_VISIBLE_DEVICES=0 ./run.sh --stage 2 --stop-stage 6 --tag baseline --acoustic-model acoustic_baseline

Dim-wise MDN

CUDA_VISIBLE_DEVICES=0 ./run.sh --stage 2 --stop-stage 6 --tag new-mdn --acoustic-model acoustic_mdn

Based on my perceptual testing, dim-wise MDN is better than the normal MDN. In particular, /s/ sound is much clearer in dim-wise MDN.

@taroushirani What do you think of the new approach? Any comments are welcome!

I confirmed the dim-wise MDN works good for NIT-song070 dataset but haven't tested the performance on other datasets. It would great if any of you could check the performance on your dataset and share the results with us.

ref #20 , #39

ref #39

- Add dim_wise option; if True, features are modeled in the dimension wise manner by dim-wise independent GMMs. This would probably workaround the training difficulty on high dimentional data. - Compute log probs using centered tareget - Use lower log_pi_min/log_sigma_min. Set to log(1e-9) - Code formatting fixes

taroushirani · 2020-11-17T08:51:38Z

I struggled to find whether dim-wise MDN can be superior to conventional MDN mathematically but I couldn't. However, the sound generated by using dim-wise MDN[1] is better than that of conventional MDN[2] (especially the breath-related noise is reduced in dim-wise MDN).

I used the same models as that of nit-song070/svs-world-conv-mdn, but with default settings, the back propagation of pow in torch.distributions.Normal returned nan in stage 3 as same as #39 . It was solved by changing log_pi_min and log_sigma_min of mdn_loss from -20.0 to -7.0.

I'll try other datasets.

taroushirani · 2020-11-17T14:07:47Z

I also tried dim-wise MDN with natsume_singing. The noise around breath did not improve as much as I expected, but the sibilants generated by using dim-wise MDN[1] seem to be better than that of conventional MDN[2].

log sigma min is a quality-sensitive parameter in Gaussian WaveNet, so I would like to keep it not too large. Let me try with -9, which I found it good for waveform modeling tasks. I used log sigma min -9 for https://arxiv.org/abs/2010.14151

r9y9 · 2020-11-17T15:16:34Z

Thank you for sharing your results! Interesting, I haven't seen this kind of noise in my experiments. Also, thank you for reporting the numerical stability issues. I changed the clipping parameters: ed2b277

seeems like this avoids strainge noise in silence regions

r9y9 · 2020-11-26T16:29:09Z

I've also confirmed this works okay for the oniku_kurumi database. Let us merge the PR and continue further improvements on follow-up PRs.

r9y9 added 5 commits November 16, 2020 01:05

Add Conv1dResNetMDN

869260c

ref #39

fix train.py for dim_wise is True

e4c919d

Expose dim_wise option to MDN related models

b532c01

egs: add nit-song070/svs-world-conv-mdn

836e35d

FIx log_pi_min and log_sigma_min

ed2b277

log sigma min is a quality-sensitive parameter in Gaussian WaveNet, so I would like to keep it not too large. Let me try with -9, which I found it good for waveform modeling tasks. I used log sigma min -9 for https://arxiv.org/abs/2010.14151

r9y9 added 4 commits November 18, 2020 01:26

mv mdn sampling code from gen.py to model.py

f02e3ea

FIx gnerate

ab13738

umm, set back to log scale min to -7

54ac36b

seeems like this avoids strainge noise in silence regions

Reproducibility fix

b8cb23e

r9y9 added the enhancement New feature or request label Nov 26, 2020

r9y9 merged commit a78b222 into master Nov 26, 2020

r9y9 deleted the mdn-improvements branch November 26, 2020 16:29

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Dim-wise MDN: attempt to improve MDN-based models #44

Dim-wise MDN: attempt to improve MDN-based models #44

r9y9 commented Nov 16, 2020

taroushirani commented Nov 17, 2020

taroushirani commented Nov 17, 2020

r9y9 commented Nov 17, 2020

r9y9 commented Nov 26, 2020

Dim-wise MDN: attempt to improve MDN-based models #44

Dim-wise MDN: attempt to improve MDN-based models #44

Conversation

r9y9 commented Nov 16, 2020

What

Motivation

Note

Results

MDN

Dim-wise MDN

taroushirani commented Nov 17, 2020

taroushirani commented Nov 17, 2020

r9y9 commented Nov 17, 2020

r9y9 commented Nov 26, 2020