-
Notifications
You must be signed in to change notification settings - Fork 81
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Dim-wise MDN: attempt to improve MDN-based models #44
Conversation
- Add dim_wise option; if True, features are modeled in the dimension wise manner by dim-wise independent GMMs. This would probably workaround the training difficulty on high dimentional data. - Compute log probs using centered tareget - Use lower log_pi_min/log_sigma_min. Set to log(1e-9) - Code formatting fixes
I struggled to find whether dim-wise MDN can be superior to conventional MDN mathematically but I couldn't. However, the sound generated by using dim-wise MDN[1] is better than that of conventional MDN[2] (especially the breath-related noise is reduced in dim-wise MDN). I used the same models as that of nit-song070/svs-world-conv-mdn, but with default settings, the back propagation of pow in torch.distributions.Normal returned nan in stage 3 as same as #39 . It was solved by changing log_pi_min and log_sigma_min of mdn_loss from -20.0 to -7.0.
I'll try other datasets. |
I also tried dim-wise MDN with natsume_singing. The noise around breath did not improve as much as I expected, but the sibilants generated by using dim-wise MDN[1] seem to be better than that of conventional MDN[2]. |
log sigma min is a quality-sensitive parameter in Gaussian WaveNet, so I would like to keep it not too large. Let me try with -9, which I found it good for waveform modeling tasks. I used log sigma min -9 for https://arxiv.org/abs/2010.14151
Thank you for sharing your results! Interesting, I haven't seen this kind of noise in my experiments. Also, thank you for reporting the numerical stability issues. I changed the clipping parameters: ed2b277 |
seeems like this avoids strainge noise in silence regions
I've also confirmed this works okay for the oniku_kurumi database. Let us merge the PR and continue further improvements on follow-up PRs. |
What
Here's my attempt to improve MDN-based models using dimension-wise 1-D GMMs.
As a bonus, I added several improvements to the MDN implementation following #39. Note that I haven't addressed the numerical stability problems yet.
Motivation
I noticed that MDN samples tend to be over-smoothed (sometimes consonant sounds are not very clear) than those of MSE-loss based models. My hypothesis is that modeling high-dimensional data (199 in most recipes) with GMMs are difficult especially if the number of mixtures is large, even the diagonal covariance matrix is assumed. To alleviate the difficulty, I added an option to enable dimension-wise MDN. Instead of modeling joint distribution with 199-dim GMMs, it uses 1-D GMMs for each feature dimension separately, which I suppose is easier to train and converges to the better local opitma.
In the implementation, mixture weights are predicted for each feature dimension (shape
B x T x G x D_out
where B, T, G, and D_out are the batch size, number of frames, number of mixtures, number of output dimensions), while those are predicted for all features in the current MDN (shapeB x T x G
).Note
Note that I cannot find any papers that do the same thing as my approach so far. So I might be wrong. Maybe I just cannot find good parameters for the normal MDN.
As far as I know,
use the same MDN formulation as in #20 by @taroushirani). So I believe the MDN implementation (#20) is correct. This is also supported by a similar loss curve (from the paper [1]):
Note for Conv1dResnetMDN: increasing the number of epochs from 50 to 200 improved the perceptual quality.
Results
To reproduce, try the following recipe:
MDN
(Assuming data preparation is finished)
Dim-wise MDN
Based on my perceptual testing, dim-wise MDN is better than the normal MDN. In particular, /s/ sound is much clearer in dim-wise MDN.
@taroushirani What do you think of the new approach? Any comments are welcome!
I confirmed the dim-wise MDN works good for NIT-song070 dataset but haven't tested the performance on other datasets. It would great if any of you could check the performance on your dataset and share the results with us.
ref #20 , #39