Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Audio quality improvements #21

Open
janvainer opened this issue Mar 28, 2021 · 6 comments
Open

Audio quality improvements #21

janvainer opened this issue Mar 28, 2021 · 6 comments

Comments

@janvainer
Copy link

Hi, awesome contribution for TTS community :) I am wondering, did you manage to train a model that would have higher audio quality than the pretrained checkpoint provided with this repo? The audio samples seem to have lower quality than the ones presented in the paper. Any ideas what might be missing?

I am now training the model from scratch and the audio samples are very noisy now (approx 12 hours on 2 GPUs, batch size 128). It is getting better, but I am curious in some upper bound on the quality with the provided source code.

@ivanvovk
Copy link
Owner

ivanvovk commented Mar 28, 2021

@janvainer Hey, thanks, man. Yeah, the samples are of a bit lower quality than ones presented in demo page of the paper. However, authors used their personal proprietary dataset for training, where the female had much lower pitch than Linda (it is always hard to train on LJ). And I noticed that the less iterations you make, model reconstructs the less accurate higher frequencies. But I also think there might be some issues in diffusion calculations. I can suggest you to look towards lucidrains code and reuse forward and backward DDPM calculations with improved cosine schedules (maybe this can help): https://github.com/lucidrains/denoising-diffusion-pytorch. His repo follows the paper https://arxiv.org/pdf/2102.09672.pdf. I am going to return to this WaveGrad repo and gain its best quality, finally, once all my other projects are finished. But I think it can be delayed till summer. Also, you can check Mozilla's TTS library, I remember some guys from there interested in WaveGrad and they even added WaveGrad to their codebase: https://github.com/mozilla/TTS. Hope, it can help you.

@janvainer
Copy link
Author

Thanks for swift repsonse :) I will check the diffusion calculations. I also tried the mozzila version, but the quality of the synthesized audio seemed a bit lower to me, at least for the WaveGrad vocoder combined with tacotron 2. There is this weird high freq noise.

On a side note, I am getting increasing L1 test batch loss, while the l1 test spec batch loss is going down. Did you experience the same behavior?

image

@ivanvovk
Copy link
Owner

@janvainer Yes, actually, I remember in my experiments that loss was not representative at all, spectral was more informative. I think such behavior is okay, don't pay attention to this.

@janvainer
Copy link
Author

Ok thanks! :)

@yijingshihenxiule
Copy link

Hello, @janvainer ! I just train and the audio samples are very noisy now (approx 12 hours 25K epochs on single GPU, batch size 96,). Could you show me your train result? And when will the samples be good? Thanks!

@janvainer
Copy link
Author

Hi, unfortunately I do not have the results with me anymore. But I remember training on 4 GPUs for several days.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants