Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

GlowTTS with MultiBand Melgan #38

Closed
Zarbuvit opened this issue Oct 1, 2020 · 7 comments
Closed

GlowTTS with MultiBand Melgan #38

Zarbuvit opened this issue Oct 1, 2020 · 7 comments

Comments

@Zarbuvit
Copy link

Zarbuvit commented Oct 1, 2020

I am trying to get GlowTTS working with Multiband Melgan but I am running into many issues with the different MB Melgan models I am trying.
I managed to get this working with normal Melgan from https://github.com/seungwonpark/melgan , but Multiband Melgan seems to be expecting different input or some normalization I can't figure out.

What I Tried

Using Mozilla-TTS Multiband Melgan and taking most of my implementation from https://colab.research.google.com/drive/1u_16ZzHjKYFn1HNVuA4Qf_i2MMFB9olY?usp=sharing#scrollTo=x8IDS6fO8uW2

  • Initially I got a lot of mechanical noise and nothing else.
  • I tried copying the normalization that is done during the melgan synthesis and that made it so that I could here and understand the words in the synthesis but with a lot of background noise
  • I then came across Cannot use WaveGAN with Glow-TTS and Nividia's Tacotron2 kan-bayashi/ParallelWaveGAN#169 where it was mentioned that the normalization includes decompression and logs. I used @seantempesta code with some differences (the stats file provided by MozillaTTS gives standard deviation and not variance so I skipped var and imported sigma directly). This made it so there was no background noise and it sounded like someone talking but all the words were garbled up.

I tried using https://github.com/kan-bayashi/ParallelWaveGAN Multiband Melgan but I kept running into tensor size issues during the inference and I couldn't figure out why because the tensor size is the same as what I sent to MozillaTTS as well as to the normal Melgan.

I also tried the Multiband Melgan model from https://github.com/TensorSpeech/TensorflowTTS but I ran into similar tensor size issues.

Question

Has anyone managed to get any model of Multiband Melgan working with GlowTTS? Is there a specific repository that is better to use?
Is this really up to differences in normalization prior to sending the mel spectrogram to the Multiband Melgan? What is the normalization that needs to be done to the mel spectrograms that come out of GlowTTS in order from them to work with Multiband Melgan?

Please let me know if more information is needed from me (i didn't want to elaborate on every specific error I got as to not make this post go into too many directions at once).

Thanks in advance for your time and any help you can provide!

@seantempesta
Copy link

Hey @Zarbuvit . I feel your frustration. I ended up trying all of those libraries and none of them worked well with glow-tts. Then I came across a forked version of @seungwanpark 's melgan written by @rishikksh20 and it worked perfectly!

Multi-band Melgan that works with glow-tts
https://github.com/rishikksh20/melgan

I forked his project and have been re-working it so it can be used as a package for inference:
https://github.com/seantempesta/melgan-1

(Note: I may have totally broken the training aspects as I've only tested the inference parts since I repackaged it)

@echelon
Copy link

echelon commented Oct 2, 2020

This is fantastic! Thanks for sharing!

@Zarbuvit
Copy link
Author

Zarbuvit commented Oct 4, 2020

@seantempesta Thank you soo much! I will have a look now and hopefully all goes well

@Zarbuvit
Copy link
Author

Zarbuvit commented Oct 4, 2020

@seantempesta Sadly this isn't working for me. I took the inference from glow-tts as is, removed the waveglow stuff and added the mb melgan generator from rishikksh20 instead, and all the inference from rishikksh as well. I used his pretrained model.
Im getting a clean voice but it is garbled up and i cant understand the words. This is similar to what got using MozillaTTS multiband melgan after applying the code changes you recommended in a different issue.

Also a separate issue I am having is that the denoiser isn't working. I get the error:

IndexError: Dimension out of range (expected to be in range of [-1, 0], but got 1)

on the line audio = denoiser(audio, 0.1). This is weird for me because I assumed if I got any error like that it would happen during inference, but the inference was fine, only the denoiser crashes. If I comment out the denoiser it all runs fine, except for the result being garbled as mentioned before.
I saw that you changed the Denoiser in your repo to work with cpu. Is this just a preferece or is it because it does not work in gpu for you?

Did you run into any of these issues?

@Zarbuvit
Copy link
Author

Zarbuvit commented Oct 8, 2020

Its working!
@seantempesta thank you for that repo!
I ended up using mostly https://github.com/rishikksh20/melgan with editing the denoiser according to what @seantempesta did in his repo: https://github.com/seantempesta/melgan-1

As for the garbling words - completely my personal problem! I missnamed my models and used a different method of converting to phonemes in training and in inference.
I am sorry for any time I caused you to waste on my stupidity.

Thank you for your help!

@Zarbuvit Zarbuvit closed this as completed Oct 8, 2020
@v-nhandt21
Copy link

Its working!
@seantempesta thank you for that repo!
I ended up using mostly https://github.com/rishikksh20/melgan with editing the denoiser according to what @seantempesta did in his repo: https://github.com/seantempesta/melgan-1

As for the garbling words - completely my personal problem! I missnamed my models and used a different method of converting to phonemes in training and in inference.
I am sorry for any time I caused you to waste on my stupidity.

Thank you for your help!

Hello Zarbuvit, Can you share some of the phoneme you are using, I am struggling with representation in phoneme

@phamkhactu
Copy link

@Zarbuvit I have trained model glow, model speak very natural, but buzzing noise. Have you any ideas?. I tried out other model not have buzzing noise, however not temperature.

Thank you

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants