Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

New Tacotron2 model release with WaveRNN vocoder. #153

Closed
erogol opened this issue Apr 12, 2019 · 17 comments
Closed

New Tacotron2 model release with WaveRNN vocoder. #153

erogol opened this issue Apr 12, 2019 · 17 comments
Projects

Comments

@erogol
Copy link
Contributor

erogol commented Apr 12, 2019

A new TTS Tacotron2 model trained on LJSpeech is released. It should work well with the MOLD WaveRNN model.

  • Model has been trained for 260K iterations. It has the best validation loss so far on LJSpeech.

  • Model has been trained first with dropout Prenet as in the original paper and then switched to BN prenet described above. And finally, it's been trained with "forward attention." for just experimental reasons.

  • In inference time you can try different attention related parameters and pick the one best fits you. So you can switch on/off forward attention, use "sigmoid" or "softmax" norm or consider to use attention windowing. The default settings are given by the model's config.json.

  • I think both WaveRNN and TTS models have more space for finetuning (especially WaveRNN) for better results.

You can also read more here #26

@erogol erogol added this to In Progress in v0.0.1 Apr 12, 2019
@erogol erogol moved this from In Progress to Done in v0.0.1 Apr 12, 2019
@erogol erogol moved this from Done to In Progress in v0.0.1 Apr 12, 2019
@erogol erogol closed this as completed Apr 12, 2019
v0.0.1 automation moved this from In Progress to Done Apr 12, 2019
@ZohaibAhmed
Copy link

ZohaibAhmed commented Apr 12, 2019

@erogol - tried to load the checkpoint with the latest code on the dev-tacotron2 branch. I get the following error:

RuntimeError: Error(s) in loading state_dict for Tacotron2:
	Missing key(s) in state_dict: "decoder.attention_layer.ta.weight", "decoder.attention_layer.ta.bias".

Solved - just make sure you use the right config.json files :)

@ZohaibAhmed
Copy link

@erogol - I tried to train a new WaveRNN model (from scratch and finetune on top of yours) as well as use my previous implementation of WaveRNN. For each one the output is very scrambled:

https://drive.google.com/open?id=1iHo-b3WwGrvRUc-RjhpQA_G0GgycsENW

When I use point the vocoder to the MOLD model that you published, I get clearer speech (I can make out all of the words) but with noise. Any ideas?

@erogol
Copy link
Contributor Author

erogol commented Apr 16, 2019

You need to train more to get cleaner output, but LJSpeech is also noisy. So to a level, it is acceptable.

@ZohaibAhmed
Copy link

@erogol - thanks. Is this the case even when I'm fine-tuning? By training more, do you mean training tacotron more or WaveRNN? How many steps should it generally start to get better?

I checked the alignment of what tacotron produces and it seems like the alignment is there.

@erogol
Copy link
Contributor Author

erogol commented Apr 16, 2019

I meant to train WaveRNN. If you train from the start, it sounds good after 300K iters but depends on the dataset.

@ZohaibAhmed
Copy link

@erogol Thanks. From your experience, do you think it's possible to fine-tune WaveRNN like we can fine-tune tacotron? My dataset is just a couple of hours so it might not be enough to train from scratch.

I've also tried to use my own implementation of WaveRNN (very similar to yours) and after 900k steps, it works well with Rayhane's tacotron implementation but not yours.

@erogol
Copy link
Contributor Author

erogol commented Apr 16, 2019

finetuning WaveRNN works but I've not tried a small dataset to finetune.

@ZohaibAhmed
Copy link

@erogol - I tried to finetune to 731k steps, the output still sounds scrambled: https://drive.google.com/file/d/1niGB9-IvkjW-Q7MTrgTtwa96Sp8Bu6Ub/view?usp=sharing

Any tips on what I can do to debug or see what might be wrong?

@mrgloom
Copy link

mrgloom commented Apr 22, 2019

How to use WaveRnn model?
I have downloaded mold_ljspeech_best_model from here https://github.com/erogol/WaveRNN#released-models (https://drive.google.com/drive/folders/1wpPn3a0KQc6EYtKL0qOi4NqEmhML71Ve)
And use suggested notebook from db7f3d3 https://github.com/mozilla/TTS/blob/db7f3d36e7768f9179d42a8f19b88c2c736d87eb/notebooks/Benchmark.ipynb
But in config I can't see CONFIG.use_phonemes and CONFIG.embedding_size

Update:
I fix it, tacotron2 and wavernn is separate models and should be used from specific commits.

@mrgloom
Copy link

mrgloom commented Apr 30, 2019

I have tried tacotron2 + wavernn and found that quality is good but wavernn is too slow on CPU about 3 sec for tacotron 2 and about 30 sec for wavernn, so it's comparable with waveglow model in terms of speed, but wavernn model size is smaller. Also tacotron 2 processing speed depends on sentence length(i.e. shorter sentences processed faster ~1 sec), but for wavernn it's also high for short sentences ~25 sec, why?

Model size:

Tacotron2:
    336Mb ljspeech-260k/checkpoint_260000.pth.tar
WaveRnn:
    49Mb mold_ljspeech_best_model/checkpoint_393000.pth.tar

@CorentinJ
Copy link

@erogol Do you have a model trained on the latest commit?

@haqkiemdaim
Copy link

@erogol - tried to load the checkpoint with the latest code on the dev-tacotron2 branch. I get the following error:

RuntimeError: Error(s) in loading state_dict for Tacotron2:
	Missing key(s) in state_dict: "decoder.attention_layer.ta.weight", "decoder.attention_layer.ta.bias".

Solved - just make sure you use the right config.json files :)

may i know which config.json file solve your issue? @ZohaibAhmed

@erogol
Copy link
Contributor Author

erogol commented Oct 10, 2019

@CorentinJ not yet but I'll be releasing new models soon.

@RaulButuc
Copy link

RaulButuc commented Dec 29, 2019

@erogol is there any way to just run the pre-trained model with custom given inputs in an "easy" way (I don't really understand most of the code just yet - as I'm still learning about ML).

@mozilla mozilla deleted a comment from IveJ Dec 29, 2019
@reuben
Copy link
Contributor

reuben commented Dec 29, 2019

@RaulButuc check out the instructions here: https://github.com/mozilla/TTS/wiki/Released-Models#simple-packaging---self-contained-package-that-runs-an-http-api-for-a-pre-trained-tts-model

@RaulButuc
Copy link

RaulButuc commented Dec 30, 2019

@reuben i actually tried that yesterday but unfortunately there is a conflict of pytorch versions in the requirements (had to manually download an older pytorch .whl to be able to install the TTS-0.0.1 package, which then throws a dependency requirements error when I try to run it)

EDIT:

  • I will try to do a clean install since maybe something got messed up yesterday with all the trial-and-error.
  • Also forgot to mention, I was actually rather interested in something similar to https://github.com/fatchord/WaveRNN (where you can just run a quick_start.py script with custom sentences) but for the 10bit version of the model given by @erogol. I tried writing one myself based on all the samples I could find here on GH, but not sure I fully understand how to correctly load the models

@reuben
Copy link
Contributor

reuben commented Dec 30, 2019 via email

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
v0.0.1
  
Done
Development

No branches or pull requests

7 participants