-
Notifications
You must be signed in to change notification settings - Fork 1.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
New Tacotron2 model release with WaveRNN vocoder. #153
Comments
@erogol - tried to load the checkpoint with the latest code on the
Solved - just make sure you use the right config.json files :) |
@erogol - I tried to train a new WaveRNN model (from scratch and finetune on top of yours) as well as use my previous implementation of WaveRNN. For each one the output is very scrambled: https://drive.google.com/open?id=1iHo-b3WwGrvRUc-RjhpQA_G0GgycsENW When I use point the vocoder to the MOLD model that you published, I get clearer speech (I can make out all of the words) but with noise. Any ideas? |
You need to train more to get cleaner output, but LJSpeech is also noisy. So to a level, it is acceptable. |
@erogol - thanks. Is this the case even when I'm fine-tuning? By training more, do you mean training tacotron more or WaveRNN? How many steps should it generally start to get better? I checked the alignment of what tacotron produces and it seems like the alignment is there. |
I meant to train WaveRNN. If you train from the start, it sounds good after 300K iters but depends on the dataset. |
@erogol Thanks. From your experience, do you think it's possible to fine-tune WaveRNN like we can fine-tune tacotron? My dataset is just a couple of hours so it might not be enough to train from scratch. I've also tried to use my own implementation of WaveRNN (very similar to yours) and after 900k steps, it works well with Rayhane's tacotron implementation but not yours. |
finetuning WaveRNN works but I've not tried a small dataset to finetune. |
@erogol - I tried to finetune to 731k steps, the output still sounds scrambled: https://drive.google.com/file/d/1niGB9-IvkjW-Q7MTrgTtwa96Sp8Bu6Ub/view?usp=sharing Any tips on what I can do to debug or see what might be wrong? |
How to use WaveRnn model? Update: |
I have tried tacotron2 + wavernn and found that quality is good but wavernn is too slow on CPU about 3 sec for tacotron 2 and about 30 sec for wavernn, so it's comparable with waveglow model in terms of speed, but wavernn model size is smaller. Also tacotron 2 processing speed depends on sentence length(i.e. shorter sentences processed faster ~1 sec), but for wavernn it's also high for short sentences ~25 sec, why? Model size:
|
@erogol Do you have a model trained on the latest commit? |
may i know which config.json file solve your issue? @ZohaibAhmed |
@CorentinJ not yet but I'll be releasing new models soon. |
@erogol is there any way to just run the pre-trained model with custom given inputs in an "easy" way (I don't really understand most of the code just yet - as I'm still learning about ML). |
@reuben i actually tried that yesterday but unfortunately there is a conflict of pytorch versions in the requirements (had to manually download an older pytorch .whl to be able to install the TTS-0.0.1 package, which then throws a dependency requirements error when I try to run it) EDIT:
|
You should be able to create and use a fresh virtualenv to avoid any conflicts.
… On 30 Dec 2019, at 02:51, Raul Butuc ***@***.***> wrote:
@reuben i actually tried that yesterday but unfortunately there is a conflict of pytorch versions in the requirements (had to manually download an older pytorch .whl to be able to install the TTS-0.0.1 package, which then throws a dependency requirements error)
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub, or unsubscribe.
|
A new TTS Tacotron2 model trained on LJSpeech is released. It should work well with the MOLD WaveRNN model.
Model has been trained for 260K iterations. It has the best validation loss so far on LJSpeech.
Model has been trained first with dropout Prenet as in the original paper and then switched to BN prenet described above. And finally, it's been trained with "forward attention." for just experimental reasons.
In inference time you can try different attention related parameters and pick the one best fits you. So you can switch on/off forward attention, use "sigmoid" or "softmax" norm or consider to use attention windowing. The default settings are given by the model's config.json.
I think both WaveRNN and TTS models have more space for finetuning (especially WaveRNN) for better results.
You can also read more here #26
The text was updated successfully, but these errors were encountered: