Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Required amount of data and iterations to train the model #12

Open
Alexey322 opened this issue Sep 6, 2022 · 5 comments
Open

Required amount of data and iterations to train the model #12

Alexey322 opened this issue Sep 6, 2022 · 5 comments

Comments

@Alexey322
Copy link

Alexey322 commented Sep 6, 2022

Hi, I'm training your model from scratch on 60 votes, each with 3-15 minutes of data. Surprisingly, the model starts to retrain already at 26k iterations with batch 12, given that the total duration of all audio files is about 7-8 hours. Unfortunately, I got unsatisfactory results, the speech of many speakers is completely illegible. I attach screenshots of the decoder training.

image

@szprytny
Copy link

szprytny commented Sep 6, 2022

Hi @Alexey322 ,
I did train from scratch for Polish language - about 14 hours dataset in total, about 9 hours of that is one speaker, other speakers' durations vary much.

I can tell you, that looking at your tensorboard and compairing to mine, I see higher loss_ctc - about 1.8 vs mine 1.3,
binarization_loss values - > 0.4, for me it was between 0.25-0.35

train/mel_loss was going toward -2.0 reaching it around 200k step, at 60k step it was around -1.7,
For val/mel_loss I had peak near 30k being -1.52 then at 200k step it was -0.75

image

@Alexey322
Copy link
Author

Thank you for sharing the results, @szprytny . Why did you try to overfit the model and what synthesis results did you get before and after overfitting?

@szprytny
Copy link

szprytny commented Sep 7, 2022

I cannot answer regarding synthesis on not overfitted model, because I used that 600k checkpoint for training second step of RADTTS++ model.
I can only say, that some of the speakers are quite biased comparing to training samples, but still for most of them you could recognize who is who :D

What is important - pronunciation is very good, there is no problem with understanding of spoken sentences, even very long ones "tongue twisters".
e.g. w gąszczu.zip

Tensorboard screenshot is from step 1 - training decoder with config_ljs_decoder.json
Then I used in 2nd step config_ljs_dap.json to get model for synthesis.

@unilight
Copy link

Hi @szprytny, thank you for the insights! Just wondering that in your experience, what would be a sufficient amount of training steps? It's not described in the original paper, and as I am still doing initial experiments with LJSpeech, the config (https://github.com/NVIDIA/radtts/blob/main/configs/config_ljs_decoder.json) sets the total number of epochs to be 10,000,000, which seems to be way too much.

@szprytny
Copy link

Hi @szprytny, thank you for the insights! Just wondering that in your experience, what would be a sufficient amount of training steps? It's not described in the original paper, and as I am still doing initial experiments with LJSpeech, the config (https://github.com/NVIDIA/radtts/blob/main/configs/config_ljs_decoder.json) sets the total number of epochs to be 10,000,000, which seems to be way too much.

That probably depends on dataset very much, but I can tell, that model is producing intelligible utterances pretty quickly for me, about 30k steps with 8 samples per batch.

I don't train model with pitch and energy conditioning anymore. I noticed that for my multispeaker data results are much worse than basic RADTTS model.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants