-
Notifications
You must be signed in to change notification settings - Fork 40
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Required amount of data and iterations to train the model #12
Comments
Hi @Alexey322 , I can tell you, that looking at your tensorboard and compairing to mine, I see higher loss_ctc - about 1.8 vs mine 1.3, train/mel_loss was going toward -2.0 reaching it around 200k step, at 60k step it was around -1.7, |
Thank you for sharing the results, @szprytny . Why did you try to overfit the model and what synthesis results did you get before and after overfitting? |
I cannot answer regarding synthesis on not overfitted model, because I used that 600k checkpoint for training second step of RADTTS++ model. What is important - pronunciation is very good, there is no problem with understanding of spoken sentences, even very long ones "tongue twisters". Tensorboard screenshot is from step 1 - training decoder with |
Hi @szprytny, thank you for the insights! Just wondering that in your experience, what would be a sufficient amount of training steps? It's not described in the original paper, and as I am still doing initial experiments with LJSpeech, the config (https://github.com/NVIDIA/radtts/blob/main/configs/config_ljs_decoder.json) sets the total number of epochs to be 10,000,000, which seems to be way too much. |
That probably depends on dataset very much, but I can tell, that model is producing intelligible utterances pretty quickly for me, about 30k steps with 8 samples per batch. I don't train model with pitch and energy conditioning anymore. I noticed that for my multispeaker data results are much worse than basic RADTTS model. |
Hi, I'm training your model from scratch on 60 votes, each with 3-15 minutes of data. Surprisingly, the model starts to retrain already at 26k iterations with batch 12, given that the total duration of all audio files is about 7-8 hours. Unfortunately, I got unsatisfactory results, the speech of many speakers is completely illegible. I attach screenshots of the decoder training.
The text was updated successfully, but these errors were encountered: