Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Is the predicted wav realy better than fastspeech1? #53

Closed
Liujingxiu23 opened this issue Apr 26, 2021 · 8 comments
Closed

Is the predicted wav realy better than fastspeech1? #53

Liujingxiu23 opened this issue Apr 26, 2021 · 8 comments

Comments

@Liujingxiu23
Copy link

Liujingxiu23 commented Apr 26, 2021

I tried to train on my own dataset, but the result is not as good as I expected, and even worse than fastspeech1. I use default setting , phone-level, with hifigan as vocoder
How about your results?

@ming024
Copy link
Owner

ming024 commented Apr 27, 2021

@Liujingxiu23 how is the quality and size of your dataset? If the quality is even worse than FastSpeech 1, maybe you should check the correctness of the pitch values given by the DIO algorithm (or any other algorithm you used).

@Liujingxiu23
Copy link
Author

Liujingxiu23 commented Apr 27, 2021

@ming024 thank you for your reply. I trained multi-speaker model using 7 speakers, totally abput 4w sentences. I will compare the synthsiezed wav again carefully and check the f0 feature extracted.
By the way, how much steps do you think is enough? How may steps you use?
I use the default setting , batch-size=16, and waiting util train and valid loss do not decrease, about 400~500k

@Liujingxiu23
Copy link
Author

Liujingxiu23 commented Apr 28, 2021

@ming024 Thank you for your help!
The following is my loss of my laest training(the previous training may have some wrong), what it look like? The pitch is overfit? I use mel-features used in hifigan vocoder.
The wavs generated is little better than fastspeech1, and much than tacotron(r=3).
The preformance of tacotron model is worse, maybe because I use r = 3 instead of r = 1 and the record wavs of my target speaker is not of very high quality which is coincident with what you said in #52

loss

There is again another quesiton I meet。
When I train multi-speaker tacotron model using dataset with Chinese and English mixed, the chinese is main language. Speaker-1 is only chinese without english, Speaker-2 has chinese and english both, and Speaker-3 is only english. When the model is ready, speaker-1 can speaker English decently.
But when using fastspeech2 model, the synthesized wavs which contain english of speaker-1 is much worse than tacotron.
How can I improve the english level of speaker-1?

@ming024
Copy link
Owner

ming024 commented May 3, 2021

@Liujingxiu23 People believe that Tacotron is better, at least according to the SOTA TTS papers. I think it is because the autoregressive models can fit better to the datasets. However, they are more difficult to train (and probably takes more computation resources) so maybe that's the reason why you find the results are not good on your datasets.

For the language transfer experiments, one possible solution is to use pretrained speaker embedding instead of the embedding table. You can check my paper here.

@Liujingxiu23
Copy link
Author

@ming024 Thank you for your reply, I will try "use pretrained speaker embedding instead of the embedding table".

@bheshaj96
Copy link

@Liujingxiu23, how are you using speaker embeddings for multi-speaker training of fast-speech with hidden encoder vectors?
Are you directly adding both or concatenating?

@Liujingxiu23
Copy link
Author

I did not change the code, my way of using speakers info is just the same as the code in the github.
Not Using speaker info in text encoder, just use speaker embeding as:
https://github.com/ming024/FastSpeech2/blob/master/model/fastspeech2.py line 68

@ming024
Copy link
Owner

ming024 commented May 26, 2021

closed #53

@ming024 ming024 closed this as completed May 26, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants