Is the predicted wav realy better than fastspeech1? #53

Liujingxiu23 · 2021-04-26T02:55:19Z

I tried to train on my own dataset, but the result is not as good as I expected, and even worse than fastspeech1. I use default setting , phone-level, with hifigan as vocoder
How about your results?

ming024 · 2021-04-27T06:08:26Z

@Liujingxiu23 how is the quality and size of your dataset? If the quality is even worse than FastSpeech 1, maybe you should check the correctness of the pitch values given by the DIO algorithm (or any other algorithm you used).

Liujingxiu23 · 2021-04-27T06:26:28Z

@ming024 thank you for your reply. I trained multi-speaker model using 7 speakers, totally abput 4w sentences. I will compare the synthsiezed wav again carefully and check the f0 feature extracted.
By the way, how much steps do you think is enough? How may steps you use?
I use the default setting , batch-size=16, and waiting util train and valid loss do not decrease, about 400~500k

Liujingxiu23 · 2021-04-28T01:44:12Z

@ming024 Thank you for your help!
The following is my loss of my laest training（the previous training may have some wrong）, what it look like? The pitch is overfit? I use mel-features used in hifigan vocoder.
The wavs generated is little better than fastspeech1, and much than tacotron(r=3).
The preformance of tacotron model is worse, maybe because I use r = 3 instead of r = 1 and the record wavs of my target speaker is not of very high quality which is coincident with what you said in #52

There is again another quesiton I meet。
When I train multi-speaker tacotron model using dataset with Chinese and English mixed, the chinese is main language. Speaker-1 is only chinese without english, Speaker-2 has chinese and english both, and Speaker-3 is only english. When the model is ready, speaker-1 can speaker English decently.
But when using fastspeech2 model, the synthesized wavs which contain english of speaker-1 is much worse than tacotron.
How can I improve the english level of speaker-1?

ming024 · 2021-05-03T05:46:51Z

@Liujingxiu23 People believe that Tacotron is better, at least according to the SOTA TTS papers. I think it is because the autoregressive models can fit better to the datasets. However, they are more difficult to train (and probably takes more computation resources) so maybe that's the reason why you find the results are not good on your datasets.

For the language transfer experiments, one possible solution is to use pretrained speaker embedding instead of the embedding table. You can check my paper here.

Liujingxiu23 · 2021-05-06T08:16:51Z

@ming024 Thank you for your reply, I will try "use pretrained speaker embedding instead of the embedding table".

bheshaj96 · 2021-05-10T17:21:37Z

@Liujingxiu23, how are you using speaker embeddings for multi-speaker training of fast-speech with hidden encoder vectors?
Are you directly adding both or concatenating?

Liujingxiu23 · 2021-05-11T01:57:22Z

I did not change the code, my way of using speakers info is just the same as the code in the github.
Not Using speaker info in text encoder, just use speaker embeding as:
https://github.com/ming024/FastSpeech2/blob/master/model/fastspeech2.py line 68

ming024 · 2021-05-26T08:08:06Z

closed #53

ming024 closed this as completed May 26, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Is the predicted wav realy better than fastspeech1? #53

Is the predicted wav realy better than fastspeech1? #53

Liujingxiu23 commented Apr 26, 2021 •

edited

ming024 commented Apr 27, 2021

Liujingxiu23 commented Apr 27, 2021 •

edited

Liujingxiu23 commented Apr 28, 2021 •

edited

ming024 commented May 3, 2021

Liujingxiu23 commented May 6, 2021

bheshaj96 commented May 10, 2021

Liujingxiu23 commented May 11, 2021

ming024 commented May 26, 2021

Is the predicted wav realy better than fastspeech1? #53

Is the predicted wav realy better than fastspeech1? #53

Comments

Liujingxiu23 commented Apr 26, 2021 • edited

ming024 commented Apr 27, 2021

Liujingxiu23 commented Apr 27, 2021 • edited

Liujingxiu23 commented Apr 28, 2021 • edited

ming024 commented May 3, 2021

Liujingxiu23 commented May 6, 2021

bheshaj96 commented May 10, 2021

Liujingxiu23 commented May 11, 2021

ming024 commented May 26, 2021

Liujingxiu23 commented Apr 26, 2021 •

edited

Liujingxiu23 commented Apr 27, 2021 •

edited

Liujingxiu23 commented Apr 28, 2021 •

edited