Why the wav quality with frame level is much worse than with phoneme level?? #52

WuMing757 · 2021-04-25T09:59:10Z

The implementation with frame-level seems more similar to the implementation in the paper. But why the quality is much worse than with phoneme level? And with phoneme level, the pitch is predicted before expanding length, which results in the pitch in a phoneme are the same. I think the pitches in a phoneme sometimes change. But... with phoneme level has excellent performance...

Liujingxiu23 · 2021-04-26T03:43:25Z

@did you compare your result with fastspeech1, or tacotron, or other state-of-art method, what is the result?

WuMing757 · 2021-04-26T03:50:38Z

@did you compare your result with fastspeech1, or tacotron, or other state-of-art methods, what is the result?

No. I did not try fastspeech1 or tacotron, maybe you wanted to ask the implementation owner @ming024.

Liujingxiu23 · 2021-04-26T03:51:57Z

Thank you @WuMing757

ming024 · 2021-04-27T06:04:46Z

@Liujingxiu23 @WuMing757 When I did the frame-level pitch and energy prediction, the results were not so good and the model tended to predict a constant value for every frame in a phoneme at the testing time since the frame-level hidden features are copied from the same phoneme-level feature. But at training time, the ground-truth pitch and energy values can vary within a phoneme, which differs from the testing time case.

This problem does not exist in the phoneme-level pitch and energy modeling scenario, so the model performs much better. You may think that how the model knows about the intra-phoneme variation given only a phoneme-level pitch/energy value. I have to say, the decoder is much more powerful than you think and it nails it.

ming024 · 2021-04-27T06:05:25Z

The performance is better than FastSpeech 1 undoubtedly, but I think it is slightly worse than Tacotron 2 and other autoregressive models.

WuMing757 · 2021-04-28T07:47:00Z

@Liujingxiu23 @WuMing757 When I did the frame-level pitch and energy prediction, the results were not so good and the model tended to predict a constant value for every frame in a phoneme at the testing time since the frame-level hidden features are copied from the same phoneme-level feature. But at training time, the ground-truth pitch and energy values can vary within a phoneme, which differs from the testing time case.

This problem does not exist in the phoneme-level pitch and energy modeling scenario, so the model performs much better. You may think that how the model knows about the intra-phoneme variation given only a phoneme-level pitch/energy value. I have to say, the decoder is much more powerful than you think and it nails it.

Thank you for the detailed explanation. In my cases, In both phone level and frame level, the pitch predictor seems overfit. The pitch_val_loss increases while the train loss decreases and other val_losses decreases. Even though, the performance of the model is great. Do you have any ideas about it?

ming024 · 2021-05-03T05:38:53Z

@WuMing757 I think slight overfitting is not a bad thing compared with underfitting at least for TTS models. Underfitting models may generate speech with plain tone and prosody which annoys the listeners.

ming024 · 2021-05-26T08:08:53Z

closed #52

MiniXC · 2022-02-23T22:22:37Z

I think this might be due to CWT missing from this implementation, and phoneme averaging making up for it by reducing the resolution of the pitch to be predicted.

JohnHerry · 2022-07-04T16:06:21Z

Can any one please explain me the frame-level pitch predictor in this project? It seems that the output from fc_layer has the same length of input characters, Why it can be frame-level after a frame_level masked_fill ?

Liujingxiu23 mentioned this issue Apr 28, 2021

Is the predicted wav realy better than fastspeech1? #53

Closed

ming024 closed this as completed May 26, 2021

lordzuko mentioned this issue Jul 12, 2023

Why is phoneme level predictor put before length regulator? #116

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Why the wav quality with frame level is much worse than with phoneme level?? #52

Why the wav quality with frame level is much worse than with phoneme level?? #52

WuMing757 commented Apr 25, 2021

Liujingxiu23 commented Apr 26, 2021

WuMing757 commented Apr 26, 2021

Liujingxiu23 commented Apr 26, 2021

ming024 commented Apr 27, 2021 •

edited

Loading

ming024 commented Apr 27, 2021

WuMing757 commented Apr 28, 2021

ming024 commented May 3, 2021

ming024 commented May 26, 2021

MiniXC commented Feb 23, 2022

JohnHerry commented Jul 4, 2022

Why the wav quality with frame level is much worse than with phoneme level?? #52

Why the wav quality with frame level is much worse than with phoneme level?? #52

Comments

WuMing757 commented Apr 25, 2021

Liujingxiu23 commented Apr 26, 2021

WuMing757 commented Apr 26, 2021

Liujingxiu23 commented Apr 26, 2021

ming024 commented Apr 27, 2021 • edited Loading

ming024 commented Apr 27, 2021

WuMing757 commented Apr 28, 2021

ming024 commented May 3, 2021

ming024 commented May 26, 2021

MiniXC commented Feb 23, 2022

JohnHerry commented Jul 4, 2022

ming024 commented Apr 27, 2021 •

edited

Loading