-
Notifications
You must be signed in to change notification settings - Fork 534
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Why the wav quality with frame level is much worse than with phoneme level?? #52
Comments
@did you compare your result with fastspeech1, or tacotron, or other state-of-art method, what is the result? |
Thank you @WuMing757 |
@Liujingxiu23 @WuMing757 When I did the frame-level pitch and energy prediction, the results were not so good and the model tended to predict a constant value for every frame in a phoneme at the testing time since the frame-level hidden features are copied from the same phoneme-level feature. But at training time, the ground-truth pitch and energy values can vary within a phoneme, which differs from the testing time case. This problem does not exist in the phoneme-level pitch and energy modeling scenario, so the model performs much better. You may think that how the model knows about the intra-phoneme variation given only a phoneme-level pitch/energy value. I have to say, the decoder is much more powerful than you think and it nails it. |
The performance is better than FastSpeech 1 undoubtedly, but I think it is slightly worse than Tacotron 2 and other autoregressive models. |
Thank you for the detailed explanation. In my cases, In both phone level and frame level, the pitch predictor seems overfit. The pitch_val_loss increases while the train loss decreases and other val_losses decreases. Even though, the performance of the model is great. Do you have any ideas about it? |
@WuMing757 I think slight overfitting is not a bad thing compared with underfitting at least for TTS models. Underfitting models may generate speech with plain tone and prosody which annoys the listeners. |
closed #52 |
I think this might be due to CWT missing from this implementation, and phoneme averaging making up for it by reducing the resolution of the pitch to be predicted. |
Can any one please explain me the frame-level pitch predictor in this project? It seems that the output from fc_layer has the same length of input characters, Why it can be frame-level after a frame_level masked_fill ? |
The implementation with frame-level seems more similar to the implementation in the paper. But why the quality is much worse than with phoneme level? And with phoneme level, the pitch is predicted before expanding length, which results in the pitch in a phoneme are the same. I think the pitches in a phoneme sometimes change. But... with phoneme level has excellent performance...
The text was updated successfully, but these errors were encountered: