Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Why the wav quality with frame level is much worse than with phoneme level?? #52

Closed
WuMing757 opened this issue Apr 25, 2021 · 10 comments
Closed

Comments

@WuMing757
Copy link

The implementation with frame-level seems more similar to the implementation in the paper. But why the quality is much worse than with phoneme level? And with phoneme level, the pitch is predicted before expanding length, which results in the pitch in a phoneme are the same. I think the pitches in a phoneme sometimes change. But... with phoneme level has excellent performance...

@Liujingxiu23
Copy link

@did you compare your result with fastspeech1, or tacotron, or other state-of-art method, what is the result?

@WuMing757
Copy link
Author

@did you compare your result with fastspeech1, or tacotron, or other state-of-art methods, what is the result?

No. I did not try fastspeech1 or tacotron, maybe you wanted to ask the implementation owner @ming024.

@Liujingxiu23
Copy link

Thank you @WuMing757

@ming024
Copy link
Owner

ming024 commented Apr 27, 2021

@Liujingxiu23 @WuMing757 When I did the frame-level pitch and energy prediction, the results were not so good and the model tended to predict a constant value for every frame in a phoneme at the testing time since the frame-level hidden features are copied from the same phoneme-level feature. But at training time, the ground-truth pitch and energy values can vary within a phoneme, which differs from the testing time case.

This problem does not exist in the phoneme-level pitch and energy modeling scenario, so the model performs much better. You may think that how the model knows about the intra-phoneme variation given only a phoneme-level pitch/energy value. I have to say, the decoder is much more powerful than you think and it nails it.

@ming024
Copy link
Owner

ming024 commented Apr 27, 2021

The performance is better than FastSpeech 1 undoubtedly, but I think it is slightly worse than Tacotron 2 and other autoregressive models.

@WuMing757
Copy link
Author

@Liujingxiu23 @WuMing757 When I did the frame-level pitch and energy prediction, the results were not so good and the model tended to predict a constant value for every frame in a phoneme at the testing time since the frame-level hidden features are copied from the same phoneme-level feature. But at training time, the ground-truth pitch and energy values can vary within a phoneme, which differs from the testing time case.

This problem does not exist in the phoneme-level pitch and energy modeling scenario, so the model performs much better. You may think that how the model knows about the intra-phoneme variation given only a phoneme-level pitch/energy value. I have to say, the decoder is much more powerful than you think and it nails it.

Thank you for the detailed explanation. In my cases, In both phone level and frame level, the pitch predictor seems overfit. The pitch_val_loss increases while the train loss decreases and other val_losses decreases. Even though, the performance of the model is great. Do you have any ideas about it?

@ming024
Copy link
Owner

ming024 commented May 3, 2021

@WuMing757 I think slight overfitting is not a bad thing compared with underfitting at least for TTS models. Underfitting models may generate speech with plain tone and prosody which annoys the listeners.

@ming024
Copy link
Owner

ming024 commented May 26, 2021

closed #52

@ming024 ming024 closed this as completed May 26, 2021
@MiniXC
Copy link

MiniXC commented Feb 23, 2022

I think this might be due to CWT missing from this implementation, and phoneme averaging making up for it by reducing the resolution of the pitch to be predicted.

@JohnHerry
Copy link

Can any one please explain me the frame-level pitch predictor in this project? It seems that the output from fc_layer has the same length of input characters, Why it can be frame-level after a frame_level masked_fill ?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants