Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

text embedding dimension #23

Closed
Charlottecuc opened this issue Jun 22, 2020 · 6 comments
Closed

text embedding dimension #23

Charlottecuc opened this issue Jun 22, 2020 · 6 comments

Comments

@Charlottecuc
Copy link

Hi. Since I tried to train Glow-TTS using Mandarin datasets, there are about 300 symbols in symbols.py. Therefore, it seems that I need to increase the text embedding depth. I notice that in your paper, you mentioned that:
Screenshot 2020-06-22 at 8 33 21 PM
Does the Embedding Dimension here stands for "text embedding dimension"?
If it is, which parameter here should I modify, hidden_channels , or hidden_channels_enc?
Screenshot 2020-06-22 at 8 36 02 PM

Thank you very much!

@Charlottecuc
Copy link
Author

I changed hidden_channels, hidden_channels_enc and hidden_channels_decto 512, but I still encountered the following problem: in the inference time, some words are missing. For example, if I input:
Screenshot 2020-06-23 at 10 16 07 AM
I cannot hear r e4 in the synthesized voice. (The number here stands for tones in mandarin). Could you please give me some advice? Thanks!

@shahuzi
Copy link

shahuzi commented Nov 8, 2020

I changed hidden_channels, hidden_channels_enc and hidden_channels_decto 512, but I still encountered the following problem: in the inference time, some words are missing. For example, if I input:
Screenshot 2020-06-23 at 10 16 07 AM
I cannot hear r e4 in the synthesized voice. (The number here stands for tones in mandarin). Could you please give me some advice? Thanks!

Hi, you can try the trick add blank token between any two input tokens. My experiment in Chinese shows that this trick can improve pronunciation significantly.

@Charlottecuc
Copy link
Author

Charlottecuc commented Nov 12, 2020

I changed hidden_channels, hidden_channels_enc and hidden_channels_decto 512, but I still encountered the following problem: in the inference time, some words are missing. For example, if I input:
Screenshot 2020-06-23 at 10 16 07 AM
I cannot hear r e4 in the synthesized voice. (The number here stands for tones in mandarin). Could you please give me some advice? Thanks!

Hi, you can try the trick add blank token between any two input tokens. My experiment in Chinese shows that this trick can improve pronunciation significantly.

Hi @shahuzi , I'm very happy that you are also interested in traininng glow-tts by using Mandarin datasets and thank you very much for your suggestion. I will try it!

By the way, because I encountered some alignment problems before (e.g. some words are always missing at the inference time) and I'm not sure whether I gave the model right input sequences. Could you kindly tell me what does your input sequences look like? Are you also using phonemes as I mentioned above? Do you use prosodic labels (e.g. "#1 #3 #4 #5" which stands for the pause in a sentence)?
For example:
“把知识运用于实践,离不开思考” ⬇️
"REC-001.wav|start0 b a3 zh ix1 sh ix5 sp2 vn4 iong4 v2 sh ix2 j ian4 sp3 l i2 b u4 k ai1 s iy1 k ao3 end0"
in which "sp3" means long pause (e.g. comma) and sp2 means short pause (e.g. 换气短停顿).

@shahuzi
Copy link

shahuzi commented Dec 6, 2020

I changed hidden_channels, hidden_channels_enc and hidden_channels_decto 512, but I still encountered the following problem: in the inference time, some words are missing. For example, if I input:
Screenshot 2020-06-23 at 10 16 07 AM
I cannot hear r e4 in the synthesized voice. (The number here stands for tones in mandarin). Could you please give me some advice? Thanks!

Hi, you can try the trick add blank token between any two input tokens. My experiment in Chinese shows that this trick can improve pronunciation significantly.

Hi @shahuzi , I'm very happy that you are also interested in traininng glow-tts by using Mandarin datasets and thank you very much for your suggestion. I will try it!

By the way, because I encountered some alignment problems before (e.g. some words are always missing at the inference time) and I'm not sure whether I gave the model right input sequences. Could you kindly tell me what does your input sequences look like? Are you also using phonemes as I mentioned above? Do you use prosodic labels (e.g. "#1 #3 #4 #5" which stands for the pause in a sentence)?
For example:
“把知识运用于实践,离不开思考” ⬇️
"REC-001.wav|start0 b a3 zh ix1 sh ix5 sp2 vn4 iong4 v2 sh ix2 j ian4 sp3 l i2 b u4 k ai1 s iy1 k ao3 end0"
in which "sp3" means long pause (e.g. comma) and sp2 means short pause (e.g. 换气短停顿).

@Charlottecuc 不好意思,没有及时回复你。
我英语不是很好,用中文回复吧~
我的输入是音素序列、音调序列和一些韵律表征(#1,#3等),和你的差不多。
其中音素序列和韵律特征并不是和你一样组合成一个序列,而是看成平行特征,三种输入做embedding后,在channel维度上拼接起来。

@Charlottecuc
Copy link
Author

@shahuzi 您好,方便提供几个你用glow-tts合成的音频样例吗?十分感谢

@shahuzi
Copy link

shahuzi commented Dec 9, 2020

@shahuzi 您好,方便提供几个你用glow-tts合成的音频样例吗?十分感谢

由于涉及到数据安全问题,我没法给你提供demo,见谅。目前我的结论是:对于播报式的音库,可以正常地合成,对于表现力很丰富的音库,合成会出问题。

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants