Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Result getting worse when i use ground truth duration. #9

Open
AlexanderXuan opened this issue Jun 21, 2021 · 53 comments
Open

Result getting worse when i use ground truth duration. #9

AlexanderXuan opened this issue Jun 21, 2021 · 53 comments

Comments

@AlexanderXuan
Copy link

AlexanderXuan commented Jun 21, 2021

Dear author, thank you for your contribution for TTS, this is a big step in E2E TTS. But when I use ground truth duration aiming to train faster and get more accurate duration, duration loss drops fast and kl loss drops slow. I only change the attn matrix using true duration. I check the consist of loss, but not find the alignment realated part. Could you please give me some help with this problem?

@Liujingxiu23
Copy link

@AlexanderXuan For the original duration training method, do you have good synthesized results?
I tried to train the model using my own chinese datasets, but the training seems abnormal,and the synthesized wavs are bad with 180k.pth.

@AlexanderXuan
Copy link
Author

AlexanderXuan commented Jul 6, 2021

@Liujingxiu23 Sorry for my late reply. My training progress is no problem, but in my original training result, chinese voice has some pitch problem, in my opinion, it is caused by VAE part.

@Liujingxiu23
Copy link

@AlexanderXuan Thank you for your reply. I made some mistakes in my training, and when I fixed them and then train on my chinese dataset, the synthesized wavs are excellent, and without any pitch problem. Pitch problem may related to speaker?

@AlexanderXuan
Copy link
Author

AlexanderXuan commented Jul 7, 2021

@Liujingxiu23 I use multi speaker version and use 6 speakers to train the model, the pitch has little problem with 64k.pth. Maybe my training time isn't enough or just my config has some problem too. Can you give me a email address, i want to communicate with you for some other problems.

@Liujingxiu23
Copy link

@AlexanderXuan I also use train_ms.py to train multi-speaker model, I use 8 female speakers of sample rate of 16000 and all other configs are as default. I check outputs of two speakers from checkpoint 65000.pth, the synthsized wavs are good without any pitch problem.

But in the paper of Glow-WaveGan which is similar as vits, the auothor do add a pitch-predictor. https://arxiv.org/abs/2106.10831?context=cs

@AlexanderXuan
Copy link
Author

@Liujingxiu23 Can you share some sample with me? My email address is xuanxiaoguang@gmail.com. Maybe I should train my model again.

@Liujingxiu23
Copy link

@AlexanderXuan Sorry , I can not. I am working in a business company not a research center, our data is private.

@AlexanderXuan
Copy link
Author

@Liujingxiu23 ok, thank you.

@leminhnguyen
Copy link

@AlexanderXuan Thank you for your reply. I made some mistakes in my training, and when I fixed them and then train on my chinese dataset, the synthesized wavs are excellent, and without any pitch problem. Pitch problem may related to speaker?

Hi @Liujingxiu23, what was wrong in your training ?

@Liujingxiu23
Copy link

@leminhnguyen I make mistakes in process my chinese text, the input symbols.

@leminhnguyen
Copy link

@Liujingxiu23 Your config is same as the default ? What about training time to get 300K steps ?

@Liujingxiu23
Copy link

Liujingxiu23 commented Jul 16, 2021

@leminhnguyen with the default model setting, expect sample-rate=16000, 5 days to get 300k steps, with 2 GPU (V100).

@leminhnguyen
Copy link

@Liujingxiu23, For me, with the sample rate 22050 it took about 8 days to get 180K steps

@Liujingxiu23
Copy link

@leminhnguyen The training is not very fast, but it is convenient since it is end2end training compare to two stages trainning. And the training of the original hifigan is also time-comsuming.

@ductho9799
Copy link

@leminhnguyen Hello bro, what dataset did you use for training? Vietnamese or English?

@leminhnguyen
Copy link

@ductho9799 Hey bro, I've trained for Vietnamese.

@ductho9799
Copy link

@leminhnguyen Which Vietnamese dataset did you have trained? Can you share it with me?

@leminhnguyen
Copy link

@ductho9799 Sorry, data is private so I cannot share it with you.

@ductho9799
Copy link

@leminhnguyen Thank you so much! What company are you working at?

@leminhnguyen
Copy link

leminhnguyen commented Aug 9, 2021

@ductho9799 Hey bro, I think here is not the place for chatting, so please send me an email to leminhnguyen.bk@gmail.com, hope to here from you soon !

@icyda17
Copy link

icyda17 commented Jan 4, 2022

@leminhnguyen Hello, I'm interested in your experiment in Vietnamese task. Have you ever compared the synthesized audio quality made by VITS to FastSpeech2? If yes, which one do you think is more natural? I'm grateful if you share your exp.!

@leminhnguyen
Copy link

leminhnguyen commented Jan 4, 2022

@icyda17 Hi, For my exp, VITS is better than Fastspeech2 about prosody and quality. But in some cases VITS suffered from mis-pronunciation.

@OnceJune
Copy link

OnceJune commented Feb 9, 2022

hi @AlexanderXuan ,I'm trying to use ground truth duration, but the added blank puzzles me. Should the blank be assigned with any duration? Or keep it zero?

@AlexanderXuan
Copy link
Author

@OnceJune In my opinion, blank is used to figure out duration problem, if we use ground truth duration, we don't need to use blank. But i did not know if this will produce other problem, because my result has some problem, if you really need blank, maybe you can set blank duration to zero.

@OnceJune
Copy link

OnceJune commented Feb 9, 2022

@AlexanderXuan Thank you, I will use zero for blank. What's the problem in your result? Pitch or mispronunciation?

@icyda17
Copy link

icyda17 commented Feb 10, 2022

@leminhnguyen Thanks. Mis-puntutation in ur case means bad duration or tone issues? Btw, can I ask you more personally in private email or other chat platforms?

@leminhnguyen
Copy link

@leminhnguyen Thanks. Mis-puntutation in ur case means bad duration or tone issues? Btw, can I ask you more personally in private email or other chat platforms?

You can contact with me via leminhnguyen.bk@gmail.com 😃

@hdmjdp
Copy link

hdmjdp commented Feb 16, 2022

Dear author, thank you for your contribution for TTS, this is a big step in E2E TTS. But when I use ground truth duration aiming to train faster and get more accurate duration, duration loss drops fast and kl loss drops slow. I only change the attn matrix using true duration. I check the consist of loss, but not find the alignment realated part. Could you please give me some help with this problem?

Have you got good result using the groundtruth duration?

@Liujingxiu23
Copy link

Liujingxiu23 commented Feb 23, 2022

@AlexanderXuan I am tring to use ground truth duration those days, but failed in training (at 10k steps, everything works well, but at about 15k loss and grad turn to nan, I use normal duration model , not the stochastic one ), how did you compute loss_dur, is the same as the original code like "l_length = torch.sum((log(duration_true) - log(duraiton_pred))**2, [1,2]) / torch.sum(x_mask)"?

@Liujingxiu23
Copy link

Does anyone get success in training with ground-truth duration, and then get precise time boundary of phones and good wave in the inference?

@OnceJune
Copy link

OnceJune commented Mar 7, 2022

@Liujingxiu23 Have you calculate loss to the path generated from MSA? Since the duration is actually a sum of MSA path, if you only use ground truth duration in duration predict, the MSA path cannot share the benefit(loss is cut off in duration predictor by x.detach()), this might lead to some problem.

@Liujingxiu23
Copy link

@OnceJune I did not use MSA when I use ground truth duration, I use duration_predict model and length_regulator. Everythings works well, except "wrong pronunciation" in the final generated waves.

I can not fully understand what you mean, do you mean:
attn = monotonic_align.maximum_path(neg_cent, attn_mask.squeeze(1)).unsqueeze(1).detach()
----> attn = monotonic_align.maximum_path(neg_cent, attn_mask.squeeze(1)).unsqueeze(1)

You have done this successfully? I mean use ground truth duration and can get good wav with good pronunciation and precise phone boundary?

@OnceJune
Copy link

OnceJune commented Mar 7, 2022

@Liujingxiu23

I can not fully understand what you mean, do you mean:
attn = monotonic_align.maximum_path(neg_cent, attn_mask.squeeze(1)).unsqueeze(1).detach()
----> attn = monotonic_align.maximum_path(neg_cent, attn_mask.squeeze(1)).unsqueeze(1)
No, I mean

vits/models.py

Line 119 in 2e561ba

x = torch.detach(x)

I trained with MSA and ground truth duration, and mispronunciation exists.

@OnceJune
Copy link

OnceJune commented Mar 7, 2022

@Liujingxiu23
Copy link

@OnceJune Thank you for you reply. My target language is also Chinese. "add_blank=0". Duration predict module is good. The only problem is mispronunciation . And it happys at phones without consonant, such as "yun" "wo" "ying"

@OnceJune
Copy link

OnceJune commented Mar 7, 2022

@Liujingxiu23 Try add a placeholder to non-consonant? Like yun=>@ vn, where "@" is a placeholder with no pronunciation or duration, this works fro me.

@Liujingxiu23
Copy link

Liujingxiu23 commented Mar 7, 2022

@OnceJune, I tried the similar way, but give it a duraiton value which got using intervals or MFA. "@" is a placeholder with no pronunciation or duration. What do you mean "no pronunciation "? Do you mean “b ai vn” with time " 5 13 15"-> "b ai @ yun
" with time "5 13 0 15" ?

@OnceJune
Copy link

OnceJune commented Mar 7, 2022

@Liujingxiu23 Yes. And since the result using ground truth duration is not good, I let MSA to learn the duration, the input is still "b ai @ vn", with a placeholder.

@Liujingxiu23
Copy link

Liujingxiu23 commented Mar 7, 2022

@OnceJune Ok, I understand. What about your final result. Did it has pronunciation ? And what about the duration result? I mean the durations of phones and waves are correspond precisely? For example, the duration result show a phone should be "1.35s~1.40s", does the wave really like this?

I trid MSA to learn the duration, add add_blank=0, the final durations of phones is not accurate correspondence to audio。 For example, in the wave, last phone of "sil(silence)" is about 5 frames, but in the predict of duration model, the duraiton value is 13. Though the wave sounds good.

@OnceJune
Copy link

OnceJune commented Mar 7, 2022

@Liujingxiu23 The final result is good, I didn't check the precise duration.

the final durations of phones is not accurate correspondence to audio

I think the predict duration diffs from ground truth is normal, if you need a better control of lead & tail silence, you can try trim the input audio so vits will not product any silence to the start or end. Then you can give precisely N frames of zeros to the output samples, at the beginning and ending.

@Liujingxiu23
Copy link

@OnceJune Not only the time of final sil is not accurate correspondence to audio, but all the phones, I just take the last phone as example. Thank you very much for you relpy. I will do same other experiment and discuss with you if any conclusions.

@candlewill
Copy link

candlewill commented May 9, 2022

@OnceJune Not only the time of final sil is not accurate correspondence to audio, but all the phones, I just take the last phone as example. Thank you very much for you relpy. I will do same other experiment and discuss with you if any conclusions.

Yes, even though the synthesized speech from vits is fine, we can not obtain the correct phoneme duration from the MAS.

@yingfenging
Copy link

@Liujingxiu23

Hello。finally, you use MAS or ground truth duraton? and did it has mispronunciation ?
Thanks.

@Liujingxiu23
Copy link

@Liujingxiu23

Hello。finally, you use MAS or ground truth duraton? and did it has mispronunciation ? Thanks.

Yes, I use MAS, mispronunciation really exists after careful listening.

@tuannvhust
Copy link

@leminhnguyen a ơi, e đang gặp 1 chút trục trặc về chất lượng của kết quả thu được. e contact a qua email được không ạ?

@leminhnguyen
Copy link

@tuannvhust yeah, you're welcome

@ZhaoZeqing
Copy link

Dear author, thank you for your contribution for TTS, this is a big step in E2E TTS. But when I use ground truth duration aiming to train faster and get more accurate duration, duration loss drops fast and kl loss drops slow. I only change the attn matrix using true duration. I check the consist of loss, but not find the alignment realated part. Could you please give me some help with this problem?

@AlexanderXuan Hi, I got the same problem when using the true duration instead of MAS, and I set the use_sdp to False. The kl loss couldn't drop to lower values. Did you have any progress? Thanks!

@weixsong
Copy link

weixsong commented Apr 6, 2023

@leminhnguyen Thanks. Mis-puntutation in ur case means bad duration or tone issues? Btw, can I ask you more personally in private email or other chat platforms?

You can contact with me via leminhnguyen.bk@gmail.com 😃

I meet the same issue about mis-pronunciation for some phonemes in Chinese, if no blank is interspersed the mis-pronouncitation issue get worse.

@leminhnguyen
Copy link

@weixsong I've replied to you in the email.

@elch10
Copy link

elch10 commented Jun 7, 2023

@weixsong @leminhnguyen what your results of no interspertion? I also have mis-prononciation, wanna fix it

@weixsong
Copy link

weixsong commented Jun 21, 2023

@weixsong @leminhnguyen what your results of no interspertion? I also have mis-prononciation, wanna fix it

If no blank is used between each phoneme, the results is bad, many phonemes got incomplete pronunciation.

I think maybe the mis-pronouncication issue is caused by the normalizing flow itself, which is hard to fix the mis-pronounciation issue.

@JohnHerry
Copy link

VAE

The same problem, I have trined over 1000,000 steps, and some times some phoneme got strange voice in pitch;
and, the second problem is, if the input phoneme is short, the synthesize result is not stable. eg, if I put just one or tow PinYin for inference, there may be every kind of bad speech output.

@phamkhactu
Copy link

Hi @OnceJune @Liujingxiu23 @ZhaoZeqing
if I want to use ground truth duration, I only set use_dsp = False, is it right?
Thank you

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests