-
Notifications
You must be signed in to change notification settings - Fork 1.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
need some more detailed information about the training data distribution #780
Comments
|
Thank you for the help! |
@neonbjb Hi James, I got a few questions about using VQVAE, could you kindly share some more information?
|
Hi, OnceJune. 2、as mentioned above, the "dead code" phenomenon is not good for VQVAE. it meas only part of your codebook is finally used. It is true that "constricting VQVAE codebook dim" helps to alleviate the "dead code" problem. but if the word "improving performance" in your question means to improve the VQVAE decoder mel spectrogram quality, the answer is NO. in my experiments, "constricting VQVAE codebook dim" may hurt the VQVAE decoder ability. In contrast, if you keep the codebook dim higher, eg. 64, 128, the VQVAE decoder output may be somewhat better then taking a low codebook dim. but higher codebook dim do makes more "dead codes", which means unfriendly results for unseen samples. |
@JohnHerry Thank you so much for your help. I'm using diffusion as mel-vocoder as TorToiSe do. Let me try different codebook dim size, hope I can find a friendly number for seen and unseen samples. |
@neonbjb
Hi, James, I had learned alot from your paper https://arxiv.org/pdf/2305.07243, thank you.
but I need some more detailed infomation which I did not find in the paper. could you share more detail about the tortoise?
(1) How many speakers in the training data of VQVAE and AR model? eg. How many speakers in the AR model training dataset it should at least contain?
(2) What is the shortest mel length to train the AR model? I see the sample shoud not shorter then 2 seconds in VQVAE training, which is not mentioned about the AR model.
(3) Is there any requirements about Per-Speaker sample distribution? I see in your Extend Dataset, all samples are 5-20 seconds clips. but is there any sugestions on the view of distribution? eg. If I have a lot of speakers each of them have only minites of speech, and some of them even have only tens of seconds speech. can I use this kind of dataset to train the VQVAE or AR?
(4) When training the AR model, If the training dataset is unbalanced on speaker distribution. eg. some speaker may have hours of speech while some of them may only have less then 10 samples [may be less then 1 minite on total speech duration]. is that means the small sample speaker should have bad TTS result compared with the large sample speakers? How many samples about one speaker is enough to be trained in the AR model to get good enough result?
The text was updated successfully, but these errors were encountered: