need some more detailed information about the training data distribution #780

JohnHerry · 2024-05-20T08:29:46Z

@neonbjb
Hi, James, I had learned alot from your paper https://arxiv.org/pdf/2305.07243, thank you.
but I need some more detailed infomation which I did not find in the paper. could you share more detail about the tortoise?

(1) How many speakers in the training data of VQVAE and AR model? eg. How many speakers in the AR model training dataset it should at least contain?
(2) What is the shortest mel length to train the AR model? I see the sample shoud not shorter then 2 seconds in VQVAE training, which is not mentioned about the AR model.
(3) Is there any requirements about Per-Speaker sample distribution? I see in your Extend Dataset, all samples are 5-20 seconds clips. but is there any sugestions on the view of distribution? eg. If I have a lot of speakers each of them have only minites of speech, and some of them even have only tens of seconds speech. can I use this kind of dataset to train the VQVAE or AR?
(4) When training the AR model, If the training dataset is unbalanced on speaker distribution. eg. some speaker may have hours of speech while some of them may only have less then 10 samples [may be less then 1 minite on total speech duration]. is that means the small sample speaker should have bad TTS result compared with the large sample speakers? How many samples about one speaker is enough to be trained in the AR model to get good enough result?

neonbjb · 2024-05-20T22:51:30Z

Don't really know, but I'd estimate on O(100k) unique speakers for the AR and diffusion model datasets. The VQVAE was trained on LibriTTS.
Sorry, I don't remember the exact minimum. It was somewhere around 5s. This is a tricky design detail for Tortoise as you really need to regularize the model to be able to handle arbitrary text and audio lengths but if you allow too small input lengths, you waste a lot of compute.
Unless your use-case specifically calls for long text inputs, I'd consider using Whisper or similar to break up your audio. Most uses for TTS only operate at the sentence level so training a model on very long outputs is training it to operate in a mode that it'll rarely see at inference time. It's also very expensive to train on very long sequences.
Yes, the model will optimize the statistics of the dataset (like all likelihood models do!) so more-present speakers will perform better than less-present ones. This can be improved with fine-tuning or scaling or both. If you really want to get good at a single speaker, I'd train a large model and fine-tune on your single speaker.

JohnHerry · 2024-06-04T03:18:18Z

Thank you for the help!

OnceJune · 2024-06-19T03:35:47Z

The VQVAE was trained on LibriTTS.

@neonbjb Hi James, I got a few questions about using VQVAE, could you kindly share some more information?

The training data of VQVAE is much less (might be much better quality) than that of AR, will this design help to improve synth quality?
I read from your paper that by constricting VQVAE codebook dim it helps a lot on improving performance, could you give more explain?
Thank you in advance!

JohnHerry · 2024-06-20T15:12:26Z

Hi, OnceJune.
I would like to share something about the vqvae, would like for supplement and rectify if my opinion is not consistent with you experiments.
1、I think to make zero-shot tts, the VQVAE decoder is not good enough (especially for unseen speakers) as a mel-vocoder. Here I use mel-vocoder to explain it to be used to translate quantized codes into mel spectrogram. @neonbjb said in his paper, he had employed the diffusion as a mel-vocoder. so the focus of the VQVAE should be beamed more on to the encoding ability. VQVAE assumes a uniform distribution on quantized codes, so the "dead code" phenomenon is not good when it is used as encoder. in the firstly question you said "imporve synth quality", I am sorry if I misunderstand you, but i think here you want to use the VQVAE decoder as the mel-vocoder.

2、as mentioned above, the "dead code" phenomenon is not good for VQVAE. it meas only part of your codebook is finally used. It is true that "constricting VQVAE codebook dim" helps to alleviate the "dead code" problem. but if the word "improving performance" in your question means to improve the VQVAE decoder mel spectrogram quality, the answer is NO. in my experiments, "constricting VQVAE codebook dim" may hurt the VQVAE decoder ability. In contrast, if you keep the codebook dim higher, eg. 64, 128, the VQVAE decoder output may be somewhat better then taking a low codebook dim. but higher codebook dim do makes more "dead codes", which means unfriendly results for unseen samples.

OnceJune · 2024-06-25T05:57:05Z

@JohnHerry Thank you so much for your help. I'm using diffusion as mel-vocoder as TorToiSe do. Let me try different codebook dim size, hope I can find a friendly number for seen and unseen samples.

JohnHerry closed this as completed Jun 4, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

need some more detailed information about the training data distribution #780

need some more detailed information about the training data distribution #780

JohnHerry commented May 20, 2024

neonbjb commented May 20, 2024

JohnHerry commented Jun 4, 2024

OnceJune commented Jun 19, 2024 •

edited

Loading

JohnHerry commented Jun 20, 2024

OnceJune commented Jun 25, 2024

need some more detailed information about the training data distribution #780

need some more detailed information about the training data distribution #780

Comments

JohnHerry commented May 20, 2024

neonbjb commented May 20, 2024

JohnHerry commented Jun 4, 2024

OnceJune commented Jun 19, 2024 • edited Loading

JohnHerry commented Jun 20, 2024

OnceJune commented Jun 25, 2024

OnceJune commented Jun 19, 2024 •

edited

Loading