Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

need some more detailed information about the training data distribution #780

Closed
JohnHerry opened this issue May 20, 2024 · 5 comments
Closed

Comments

@JohnHerry
Copy link

@neonbjb
Hi, James, I had learned alot from your paper https://arxiv.org/pdf/2305.07243, thank you.
but I need some more detailed infomation which I did not find in the paper. could you share more detail about the tortoise?

(1) How many speakers in the training data of VQVAE and AR model? eg. How many speakers in the AR model training dataset it should at least contain?
(2) What is the shortest mel length to train the AR model? I see the sample shoud not shorter then 2 seconds in VQVAE training, which is not mentioned about the AR model.
(3) Is there any requirements about Per-Speaker sample distribution? I see in your Extend Dataset, all samples are 5-20 seconds clips. but is there any sugestions on the view of distribution? eg. If I have a lot of speakers each of them have only minites of speech, and some of them even have only tens of seconds speech. can I use this kind of dataset to train the VQVAE or AR?
(4) When training the AR model, If the training dataset is unbalanced on speaker distribution. eg. some speaker may have hours of speech while some of them may only have less then 10 samples [may be less then 1 minite on total speech duration]. is that means the small sample speaker should have bad TTS result compared with the large sample speakers? How many samples about one speaker is enough to be trained in the AR model to get good enough result?

@neonbjb
Copy link
Owner

neonbjb commented May 20, 2024

  1. Don't really know, but I'd estimate on O(100k) unique speakers for the AR and diffusion model datasets. The VQVAE was trained on LibriTTS.
  2. Sorry, I don't remember the exact minimum. It was somewhere around 5s. This is a tricky design detail for Tortoise as you really need to regularize the model to be able to handle arbitrary text and audio lengths but if you allow too small input lengths, you waste a lot of compute.
  3. Unless your use-case specifically calls for long text inputs, I'd consider using Whisper or similar to break up your audio. Most uses for TTS only operate at the sentence level so training a model on very long outputs is training it to operate in a mode that it'll rarely see at inference time. It's also very expensive to train on very long sequences.
  4. Yes, the model will optimize the statistics of the dataset (like all likelihood models do!) so more-present speakers will perform better than less-present ones. This can be improved with fine-tuning or scaling or both. If you really want to get good at a single speaker, I'd train a large model and fine-tune on your single speaker.

@JohnHerry
Copy link
Author

Thank you for the help!

@OnceJune
Copy link

OnceJune commented Jun 19, 2024

The VQVAE was trained on LibriTTS.

@neonbjb Hi James, I got a few questions about using VQVAE, could you kindly share some more information?

  1. The training data of VQVAE is much less (might be much better quality) than that of AR, will this design help to improve synth quality?
  2. I read from your paper that by constricting VQVAE codebook dim it helps a lot on improving performance, could you give more explain?
    Thank you in advance!

@JohnHerry
Copy link
Author

Hi, OnceJune.
I would like to share something about the vqvae, would like for supplement and rectify if my opinion is not consistent with you experiments.
1、I think to make zero-shot tts, the VQVAE decoder is not good enough (especially for unseen speakers) as a mel-vocoder. Here I use mel-vocoder to explain it to be used to translate quantized codes into mel spectrogram. @neonbjb said in his paper, he had employed the diffusion as a mel-vocoder. so the focus of the VQVAE should be beamed more on to the encoding ability. VQVAE assumes a uniform distribution on quantized codes, so the "dead code" phenomenon is not good when it is used as encoder. in the firstly question you said "imporve synth quality", I am sorry if I misunderstand you, but i think here you want to use the VQVAE decoder as the mel-vocoder.

2、as mentioned above, the "dead code" phenomenon is not good for VQVAE. it meas only part of your codebook is finally used. It is true that "constricting VQVAE codebook dim" helps to alleviate the "dead code" problem. but if the word "improving performance" in your question means to improve the VQVAE decoder mel spectrogram quality, the answer is NO. in my experiments, "constricting VQVAE codebook dim" may hurt the VQVAE decoder ability. In contrast, if you keep the codebook dim higher, eg. 64, 128, the VQVAE decoder output may be somewhat better then taking a low codebook dim. but higher codebook dim do makes more "dead codes", which means unfriendly results for unseen samples.

@OnceJune
Copy link

@JohnHerry Thank you so much for your help. I'm using diffusion as mel-vocoder as TorToiSe do. Let me try different codebook dim size, hope I can find a friendly number for seen and unseen samples.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants