Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

gpt training issue #10

Closed
z592694590 opened this issue Nov 21, 2022 · 14 comments
Closed

gpt training issue #10

z592694590 opened this issue Nov 21, 2022 · 14 comments

Comments

@z592694590
Copy link

Hello! I found the GptVoiceLatentInjector injector. And I found several dataset classes you defined. Such like fast_paired_dataset_with_phonemes, fast_paired_dataset, gpt_tts_dataset and so on. Could you please show which dataset you use for training gpt?

@neonbjb
Copy link
Owner

neonbjb commented Nov 25, 2022

fast_paired_dataset

@z592694590
Copy link
Author

fast_paired_dataset

thanks!

@z592694590
Copy link
Author

Hi I have trained a gpt in another language. But I found that It would only work great when the text spoken in the conditioning clip was exactly the text provided to the model to render. Using another clip of the same person speaking may be a good approach. But it is diffucult to implement in a large dataset. It would be appreciated if you can give me some suggestions to this problem. Thank you!

@neonbjb
Copy link
Owner

neonbjb commented Jan 3, 2023

Yeah, the model will definitely learn to cheat if you simply feed it the same clips that it is expected to generate. AR models are pretty remarkable in this way. This is actually the genesis of this blog post and I think a really interesting, unexplored capability of AR models.

I actually have provisions for this built into the dataloader - but you need to produce some auxiliary metadata files to get it to work. The code is implemented here: https://github.com/neonbjb/DL-Art-School/blob/master/codes/data/audio/unsupervised_audio_dataset.py#L49

Basically the process is to go through your dataset and, for each audio clip, find 1-3 other audio clips that have the same speaker. You compile paths to these similar audio clips into a dict which is written in each top level directory containing audio clips called similarities.pth.

@z592694590
Copy link
Author

Yeah, the model will definitely learn to cheat if you simply feed it the same clips that it is expected to generate. AR models are pretty remarkable in this way. This is actually the genesis of this blog post and I think a really interesting, unexplored capability of AR models.

I actually have provisions for this built into the dataloader - but you need to produce some auxiliary metadata files to get it to work. The code is implemented here: https://github.com/neonbjb/DL-Art-School/blob/master/codes/data/audio/unsupervised_audio_dataset.py#L49

Basically the process is to go through your dataset and, for each audio clip, find 1-3 other audio clips that have the same speaker. You compile paths to these similar audio clips into a dict which is written in each top level directory containing audio clips called similarities.pth.

It's very enlightening, thank you very much!!!!!!!!!!!!!!! Have a nice day!

@z592694590
Copy link
Author

Hi. I found that you use 256 tokens to encode the text. why don't use more tokens such like NLP.

@neonbjb
Copy link
Owner

neonbjb commented Jan 6, 2023

Because joining tokens like NLP is actively harmful for generative models. (I did experiments on this with tiny-scale AR models).

The current thought behind this is that with generative models, often each character is important (for example with pronouncing phonemes or rendering text). With more condensed token spaces, the models need to learn more complex relationships between tokens to work.

Example from image-gen land: https://arxiv.org/abs/2212.10562

@z592694590
Copy link
Author

Because joining tokens like NLP is actively harmful for generative models. (I did experiments on this with tiny-scale AR models).

The current thought behind this is that with generative models, often each character is important (for example with pronouncing phonemes or rendering text). With more condensed token spaces, the models need to learn more complex relationships between tokens to work.

Example from image-gen land: https://arxiv.org/abs/2212.10562

This seems very reasonable 👍👍. Thanks for you reply! I can't wait to do some experiments.

@z592694590
Copy link
Author

z592694590 commented Jan 29, 2023

Hi Mr Betker. On your suggestion, I trained a gpt on 16k sample rate waveform. It performs very well. However, when I trained on 24k sample rate, the result is completely different. In predict priod, the model seems to be unstable, particularly in short text input such like 'yes', 'no' or 'go'. In one batch, the variation of predicted codes is very large. It usually predicts very long mel codes and even can not predict the stop token. The audio output sounds like a long stretched-out “uhhhhhhhhhh” or just simply silence. These incidents don't occur in 16k waveform of training pipeline. I'm confused. Do you have any ideas about this? Looking forward to your reply。

@neonbjb
Copy link
Owner

neonbjb commented Jan 29, 2023

Possibly stupid question - did you re-train your VQ model for the higher sample rate?

@z592694590
Copy link
Author

Possibly stupid question - did you re-train your VQ model for the higher sample rate?

yes, I re-train my VQ model with 24k sample rate, 256 hop size, 1024 window size.

@neonbjb
Copy link
Owner

neonbjb commented Jan 29, 2023

This is certainly a strange finding. Honestly, it's probably a bug. Since this showed up when you changed sample rates, I'd hazard that somewhere in your code there is an assumption made about audio being 16kHz when it is really 24kHz. Or potentially there's something still assuming 22kHz (which I used), and 16kHz works with that but 24kHz doesn't.

Do the loss curves between the two runs look the same?

You probably know this already, but early on in training the model does exactly as you stated: produces a lot of "uhhhhhs" and other constant tones. This was originally the reason I built CLVP (filter out this garbage from real speech).

@z592694590
Copy link
Author

yes, the loss curves between the two runs look the same. The loss of 24k training is lower.

@Naminwang
Copy link

Hi Mr Betker. On your suggestion, I trained a gpt on 16k sample rate waveform. It performs very well. However, when I trained on 24k sample rate, the result is completely different. In predict priod, the model seems to be unstable, particularly in short text input such like 'yes', 'no' or 'go'. In one batch, the variation of predicted codes is very large. It usually predicts very long mel codes and even can not predict the stop token. The audio output sounds like a long stretched-out “uhhhhhhhhhh” or just simply silence. These incidents don't occur in 16k waveform of training pipeline. I'm confused. Do you have any ideas about this? Looking forward to your reply。

Hi,can you share the configuration for 16k vqvae model training? I've been training a 16k vqvae model. The results are always unsatisfactory. The generated audio has a noise floor, and some words are pronounced strangely.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants