gpt training issue #10

z592694590 · 2022-11-21T03:53:15Z

Hello! I found the GptVoiceLatentInjector injector. And I found several dataset classes you defined. Such like fast_paired_dataset_with_phonemes, fast_paired_dataset, gpt_tts_dataset and so on. Could you please show which dataset you use for training gpt?

neonbjb · 2022-11-25T05:06:17Z

fast_paired_dataset

z592694590 · 2022-12-12T07:48:34Z

fast_paired_dataset

thanks!

z592694590 · 2023-01-03T16:27:39Z

Hi I have trained a gpt in another language. But I found that It would only work great when the text spoken in the conditioning clip was exactly the text provided to the model to render. Using another clip of the same person speaking may be a good approach. But it is diffucult to implement in a large dataset. It would be appreciated if you can give me some suggestions to this problem. Thank you!

neonbjb · 2023-01-03T16:59:55Z

Yeah, the model will definitely learn to cheat if you simply feed it the same clips that it is expected to generate. AR models are pretty remarkable in this way. This is actually the genesis of this blog post and I think a really interesting, unexplored capability of AR models.

I actually have provisions for this built into the dataloader - but you need to produce some auxiliary metadata files to get it to work. The code is implemented here: https://github.com/neonbjb/DL-Art-School/blob/master/codes/data/audio/unsupervised_audio_dataset.py#L49

Basically the process is to go through your dataset and, for each audio clip, find 1-3 other audio clips that have the same speaker. You compile paths to these similar audio clips into a dict which is written in each top level directory containing audio clips called similarities.pth.

z592694590 · 2023-01-03T17:11:45Z

Yeah, the model will definitely learn to cheat if you simply feed it the same clips that it is expected to generate. AR models are pretty remarkable in this way. This is actually the genesis of this blog post and I think a really interesting, unexplored capability of AR models.

I actually have provisions for this built into the dataloader - but you need to produce some auxiliary metadata files to get it to work. The code is implemented here: https://github.com/neonbjb/DL-Art-School/blob/master/codes/data/audio/unsupervised_audio_dataset.py#L49

Basically the process is to go through your dataset and, for each audio clip, find 1-3 other audio clips that have the same speaker. You compile paths to these similar audio clips into a dict which is written in each top level directory containing audio clips called similarities.pth.

It's very enlightening, thank you very much!!!!!!!!!!!!!!! Have a nice day!

z592694590 · 2023-01-06T13:45:27Z

Hi. I found that you use 256 tokens to encode the text. why don't use more tokens such like NLP.

neonbjb · 2023-01-06T15:53:22Z

Because joining tokens like NLP is actively harmful for generative models. (I did experiments on this with tiny-scale AR models).

The current thought behind this is that with generative models, often each character is important (for example with pronouncing phonemes or rendering text). With more condensed token spaces, the models need to learn more complex relationships between tokens to work.

Example from image-gen land: https://arxiv.org/abs/2212.10562

z592694590 · 2023-01-06T16:03:16Z

Because joining tokens like NLP is actively harmful for generative models. (I did experiments on this with tiny-scale AR models).

The current thought behind this is that with generative models, often each character is important (for example with pronouncing phonemes or rendering text). With more condensed token spaces, the models need to learn more complex relationships between tokens to work.

Example from image-gen land: https://arxiv.org/abs/2212.10562

This seems very reasonable 👍👍. Thanks for you reply! I can't wait to do some experiments.

z592694590 · 2023-01-29T03:12:55Z

Hi Mr Betker. On your suggestion, I trained a gpt on 16k sample rate waveform. It performs very well. However, when I trained on 24k sample rate, the result is completely different. In predict priod, the model seems to be unstable, particularly in short text input such like 'yes', 'no' or 'go'. In one batch, the variation of predicted codes is very large. It usually predicts very long mel codes and even can not predict the stop token. The audio output sounds like a long stretched-out “uhhhhhhhhhh” or just simply silence. These incidents don't occur in 16k waveform of training pipeline. I'm confused. Do you have any ideas about this? Looking forward to your reply。

neonbjb · 2023-01-29T05:37:51Z

Possibly stupid question - did you re-train your VQ model for the higher sample rate?

z592694590 · 2023-01-29T06:00:11Z

Possibly stupid question - did you re-train your VQ model for the higher sample rate?

yes, I re-train my VQ model with 24k sample rate, 256 hop size, 1024 window size.

neonbjb · 2023-01-29T06:34:39Z

This is certainly a strange finding. Honestly, it's probably a bug. Since this showed up when you changed sample rates, I'd hazard that somewhere in your code there is an assumption made about audio being 16kHz when it is really 24kHz. Or potentially there's something still assuming 22kHz (which I used), and 16kHz works with that but 24kHz doesn't.

Do the loss curves between the two runs look the same?

You probably know this already, but early on in training the model does exactly as you stated: produces a lot of "uhhhhhs" and other constant tones. This was originally the reason I built CLVP (filter out this garbage from real speech).

z592694590 · 2023-01-29T07:00:56Z

yes, the loss curves between the two runs look the same. The loss of 24k training is lower.

Naminwang · 2024-03-21T09:43:46Z

Hi Mr Betker. On your suggestion, I trained a gpt on 16k sample rate waveform. It performs very well. However, when I trained on 24k sample rate, the result is completely different. In predict priod, the model seems to be unstable, particularly in short text input such like 'yes', 'no' or 'go'. In one batch, the variation of predicted codes is very large. It usually predicts very long mel codes and even can not predict the stop token. The audio output sounds like a long stretched-out “uhhhhhhhhhh” or just simply silence. These incidents don't occur in 16k waveform of training pipeline. I'm confused. Do you have any ideas about this? Looking forward to your reply。

Hi，can you share the configuration for 16k vqvae model training? I've been training a 16k vqvae model. The results are always unsatisfactory. The generated audio has a noise floor, and some words are pronounced strangely.

z592694590 closed this as completed Jan 11, 2023

devilismyfriend mentioned this issue Feb 21, 2023

Figure out the best training hyperparameters 152334H/DL-Art-School#1

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

gpt training issue #10

gpt training issue #10

z592694590 commented Nov 21, 2022

neonbjb commented Nov 25, 2022

z592694590 commented Dec 12, 2022

z592694590 commented Jan 3, 2023

neonbjb commented Jan 3, 2023

z592694590 commented Jan 3, 2023

z592694590 commented Jan 6, 2023

neonbjb commented Jan 6, 2023

z592694590 commented Jan 6, 2023

z592694590 commented Jan 29, 2023 •

edited

neonbjb commented Jan 29, 2023

z592694590 commented Jan 29, 2023

neonbjb commented Jan 29, 2023

z592694590 commented Jan 29, 2023

Naminwang commented Mar 21, 2024

gpt training issue #10

gpt training issue #10

Comments

z592694590 commented Nov 21, 2022

neonbjb commented Nov 25, 2022

z592694590 commented Dec 12, 2022

z592694590 commented Jan 3, 2023

neonbjb commented Jan 3, 2023

z592694590 commented Jan 3, 2023

z592694590 commented Jan 6, 2023

neonbjb commented Jan 6, 2023

z592694590 commented Jan 6, 2023

z592694590 commented Jan 29, 2023 • edited

neonbjb commented Jan 29, 2023

z592694590 commented Jan 29, 2023

neonbjb commented Jan 29, 2023

z592694590 commented Jan 29, 2023

Naminwang commented Mar 21, 2024

z592694590 commented Jan 29, 2023 •

edited