training on a large dataset #89

AlvL1225 · 2022-06-29T02:25:07Z

Hi @lucidrains
Thank you for your great work!

I am going to train Imagen on a relatively large dataset(100M data, a subset from LAION-5B) with 8xRTX3090 .

For the text-to-image DDPM, 128dim-non-efficient or 256dim-efficient will be considered, can I have your suggestion which one you may prefer(256dim-eff speed is about 2x faster at around 2days/epoch)

Learned a lot from your discussion in issue#72!

lucidrains · 2022-06-29T14:40:38Z

@yli1994 hey! so i think non-efficient is the more conservative route, until we close out issue 72

128 dimension with non efficient should be sufficient, but your guess is as good as mine 😆

AlvL1225 · 2022-06-29T15:12:19Z

@lucidrains thanks! I will try 128 non-eff first, but the speed of 256-eff is so charming😆😆

lucidrains · 2022-06-29T15:42:43Z

@yli1994 definitely chat with Aidan and Zion on the Laion discord, as they have been training the dalle2-pytorch unet with good results and probably have some training tips 😄

lucidrains · 2022-06-29T15:43:00Z

@yli1994 i'll try to get the lightning training code in place by week's end!

lucidrains · 2022-06-29T17:01:08Z

@yli1994 Francisco just reported that the color shifting issue seems to be absent in the latest version (caveat, @ 10k steps) so maybe it is a cautious yellow light to proceed with the memory efficient unet

srelbo · 2022-06-29T18:45:50Z

Here is our run ~ 90k steps (memory efficient unet) ... No color shifting as far as I can see.. (samples in report)

https://wandb.ai/elbo-ai/imagen/runs/1y5gc6d2?workspace=user-elbo

This is from v0.11.5 IIRC..

camlaedtke · 2022-06-29T19:33:15Z

@srelbo Great report! Looks like you're starting to get good results. A few questions

For your training loop, are you training one unet at a time (i.e. train unet1 for 25 epochs, then train unet2 for 25 epochs, etc.). Or are you training each unet through all mini-batches once per epoch (i.e. train unets 1 and 2 for one epoch, then train unets 1 and 2 for the next epoch, and so on)?
How many images does your dataset contain?
Have you noticed a pretty big increase in model size and a decrease in training speed between 0.11.5 and 0.14.1? I definitely have.

Anyways, I am currently training on a single RTX 3090 GPU on the CocoCaptions dataset. Here's my run so far.
https://wandb.ai/camlaedtke/imagen?workspace=user-camlaedtke

srelbo · 2022-06-29T20:03:56Z

Thanks @camlaedtke ! Looks like you are keeping your 3090 super busy 😄

For your training loop, are you training one unet at a time (i.e. train unet1 for 25 epochs, then train unet2 for 25 epochs, etc.). Or are you training each unet through all mini-batches once per epoch (i.e. train unets 1 and 2 for one epoch, then train unets 1 and 2 for the next epoch, and so on)?

We are doing the latter. This is just our first experiment, but doing one unet at a time is interesting.

How many images does your dataset contain?

It's a subset of the LAION Aesthetic data set, it's really large (~2.6 Tb). We have not done even a single epoch after 2 days.

Have you noticed a pretty big increase in model size and a decrease in training speed between 0.11.5 and 0.14.1? I definitely have.

I will keep an 👀 on it when we pick up the latest changes. Perhaps in the next few days.

AlvL1225 · 2022-06-29T23:41:17Z

@yli1994 i'll try to get the lightning training code in place by week's end!

Sounds great! Looking forward to it!

AlvL1225 · 2022-06-29T23:52:13Z

For your training loop, are you training one unet at a time (i.e. train unet1 for 25 epochs, then train unet2 for 25 epochs, etc.). Or are you training each unet through all mini-batches once per epoch (i.e. train unets 1 and 2 for one epoch, then train unets 1 and 2 for the next epoch, and so on)?

Training one unet at one time could be cheaper. You can respond to your generation results faster and tune your strategy, especially when using few gpus.

camlaedtke · 2022-06-30T00:20:21Z

@srelbo Good to know, thanks! Yeah using a single GPU is really testing my patience. Looks like it'll take a couple of days to train 30,000 steps but hopefully that will be good enough for somewhat decent results.

lucidrains · 2022-07-14T20:15:33Z

should be ready for that with the new accelerate integration

AlvL1225 · 2022-07-15T05:31:15Z

Hi @lucidrains
should I wrap my dataloader and trainer instance with accelerator.prepare in my train.py script? When I call accelerate launch my dataloader was not splited into N nodes(I use trainer.accelerate.prepare(dataloader)).

Finally I tried DistributedSampler.

lucidrains closed this as completed Jul 14, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

training on a large dataset #89

training on a large dataset #89

AlvL1225 commented Jun 29, 2022

lucidrains commented Jun 29, 2022

AlvL1225 commented Jun 29, 2022

lucidrains commented Jun 29, 2022 •

edited

lucidrains commented Jun 29, 2022

lucidrains commented Jun 29, 2022

srelbo commented Jun 29, 2022

camlaedtke commented Jun 29, 2022 •

edited

srelbo commented Jun 29, 2022

AlvL1225 commented Jun 29, 2022

AlvL1225 commented Jun 29, 2022

camlaedtke commented Jun 30, 2022

lucidrains commented Jul 14, 2022

AlvL1225 commented Jul 15, 2022 •

edited

training on a large dataset #89

training on a large dataset #89

Comments

AlvL1225 commented Jun 29, 2022

lucidrains commented Jun 29, 2022

AlvL1225 commented Jun 29, 2022

lucidrains commented Jun 29, 2022 • edited

lucidrains commented Jun 29, 2022

lucidrains commented Jun 29, 2022

srelbo commented Jun 29, 2022

camlaedtke commented Jun 29, 2022 • edited

srelbo commented Jun 29, 2022

AlvL1225 commented Jun 29, 2022

AlvL1225 commented Jun 29, 2022

camlaedtke commented Jun 30, 2022

lucidrains commented Jul 14, 2022

AlvL1225 commented Jul 15, 2022 • edited

lucidrains commented Jun 29, 2022 •

edited

camlaedtke commented Jun 29, 2022 •

edited

AlvL1225 commented Jul 15, 2022 •

edited