Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

training on a large dataset #89

Closed
AlvL1225 opened this issue Jun 29, 2022 · 13 comments
Closed

training on a large dataset #89

AlvL1225 opened this issue Jun 29, 2022 · 13 comments

Comments

@AlvL1225
Copy link

Hi @lucidrains
Thank you for your great work!

I am going to train Imagen on a relatively large dataset(100M data, a subset from LAION-5B) with 8xRTX3090 .

For the text-to-image DDPM, 128dim-non-efficient or 256dim-efficient will be considered, can I have your suggestion which one you may prefer(256dim-eff speed is about 2x faster at around 2days/epoch)

Learned a lot from your discussion in issue#72!

@lucidrains
Copy link
Owner

@yli1994 hey! so i think non-efficient is the more conservative route, until we close out issue 72

128 dimension with non efficient should be sufficient, but your guess is as good as mine 😆

@AlvL1225
Copy link
Author

@lucidrains thanks! I will try 128 non-eff first, but the speed of 256-eff is so charming😆😆

@lucidrains
Copy link
Owner

lucidrains commented Jun 29, 2022

@yli1994 definitely chat with Aidan and Zion on the Laion discord, as they have been training the dalle2-pytorch unet with good results and probably have some training tips 😄

@lucidrains
Copy link
Owner

@yli1994 i'll try to get the lightning training code in place by week's end!

@lucidrains
Copy link
Owner

@yli1994 Francisco just reported that the color shifting issue seems to be absent in the latest version (caveat, @ 10k steps) so maybe it is a cautious yellow light to proceed with the memory efficient unet

@srelbo
Copy link

srelbo commented Jun 29, 2022

Here is our run ~ 90k steps (memory efficient unet) ... No color shifting as far as I can see.. (samples in report)

https://wandb.ai/elbo-ai/imagen/runs/1y5gc6d2?workspace=user-elbo

This is from v0.11.5 IIRC..

@camlaedtke
Copy link

camlaedtke commented Jun 29, 2022

@srelbo Great report! Looks like you're starting to get good results. A few questions

  • For your training loop, are you training one unet at a time (i.e. train unet1 for 25 epochs, then train unet2 for 25 epochs, etc.). Or are you training each unet through all mini-batches once per epoch (i.e. train unets 1 and 2 for one epoch, then train unets 1 and 2 for the next epoch, and so on)?
  • How many images does your dataset contain?
  • Have you noticed a pretty big increase in model size and a decrease in training speed between 0.11.5 and 0.14.1? I definitely have.

Anyways, I am currently training on a single RTX 3090 GPU on the CocoCaptions dataset. Here's my run so far.
https://wandb.ai/camlaedtke/imagen?workspace=user-camlaedtke

@srelbo
Copy link

srelbo commented Jun 29, 2022

Thanks @camlaedtke ! Looks like you are keeping your 3090 super busy 😄

For your training loop, are you training one unet at a time (i.e. train unet1 for 25 epochs, then train unet2 for 25 epochs, etc.). Or are you training each unet through all mini-batches once per epoch (i.e. train unets 1 and 2 for one epoch, then train unets 1 and 2 for the next epoch, and so on)?

We are doing the latter. This is just our first experiment, but doing one unet at a time is interesting.

How many images does your dataset contain?

It's a subset of the LAION Aesthetic data set, it's really large (~2.6 Tb). We have not done even a single epoch after 2 days.

Have you noticed a pretty big increase in model size and a decrease in training speed between 0.11.5 and 0.14.1? I definitely have.

I will keep an 👀 on it when we pick up the latest changes. Perhaps in the next few days.

@AlvL1225
Copy link
Author

@yli1994 i'll try to get the lightning training code in place by week's end!

Sounds great! Looking forward to it!

@AlvL1225
Copy link
Author

  • For your training loop, are you training one unet at a time (i.e. train unet1 for 25 epochs, then train unet2 for 25 epochs, etc.). Or are you training each unet through all mini-batches once per epoch (i.e. train unets 1 and 2 for one epoch, then train unets 1 and 2 for the next epoch, and so on)?

Training one unet at one time could be cheaper. You can respond to your generation results faster and tune your strategy, especially when using few gpus.

@camlaedtke
Copy link

@srelbo Good to know, thanks! Yeah using a single GPU is really testing my patience. Looks like it'll take a couple of days to train 30,000 steps but hopefully that will be good enough for somewhat decent results.

@lucidrains
Copy link
Owner

should be ready for that with the new accelerate integration

@AlvL1225
Copy link
Author

AlvL1225 commented Jul 15, 2022

Hi @lucidrains
should I wrap my dataloader and trainer instance with accelerator.prepare in my train.py script? When I call accelerate launch my dataloader was not splited into N nodes(I use trainer.accelerate.prepare(dataloader)).

Finally I tried DistributedSampler.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants