Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Slow Training Speed #21

Open
s13kman opened this issue Nov 2, 2021 · 3 comments
Open

Slow Training Speed #21

s13kman opened this issue Nov 2, 2021 · 3 comments

Comments

@s13kman
Copy link

s13kman commented Nov 2, 2021

Hi,
First of all great work! I really loved it. To replicate, I tried training on the Conceptual 12M Dataset with the depth and dims same as the pretrained models but the training was too slow. Even in 4 days it was going through the first (or 0th) epoch. I'm training it on NVIDIA Quadro RTX A6000 which I don't think is that much slow.
Any suggestions to improve the speed of training? I have multi-gpu access but seems it isn't supported rn.
Thanks !

@mehdidc
Copy link
Owner

mehdidc commented Nov 5, 2021

Hi @s13kman, thanks for your interest! I would suggest to use multi-gpu training to speed up training since you have access to multiple GPUs. Actually multi-gpu is supported through Horovod (https://github.com/horovod/horovod).
Once you install Horovod, basically you don't need to change much, something like:

horovodrun -np number_of_gpus python main.py your_config_file.yaml

Given that the dataset is relatively big, I actually train the models usually only on a single epoch.

@CrossLee1
Copy link

CrossLee1 commented Nov 22, 2021

How long did it take you to train only a single epoch?

@mehdidc
Copy link
Owner

mehdidc commented Jul 9, 2022

Hi @CrossLee1 sorry for replying until now, so it takes around 6 hours, but I train them on 64 A100 GPUs (data parallel with Horovod) to speed up the process. I am quite sure there are a lot things to optimize here in terms of hardware usage, I was mostly going for fast experiments (walltime) to figure out what works the best (in terms of architecture, data augmentation, losses, etc.) rather than optimizing the training speed.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants