Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

training is slow #14

Closed
ggsonic opened this issue Jul 15, 2022 · 5 comments
Closed

training is slow #14

ggsonic opened this issue Jul 15, 2022 · 5 comments

Comments

@ggsonic
Copy link

ggsonic commented Jul 15, 2022

dataloader seems using only single process to handle image processing, num_workers waiting 1 process. gpu is fast ,but the cpu is slow. why is that? how to make the training faster?

@ahatamiz
Copy link
Collaborator

Hi @ggsonic

That's an interesting observation. We primarily use the timm library to facilitate training, but did not notice this issue with dataloader.

Would you please provide more information regarding your setup which is used for training ?

Best,

@ggsonic
Copy link
Author

ggsonic commented Jul 18, 2022

CPU:AuthenticAMD AMD EPYC 7763 64-Core Processor . GPU: A100. OS: Ubuntu 20.04 CUDA: cuda_11.3 . pytorch: 1.10.2+cu113 . timm 0.5.4.
we are using SLURM . 8 A100 GPU , each task with num_workers =8 . you can see in the snapshot only 8 cpu processes 100% working(we are using 8 A100 GPU not 4 as in your paper ).
snap1

@ahatamiz
Copy link
Collaborator

ahatamiz commented Jul 22, 2022

Hi @ggsonic

Thank you for sharing this information. We trained the model using 4 computational nodes, as specified in the paper. However, each node comprises of 8 GPUs, hence a total of 32 GPUs for this task. Using batch size of 128, training finished in around 22 hours on NVIDIA's NGC cluster.

According to your snapshot, I believe the issue CPU is clearly the bottleneck-- the A100 GPUs are consuming data in a higher rate and the CPU cores have a hard time to catch up. The timm library, which is primarily used by our work, has done a great job to address these bottlenecks in the dataloader, but it is still challenging for blazing fast hardware such as A100.

Best,

@ggsonic
Copy link
Author

ggsonic commented Jul 26, 2022

it was my fault. i changed the code for using slurm,and set wrong create_loader params. Then i set distributed=True and everything is ok now. Thanks!

@ahatamiz
Copy link
Collaborator

Hi @ggsonic ,

Thanks for letting us know. In addition, we have updated our model and provide checkpoint with improved performance.

Best

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants