training is slow #14

ggsonic · 2022-07-15T02:22:43Z

dataloader seems using only single process to handle image processing, num_workers waiting 1 process. gpu is fast ,but the cpu is slow. why is that? how to make the training faster?

ahatamiz · 2022-07-16T15:54:50Z

Hi @ggsonic

That's an interesting observation. We primarily use the timm library to facilitate training, but did not notice this issue with dataloader.

Would you please provide more information regarding your setup which is used for training ?

Best,

ggsonic · 2022-07-18T02:45:19Z

CPU:AuthenticAMD AMD EPYC 7763 64-Core Processor . GPU: A100. OS: Ubuntu 20.04 CUDA: cuda_11.3 . pytorch: 1.10.2+cu113 . timm 0.5.4.
we are using SLURM . 8 A100 GPU , each task with num_workers =8 . you can see in the snapshot only 8 cpu processes 100% working(we are using 8 A100 GPU not 4 as in your paper ).

ahatamiz · 2022-07-22T15:57:48Z

Hi @ggsonic

Thank you for sharing this information. We trained the model using 4 computational nodes, as specified in the paper. However, each node comprises of 8 GPUs, hence a total of 32 GPUs for this task. Using batch size of 128, training finished in around 22 hours on NVIDIA's NGC cluster.

According to your snapshot, I believe the issue CPU is clearly the bottleneck-- the A100 GPUs are consuming data in a higher rate and the CPU cores have a hard time to catch up. The timm library, which is primarily used by our work, has done a great job to address these bottlenecks in the dataloader, but it is still challenging for blazing fast hardware such as A100.

Best,

ggsonic · 2022-07-26T06:18:33Z

it was my fault. i changed the code for using slurm,and set wrong create_loader params. Then i set distributed=True and everything is ok now. Thanks!

ahatamiz · 2022-07-27T05:57:56Z

Hi @ggsonic ,

Thanks for letting us know. In addition, we have updated our model and provide checkpoint with improved performance.

Best

ahatamiz closed this as completed Jul 22, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

training is slow #14

training is slow #14

ggsonic commented Jul 15, 2022

ahatamiz commented Jul 16, 2022

ggsonic commented Jul 18, 2022

ahatamiz commented Jul 22, 2022 •

edited

ggsonic commented Jul 26, 2022

ahatamiz commented Jul 27, 2022

training is slow #14

training is slow #14

Comments

ggsonic commented Jul 15, 2022

ahatamiz commented Jul 16, 2022

ggsonic commented Jul 18, 2022

ahatamiz commented Jul 22, 2022 • edited

ggsonic commented Jul 26, 2022

ahatamiz commented Jul 27, 2022

ahatamiz commented Jul 22, 2022 •

edited