-
Notifications
You must be signed in to change notification settings - Fork 50
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
training is slow #14
Comments
Hi @ggsonic That's an interesting observation. We primarily use the timm library to facilitate training, but did not notice this issue with dataloader. Would you please provide more information regarding your setup which is used for training ? Best, |
Hi @ggsonic Thank you for sharing this information. We trained the model using 4 computational nodes, as specified in the paper. However, each node comprises of 8 GPUs, hence a total of 32 GPUs for this task. Using batch size of 128, training finished in around 22 hours on NVIDIA's NGC cluster. According to your snapshot, I believe the issue CPU is clearly the bottleneck-- the A100 GPUs are consuming data in a higher rate and the CPU cores have a hard time to catch up. The timm library, which is primarily used by our work, has done a great job to address these bottlenecks in the dataloader, but it is still challenging for blazing fast hardware such as A100. Best, |
it was my fault. i changed the code for using slurm,and set wrong create_loader params. Then i set |
Hi @ggsonic , Thanks for letting us know. In addition, we have updated our model and provide checkpoint with improved performance. Best |
dataloader seems using only single process to handle image processing, num_workers waiting 1 process. gpu is fast ,but the cpu is slow. why is that? how to make the training faster?
The text was updated successfully, but these errors were encountered: