New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Questions about the training effiency #57
Comments
It's strange. When I use DDP, the GPU utilization fluctuates but is always high (70%-100%). Can you try to train the model using one GPU and check the GPU utilization? |
@JingyunLiang Thanks for your quick reply~ Following your instruction, I try to train the model using one GPU to verify the GPU utilization. When I use DP, by running When I use DDP, by running I will try to solve this problem in the coming days and update my progress here. If there is no progress on the GPU utilization, I will close this issue. Thanks. |
By the way, I'm using the |
@XiaoqiangZhou any update on this? I am also facing similarly slow training time. With batch size 16 and 1000 iterations per epoch, it takes about 1000 seconds to run a single epoch, any insights on this @JingyunLiang? |
I have the same problem, GPU utilization is very low. |
各位解决这个问题了么? |
You can try to cut the large pictures in the training set into small pictures, because a lot of IO time is spent reading high-resolution images. After using this method, the GPU utilization rate did not appear 0 during my training. If there are still problems, the CPU performance in the server is probably insufficient. |
Thank you @songwg188 |
Should you do it for both High Res and Low Res images? |
Thanks for releasing the code of SiwnIR, which is a really great work for low-level vision tasks.
However, when I train the SwinIR model with the guidance provided in the repo, I find the training efficiency is relatively low.
Specifically, the GPU utilization rate keeps 0 for a while from time to time (run 14 seconds and sleep 14 seconds). When the GPU utilization rate is 0, the CPU utilization is also 0. It's worth noting that I use the DDP training on 8 TITAN-RTX GPU cards with the default batch_size. I train the classic SR task with DIV2K dataset on X2 scale. After half-day training, The epoch, iteration and PSNR on Set5 are about 1500, 42000 and 35.73dB, respectively. So, it will takes about 5 days to finish the 500k iterations, far exceeding the 2 days reported in the README.
Could you please help me to figure out the reason for training efficiency?
The text was updated successfully, but these errors were encountered: