-
Notifications
You must be signed in to change notification settings - Fork 3.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[BUG] Multi-GPU Training frozen after finishing first epoch #4730
Comments
I meet the same question. I think you can first add more specific logs to find the question before running your training command : |
I'm training a diffusion pipeline and using the deepspeed-stage2 in 8 A100 GPUS. When training the first epoch ,everything goes well, however, when training the second epoch, the process is forzen, so I add some debug logs to find the question: In my case: |
Thank you for sharing your exprience. Actually, in my case, the problem was caused by the logger. For reference, the environment set up for training was as follows.
To cut to the chase, if I used deepspeed without adding the argument To summarize, # before
self.log("[Train] Total Loss",
aeloss,
prog_bar=False,
logger=True,
on_step=True,
on_epoch=True,
sync_dist=True,
)
# after
self.log("[Train] Total Loss",
aeloss,
prog_bar=False,
logger=True,
on_step=True,
on_epoch=True,
sync_dist=True,
rank_zero_only=True # added argument
) Additionally, the link below helped me solve my problem. |
Thank you for your sharing, however, it doesn't work for my case/(ㄒoㄒ)/~~. Anyway, Thanks! |
I'm also having the same problem. working locally on the HICO data set: |
Describe the bug
Hello, I am training VAE. The resolution of the data is large, so I've been using deepspeed recently. However, with deepspeed, all iteration of the first epoch rotates, but it doesn't go to the next epoch. Even after a few hours, I stay in the first epoch like below.
However, if I reduce the number of GPUs I use to one, it goes straight to the next epoch. I think it seems to be a communication problem between GPUs as the epoch goes over, so how can we solve this problem?
Screenshots
As you can see, GPU-Utils are all 100%.
System info (please complete the following information):
The text was updated successfully, but these errors were encountered: