Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] Multi-GPU Training frozen after finishing first epoch #4730

Open
drizzle0171 opened this issue Nov 27, 2023 · 5 comments
Open

[BUG] Multi-GPU Training frozen after finishing first epoch #4730

drizzle0171 opened this issue Nov 27, 2023 · 5 comments
Labels
bug Something isn't working training

Comments

@drizzle0171
Copy link

drizzle0171 commented Nov 27, 2023

Describe the bug
Hello, I am training VAE. The resolution of the data is large, so I've been using deepspeed recently. However, with deepspeed, all iteration of the first epoch rotates, but it doesn't go to the next epoch. Even after a few hours, I stay in the first epoch like below.
However, if I reduce the number of GPUs I use to one, it goes straight to the next epoch. I think it seems to be a communication problem between GPUs as the epoch goes over, so how can we solve this problem?

Screenshots
image
image
As you can see, GPU-Utils are all 100%.

System info (please complete the following information):

  • GPU count and types: four ~ eight machines A100s
@drizzle0171 drizzle0171 added bug Something isn't working training labels Nov 27, 2023
@HUAFOR
Copy link

HUAFOR commented Dec 12, 2023

I meet the same question. I think you can first add more specific logs to find the question before running your training command :
export TORCH_DISTRIBUTED_DEBUG=DETAIL
export DEEPSPEED_LOG_LEVEL=debug
export OMPI_MCA_btl_base_verbose=1
export NCCL_DEBUG=INFO
export NCCL_DEBUG_SUBSYS=ALL
export TORCH_CPP_LOG_LEVEL=INFO

@HUAFOR
Copy link

HUAFOR commented Dec 12, 2023

I'm training a diffusion pipeline and using the deepspeed-stage2 in 8 A100 GPUS. When training the first epoch ,everything goes well, however, when training the second epoch, the process is forzen, so I add some debug logs to find the question:
export TORCH_DISTRIBUTED_DEBUG=DETAIL
export DEEPSPEED_LOG_LEVEL=debug
export OMPI_MCA_btl_base_verbose=1
export NCCL_DEBUG=INFO
export NCCL_DEBUG_SUBSYS=ALL
export TORCH_CPP_LOG_LEVEL=INFO
after add the debug logs, the process are no longer forzen for hours, but quickly suffer an error.

In my case:
basemodel + deepspeed can be train
basemodel + B custom neural module + deepspeed can not be train and frozen in the second epoch.
if I closed the deepspeed, basemodel + B custom neural module can be train .

@drizzle0171
Copy link
Author

Thank you for sharing your exprience. Actually, in my case, the problem was caused by the logger.

For reference, the environment set up for training was as follows.

  • pytorch lightning 1.7.7
  • using WandbLogger implementated by pytorch lightning

To cut to the chase, if I used deepspeed without adding the argument rank_zero_only=True to self.log(), it would stop after the first epoch, which is the problem I mentioned. So I added that argument to self.log, and after that, if worked fine.

To summarize,

# before
self.log("[Train] Total Loss", 
		aeloss, 
		prog_bar=False, 
		logger=True, 
		on_step=True, 
		on_epoch=True, 
		sync_dist=True, 
				)
# after
self.log("[Train] Total Loss", 
		aeloss, 
		prog_bar=False, 
		logger=True, 
		on_step=True, 
		on_epoch=True, 
		sync_dist=True, 
		rank_zero_only=True # added argument
       )

Additionally, the link below helped me solve my problem.
Lightning-AI/pytorch-lightning#11242

@HUAFOR
Copy link

HUAFOR commented Dec 13, 2023

Thank you for your sharing, however, it doesn't work for my case/(ㄒoㄒ)/~~. Anyway, Thanks!

@itsik1
Copy link

itsik1 commented Dec 20, 2023

I'm also having the same problem.
In trainer.test - after fix number of examples(images) it's getting stuck.
I have tried to change batch size to half - and it got stuck at same number of images,
I have tried to change the images i'm using - same problem.
Using rtx_8000 / rtx_6000
latest pytorch-lightning(pytorch-lightning==2.1.2)

working locally on the HICO data set:
https://websites.umich.edu/~ywchao/hico/

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working training
Projects
None yet
Development

No branches or pull requests

3 participants