[BUG] Multi-GPU Training frozen after finishing first epoch #4730

drizzle0171 · 2023-11-27T01:41:33Z

Describe the bug
Hello, I am training VAE. The resolution of the data is large, so I've been using deepspeed recently. However, with deepspeed, all iteration of the first epoch rotates, but it doesn't go to the next epoch. Even after a few hours, I stay in the first epoch like below.
However, if I reduce the number of GPUs I use to one, it goes straight to the next epoch. I think it seems to be a communication problem between GPUs as the epoch goes over, so how can we solve this problem?

Screenshots

As you can see, GPU-Utils are all 100%.

System info (please complete the following information):

GPU count and types: four ~ eight machines A100s

HUAFOR · 2023-12-12T03:28:09Z

I meet the same question. I think you can first add more specific logs to find the question before running your training command :
export TORCH_DISTRIBUTED_DEBUG=DETAIL
export DEEPSPEED_LOG_LEVEL=debug
export OMPI_MCA_btl_base_verbose=1
export NCCL_DEBUG=INFO
export NCCL_DEBUG_SUBSYS=ALL
export TORCH_CPP_LOG_LEVEL=INFO

HUAFOR · 2023-12-12T03:38:43Z

I'm training a diffusion pipeline and using the deepspeed-stage2 in 8 A100 GPUS. When training the first epoch ,everything goes well, however, when training the second epoch, the process is forzen, so I add some debug logs to find the question:
export TORCH_DISTRIBUTED_DEBUG=DETAIL
export DEEPSPEED_LOG_LEVEL=debug
export OMPI_MCA_btl_base_verbose=1
export NCCL_DEBUG=INFO
export NCCL_DEBUG_SUBSYS=ALL
export TORCH_CPP_LOG_LEVEL=INFO
after add the debug logs, the process are no longer forzen for hours, but quickly suffer an error.

In my case:
basemodel + deepspeed can be train
basemodel + B custom neural module + deepspeed can not be train and frozen in the second epoch.
if I closed the deepspeed, basemodel + B custom neural module can be train .

drizzle0171 · 2023-12-12T05:06:40Z

Thank you for sharing your exprience. Actually, in my case, the problem was caused by the logger.

For reference, the environment set up for training was as follows.

pytorch lightning 1.7.7
using WandbLogger implementated by pytorch lightning

To cut to the chase, if I used deepspeed without adding the argument rank_zero_only=True to self.log(), it would stop after the first epoch, which is the problem I mentioned. So I added that argument to self.log, and after that, if worked fine.

To summarize,

# before
self.log("[Train] Total Loss", 
		aeloss, 
		prog_bar=False, 
		logger=True, 
		on_step=True, 
		on_epoch=True, 
		sync_dist=True, 
				)
# after
self.log("[Train] Total Loss", 
		aeloss, 
		prog_bar=False, 
		logger=True, 
		on_step=True, 
		on_epoch=True, 
		sync_dist=True, 
		rank_zero_only=True # added argument
       )

Additionally, the link below helped me solve my problem.
Lightning-AI/pytorch-lightning#11242

HUAFOR · 2023-12-13T02:56:50Z

Thank you for your sharing, however, it doesn't work for my case/(ㄒoㄒ)/~~. Anyway, Thanks！

itsik1 · 2023-12-20T15:17:05Z

I'm also having the same problem.
In trainer.test - after fix number of examples(images) it's getting stuck.
I have tried to change batch size to half - and it got stuck at same number of images,
I have tried to change the images i'm using - same problem.
Using rtx_8000 / rtx_6000
latest pytorch-lightning(pytorch-lightning==2.1.2)

working locally on the HICO data set:
https://websites.umich.edu/~ywchao/hico/

drizzle0171 added bug Something isn't working training labels Nov 27, 2023

kai-0430 mentioned this issue Apr 23, 2024

[BUG] DeepSpeed hangs during evaluation under multi-GPU #5394

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[BUG] Multi-GPU Training frozen after finishing first epoch #4730

[BUG] Multi-GPU Training frozen after finishing first epoch #4730

drizzle0171 commented Nov 27, 2023 •

edited

HUAFOR commented Dec 12, 2023

HUAFOR commented Dec 12, 2023 •

edited

drizzle0171 commented Dec 12, 2023

HUAFOR commented Dec 13, 2023

itsik1 commented Dec 20, 2023 •

edited

[BUG] Multi-GPU Training frozen after finishing first epoch #4730

[BUG] Multi-GPU Training frozen after finishing first epoch #4730

Comments

drizzle0171 commented Nov 27, 2023 • edited

HUAFOR commented Dec 12, 2023

HUAFOR commented Dec 12, 2023 • edited

drizzle0171 commented Dec 12, 2023

HUAFOR commented Dec 13, 2023

itsik1 commented Dec 20, 2023 • edited

drizzle0171 commented Nov 27, 2023 •

edited

HUAFOR commented Dec 12, 2023 •

edited

itsik1 commented Dec 20, 2023 •

edited