Deepspeed stage 3 hanging after 1st validation sample #5635

AceMcAwesome77 · 2024-06-10T21:43:05Z

I am training a pytorch-lightning encoder-decoder model on 2 GPUs. The model trains fine (with roughly the 2x acceleration expected) when using deepspeed stage 2. When using deepspeed stage 3, the model passes through the full training step of epoch 1 fine, but when it gets to the validation step, it passes through just 1 sample before permanently hanging with no error message. Logging appears to show that it is running 2 validation samples at once (even though val batch size = 1), and the hanging starts right after one of the validation samples has finished and produced a result while the other one is still going. Is this a common problem when using deepspeed stage 3? If so, what is the fix? The model has custom code that would be too big to paste here, but I can try to paste relevant sections of it if there is a particular spot that would be useful to see. Thanks!

AceMcAwesome77 added bug Something isn't working training labels Jun 10, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Deepspeed stage 3 hanging after 1st validation sample #5635

Deepspeed stage 3 hanging after 1st validation sample #5635

AceMcAwesome77 commented Jun 10, 2024 •

edited

Loading

Deepspeed stage 3 hanging after 1st validation sample #5635

Deepspeed stage 3 hanging after 1st validation sample #5635

Comments

AceMcAwesome77 commented Jun 10, 2024 • edited Loading

AceMcAwesome77 commented Jun 10, 2024 •

edited

Loading