Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Deepspeed stage 3 hanging after 1st validation sample #5635

Open
AceMcAwesome77 opened this issue Jun 10, 2024 · 0 comments
Open

Deepspeed stage 3 hanging after 1st validation sample #5635

AceMcAwesome77 opened this issue Jun 10, 2024 · 0 comments
Labels
bug Something isn't working training

Comments

@AceMcAwesome77
Copy link

AceMcAwesome77 commented Jun 10, 2024

I am training a pytorch-lightning encoder-decoder model on 2 GPUs. The model trains fine (with roughly the 2x acceleration expected) when using deepspeed stage 2. When using deepspeed stage 3, the model passes through the full training step of epoch 1 fine, but when it gets to the validation step, it passes through just 1 sample before permanently hanging with no error message. Logging appears to show that it is running 2 validation samples at once (even though val batch size = 1), and the hanging starts right after one of the validation samples has finished and produced a result while the other one is still going. Is this a common problem when using deepspeed stage 3? If so, what is the fix? The model has custom code that would be too big to paste here, but I can try to paste relevant sections of it if there is a particular spot that would be useful to see. Thanks!

@AceMcAwesome77 AceMcAwesome77 added bug Something isn't working training labels Jun 10, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working training
Projects
None yet
Development

No branches or pull requests

1 participant