You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I am training a pytorch-lightning encoder-decoder model on 2 GPUs. The model trains fine (with roughly the 2x acceleration expected) when using deepspeed stage 2. When using deepspeed stage 3, the model passes through the full training step of epoch 1 fine, but when it gets to the validation step, it passes through just 1 sample before permanently hanging with no error message. Logging appears to show that it is running 2 validation samples at once (even though val batch size = 1), and the hanging starts right after one of the validation samples has finished and produced a result while the other one is still going. Is this a common problem when using deepspeed stage 3? If so, what is the fix? The model has custom code that would be too big to paste here, but I can try to paste relevant sections of it if there is a particular spot that would be useful to see. Thanks!
The text was updated successfully, but these errors were encountered:
I am training a pytorch-lightning encoder-decoder model on 2 GPUs. The model trains fine (with roughly the 2x acceleration expected) when using deepspeed stage 2. When using deepspeed stage 3, the model passes through the full training step of epoch 1 fine, but when it gets to the validation step, it passes through just 1 sample before permanently hanging with no error message. Logging appears to show that it is running 2 validation samples at once (even though val batch size = 1), and the hanging starts right after one of the validation samples has finished and produced a result while the other one is still going. Is this a common problem when using deepspeed stage 3? If so, what is the fix? The model has custom code that would be too big to paste here, but I can try to paste relevant sections of it if there is a particular spot that would be useful to see. Thanks!
The text was updated successfully, but these errors were encountered: