[BUG] load_checkpoint
fails after deepspeed engine started training
#1612
Labels
bug
Something isn't working
Describe the bug
load_checkpoint
works when a fresh deepspeed engine was created, but if you train with it and try againload_checkpoint
fails.This is exactly the same issue as reported in #1394 which was closed with a workaround but not with a solution.
I used the same re-init workaround I proposed on that issue in Transformers: huggingface/transformers#14652 when users want to reload the best model at the end of the training, but this would be hugely slow for any large model, because everything has to be reallocated.
Thank you!
Reproduction script: #1750 (comment)
@tjruwase, @jeffra
The text was updated successfully, but these errors were encountered: