Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] load_checkpoint fails after deepspeed engine started training #1612

Open
stas00 opened this issue Dec 6, 2021 · 0 comments
Open

[BUG] load_checkpoint fails after deepspeed engine started training #1612

stas00 opened this issue Dec 6, 2021 · 0 comments
Labels
bug Something isn't working

Comments

@stas00
Copy link
Collaborator

stas00 commented Dec 6, 2021

Describe the bug

load_checkpoint works when a fresh deepspeed engine was created, but if you train with it and try again load_checkpoint fails.

This is exactly the same issue as reported in #1394 which was closed with a workaround but not with a solution.

I used the same re-init workaround I proposed on that issue in Transformers: huggingface/transformers#14652 when users want to reload the best model at the end of the training, but this would be hugely slow for any large model, because everything has to be reallocated.

Thank you!

Reproduction script: #1750 (comment)

@tjruwase, @jeffra

@stas00 stas00 added the bug Something isn't working label Dec 6, 2021
@stas00 stas00 changed the title [BUG] [BUG] load_checkpoint fails after deepspeed engine started training Dec 6, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

1 participant