-
Notifications
You must be signed in to change notification settings - Fork 1.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Please fix Lora model resume in transformers when using DeepSpeed #746
Comments
cc @younesbelkada 😉 |
Hey @ArthurZucker, any updates on this? |
Hey! I am not working on this, both @younesbelkada and @pacman100 should know more about this and whether this was fixed in the latest release or not! |
I have bumped into a somehow similar problem that I guess would help you, too. I didn't find your error message above but the error I got when attempting to resume on a specific Lora model checkpoint saved by Deepspeed, was the same one stated in the issue linked below, which was Your issue is somehow the same I guess. As with Peft and Lora, only some of the model layers are updated and saved, when saving, all of the layers are not saved. Hence, it seems, when loading, only some of the layers can be loaded as all layers have not been saved. and the solution proposed was to edit DeepspeedEngine
and also this answer:
|
This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread. |
I ran into the same issue. I checked the transformers code and it seems like the suggestion hasn't been implemented yet. |
I have tweaked and forked from Transformer to handle this resuming issue. Did you find a relevant pull request for this on Transformer repository, or should I create a new one? |
Having the same issue with PEFT + Deepspeed stage 1 The @kazemf78 trick fix the problem for me: load_path, _ = deepspeed_engine.load_checkpoint(
resume_from_checkpoint, load_optimizer_states=True, load_lr_scheduler_states=True,
load_module_strict=False #add this param
) |
I found a potential bug in transformers when training in Deepspeed:
Obviously the model will resume from deepspeed if deepspeed not None.
And dpeepspeed will only try founds from
checkpoint-1200
like dir which contains aglobal-steps1200
folder to found state.The model defined in transformers state dict like this:
But deepspeed 's statedict like this:
How do they load correctly???
Am SO CONFUSED, please guide me.
The text was updated successfully, but these errors were encountered: