Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Please fix Lora model resume in transformers when using DeepSpeed #746

Closed
lucasjinreal opened this issue May 31, 2023 · 9 comments
Closed

Comments

@lucasjinreal
Copy link

I found a potential bug in transformers when training in Deepspeed:

  1. the transformers will resume checkpoint using dpeespeed when it was enabled:

image

Obviously the model will resume from deepspeed if deepspeed not None.

image

And dpeepspeed will only try founds from checkpoint-1200 like dir which contains a global-steps1200 folder to found state.

  1. And then bump, problems goes:

The model defined in transformers state dict like this:

"base_model.model.model.layers.0.self_attn.q_proj.lora_A.weight", "base_model.model.model.layers.0.self_attn.q_proj.lora_B.weight", 
"base_model.model.model.layers.0.self_attn.k_proj.lora_A.weight", "base_model.model.model.layers.0.self_attn.k_proj.lora_B.weight", "base_model.model.model.layers.0.self_attn.v_proj.lora_A.weight", 
"base_model.model.model.layers.0.self_attn.v_proj.lora_B.weight", "base_model.model.model.layers.0.self_attn.o_proj.lora_A.weight", "base_model.model.model.layers.0.self_attn.o_proj.lora_B.weight", 
"base_model.model.model.layers.0.mlp.gate_proj.lora_A.weight", "base_model.model.model.layers.0.mlp.gate_proj.lora_B.weight", "base_model.model.model.layers.0.mlp.down_proj.lora_A.weight", 
"base_model.model.model.layers.0.mlp.down_proj.lora_B.weight", "base_model.model.model.layers.0.mlp.up_proj.lora_A.weight", "base_model.model.model.layers.0.mlp.up_proj.lora_B.weight", 
"base_model.model.model.layers.1.self_attn.q_proj.lora_A.weight", "base_model.model.model.layers.1.self_attn.q_proj.lora_B.weight", "base_model.model.model.layers.1.self_attn.k_proj.lora_A.weight", 
"base_model.model.model.layers.1.self_attn.k_proj.lora_B.weight", "base_model.model.model.layers.1.self_attn.v_proj.lora_A.weight",

But deepspeed 's statedict like this:

dict_keys(['base_model.model.model.layers.0.self_attn.q_proj.lora_A.default.weight', 'base_model.model.model.layers.0.self_attn.q_proj.lora_B.default.weight', 'base_model.model.model.layers.0.self_attn.k_proj.lora_A.default.weight', 'base_model.model.model.layers.0.self_attn.k_proj.lora_B.default.weight', 'base_model.model.model.layers.0.self_attn.v_proj.lora_A.default.weight', 'base_model.model.model.layers.0.self_attn.v_proj.lora_B.default.weight', 'base_model.model.model.layers.0.self_attn.o_proj.lora_A.default.weight', 'base_model.model.model.layers.0.self_attn.o_proj.lora_B.default.weight', 

How do they load correctly???

Am SO CONFUSED, please guide me.

@lucasjinreal
Copy link
Author

And i found in transformers, LORA models added a default as suffix only when LORA model type is ADALORA

image

But am not, so , how does mostly resume deepspeed checkpoints correctly?

@ArthurZucker
Copy link
Collaborator

cc @younesbelkada 😉

@huggingface huggingface deleted a comment from github-actions bot Jun 30, 2023
@acforvs
Copy link

acforvs commented Jul 17, 2023

Hey @ArthurZucker, any updates on this?

@ArthurZucker
Copy link
Collaborator

Hey! I am not working on this, both @younesbelkada and @pacman100 should know more about this and whether this was fixed in the latest release or not!

@younesbelkada younesbelkada transferred this issue from huggingface/transformers Jul 24, 2023
@kazemf78
Copy link

kazemf78 commented Jul 26, 2023

I have bumped into a somehow similar problem that I guess would help you, too.

I didn't find your error message above but the error I got when attempting to resume on a specific Lora model checkpoint saved by Deepspeed, was the same one stated in the issue linked below, which was RuntimeError: Error(s) in loading state_dict for PeftModelForCausalLM: Missing key(s) in state_dict:

Your issue is somehow the same I guess. As with Peft and Lora, only some of the model layers are updated and saved, when saving, all of the layers are not saved. Hence, it seems, when loading, only some of the layers can be loaded as all layers have not been saved. and the solution proposed was to edit DeepspeedEngine load_checkpoint function so it does not restrict loading all of the layers as stated in these two linked issues and answers 1 2.

Because the trainer only saves some parameters, and deepspeed defaults to strict=True when loading the model, which requires strict loading of all weights, so an error occurs.
The source code of deepspeed needs to be modified. Because the version of deepspeed is different, the code location cannot be given directly.
The approximate modification method is to change the default value load_checkpointof the parameter in the function definition in deepspeed/runtime/engine.py to Falseload_module_strict

and also this answer:

I think you have checked #464 Here is the detailed explanation: Since only the LoRA part (and embed_tokens and lm_head) of the full model are saved to the ckpt, when resuming training and loading the saved ckpt, we must allow missing keys. When using Transformers with Deepspeed, Transformers doesn't provide any parameter to set allowing missing keys. See here Therefore we have to modify the source code.

There are two ways to fix the problem:

  1. Modify Deepspeed:
    基于上一次保存checkpoint继续训练 #464 (comment)
  2. Modify Transformers:
    Change the original resuming-from-ckpt-with-deepspeed-code to
          load_path, _ = deepspeed_engine.load_checkpoint(
              resume_from_checkpoint, load_optimizer_states=True, load_lr_scheduler_states=True,
              load_module_strict=False
          )

If you find a more convenient method, you are welcome to share it with us.

@github-actions
Copy link

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.

@dachenlian
Copy link

I ran into the same issue. I checked the transformers code and it seems like the suggestion hasn't been implemented yet.

@kazemf78
Copy link

I have tweaked and forked from Transformer to handle this resuming issue. Did you find a relevant pull request for this on Transformer repository, or should I create a new one?

@ccdv-ai
Copy link

ccdv-ai commented Jan 29, 2024

Having the same issue with PEFT + Deepspeed stage 1

The @kazemf78 trick fix the problem for me:
In deepspeed.py:

load_path, _ = deepspeed_engine.load_checkpoint(
              resume_from_checkpoint, load_optimizer_states=True, load_lr_scheduler_states=True,
              load_module_strict=False #add this param 
          )

@younesbelkada , @pacman100

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

6 participants