Please fix Lora model resume in transformers when using DeepSpeed #746

lucasjinreal · 2023-05-31T03:27:07Z

I found a potential bug in transformers when training in Deepspeed:

the transformers will resume checkpoint using dpeespeed when it was enabled:

Obviously the model will resume from deepspeed if deepspeed not None.

And dpeepspeed will only try founds from checkpoint-1200 like dir which contains a global-steps1200 folder to found state.

And then bump, problems goes:

The model defined in transformers state dict like this:

"base_model.model.model.layers.0.self_attn.q_proj.lora_A.weight", "base_model.model.model.layers.0.self_attn.q_proj.lora_B.weight", 
"base_model.model.model.layers.0.self_attn.k_proj.lora_A.weight", "base_model.model.model.layers.0.self_attn.k_proj.lora_B.weight", "base_model.model.model.layers.0.self_attn.v_proj.lora_A.weight", 
"base_model.model.model.layers.0.self_attn.v_proj.lora_B.weight", "base_model.model.model.layers.0.self_attn.o_proj.lora_A.weight", "base_model.model.model.layers.0.self_attn.o_proj.lora_B.weight", 
"base_model.model.model.layers.0.mlp.gate_proj.lora_A.weight", "base_model.model.model.layers.0.mlp.gate_proj.lora_B.weight", "base_model.model.model.layers.0.mlp.down_proj.lora_A.weight", 
"base_model.model.model.layers.0.mlp.down_proj.lora_B.weight", "base_model.model.model.layers.0.mlp.up_proj.lora_A.weight", "base_model.model.model.layers.0.mlp.up_proj.lora_B.weight", 
"base_model.model.model.layers.1.self_attn.q_proj.lora_A.weight", "base_model.model.model.layers.1.self_attn.q_proj.lora_B.weight", "base_model.model.model.layers.1.self_attn.k_proj.lora_A.weight", 
"base_model.model.model.layers.1.self_attn.k_proj.lora_B.weight", "base_model.model.model.layers.1.self_attn.v_proj.lora_A.weight",

But deepspeed 's statedict like this:

dict_keys(['base_model.model.model.layers.0.self_attn.q_proj.lora_A.default.weight', 'base_model.model.model.layers.0.self_attn.q_proj.lora_B.default.weight', 'base_model.model.model.layers.0.self_attn.k_proj.lora_A.default.weight', 'base_model.model.model.layers.0.self_attn.k_proj.lora_B.default.weight', 'base_model.model.model.layers.0.self_attn.v_proj.lora_A.default.weight', 'base_model.model.model.layers.0.self_attn.v_proj.lora_B.default.weight', 'base_model.model.model.layers.0.self_attn.o_proj.lora_A.default.weight', 'base_model.model.model.layers.0.self_attn.o_proj.lora_B.default.weight',

How do they load correctly???

Am SO CONFUSED, please guide me.

The text was updated successfully, but these errors were encountered:

lucasjinreal · 2023-05-31T03:28:41Z

And i found in transformers, LORA models added a default as suffix only when LORA model type is ADALORA

But am not, so , how does mostly resume deepspeed checkpoints correctly?

ArthurZucker · 2023-05-31T09:52:20Z

cc @younesbelkada 😉

acforvs · 2023-07-17T09:23:38Z

Hey @ArthurZucker, any updates on this?

ArthurZucker · 2023-07-24T10:09:12Z

Hey! I am not working on this, both @younesbelkada and @pacman100 should know more about this and whether this was fixed in the latest release or not!

kazemf78 · 2023-07-26T09:21:41Z

I have bumped into a somehow similar problem that I guess would help you, too.

I didn't find your error message above but the error I got when attempting to resume on a specific Lora model checkpoint saved by Deepspeed, was the same one stated in the issue linked below, which was RuntimeError: Error(s) in loading state_dict for PeftModelForCausalLM: Missing key(s) in state_dict:

Your issue is somehow the same I guess. As with Peft and Lora, only some of the model layers are updated and saved, when saving, all of the layers are not saved. Hence, it seems, when loading, only some of the layers can be loaded as all layers have not been saved. and the solution proposed was to edit DeepspeedEngine load_checkpoint function so it does not restrict loading all of the layers as stated in these two linked issues and answers 1 2.

Because the trainer only saves some parameters, and deepspeed defaults to strict=True when loading the model, which requires strict loading of all weights, so an error occurs.
The source code of deepspeed needs to be modified. Because the version of deepspeed is different, the code location cannot be given directly.
The approximate modification method is to change the default value load_checkpointof the parameter in the function definition in deepspeed/runtime/engine.py to Falseload_module_strict

and also this answer:

I think you have checked #464 Here is the detailed explanation: Since only the LoRA part (and embed_tokens and lm_head) of the full model are saved to the ckpt, when resuming training and loading the saved ckpt, we must allow missing keys. When using Transformers with Deepspeed, Transformers doesn't provide any parameter to set allowing missing keys. See here Therefore we have to modify the source code.

There are two ways to fix the problem:

Modify Deepspeed:
基于上一次保存checkpoint继续训练 #464 (comment)

Modify Transformers:
Change the original resuming-from-ckpt-with-deepspeed-code to
          load_path, _ = deepspeed_engine.load_checkpoint(
              resume_from_checkpoint, load_optimizer_states=True, load_lr_scheduler_states=True,
              load_module_strict=False
          )
If you find a more convenient method, you are welcome to share it with us.

github-actions · 2023-08-19T15:03:18Z

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.

dachenlian · 2023-12-23T04:30:49Z

I ran into the same issue. I checked the transformers code and it seems like the suggestion hasn't been implemented yet.

kazemf78 · 2023-12-23T10:16:51Z

I have tweaked and forked from Transformer to handle this resuming issue. Did you find a relevant pull request for this on Transformer repository, or should I create a new one?

ccdv-ai · 2024-01-29T12:17:00Z

Having the same issue with PEFT + Deepspeed stage 1

The @kazemf78 trick fix the problem for me:
In deepspeed.py:

load_path, _ = deepspeed_engine.load_checkpoint(
              resume_from_checkpoint, load_optimizer_states=True, load_lr_scheduler_states=True,
              load_module_strict=False #add this param 
          )

@younesbelkada , @pacman100

lucasjinreal mentioned this issue May 31, 2023

Deepspeed unable to resume training on Peft huggingface/transformers#23860

Closed

huggingface deleted a comment from github-actions bot Jun 30, 2023

younesbelkada transferred this issue from huggingface/transformers Jul 24, 2023

github-actions bot closed this as completed Aug 27, 2023

winglian mentioned this issue Jan 28, 2024

deepseed multiGPU resume from checkpoint fails axolotl-ai-cloud/axolotl#1134

Closed

8 tasks

thepowerfuldeez mentioned this issue Feb 14, 2024

Support resuming of deepspeed + Lora + offloading huggingface/transformers#29015

Closed

davidhalladay mentioned this issue Apr 8, 2024

[Usage] resume_from_checkpoint fails when finetuning in the lora settings haotian-liu/LLaVA#1200

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Please fix Lora model resume in transformers when using DeepSpeed #746

Please fix Lora model resume in transformers when using DeepSpeed #746

lucasjinreal commented May 31, 2023

lucasjinreal commented May 31, 2023

ArthurZucker commented May 31, 2023

acforvs commented Jul 17, 2023

ArthurZucker commented Jul 24, 2023

kazemf78 commented Jul 26, 2023 •

edited

Loading

github-actions bot commented Aug 19, 2023

dachenlian commented Dec 23, 2023

kazemf78 commented Dec 23, 2023

ccdv-ai commented Jan 29, 2024

Please fix Lora model resume in transformers when using DeepSpeed #746

Please fix Lora model resume in transformers when using DeepSpeed #746

Comments

lucasjinreal commented May 31, 2023

lucasjinreal commented May 31, 2023

ArthurZucker commented May 31, 2023

acforvs commented Jul 17, 2023

ArthurZucker commented Jul 24, 2023

kazemf78 commented Jul 26, 2023 • edited Loading

github-actions bot commented Aug 19, 2023

dachenlian commented Dec 23, 2023

kazemf78 commented Dec 23, 2023

ccdv-ai commented Jan 29, 2024

kazemf78 commented Jul 26, 2023 •

edited

Loading