Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Usage] resume_from_checkpoint fails when finetuning in the lora settings #1200

Open
zengxingchen opened this issue Feb 29, 2024 · 11 comments
Open

Comments

@zengxingchen
Copy link

Describe the issue

I think the code is trying to resume_from_checkpoint like its a full-parameter fine-tunung checkpoint.

@zengxingchen zengxingchen changed the title [Usage] resume_from_checkpoint always failed when lora finetuning [Usage] resume_from_checkpoint fails when finetuning in the lora settings Feb 29, 2024
@zengxingchen
Copy link
Author

Screenshot 2024-02-29 at 11 40 53

@CynthiaChuang
Copy link

I have the same issue. Can anyone tell me how to fix it?

@qingyuanxingsi
Copy link

+1

1 similar comment
@sunhm15
Copy link

sunhm15 commented Apr 1, 2024

+1

@davidhalladay
Copy link

I encountered this error while resuming the checkpoint of Lora training. I found that this is basically due to the old version of Transformers that LLaVA is using. Please refer to this issue: huggingface/peft#746.

In this issue, the names of keys in the checkpoint saved via deepspeed have some mismatches with those saved via Transformers. Specifically, there is an added ".default." in each key of the non-trainable parameters, leading to errors while loading the checkpoint.

Here is a solution that I found. I have only tested it for Lora training, and it works well. However, I haven't tested it for other features, thus, it may potentially introduce further errors:

  1. This "mismatch" problem has been solved in the latest Transformers package. Thus, we need to update the Transformers package to the latest version:
    pip install transformers==4.39.3
  2. And then we need to update Accelerate as well based on the version of Transformers:
    pip install accelerate==0.27.2

Again, this works for me so far only with Lora training. I'm not sure whether this will introduce other errors.

@STARRY2001
Copy link

STARRY2001 commented Apr 25, 2024

I encountered this error while resuming the checkpoint of Lora training. I found that this is basically due to the old version of Transformers that LLaVA is using. Please refer to this issue: huggingface/peft#746.

In this issue, the names of keys in the checkpoint saved via deepspeed have some mismatches with those saved via Transformers. Specifically, there is an added ".default." in each key of the non-trainable parameters, leading to errors while loading the checkpoint.

Here is a solution that I found. I have only tested it for Lora training, and it works well. However, I haven't tested it for other features, thus, it may potentially introduce further errors:

  1. This "mismatch" problem has been solved in the latest Transformers package. Thus, we need to update the Transformers package to the latest version:
    pip install transformers==4.39.3
  2. And then we need to update Accelerate as well based on the version of Transformers:
    pip install accelerate==0.27.2

Again, this works for me so far only with Lora training. I'm not sure whether this will introduce other errors.

but i meet some problems when pip, how can you solve this :
image

@Linjyan00
Copy link

I encountered this error while resuming the checkpoint of Lora training. I found that this is basically due to the old version of Transformers that LLaVA is using. Please refer to this issue: huggingface/peft#746.
In this issue, the names of keys in the checkpoint saved via deepspeed have some mismatches with those saved via Transformers. Specifically, there is an added ".default." in each key of the non-trainable parameters, leading to errors while loading the checkpoint.
Here is a solution that I found. I have only tested it for Lora training, and it works well. However, I haven't tested it for other features, thus, it may potentially introduce further errors:

  1. This "mismatch" problem has been solved in the latest Transformers package. Thus, we need to update the Transformers package to the latest version:
    pip install transformers==4.39.3
  2. And then we need to update Accelerate as well based on the version of Transformers:
    pip install accelerate==0.27.2

Again, this works for me so far only with Lora training. I'm not sure whether this will introduce other errors.

but i meet some problems when pip, how can you solve this : image

just ignore it

@davidhalladay
Copy link

davidhalladay commented Apr 26, 2024

On my end, this compatibility issue only causes errors during testing. Therefore, I maintain two separate conda environments: one for training (with transformers==4.39.3) and one for testing (with transformers==4.37.1). While this setup may seem redundant, it offers a quick solution to address the problem.

@user074
Copy link

user074 commented Apr 30, 2024

I encountered this error while resuming the checkpoint of Lora training. I found that this is basically due to the old version of Transformers that LLaVA is using. Please refer to this issue: huggingface/peft#746.

In this issue, the names of keys in the checkpoint saved via deepspeed have some mismatches with those saved via Transformers. Specifically, there is an added ".default." in each key of the non-trainable parameters, leading to errors while loading the checkpoint.

Here is a solution that I found. I have only tested it for Lora training, and it works well. However, I haven't tested it for other features, thus, it may potentially introduce further errors:

  1. This "mismatch" problem has been solved in the latest Transformers package. Thus, we need to update the Transformers package to the latest version:
    pip install transformers==4.39.3
  2. And then we need to update Accelerate as well based on the version of Transformers:
    pip install accelerate==0.27.2

Again, this works for me so far only with Lora training. I'm not sure whether this will introduce other errors.

Thanks! Solved my issue. I tried to save and load the LoRA checkpoints but had problems for a while

@wenyisir
Copy link

wenyisir commented May 9, 2024

I fixed this bug by modifying it: site-packages/deepspeed/runtime/engine.py line 2675 load_module_strict=Fasle

@tetsu-kikuchi
Copy link

tetsu-kikuchi commented May 9, 2024

I am afraid that non_lora_trainables.bin will not be loaded by just setting trainer.train(resume_from_checkpoint=True), because non_lora_trainables.bin is a name only specific to LLaVA and is outside the scope of huggingface.
Could anyone clarify this point?

Added : It seems that non_lora_trainables.bin is not even saved at intermediate saving steps (at every args.save_steps iterations). It is saved only when the whole training schedule is ended. In any case, I am afraid that non_lora_trainables.bin will not be loaded by using huggingface APIs, including other ways such as in #1027

Maybe we have to insert a code to load non_lora_trainables.bin in llava/train/train.py, just as is done, for example, in llava/eval/model_vqa.py. I would appreciate comments if I am misunderstanding.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

10 participants