Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Usage] finetune_task_lora.sh checkpoints usage #1423

Open
leechangdong opened this issue Apr 18, 2024 · 5 comments
Open

[Usage] finetune_task_lora.sh checkpoints usage #1423

leechangdong opened this issue Apr 18, 2024 · 5 comments

Comments

@leechangdong
Copy link

leechangdong commented Apr 18, 2024

Describe the issue

Issue:

�train args:
finetune_task_lora.sh

deepspeed llava/train/train_mem.py \
    --lora_enable True --lora_r 128 --lora_alpha 256 --mm_projector_lr 2e-5 \
    --deepspeed ./scripts/zero3.json \
    --model_name_or_path liuhaotian/llava-v1.5-7b \
    --version v1 \
    --data_path ./playground/data/epoch_test.json\
    --image_folder ./playground/data \
    --vision_tower openai/clip-vit-large-patch14-336 \
    --mm_projector_type mlp2x_gelu \
    --mm_vision_select_layer -2 \
    --mm_use_im_start_end False \
    --mm_use_im_patch_token False \
    --image_aspect_ratio pad \
    --group_by_modality_length True \
    --bf16 True \
    --output_dir ./checkpoints/epoch_test \
    --num_train_epochs 1 \
    --per_device_train_batch_size 8 \
    --per_device_eval_batch_size 4 \
    --gradient_accumulation_steps 1 \
    --evaluation_strategy "no" \
    --save_strategy "steps" \
    --save_steps 0.05 \
    --save_total_limit 3 \
    --learning_rate 2e-6 \
    --weight_decay 0. \
    --warmup_ratio 0.03 \
    --lr_scheduler_type "cosine" \
    --logging_steps 1 \
    --tf32 True \
    --model_max_length 2048 \
    --gradient_checkpointing True \
    --dataloader_num_workers 4 \
    --lazy_preprocess True \
    --report_to wandb

With that training setup, I fine tune using my custom data, and checkpoints were saved for each step.

image

The folder contents of those "checkpoints" are as shown in the photo above, and in order to use those checkpoints for inference, I think need to run merge_lora. "scripts/merge_lora_weights.py" to use the above checkpoints for inference. and i did.

command:

python scripts/merge_lora_weights.py \
--model-path ./checkpoints/llava-v1.5-7b-task-lora-13/checkpoint-3-lora \
--model-base liuhaotian/llava-v1.5-7b \
--save-model-path ./checkpoints/llava-v1.5-7b-task-lora-13/merged/llava-v1.5-7b-epoch-test

But I get an error like this

ValueError: Unrecognized configuration class <class 'transformers.models.llava.configuration_llava.LlavaConfig'> for this kind of AutoModel: AutoModelForCausalLM.
Model type should be one of BartConfig, BertConfig, BertGenerationConfig, BigBirdConfig, BigBirdPegasusConfig, BioGptConfig, BlenderbotConfig, BlenderbotSmallConfig, BloomConfig, CamembertConfig, LlamaConfig, CodeGenConfig, CpmAntConfig, CTRLConfig, Data2VecTextConfig, ElectraConfig, ErnieConfig, FalconConfig, FuyuConfig, GitConfig, GPT2Config, GPT2Config, GPTBigCodeConfig, GPTNeoConfig, GPTNeoXConfig, GPTNeoXJapaneseConfig, GPTJConfig, LlamaConfig, MarianConfig, MBartConfig, MegaConfig, MegatronBertConfig, MistralConfig, MixtralConfig, MptConfig, MusicgenConfig, MvpConfig, OpenLlamaConfig, OpenAIGPTConfig, OPTConfig, PegasusConfig, PersimmonConfig, PhiConfig, PLBartConfig, ProphetNetConfig, QDQBertConfig, Qwen2Config, ReformerConfig, RemBertConfig, RobertaConfig, RobertaPreLayerNormConfig, RoCBertConfig, RoFormerConfig, RwkvConfig, Speech2Text2Config, TransfoXLConfig, TrOCRConfig, WhisperConfig, XGLMConfig, XLMConfig, XLMProphetNetConfig, XLMRobertaConfig, XLMRobertaXLConfig, XLNetConfig, XmodConfig, LlavaConfig, LlavaMptConfig, LlavaMistralConfig.

My purpose is to use the checkpoints I saved while training "finetune_task_lora.sh" for inference.
Does anyone know if what I did above is correct and how to fix this error?

  • my transformer version - transformers==4.37.2
@wuwu-C
Copy link

wuwu-C commented Apr 20, 2024

I also got the problem when I try to use merge_lora_weights.

`OSError: /data/user4/cww/LLaVA/checkpoint/LLaVA.finetune/geochat-LLAVA15-7B-Vicuna-finetune_lora/checkpoint-2234 does not appear to have a file named config.json. Checkout 'https://huggingface.co//data/user4/cww/LLaVA/checkpoint/LLaVA.finetune/geochat-LLAVA15-7B-Vicuna-finetune_lora/checkpoint-2234/main' for available files.
`

Did you solve?

@leechangdong
Copy link
Author

#844

Check this out

@CHENGY12
Copy link

CHENGY12 commented Apr 26, 2024

Hi leechangdong, did you modify the code beyond #844 . Currently, I change the trainer as #844, but only have
image

I don't have the gradient information like the folder of "global_step*" as you show.

@wuwu-C
Copy link

wuwu-C commented Apr 28, 2024

@leechangdong Thank you! But I have another problem.I follow #844 modify my code, but it not save files as expected.And when I examine what errors happened, I find any "print" or "rank0_print"would not successfule print,it seems that it does not execute my local train.py/LLaVATrainer.py

@user074
Copy link

user074 commented May 1, 2024

Not for task but for LoRA, I followed #729 to save it. Able to load the LoRA weights after following #1200

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants