[Usage] finetune_task_lora.sh checkpoints usage #1423

leechangdong · 2024-04-18T13:31:01Z

Describe the issue

Issue:

�train args:
finetune_task_lora.sh

deepspeed llava/train/train_mem.py \
    --lora_enable True --lora_r 128 --lora_alpha 256 --mm_projector_lr 2e-5 \
    --deepspeed ./scripts/zero3.json \
    --model_name_or_path liuhaotian/llava-v1.5-7b \
    --version v1 \
    --data_path ./playground/data/epoch_test.json\
    --image_folder ./playground/data \
    --vision_tower openai/clip-vit-large-patch14-336 \
    --mm_projector_type mlp2x_gelu \
    --mm_vision_select_layer -2 \
    --mm_use_im_start_end False \
    --mm_use_im_patch_token False \
    --image_aspect_ratio pad \
    --group_by_modality_length True \
    --bf16 True \
    --output_dir ./checkpoints/epoch_test \
    --num_train_epochs 1 \
    --per_device_train_batch_size 8 \
    --per_device_eval_batch_size 4 \
    --gradient_accumulation_steps 1 \
    --evaluation_strategy "no" \
    --save_strategy "steps" \
    --save_steps 0.05 \
    --save_total_limit 3 \
    --learning_rate 2e-6 \
    --weight_decay 0. \
    --warmup_ratio 0.03 \
    --lr_scheduler_type "cosine" \
    --logging_steps 1 \
    --tf32 True \
    --model_max_length 2048 \
    --gradient_checkpointing True \
    --dataloader_num_workers 4 \
    --lazy_preprocess True \
    --report_to wandb

With that training setup, I fine tune using my custom data, and checkpoints were saved for each step.

The folder contents of those "checkpoints" are as shown in the photo above, and in order to use those checkpoints for inference, I think need to run merge_lora. "scripts/merge_lora_weights.py" to use the above checkpoints for inference. and i did.

command:

python scripts/merge_lora_weights.py \
--model-path ./checkpoints/llava-v1.5-7b-task-lora-13/checkpoint-3-lora \
--model-base liuhaotian/llava-v1.5-7b \
--save-model-path ./checkpoints/llava-v1.5-7b-task-lora-13/merged/llava-v1.5-7b-epoch-test

But I get an error like this

ValueError: Unrecognized configuration class <class 'transformers.models.llava.configuration_llava.LlavaConfig'> for this kind of AutoModel: AutoModelForCausalLM.
Model type should be one of BartConfig, BertConfig, BertGenerationConfig, BigBirdConfig, BigBirdPegasusConfig, BioGptConfig, BlenderbotConfig, BlenderbotSmallConfig, BloomConfig, CamembertConfig, LlamaConfig, CodeGenConfig, CpmAntConfig, CTRLConfig, Data2VecTextConfig, ElectraConfig, ErnieConfig, FalconConfig, FuyuConfig, GitConfig, GPT2Config, GPT2Config, GPTBigCodeConfig, GPTNeoConfig, GPTNeoXConfig, GPTNeoXJapaneseConfig, GPTJConfig, LlamaConfig, MarianConfig, MBartConfig, MegaConfig, MegatronBertConfig, MistralConfig, MixtralConfig, MptConfig, MusicgenConfig, MvpConfig, OpenLlamaConfig, OpenAIGPTConfig, OPTConfig, PegasusConfig, PersimmonConfig, PhiConfig, PLBartConfig, ProphetNetConfig, QDQBertConfig, Qwen2Config, ReformerConfig, RemBertConfig, RobertaConfig, RobertaPreLayerNormConfig, RoCBertConfig, RoFormerConfig, RwkvConfig, Speech2Text2Config, TransfoXLConfig, TrOCRConfig, WhisperConfig, XGLMConfig, XLMConfig, XLMProphetNetConfig, XLMRobertaConfig, XLMRobertaXLConfig, XLNetConfig, XmodConfig, LlavaConfig, LlavaMptConfig, LlavaMistralConfig.

My purpose is to use the checkpoints I saved while training "finetune_task_lora.sh" for inference.
Does anyone know if what I did above is correct and how to fix this error?

my transformer version - transformers==4.37.2

The text was updated successfully, but these errors were encountered:

wuwu-C · 2024-04-20T12:38:12Z

I also got the problem when I try to use merge_lora_weights.

`OSError: /data/user4/cww/LLaVA/checkpoint/LLaVA.finetune/geochat-LLAVA15-7B-Vicuna-finetune_lora/checkpoint-2234 does not appear to have a file named config.json. Checkout 'https://huggingface.co//data/user4/cww/LLaVA/checkpoint/LLaVA.finetune/geochat-LLAVA15-7B-Vicuna-finetune_lora/checkpoint-2234/main' for available files.
`

Did you solve?

leechangdong · 2024-04-22T02:04:13Z

#844

Check this out

CHENGY12 · 2024-04-26T18:07:52Z

Hi leechangdong, did you modify the code beyond #844 . Currently, I change the trainer as #844, but only have

I don't have the gradient information like the folder of "global_step*" as you show.

wuwu-C · 2024-04-28T06:14:56Z

@leechangdong Thank you! But I have another problem.I follow #844 modify my code, but it not save files as expected.And when I examine what errors happened, I find any "print" or "rank0_print"would not successfule print,it seems that it does not execute my local train.py/LLaVATrainer.py

user074 · 2024-05-01T20:45:18Z

Not for task but for LoRA, I followed #729 to save it. Able to load the LoRA weights after following #1200

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Usage] finetune_task_lora.sh checkpoints usage #1423

[Usage] finetune_task_lora.sh checkpoints usage #1423

leechangdong commented Apr 18, 2024 •

edited

Loading

wuwu-C commented Apr 20, 2024

leechangdong commented Apr 22, 2024

CHENGY12 commented Apr 26, 2024 •

edited

Loading

wuwu-C commented Apr 28, 2024

user074 commented May 1, 2024

[Usage] finetune_task_lora.sh checkpoints usage #1423

[Usage] finetune_task_lora.sh checkpoints usage #1423

Comments

leechangdong commented Apr 18, 2024 • edited Loading

Describe the issue

wuwu-C commented Apr 20, 2024

leechangdong commented Apr 22, 2024

CHENGY12 commented Apr 26, 2024 • edited Loading

wuwu-C commented Apr 28, 2024

user074 commented May 1, 2024

leechangdong commented Apr 18, 2024 •

edited

Loading

CHENGY12 commented Apr 26, 2024 •

edited

Loading