Skip to content

qwen2.5 vl加载checkpoint继续训练报错 #6064

@Yimi81

Description

@Yimi81

System Info

peft: 0.16.0
transformers: 4.51.3
torch: 2.6.0
swift: 3.9.0-dev

Who can help?

No response

Reproduction

# 150 visual token
export MASTER_PORT=29501
NPROC_PER_NODE=2 \
CUDA_VISIBLE_DEVICES=5,6 \
MAX_PIXELS=117600 \
swift sft \
    --model /media/Qwen2.5-VL-7B-Instruct \
    --dataset my_data \
    --split_dataset_ratio 0.01 \
    --resume_from_checkpoint  /my_checkpoint/v0-20250930-174328/checkpoint-699 \
    --resume_only_model True \
    --ignore_data_skip True \
    --train_type custom \
    --external_plugins 'examples/train/multimodal/lora_llm_full_vit/custom_plugin.py' \
    --torch_dtype bfloat16 \
    --num_train_epochs 2 \
    --per_device_train_batch_size 1 \
    --per_device_eval_batch_size 1 \
    --learning_rate 1e-5 \
    --vit_lr 1e-5 \
    --aligner_lr 1e-5 \
    --lora_rank 16 \
    --lora_alpha 32 \
    --gradient_accumulation_steps 4 \
    --eval_steps 50 \
    --save_steps 100 \
    --save_total_limit 5 \
    --logging_steps 5 \
    --max_length 32678 \
    --output_dir output/test \
    --warmup_ratio 0.05 \
    --dataloader_num_workers 4 \
    --dataset_num_proc 4 \
    --attn_impl flash_attn \
    --deepspeed zero2 \
    --save_only_model true \
    --report_to wandb
报错如下:
代码运行报错如下:Traceback (most recent call last):                                                                                                       [rank3]:   File "/media/ssd2/markyi/projects/ms-swift/swift/cli/sft.py", line 10, in <module>                                                     
[rank3]:     sft_main()                                                                                                                           
[rank3]:   File "/media/ssd2/markyi/projects/ms-swift/swift/llm/train/sft.py", line 338, in sft_main                                              
[rank3]:     return SwiftSft(args).main()                                                                                                         
[rank3]:            ^^^^^^^^^^^^^^^^^^^^^                                                                                                         
[rank3]:   File "/media/ssd2/markyi/projects/ms-swift/swift/llm/base.py", line 49, in main                                                        
[rank3]:     result = self.run()                                                                                                                  
[rank3]:              ^^^^^^^^^^                                                                                                                  
[rank3]:   File "/media/ssd2/markyi/projects/ms-swift/swift/llm/train/sft.py", line 178, in run                                                   
[rank3]:     self.model = self.prepare_model(self.args, self.model, template=self.template, train_dataset=train_dataset)                          
[rank3]:                  ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^                          
[rank3]:   File "/media/ssd2/markyi/projects/ms-swift/swift/llm/train/tuner.py", line 342, in prepare_model                                       
[rank3]:     model = tuner.from_pretrained(model, args.resume_from_checkpoint or args.adapters[0], is_trainable=True)                             
[rank3]:             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^                           
[rank3]:   File "/media/ssd2/markyi/projects/ms-swift/examples/train/multimodal/lora_llm_full_vit/custom_plugin.py", line 29, in from_pretrained  
[rank3]:     model = Swift.from_pretrained(model, model_id, **kwargs)                                                                             
[rank3]:             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^                                                                             
[rank3]:   File "/media/ssd2/markyi/projects/ms-swift/swift/tuners/base.py", line 909, in from_pretrained                                         
[rank3]:     peft_model = load_peft_model(model, 'default')                                                                                       
[rank3]:                  ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^                                                                                       
[rank3]:   File "/media/ssd2/markyi/projects/ms-swift/swift/tuners/base.py", line 895, in load_peft_model                                         
[rank3]:     return PeftModel.from_pretrained(                                                                                                    
[rank3]:            ^^^^^^^^^^^^^^^^^^^^^^^^^^                                                                                                    
[rank3]:   File "/media/ssd2/markyi/projects/ms-swift/swift/tuners/peft.py", line 357, in from_pretrained                                         
[rank3]:     return module_class.from_pretrained(model, model_id, *args, **kwargs)                                                                
[rank3]:            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^                                                                
[rank3]:   File "/media/ssd2/markyi/projects/ms-swift/.venv/lib/python3.12/site-packages/peft/peft_model.py", line 547, in from_pretrained        
[rank3]:     model = MODEL_TYPE_TO_PEFT_MODEL_MAPPING[config.task_type](                                                                          
[rank3]:             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^                                                                          
[rank3]:   File "/media/ssd2/markyi/projects/ms-swift/.venv/lib/python3.12/site-packages/peft/peft_model.py", line 1810, in __init__              
[rank3]:     super().__init__(model, peft_config, adapter_name, **kwargs)                                                                         
[rank3]:   File "/media/ssd2/markyi/projects/ms-swift/.venv/lib/python3.12/site-packages/peft/peft_model.py", line 130, in __init__               
[rank3]:     self.base_model = cls(model, {adapter_name: peft_config}, adapter_name)           
[rank0]:                       ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^                                                    [391/1843]
[rank0]:   File "/media/ssd2/markyi/projects/ms-swift/swift/tuners/peft.py", line 304, in __new_init__                                            
[rank0]:     self.__init_origin__(model, config, adapter_name)                                                                                    
[rank0]:   File "/media/ssd2/markyi/projects/ms-swift/.venv/lib/python3.12/site-packages/peft/tuners/lora/model.py", line 143, in __init__
[rank0]:     super().__init__(model, config, adapter_name, low_cpu_mem_usage=low_cpu_mem_usage)
[rank0]:   File "/media/ssd2/markyi/projects/ms-swift/.venv/lib/python3.12/site-packages/peft/tuners/tuners_utils.py", line 203, in __init__      
[rank0]:     self.inject_adapter(self.model, adapter_name, low_cpu_mem_usage=low_cpu_mem_usage)                                                   
[rank0]:   File "/media/ssd2/markyi/projects/ms-swift/.venv/lib/python3.12/site-packages/peft/tuners/tuners_utils.py", line 550, in inject_adapter
[rank0]:     raise ValueError(error_msg)                                                                                                          
[rank0]: ValueError: Target modules ^(model.language_model.*\.(q_proj|gate_proj|k_proj|v_proj|down_proj|o_proj|up_proj))$ not found in the base mo
del. Please check the target modules and try again.                                                                                               
[rank0]:[W1009 16:51:12.467035344 ProcessGroupNCCL.cpp:1496] Warning: WARNING: destroy_process_group() was not called before program exit, which c
an leak resources. For more info, please see https://pytorch.org/docs/stable/distributed.html#shutdown (function operator())                      
W1009 16:51:14.051000 85790 .venv/lib/python3.12/site-packages/torch/distributed/elastic/multiprocessing/api.py:897] Sending process 85899 closing
 signal SIGTERM                                                                                                                                   
W1009 16:51:14.052000 85790 .venv/lib/python3.12/site-packages/torch/distributed/elastic/multiprocessing/api.py:897] Sending process 85901 closing
 signal SIGTERM                                                                                                                                   
W1009 16:51:14.052000 85790 .venv/lib/python3.12/site-packages/torch/distributed/elastic/multiprocessing/api.py:897] Sending process 85903 closing
 signal SIGTERM                                                                                                                                   
E1009 16:51:14.682000 85790 .venv/lib/python3.12/site-packages/torch/distributed/elastic/multiprocessing/api.py:869] failed (exitcode: 1) local_ra
nk: 3 (pid: 85904) of binary: /media/ssd2/markyi/projects/ms-swift/.venv/bin/python3                                                              
Traceback (most recent call last):                                                                                                                
  File "<frozen runpy>", line 198, in _run_module_as_main                                                                                         
  File "<frozen runpy>", line 88, in _run_code                                                                                                    
  File "/media/ssd2/markyi/projects/ms-swift/.venv/lib/python3.12/site-packages/torch/distributed/run.py", line 922, in <module>
    main()                                                                                                                                        
  File "/media/ssd2/markyi/projects/ms-swift/.venv/lib/python3.12/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", lin
e 355, in wrapper                                                                                                                                 
    return f(*args, **kwargs)                                                                                                                     
           ^^^^^^^^^^^^^^^^^^                                                                                                                     
  File "/media/ssd2/markyi/projects/ms-swift/.venv/lib/python3.12/site-packages/torch/distributed/run.py", line 918, in main                    
    run(args)                                                                                                                                     
  File "/media/ssd2/markyi/projects/ms-swift/.venv/lib/python3.12/site-packages/torch/distributed/run.py", line 909, in run                       
    elastic_launch(                                                                                                                               
  File "/media/ssd2/markyi/projects/ms-swift/.venv/lib/python3.12/site-packages/torch/distributed/launcher/api.py", line 138, in __call__         
    return launch_agent(self._config, self._entrypoint, list(args))                                                                               
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^                                                                               
  File "/media/ssd2/markyi/projects/ms-swift/.venv/lib/python3.12/site-packages/torch/distributed/launcher/api.py", line 269, in launch_agent     
    raise ChildFailedError(                                                                                                                       
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:                                                                                
============================================================   

Expected behavior

之前老版本同样shell脚本是可以正常启动的,上周拉了一下代码 pip install -e .后就出错了

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions