-
Notifications
You must be signed in to change notification settings - Fork 900
Closed
Description
System Info
peft: 0.16.0
transformers: 4.51.3
torch: 2.6.0
swift: 3.9.0-dev
Who can help?
No response
Reproduction
# 150 visual token
export MASTER_PORT=29501
NPROC_PER_NODE=2 \
CUDA_VISIBLE_DEVICES=5,6 \
MAX_PIXELS=117600 \
swift sft \
--model /media/Qwen2.5-VL-7B-Instruct \
--dataset my_data \
--split_dataset_ratio 0.01 \
--resume_from_checkpoint /my_checkpoint/v0-20250930-174328/checkpoint-699 \
--resume_only_model True \
--ignore_data_skip True \
--train_type custom \
--external_plugins 'examples/train/multimodal/lora_llm_full_vit/custom_plugin.py' \
--torch_dtype bfloat16 \
--num_train_epochs 2 \
--per_device_train_batch_size 1 \
--per_device_eval_batch_size 1 \
--learning_rate 1e-5 \
--vit_lr 1e-5 \
--aligner_lr 1e-5 \
--lora_rank 16 \
--lora_alpha 32 \
--gradient_accumulation_steps 4 \
--eval_steps 50 \
--save_steps 100 \
--save_total_limit 5 \
--logging_steps 5 \
--max_length 32678 \
--output_dir output/test \
--warmup_ratio 0.05 \
--dataloader_num_workers 4 \
--dataset_num_proc 4 \
--attn_impl flash_attn \
--deepspeed zero2 \
--save_only_model true \
--report_to wandb
报错如下:
代码运行报错如下:Traceback (most recent call last): [rank3]: File "/media/ssd2/markyi/projects/ms-swift/swift/cli/sft.py", line 10, in <module>
[rank3]: sft_main()
[rank3]: File "/media/ssd2/markyi/projects/ms-swift/swift/llm/train/sft.py", line 338, in sft_main
[rank3]: return SwiftSft(args).main()
[rank3]: ^^^^^^^^^^^^^^^^^^^^^
[rank3]: File "/media/ssd2/markyi/projects/ms-swift/swift/llm/base.py", line 49, in main
[rank3]: result = self.run()
[rank3]: ^^^^^^^^^^
[rank3]: File "/media/ssd2/markyi/projects/ms-swift/swift/llm/train/sft.py", line 178, in run
[rank3]: self.model = self.prepare_model(self.args, self.model, template=self.template, train_dataset=train_dataset)
[rank3]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank3]: File "/media/ssd2/markyi/projects/ms-swift/swift/llm/train/tuner.py", line 342, in prepare_model
[rank3]: model = tuner.from_pretrained(model, args.resume_from_checkpoint or args.adapters[0], is_trainable=True)
[rank3]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank3]: File "/media/ssd2/markyi/projects/ms-swift/examples/train/multimodal/lora_llm_full_vit/custom_plugin.py", line 29, in from_pretrained
[rank3]: model = Swift.from_pretrained(model, model_id, **kwargs)
[rank3]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank3]: File "/media/ssd2/markyi/projects/ms-swift/swift/tuners/base.py", line 909, in from_pretrained
[rank3]: peft_model = load_peft_model(model, 'default')
[rank3]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank3]: File "/media/ssd2/markyi/projects/ms-swift/swift/tuners/base.py", line 895, in load_peft_model
[rank3]: return PeftModel.from_pretrained(
[rank3]: ^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank3]: File "/media/ssd2/markyi/projects/ms-swift/swift/tuners/peft.py", line 357, in from_pretrained
[rank3]: return module_class.from_pretrained(model, model_id, *args, **kwargs)
[rank3]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank3]: File "/media/ssd2/markyi/projects/ms-swift/.venv/lib/python3.12/site-packages/peft/peft_model.py", line 547, in from_pretrained
[rank3]: model = MODEL_TYPE_TO_PEFT_MODEL_MAPPING[config.task_type](
[rank3]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank3]: File "/media/ssd2/markyi/projects/ms-swift/.venv/lib/python3.12/site-packages/peft/peft_model.py", line 1810, in __init__
[rank3]: super().__init__(model, peft_config, adapter_name, **kwargs)
[rank3]: File "/media/ssd2/markyi/projects/ms-swift/.venv/lib/python3.12/site-packages/peft/peft_model.py", line 130, in __init__
[rank3]: self.base_model = cls(model, {adapter_name: peft_config}, adapter_name)
[rank0]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ [391/1843]
[rank0]: File "/media/ssd2/markyi/projects/ms-swift/swift/tuners/peft.py", line 304, in __new_init__
[rank0]: self.__init_origin__(model, config, adapter_name)
[rank0]: File "/media/ssd2/markyi/projects/ms-swift/.venv/lib/python3.12/site-packages/peft/tuners/lora/model.py", line 143, in __init__
[rank0]: super().__init__(model, config, adapter_name, low_cpu_mem_usage=low_cpu_mem_usage)
[rank0]: File "/media/ssd2/markyi/projects/ms-swift/.venv/lib/python3.12/site-packages/peft/tuners/tuners_utils.py", line 203, in __init__
[rank0]: self.inject_adapter(self.model, adapter_name, low_cpu_mem_usage=low_cpu_mem_usage)
[rank0]: File "/media/ssd2/markyi/projects/ms-swift/.venv/lib/python3.12/site-packages/peft/tuners/tuners_utils.py", line 550, in inject_adapter
[rank0]: raise ValueError(error_msg)
[rank0]: ValueError: Target modules ^(model.language_model.*\.(q_proj|gate_proj|k_proj|v_proj|down_proj|o_proj|up_proj))$ not found in the base mo
del. Please check the target modules and try again.
[rank0]:[W1009 16:51:12.467035344 ProcessGroupNCCL.cpp:1496] Warning: WARNING: destroy_process_group() was not called before program exit, which c
an leak resources. For more info, please see https://pytorch.org/docs/stable/distributed.html#shutdown (function operator())
W1009 16:51:14.051000 85790 .venv/lib/python3.12/site-packages/torch/distributed/elastic/multiprocessing/api.py:897] Sending process 85899 closing
signal SIGTERM
W1009 16:51:14.052000 85790 .venv/lib/python3.12/site-packages/torch/distributed/elastic/multiprocessing/api.py:897] Sending process 85901 closing
signal SIGTERM
W1009 16:51:14.052000 85790 .venv/lib/python3.12/site-packages/torch/distributed/elastic/multiprocessing/api.py:897] Sending process 85903 closing
signal SIGTERM
E1009 16:51:14.682000 85790 .venv/lib/python3.12/site-packages/torch/distributed/elastic/multiprocessing/api.py:869] failed (exitcode: 1) local_ra
nk: 3 (pid: 85904) of binary: /media/ssd2/markyi/projects/ms-swift/.venv/bin/python3
Traceback (most recent call last):
File "<frozen runpy>", line 198, in _run_module_as_main
File "<frozen runpy>", line 88, in _run_code
File "/media/ssd2/markyi/projects/ms-swift/.venv/lib/python3.12/site-packages/torch/distributed/run.py", line 922, in <module>
main()
File "/media/ssd2/markyi/projects/ms-swift/.venv/lib/python3.12/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", lin
e 355, in wrapper
return f(*args, **kwargs)
^^^^^^^^^^^^^^^^^^
File "/media/ssd2/markyi/projects/ms-swift/.venv/lib/python3.12/site-packages/torch/distributed/run.py", line 918, in main
run(args)
File "/media/ssd2/markyi/projects/ms-swift/.venv/lib/python3.12/site-packages/torch/distributed/run.py", line 909, in run
elastic_launch(
File "/media/ssd2/markyi/projects/ms-swift/.venv/lib/python3.12/site-packages/torch/distributed/launcher/api.py", line 138, in __call__
return launch_agent(self._config, self._entrypoint, list(args))
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/media/ssd2/markyi/projects/ms-swift/.venv/lib/python3.12/site-packages/torch/distributed/launcher/api.py", line 269, in launch_agent
raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:
============================================================
Expected behavior
之前老版本同样shell脚本是可以正常启动的,上周拉了一下代码 pip install -e .后就出错了
Metadata
Metadata
Assignees
Labels
No labels