-
Notifications
You must be signed in to change notification settings - Fork 984
Closed
Labels
questionFurther information is requestedFurther information is requested
Description
Describe the bug
Traceback (most recent call last):
File "/opt/swift/lib/python3.10/site-packages/swift/cli/sft.py", line 5, in <module>
sft_main()
File "/opt/swift/lib/python3.10/site-packages/swift/utils/run_utils.py", line 31, in x_main
result = llm_x(args, **kwargs)
File "/opt/swift/lib/python3.10/site-packages/swift/llm/sft.py", line 236, in llm_sft
trainer.train(training_args.resume_from_checkpoint)
File "/opt/swift/lib/python3.10/site-packages/swift/trainers/trainers.py", line 50, in train
res = super().train(*args, **kwargs)
File "/opt/swift/lib/python3.10/site-packages/transformers/trainer.py", line 1624, in train
return inner_training_loop(
File "/opt/swift/lib/python3.10/site-packages/transformers/trainer.py", line 1776, in _inner_training_loop
model, self.optimizer = self.accelerator.prepare(self.model, self.optimizer)
File "/opt/swift/lib/python3.10/site-packages/accelerate/accelerator.py", line 1228, in prepare
result = tuple(
File "/opt/swift/lib/python3.10/site-packages/accelerate/accelerator.py", line 1229, in <genexpr>
self._prepare_one(obj, first_pass=True, device_placement=d) for obj, d in zip(args, device_placement)
File "/opt/swift/lib/python3.10/site-packages/accelerate/accelerator.py", line 1105, in _prepare_one
return self.prepare_model(obj, device_placement=device_placement)
File "/opt/swift/lib/python3.10/site-packages/accelerate/accelerator.py", line 1356, in prepare_model
model = torch.nn.parallel.DistributedDataParallel(
File "/opt/swift/lib/python3.10/site-packages/swift/llm/utils/utils.py", line 857, in <lambda>
_old_ddp_init(self, model, *args, **kwargs))
File "/opt/swift/lib/python3.10/site-packages/torch/nn/parallel/distributed.py", line 697, in __init__
self._log_and_throw(
File "/opt/swift/lib/python3.10/site-packages/torch/nn/parallel/distributed.py", line 1037, in _log_and_throw
raise err_type(err_msg)
ValueError: DistributedDataParallel's input module must be on the same type of devices, but input module parameters locate in {'cuda', 'meta'}.
Your hardware and system info
V100-32G * 8
ms-swift==1.7.3
Additional context
运行命令
CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 \
NPROC_PER_NODE=2 \
swift sft \
--model_type qwen1half-72b-chat \
--sft_type lora \
--dtype AUTO \
--output_dir output \
--dataset ms-bench-mini \
--train_dataset_sample 1000 \
--num_train_epochs 3 \
--max_length 4096 \
--check_dataset_strategy warning \
--lora_target_modules ALL \
--self_cognition_sample 500 \
--model_name 小黄 'Xiao Huang' \
--model_author 魔搭 ModelScope
如果设置NPROC_PER_NODE=1的话是可以运行的;
同样的命令,改成14b的话也是可以运行的;
所以是跟显存有关系?
Metadata
Metadata
Assignees
Labels
questionFurther information is requestedFurther information is requested