Skip to content

qwen1.5-72B-Chat DDP+MP 微调报 input module parameters locate in {'cuda', 'meta'} #634

@kratorado

Description

@kratorado

Describe the bug

Traceback (most recent call last):
  File "/opt/swift/lib/python3.10/site-packages/swift/cli/sft.py", line 5, in <module>
    sft_main()
  File "/opt/swift/lib/python3.10/site-packages/swift/utils/run_utils.py", line 31, in x_main
    result = llm_x(args, **kwargs)
  File "/opt/swift/lib/python3.10/site-packages/swift/llm/sft.py", line 236, in llm_sft
    trainer.train(training_args.resume_from_checkpoint)
  File "/opt/swift/lib/python3.10/site-packages/swift/trainers/trainers.py", line 50, in train
    res = super().train(*args, **kwargs)
  File "/opt/swift/lib/python3.10/site-packages/transformers/trainer.py", line 1624, in train
    return inner_training_loop(
  File "/opt/swift/lib/python3.10/site-packages/transformers/trainer.py", line 1776, in _inner_training_loop
    model, self.optimizer = self.accelerator.prepare(self.model, self.optimizer)
  File "/opt/swift/lib/python3.10/site-packages/accelerate/accelerator.py", line 1228, in prepare
    result = tuple(
  File "/opt/swift/lib/python3.10/site-packages/accelerate/accelerator.py", line 1229, in <genexpr>
    self._prepare_one(obj, first_pass=True, device_placement=d) for obj, d in zip(args, device_placement)
  File "/opt/swift/lib/python3.10/site-packages/accelerate/accelerator.py", line 1105, in _prepare_one
    return self.prepare_model(obj, device_placement=device_placement)
  File "/opt/swift/lib/python3.10/site-packages/accelerate/accelerator.py", line 1356, in prepare_model
    model = torch.nn.parallel.DistributedDataParallel(
  File "/opt/swift/lib/python3.10/site-packages/swift/llm/utils/utils.py", line 857, in <lambda>
    _old_ddp_init(self, model, *args, **kwargs))
  File "/opt/swift/lib/python3.10/site-packages/torch/nn/parallel/distributed.py", line 697, in __init__
    self._log_and_throw(
  File "/opt/swift/lib/python3.10/site-packages/torch/nn/parallel/distributed.py", line 1037, in _log_and_throw
    raise err_type(err_msg)
ValueError: DistributedDataParallel's input module must be on the same type of devices, but input module parameters locate in {'cuda', 'meta'}.

Your hardware and system info
V100-32G * 8
ms-swift==1.7.3

Additional context
运行命令

CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 \
NPROC_PER_NODE=2 \
swift sft         \
--model_type qwen1half-72b-chat         \
--sft_type lora     \
--dtype AUTO     \
--output_dir output     \
--dataset ms-bench-mini     \
--train_dataset_sample 1000     \
--num_train_epochs 3     \
--max_length 4096     \
--check_dataset_strategy warning     \
--lora_target_modules ALL     \
--self_cognition_sample 500 \
--model_name 小黄 'Xiao Huang' \
--model_author 魔搭 ModelScope 

如果设置NPROC_PER_NODE=1的话是可以运行的;
同样的命令,改成14b的话也是可以运行的;
所以是跟显存有关系?

Metadata

Metadata

Assignees

Labels

questionFurther information is requested

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions