关于deepspeed 多机多卡训练时的速度 #1551

sun1092469590 · 2023-11-17T07:52:32Z

Reminder

I have read the README and searched the existing issues.

Reproduction

您好，我现在用2台2x80G的A800多机多卡模式来微调baichuan2-13B-Chat模型，可是训练速度有点奇怪。疑惑的点在于：同样的参数配置、训练代码、训练数据、训练模型，我使用其中1台A800（即2x80G）来lora微调，训练时间预估是143.5小时，但是当使用2台A800（即组成4x80G）来同样lora微调，训练时间预估却是282小时，接近两倍，实际上使用多卡时训练速度应该变快吧。请问下这大概是怎么回事呢？

下面是多机多卡训练时的代码：
deepspeed --hostfile=hostfile --include="gpu-001:0,1@gpu-002:0,1" src/train_bash.py
--stage sft
--model_name_or_path /home/Baichuan2-13B-Chat/
--do_train
--dataset alpaca_gpt4_zh
--template baichuan2
--finetuning_type lora
--lora_target W_pack
--output_dir /home/path_to_sft_checkpoint
--overwrite_cache
--per_device_train_batch_size 2
--gradient_accumulation_steps 4
--lr_scheduler_type cosine
--logging_steps 10
--save_steps 1000
--learning_rate 5e-5
--num_train_epochs 3.0
--plot_loss
--bf16
--deepspeed ds_config_zero3.json

单机训练代码：
deepspeed --include localhost:0,1 src/train_bash.py
--stage sft
--model_name_or_path /home/Baichuan2-13B-Chat/
--do_train
--dataset alpaca_gpt4_zh
--template baichuan2
--finetuning_type lora
--lora_target W_pack
--output_dir /home/path_to_sft_checkpoint
--overwrite_cache
--per_device_train_batch_size 2
--gradient_accumulation_steps 4
--lr_scheduler_type cosine
--logging_steps 10
--save_steps 1000
--learning_rate 5e-5
--num_train_epochs 3.0
--plot_loss
--bf16
--deepspeed ds_config_zero3.json

Expected behavior

No response

System Info

No response

Others

No response

hiyouga · 2023-11-17T08:17:26Z

ZeRO3 对多机之间的通信需求远高于 ZeRO2，如果没有 infiniband 等通信架构的话，会大幅降低并行训练速度。
另外 LoRA 训练完全可以不使用 DeepSpeed，用 Accelerate 的 MULTI_GPU 模式做多机多卡就足够了。

sun1092469590 · 2023-11-21T07:45:01Z

ZeRO3 对多机之间的通信需求远高于 ZeRO2，如果没有 infiniband 等通信架构的话，会大幅降低并行训练速度。另外 LoRA 训练完全可以不使用 DeepSpeed，用 Accelerate 的 MULTI_GPU 模式做多机多卡就足够了。

嗯嗯，您好，再问下哈，Accelerate 使用多机多卡时，Accelerate 的配置参数是怎么样的？

bravelyi · 2024-03-29T10:45:39Z

ZeRO3 对多机之间的通信需求远高于 ZeRO2，如果没有 infiniband 等通信架构的话，会大幅降低并行训练速度。另外 LoRA 训练完全可以不使用 DeepSpeed，用 Accelerate 的 MULTI_GPU 模式做多机多卡就足够了。

嗯嗯，您好，再问下哈，Accelerate 使用多机多卡时，Accelerate 的配置参数是怎么样的？

您好,您这个问题解决了吗

hiyouga added the solved This problem has been already solved. label Nov 17, 2023

hiyouga closed this as completed Nov 17, 2023

youqugit mentioned this issue Apr 1, 2024

请问训练框架能否支持Megatron-DeepSpeed来进行多机多卡训练？ #2956

Open

1 task

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

关于deepspeed 多机多卡训练时的速度 #1551

关于deepspeed 多机多卡训练时的速度 #1551

sun1092469590 commented Nov 17, 2023 •

edited

hiyouga commented Nov 17, 2023

sun1092469590 commented Nov 21, 2023

bravelyi commented Mar 29, 2024

关于deepspeed 多机多卡训练时的速度 #1551

关于deepspeed 多机多卡训练时的速度 #1551

Comments

sun1092469590 commented Nov 17, 2023 • edited

Reminder

Reproduction

Expected behavior

System Info

Others

hiyouga commented Nov 17, 2023

sun1092469590 commented Nov 21, 2023

bravelyi commented Mar 29, 2024

sun1092469590 commented Nov 17, 2023 •

edited