Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

关于deepspeed 多机多卡训练时的速度 #1551

Closed
1 task done
sun1092469590 opened this issue Nov 17, 2023 · 3 comments
Closed
1 task done

关于deepspeed 多机多卡训练时的速度 #1551

sun1092469590 opened this issue Nov 17, 2023 · 3 comments
Labels
solved This problem has been already solved.

Comments

@sun1092469590
Copy link

sun1092469590 commented Nov 17, 2023

Reminder

  • I have read the README and searched the existing issues.

Reproduction

您好,我现在用2台2x80G的A800多机多卡模式来微调baichuan2-13B-Chat模型,可是训练速度有点奇怪。 疑惑的点在于:同样的参数配置、训练代码、训练数据、训练模型,我使用其中1台A800(即2x80G)来lora微调,训练时间预估是143.5小时,但是当使用2台A800(即组成4x80G)来同样lora微调,训练时间预估却是282小时,接近两倍,实际上使用多卡时训练速度应该变快吧。请问下这大概是怎么回事呢?

下面是多机多卡训练时的代码:
deepspeed --hostfile=hostfile --include="gpu-001:0,1@gpu-002:0,1" src/train_bash.py
--stage sft
--model_name_or_path /home/Baichuan2-13B-Chat/
--do_train
--dataset alpaca_gpt4_zh
--template baichuan2
--finetuning_type lora
--lora_target W_pack
--output_dir /home/path_to_sft_checkpoint
--overwrite_cache
--per_device_train_batch_size 2
--gradient_accumulation_steps 4
--lr_scheduler_type cosine
--logging_steps 10
--save_steps 1000
--learning_rate 5e-5
--num_train_epochs 3.0
--plot_loss
--bf16
--deepspeed ds_config_zero3.json

单机训练代码:
deepspeed --include localhost:0,1 src/train_bash.py
--stage sft
--model_name_or_path /home/Baichuan2-13B-Chat/
--do_train
--dataset alpaca_gpt4_zh
--template baichuan2
--finetuning_type lora
--lora_target W_pack
--output_dir /home/path_to_sft_checkpoint
--overwrite_cache
--per_device_train_batch_size 2
--gradient_accumulation_steps 4
--lr_scheduler_type cosine
--logging_steps 10
--save_steps 1000
--learning_rate 5e-5
--num_train_epochs 3.0
--plot_loss
--bf16
--deepspeed ds_config_zero3.json

Expected behavior

No response

System Info

No response

Others

No response

@hiyouga
Copy link
Owner

hiyouga commented Nov 17, 2023

ZeRO3 对多机之间的通信需求远高于 ZeRO2,如果没有 infiniband 等通信架构的话,会大幅降低并行训练速度。
另外 LoRA 训练完全可以不使用 DeepSpeed,用 Accelerate 的 MULTI_GPU 模式做多机多卡就足够了。

@hiyouga hiyouga added the solved This problem has been already solved. label Nov 17, 2023
@hiyouga hiyouga closed this as completed Nov 17, 2023
@sun1092469590
Copy link
Author

ZeRO3 对多机之间的通信需求远高于 ZeRO2,如果没有 infiniband 等通信架构的话,会大幅降低并行训练速度。 另外 LoRA 训练完全可以不使用 DeepSpeed,用 Accelerate 的 MULTI_GPU 模式做多机多卡就足够了。

嗯嗯,您好,再问下哈,Accelerate 使用多机多卡时,Accelerate 的配置参数是怎么样的?

@bravelyi
Copy link

ZeRO3 对多机之间的通信需求远高于 ZeRO2,如果没有 infiniband 等通信架构的话,会大幅降低并行训练速度。 另外 LoRA 训练完全可以不使用 DeepSpeed,用 Accelerate 的 MULTI_GPU 模式做多机多卡就足够了。

嗯嗯,您好,再问下哈,Accelerate 使用多机多卡时,Accelerate 的配置参数是怎么样的?

您好,您这个问题解决了吗

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
solved This problem has been already solved.
Projects
None yet
Development

No branches or pull requests

3 participants