-
Notifications
You must be signed in to change notification settings - Fork 4.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
DDLRUN + DeepSpeed on SUMMIT #61
Comments
Hello! Thank you for your interest in DeepSpeed. DeepSpeed uses its own launcher and relies on NCCL for communication instead of MPI. Codes need to use DeepSpeed's small API to run and no Horovod is used. To launch a DeepSpeed program, you just need a hostfile, which is compatible with many MPI implementations. DeepSpeed searches for Finally, you can launch with: deepspeed cifar_deepspeed.py --deepspeed --deepspeed_config=ds_config.json |
Thanks for the clarification. |
@agemagician, we just merged in a new PR that should make this a bit easier for you and others who want to use MPI. Please see this new text in our README for more details: https://github.com/microsoft/DeepSpeed/#mpi-compatibility In your case you should be able to do something like: Also make sure to install the python package mpi4py if you don't already have it. |
Thanks @jeffra for the update. |
The ddlrun didn't work out, as follows:
When I try Jsrun, I got another error:
|
I tried to change the distributed-backend parameter to ddl, and I had another error:
|
Oh, that was actually for using Megatron-LM code, which doesn't use DeepSpeed distributed code. I will test it again with the cifar test. |
* test commits in DSE * Support for porgressive layer dropping * Minor changes on PLD * update the finetune script * PLD client * Remove theta option Co-authored-by: Minjia Zhang <minjiaz@microsoft.com> Co-authored-by: Minjia Zhang <minjiaz@microsoft.com>
commit 9454f9ddb4c9da86a18a09c8c26e575620ec2814 (HEAD, origin/xpu-main, origin/HEAD) Author: Guo Yejun <yejun.guo@intel.com> Date: Thu Aug 18 05:45:27 2022 -0700 pretrain_gpt2.py: add atan op profiler for fwd for a given iteration commit 531dd5fb1c9c6e6281d81c84e0769c2679e9fe4e Author: Guo Yejun <yejun.guo@intel.com> Date: Fri Aug 19 21:54:55 2022 -0700 scripts/gpt-3.6b.sh: change train iteration from 6 to 10 commit 1ca3add75ec688ed8aa365b6a5e39c803db79f39 Author: Guo Yejun <yejun.guo@intel.com> Date: Wed Aug 3 19:44:16 2022 -0700 pretrain_gpt2.py: output more times for each train and whole process
Hi,
I am trying to use deepspeed on SUMMIT using ddlrun, but it doesn't work properly.
I am testing it with cifar like:
ddlrun deepspeed cifar10_deepspeed.py --deepspeed --deepspeed_config ds_config.json
Could you please give us an example for using deepspeed with horovod , mpi and ddlrun ?
The text was updated successfully, but these errors were encountered: