Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

SSH-less connection on Kubernetes #2679

Closed
dogacancolak opened this issue Jan 9, 2023 · 4 comments
Closed

SSH-less connection on Kubernetes #2679

dogacancolak opened this issue Jan 9, 2023 · 4 comments
Labels
enhancement New feature or request

Comments

@dogacancolak
Copy link

Is your feature request related to a problem? Please describe.
What is the advised way to run DeepSpeed on Kubernetes? Starting an ssh daemon does not sound optimal, since we would need to have two processes per container: one for deepspeed and one for ssh.

Using MPI isn't a solution either since MPI also uses ssh to establish the connection.

@dogacancolak dogacancolak added the enhancement New feature or request label Jan 9, 2023
@dogacancolak dogacancolak reopened this Apr 13, 2023
@hjp709394
Copy link

What's the status of this item?

@dogacancolak
Copy link
Author

I ended up with two methods of running a multi-node task. The first method is indeed setting up password-less SSH connections between the pods and just using the deepspeed launcher with the hostfile listing SSH hostnames.

The other way is using the torchrun command. If you use this one, you need to call torch.distributed.init_distributed instead of deepspeed.initialize in the training script. Also, unlike the deepspeed launcher, this command needs to be run independently on all pods, and you must pass the address of one of the pods to act as the master node.

@loadams
Copy link
Contributor

loadams commented Aug 14, 2023

@dogacancolak - that's probably the best way to do this for now.

@loadams loadams closed this as completed Aug 14, 2023
@jungyh0218
Copy link

This issue is already closed but I want to share the detailed method I used to deal with this issue.

Using torchrun instead of deepspeed launcher
As far as I know, using deepspeed launcher is not mandatory for using deepspeed configuration. As @dogacancolak mentioned, I also tried using torchrun launcher. Following is how to modify your code;

#switch deepspeed method to torch.distributed method.
#deepspeed.init_distributed() 
torch.distributed.init_process_group(backend='nccl')

And run the code with torchrun launcher. Since torchrun launcher does not invoke worker nodes via pdsh, it is necessary to run the code independently on all pods. Using MPIJob or PyTorchJob of Kubeflow can do this automatically.

Setting passwordless SSH in kubernetes pods
To set passwordless SSH in kubernetes, open-ssh and pdsh have to be installed in Docker image. I created a Dockerfile based on NGC Pytorch image.
https://gist.github.com/jungyh0218/c04a285aadaa1efe3df289a61cf77f74

To use passwordless SSH, create ssh keys.

ssh-keygen -t rsa
cp id_rsa.pub authorized_keys

To pass these key files, use kubernetes secret.

kubectl create secret generic my-ssh-key --from-file=path/to/id_rsa --from-file=path/to/id_rsa.pub --from-file=path/to/authorized_keys

After creating the secret, deploy your kubernetes deployment and mount the secret volume. When creating pods, you have to make it sure that open-sshd service is running. You also need to copy id_rsa and authorized_keys in /root/.ssh folder. Do these tasks in your init container or container commands.

If someone has any other better idea, share it please. :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

4 participants