-
Notifications
You must be signed in to change notification settings - Fork 3.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
SSH-less connection on Kubernetes #2679
Comments
What's the status of this item? |
I ended up with two methods of running a multi-node task. The first method is indeed setting up password-less SSH connections between the pods and just using the deepspeed launcher with the hostfile listing SSH hostnames. The other way is using the torchrun command. If you use this one, you need to call |
@dogacancolak - that's probably the best way to do this for now. |
This issue is already closed but I want to share the detailed method I used to deal with this issue. Using torchrun instead of deepspeed launcher
And run the code with torchrun launcher. Since torchrun launcher does not invoke worker nodes via pdsh, it is necessary to run the code independently on all pods. Using MPIJob or PyTorchJob of Kubeflow can do this automatically. Setting passwordless SSH in kubernetes pods To use passwordless SSH, create ssh keys.
To pass these key files, use kubernetes secret.
After creating the secret, deploy your kubernetes deployment and mount the secret volume. When creating pods, you have to make it sure that open-sshd service is running. You also need to copy id_rsa and authorized_keys in /root/.ssh folder. Do these tasks in your init container or container commands. If someone has any other better idea, share it please. :) |
Is your feature request related to a problem? Please describe.
What is the advised way to run DeepSpeed on Kubernetes? Starting an ssh daemon does not sound optimal, since we would need to have two processes per container: one for deepspeed and one for ssh.
Using MPI isn't a solution either since MPI also uses ssh to establish the connection.
The text was updated successfully, but these errors were encountered: