-
Notifications
You must be signed in to change notification settings - Fork 4.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Multi Node Distributing - RuntimeError: Connection reset by peer #1502
Comments
Hi @nithin8702, if you do |
Hi @jeffra ds_ssh hostname is returning worker nodes correctly. However when i execute deepspeed command it's throwing "Connection reset by peer" I have the hostfile both in /job/hostfile as well in present working directory. Full command is "deepspeed --hostfile=hostfile cifar10_deepspeed.py --deepspeed_config ds_config.json" Here's my ssh config file at /root/.ssh/config Host gpu2 |
Hi @jeffra Do you any documentation for how to setup deepspeed in Kubernetes? Also, It's good to have setup documentation for kubeflow integration. |
Hi @jeffra @nithin8702 Did you get something? |
Hi @jeffra |
Hi, possibly if anybody still searching something I can describe how I managed to setup DeepSpeed on kubeflow (EKS backend). It won't be full tutorial but mayby at least some people will save time. So generally speaking you need something for setting up pods and connection between them. In most tutorials they guide to use mpi-operator but when you've never worked with MPI... I would say there are easier options you can choose. Especially PytorchJob. I would advice to create python function that will dynamically create this ( for names, or number of nodes and gpus sake). Remember that you need to run your script both in master pod and all worker pods with different node. This operator will create sufficient env vars for you, $RANK, $MASTER_PORT, $MASTER_ADDR etc. Than you will probably want to use this operator as a pipeline component. You can use ResourceOp to achieve this. Now, how to launch it. I've menaged to do it with
Sleep was mainly for debugging issues. Setting interface Last part was connected with adjusting code. I've need to change
I've couldn't menage to do it with either And what's most important. You don't need to change your So that's it. That worked for me. I hope it will help someone. |
Hi,
Am trying to use multi node in AWS Kubernetes (EKS). Its working in Single Node (current machine without the hostfile) but when i try to connect different nodes, its not)
hostfile
gpu2 slots=1
ssh gpu2 date is returning the date.
In Terminal when i execute the following commands, its throwing the error given below. "RuntimeError: Connection reset by peer"
deepspeed --hostfile=hostfile cifar10_deepspeed.py --deepspeed_config ds_config.json
The text was updated successfully, but these errors were encountered: