Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Multi Node Distributing - RuntimeError: Connection reset by peer #1502

Open
nithin8702 opened this issue Oct 29, 2021 · 6 comments
Open

Multi Node Distributing - RuntimeError: Connection reset by peer #1502

nithin8702 opened this issue Oct 29, 2021 · 6 comments

Comments

@nithin8702
Copy link

Hi,

Am trying to use multi node in AWS Kubernetes (EKS). Its working in Single Node (current machine without the hostfile) but when i try to connect different nodes, its not)

hostfile
gpu2 slots=1

ssh gpu2 date is returning the date.

In Terminal when i execute the following commands, its throwing the error given below. "RuntimeError: Connection reset by peer"
deepspeed --hostfile=hostfile cifar10_deepspeed.py --deepspeed_config ds_config.json

image

@jeffra
Copy link
Contributor

jeffra commented Oct 29, 2021

Hi @nithin8702, if you do ds_ssh hostname it should show all the hosts you are intending to executing on. ds_ssh is assuming the hostfile is at /job/hostfile, I just pushed #1504 to allow passing in a custom path for the hostfile. I think in your case if you are wanting to execute across 2 nodes you'll want to add both nodes to your hostfile.

@nithin8702
Copy link
Author

Hi @jeffra

ds_ssh hostname is returning worker nodes correctly. However when i execute deepspeed command it's throwing "Connection reset by peer"

I have the hostfile both in /job/hostfile as well in present working directory.

Full command is "deepspeed --hostfile=hostfile cifar10_deepspeed.py --deepspeed_config ds_config.json"

Here's my ssh config file at /root/.ssh/config

Host gpu2
Hostname xx.xxx.x.xxx
user ec2-user
IdentityFile /app/ri.pem
Port 22

@nithin8702
Copy link
Author

Hi @jeffra

Do you any documentation for how to setup deepspeed in Kubernetes? Also, It's good to have setup documentation for kubeflow integration.

@a-cavalcanti
Copy link

Hi @jeffra

Do you any documentation for how to setup deepspeed in Kubernetes? Also, It's good to have setup documentation for kubeflow integration.

Hi @jeffra
Do you have any updates on how to run deepspeed in Kubernetes or kubeflow? Or some documentation that can help us?

@nithin8702 Did you get something?

@dogacancolak
Copy link

Hi @jeffra
Would love to see some information on running deepspeed in k8s as the others mentioned. Thank you!

@Szakulli07
Copy link

Hi, possibly if anybody still searching something I can describe how I managed to setup DeepSpeed on kubeflow (EKS backend). It won't be full tutorial but mayby at least some people will save time.

So generally speaking you need something for setting up pods and connection between them. In most tutorials they guide to use mpi-operator but when you've never worked with MPI... I would say there are easier options you can choose. Especially PytorchJob. I would advice to create python function that will dynamically create this ( for names, or number of nodes and gpus sake).

Remember that you need to run your script both in master pod and all worker pods with different node. This operator will create sufficient env vars for you, $RANK, $MASTER_PORT, $MASTER_ADDR etc.

Than you will probably want to use this operator as a pipeline component. You can use ResourceOp to achieve this.

Now, how to launch it. I've menaged to do it with torchrun and script looked like this:

sleep 20 \
        && export NCCL_SOCKET_IFNAME=eth0 \
        && torchrun \
            --nnodes={num_of_nodes} \
            --nproc-per-node={num_of_gpu} \
            --node_rank=$RANK \
            --master_addr=$MASTER_ADDR \
            --master_port=$MASTER_PORT \
            -m script

Sleep was mainly for debugging issues. Setting interface eth0 was because it sometimes was detecting wrong so I've set it manually.

Last part was connected with adjusting code. I've need to change deepspeed.init_distributed to

torch.distributed.init_process_group(
        backend='nccl',
        init_method=f'tcp://{os.getenv("MASTER_ADDR")}:{os.getenv("MASTER_PORT")}',
        world_size=os.getenv("WORLD_SIZE"),
        rank=os.getenv("RANK"))

    )
    torch.distributed.barrier()

I've couldn't menage to do it with either gloo backend or env init method ot without torch.distributed.barrier but I am not sure because it was kinda messy process of guessing for me so maybe it is not needed.

And what's most important. You don't need to change your Dockerfile at all!

So that's it. That worked for me. I hope it will help someone.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants