Multi Node Distributing - RuntimeError: Connection reset by peer #1502

nithin8702 · 2021-10-29T16:46:14Z

Hi,

Am trying to use multi node in AWS Kubernetes (EKS). Its working in Single Node (current machine without the hostfile) but when i try to connect different nodes, its not)

hostfile
gpu2 slots=1

ssh gpu2 date is returning the date.

In Terminal when i execute the following commands, its throwing the error given below. "RuntimeError: Connection reset by peer"
deepspeed --hostfile=hostfile cifar10_deepspeed.py --deepspeed_config ds_config.json

jeffra · 2021-10-29T17:59:01Z

Hi @nithin8702, if you do ds_ssh hostname it should show all the hosts you are intending to executing on. ds_ssh is assuming the hostfile is at /job/hostfile, I just pushed #1504 to allow passing in a custom path for the hostfile. I think in your case if you are wanting to execute across 2 nodes you'll want to add both nodes to your hostfile.

nithin8702 · 2021-10-30T15:02:11Z

Hi @jeffra

ds_ssh hostname is returning worker nodes correctly. However when i execute deepspeed command it's throwing "Connection reset by peer"

I have the hostfile both in /job/hostfile as well in present working directory.

Full command is "deepspeed --hostfile=hostfile cifar10_deepspeed.py --deepspeed_config ds_config.json"

Here's my ssh config file at /root/.ssh/config

Host gpu2
Hostname xx.xxx.x.xxx
user ec2-user
IdentityFile /app/ri.pem
Port 22

nithin8702 · 2021-11-01T07:06:43Z

Hi @jeffra

Do you any documentation for how to setup deepspeed in Kubernetes? Also, It's good to have setup documentation for kubeflow integration.

a-cavalcanti · 2022-05-18T17:43:44Z

Hi @jeffra

Do you any documentation for how to setup deepspeed in Kubernetes? Also, It's good to have setup documentation for kubeflow integration.

Hi @jeffra
Do you have any updates on how to run deepspeed in Kubernetes or kubeflow? Or some documentation that can help us?

@nithin8702 Did you get something?

dogacancolak · 2023-01-06T20:00:48Z

Hi @jeffra
Would love to see some information on running deepspeed in k8s as the others mentioned. Thank you!

Szakulli07 · 2023-09-12T12:02:38Z

Hi, possibly if anybody still searching something I can describe how I managed to setup DeepSpeed on kubeflow (EKS backend). It won't be full tutorial but mayby at least some people will save time.

So generally speaking you need something for setting up pods and connection between them. In most tutorials they guide to use mpi-operator but when you've never worked with MPI... I would say there are easier options you can choose. Especially PytorchJob. I would advice to create python function that will dynamically create this ( for names, or number of nodes and gpus sake).

Remember that you need to run your script both in master pod and all worker pods with different node. This operator will create sufficient env vars for you, $RANK, $MASTER_PORT, $MASTER_ADDR etc.

Than you will probably want to use this operator as a pipeline component. You can use ResourceOp to achieve this.

Now, how to launch it. I've menaged to do it with torchrun and script looked like this:

sleep 20 \
        && export NCCL_SOCKET_IFNAME=eth0 \
        && torchrun \
            --nnodes={num_of_nodes} \
            --nproc-per-node={num_of_gpu} \
            --node_rank=$RANK \
            --master_addr=$MASTER_ADDR \
            --master_port=$MASTER_PORT \
            -m script

Sleep was mainly for debugging issues. Setting interface eth0 was because it sometimes was detecting wrong so I've set it manually.

Last part was connected with adjusting code. I've need to change deepspeed.init_distributed to

torch.distributed.init_process_group(
        backend='nccl',
        init_method=f'tcp://{os.getenv("MASTER_ADDR")}:{os.getenv("MASTER_PORT")}',
        world_size=os.getenv("WORLD_SIZE"),
        rank=os.getenv("RANK"))

    )
    torch.distributed.barrier()

I've couldn't menage to do it with either gloo backend or env init method ot without torch.distributed.barrier but I am not sure because it was kinda messy process of guessing for me so maybe it is not needed.

And what's most important. You don't need to change your Dockerfile at all!

So that's it. That worked for me. I hope it will help someone.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Multi Node Distributing - RuntimeError: Connection reset by peer #1502

Multi Node Distributing - RuntimeError: Connection reset by peer #1502

nithin8702 commented Oct 29, 2021

jeffra commented Oct 29, 2021

nithin8702 commented Oct 30, 2021

nithin8702 commented Nov 1, 2021

a-cavalcanti commented May 18, 2022

dogacancolak commented Jan 6, 2023

Szakulli07 commented Sep 12, 2023

Multi Node Distributing - RuntimeError: Connection reset by peer #1502

Multi Node Distributing - RuntimeError: Connection reset by peer #1502

Comments

nithin8702 commented Oct 29, 2021

jeffra commented Oct 29, 2021

nithin8702 commented Oct 30, 2021

nithin8702 commented Nov 1, 2021

a-cavalcanti commented May 18, 2022

dogacancolak commented Jan 6, 2023

Szakulli07 commented Sep 12, 2023