Multi-node setup, host can't connect to it's own provided IP address #647

neel04 · 2022-08-21T20:43:48Z

Hi 🤗 I have a 2 Nodes, each of 8xA100 for a total of 16 GPUs. I'm utilizing SLURM for launching the jobs:
SLURM scripts for the curious: https://rentry.co/9geu8n

Here, the main script uses the alloted 2 nodes and runs srun over it, i.e each node is given the PY file to execute once.

Env

Accelerate version: 0.13.0.dev0
Platform: Linux-5.10.126-117.518.amzn2.x86_64-x86_64-with-glibc2.10
Python version: 3.8.13
Numpy version: 1.22.4
PyTorch version (GPU?): 1.13.0a0+08820cb (True)
Accelerate default config:
Not found

Now, I noticed a peculiar behavior. When on a single node (no SLURM, no multi-node, only multi-GPU) and run this:

accelerate launch --num_processes 8 --num_machines 1 --multi_gpu \
--mixed_precision fp16 --machine_rank 0 --main_process_ip 172.... --main_process_port 69420 \
\
scripts/...

The script won't run - the command simply executes, and I'm back the the command prompt again - no stdout or stderr.

But with

accelerate launch --num_processes 8 --num_machines 1 --multi_gpu \
--mixed_precision fp16  \
\
scripts/torch_...

It works fine. The scripts runs alone on the 8 GPUs, and I can monitor the WandB logs.

This is a little quirk which puzzled me, and I can neither make head or tail of. I suspect it might mean something to someone here..

Multi-node training

For multi-node training, this is the PY script being executed: https://rentry.co/tz465

This script works correctly for multi-GPU cases, but NOT for multi-node

Most of it's standard snippets, but it may have some glaring flaw

Output:

This is the output of the main sbatch script, which tells SLURM to deploy

Number of Nodes: 2
Name of all Hosts: gpu-st-p4d-24xlarge-60 gpu-st-p4d-24xlarge-61 # two nodes here, each 8xA100s
Master IP: 172.3.... # IP address of the main node
MASTER_PORT= 16543
ID: 0 # Each node reporting its RANK
ID: 1
NODE_COUNT=2 #number of nodes deployed

[18:14:34] WARNING  The following values were not passed to        launch.py:838
                    `accelerate launch` and had defaults used                   
                    instead:                                                    
                            `--num_cpu_threads_per_process` was                 
                    set to `48` to improve out-of-box performance               
                    To avoid this warning pass in values for each               
                    of the problematic parameters or run                        
                    `accelerate config`.                                        
[18:14:35] WARNING  The following values were not passed to        launch.py:838
                    `accelerate launch` and had defaults used                   
                    instead:                                                    
                            `--num_cpu_threads_per_process` was                 
                    set to `48` to improve out-of-box performance               
                    To avoid this warning pass in values for each               
                    of the problematic parameters or run                        
                    `accelerate config`.  
{Waiting about 15 mins}

[E socket.cpp:858] [c10d] The client socket has timed out after 900s while trying to connect to (gpu-st-p4d-24xlarge-60, 16543).
[E socket.cpp:858] [c10d] The client socket has timed out after 900s while trying to connect to (gpu-st-p4d-24xlarge-60, 16543).

Trying random ports yields no results.

I think it might be connected with the problem specified above. Does anyone have any idea?

The text was updated successfully, but these errors were encountered:

neel04 · 2022-08-21T22:50:06Z

So I think I have a lead:
Command: (Singularity is just HPC version of docker)

singularity exec --nv --cleanenv -B /fsx/awesome:/home/awesome torch_cuda_11_7.sif accelerate launch --num_processes=8 \
--num_chines 2 --multi_gpu --mixed_precision fp16 --main_process_port 4444 scripts/torch_convnext.py \
--....

output:

[22:45:53] WARNING  The following values were not passed to `accelerate launch` and had defaults used instead:                                             launch.py:838
                            `--num_cpu_threads_per_process` was set to `48` to improve out-of-box performance
                    To avoid this warning pass in values for each of the problematic parameters or run `accelerate config`.
[W socket.cpp:599] [c10d] The IPv6 network addresses of (None, 4444) cannot be retrieved (gai error: -2 - Name or service not known).

So it appears when I add main_process_ip flag on the 0th rank host (main) to it's own IP address (since that is the main host), then it tries to connect to itself, and obviously times out as shown above.

But if I leave that field empty (above) it doesn't work. I feel either that I may be setting something wrong, which may not be explicitly in the docs or it might happen to be a bug 🤔

pacman100 · 2022-08-22T16:50:10Z

Hello @neel04, the peculiar behaviour for single-node multi-gpu setup has nothing to do with accelerate or pytorch launchers. It is the 69420 port number that you pass for main_process_port. When I changed that to other conventional port numbers such as 8888, things work fine. Please retry with other ports and you will find the issue being resolved.

pacman100 · 2022-08-22T16:52:31Z

With respect to multi-node multi-gpu using Slurm, a user seems to have provided some approach in this comment on a related issue: Using accelerate config for SLURM cluster with dist_url input · Issue #145 · huggingface/accelerate · GitHub 14

Here is the Git Gist from that comment: distributed dalle2 laion (github.com) 3

Could you please try it out and let us know if that works?

neel04 · 2022-08-22T17:19:17Z

@pacman100 Thank you for the reply! I'd indeed tried conventional ports, but they don't seem to work very well either.

>>> accelerate launch ...
[E socket.cpp:793] [c10d] The client socket has timed out after 900s while trying to connect to (172.31.****, 8888).
>>>

I'm using the eth0 IP for my host, which looks like 172.31.**** and I can confirm I can ping it fine from the host as well as the child machine.

As for the SLURM, I'm using scripts from the same user 😄 You'll find they're remarkably similar to mine.

However, I'm NOT using SLURM right now for debugging, instead running commands on both machines individually and manually. Sorry for throwing that in to confuse things 😓
Additionally, I was using singularity containers - but fearing network misconfiguration, I'm now on simple conda envs.
Here are both the config YAML for my machines - they're the same, except rank is updated: https://rentry.co/d4ev3

So far, isolating all factors - it simply seems accelerate is unable to connect. I'm new to multi-node distributed training, so if you have some obvious advice, please do not hesitate to double-check I've setup things correctly! 👍

EDIT: BTW, I've also tested accelerate test on both machines to ensure the installation is correct. Might not be relevant...

pacman100 · 2022-08-22T17:53:47Z

@neel04 , wrt single-node multi-gpu, I am unable to reproduce the error then. the below command works fine:

accelerate launch --num_processes 2 --num_machines 1 --multi_gpu --mixed_precision "fp16" --machine_rank 0 \
--main_process_ip "192.xxx.x.xx" --main_process_port 8888 accelerate/examples/nlp_example.py

wrt multi-node multi-gpu do you observe issues when launching using torchrun or torch.distributed.launch?
Could you try the below ways of launching and check.
Using torch.distributed.launch:

NCCL_DEBUG=INFO python -m torch.distributed.launch --nproc_per_node=2 --nno
des=2  --node_rank=$NODE_RANK --master_addr="192.xxx.x.xx" --master_port=52178 --use_env accelerate/examples/
nlp_example.py

NCCL_DEBUG=INFO torchrun --nproc_per_node=2 --nnodes=2  --node_rank=$NODE_RANK --master_addr="192.xxx.x.xx" --master_port=52178 accelerate/examples/nlp_example.py

neel04 · 2022-08-22T20:09:31Z

An interesting bag of results. using the new torch.distributed.launch commands, the first one half works - it complains about local_rank but it waits at the ***** Setting part unless I run the same command on the second machine - which implies that there is some inter-node communication atleast.

I feel the error could be resolved after some effort, for which I will update later on :)

The second command seems to work quite well 👌I wasn't able to train more than a couple steps (pre-emption) but the synchronized initial loss leads me to believe that atleast the parameters synced initially - and since training worked, inter-node comms are working.

So it appears some problem in accelerate with multi-node setup's networking. While torchrun works, I think I might need to add AMP to my setup for fp16. I'd still love to get to this issue's core so that future users have no problem - and as such, I'm up for further debugging and testing on my side 🤗 let me know if you have any further stuff you might want me to try if that helps triaging the bug! 🚀

I've put the error traceback for the first command just in case, though I'm pretty sure I can get it to work.

Error @ command - 1 [Main Host]

FutureWarning: The module torch.distributed.launch is deprecated
and will be removed in future. Use torchrun.
Note that --use_env is set by default in torchrun.
If your script expects `--local_rank` argument to be set, please
change it to read from `os.environ['LOCAL_RANK']` instead. See
https://pytorch.org/docs/stable/distributed.html#launch-utility for
further instructions

  warnings.warn(
WARNING:torch.distributed.run:
*****************************************
Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed.
*****************************************
usage: torch_convnext.py [-h] [--model_name MODEL_NAME] [--batch_size BATCH_SIZE] [--pretrained PRETRAINED] [--epochs EPOCHS] [--lr LR] [--optimizer OPTIMIZER]
                         [--log_frequency LOG_FREQUENCY] [--val_frequency VAL_FREQUENCY] [--input_shape INPUT_SHAPE] [--weight_decay WEIGHT_DECAY]
                         [--group_name GROUP_NAME]
torch_convnext.py: error: unrecognized arguments: --local_rank=0
usage: torch_convnext.py [-h] [--model_name MODEL_NAME] [--batch_size BATCH_SIZE] [--pretrained PRETRAINED] [--epochs EPOCHS] [--lr LR] [--optimizer OPTIMIZER]
                         [--log_frequency LOG_FREQUENCY] [--val_frequency VAL_FREQUENCY] [--input_shape INPUT_SHAPE] [--weight_decay WEIGHT_DECAY]
                         [--group_name GROUP_NAME]
torch_convnext.py: error: unrecognized arguments: --local_rank=1
ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 2) local_rank: 0 (pid: 5445) of binary: /home/awesome/awesome/anaconda3/envs/SUMO_dist/bin/python
Traceback (most recent call last):
  File "/home/awesome/awesome/anaconda3/envs/SUMO_dist/lib/python3.8/runpy.py", line 194, in _run_module_as_main
    return _run_code(code, main_globals, None,
  File "/home/awesome/awesome/anaconda3/envs/SUMO_dist/lib/python3.8/runpy.py", line 87, in _run_code
    exec(code, run_globals)
  File "/home/awesome/awesome/anaconda3/envs/SUMO_dist/lib/python3.8/site-packages/torch/distributed/launch.py", line 193, in <module>
    main()
  File "/home/awesome/awesome/anaconda3/envs/SUMO_dist/lib/python3.8/site-packages/torch/distributed/launch.py", line 189, in main
    launch(args)
  File "/home/awesome/awesome/anaconda3/envs/SUMO_dist/lib/python3.8/site-packages/torch/distributed/launch.py", line 174, in launch
    run(args)
  File "/home/awesome/awesome/anaconda3/envs/SUMO_dist/lib/python3.8/site-packages/torch/distributed/run.py", line 752, in run
    elastic_launch(
  File "/home/awesome/awesome/anaconda3/envs/SUMO_dist/lib/python3.8/site-packages/torch/distributed/launcher/api.py", line 131, in __call__
    return launch_agent(self._config, self._entrypoint, list(args))
  File "/home/awesome/awesome/anaconda3/envs/SUMO_dist/lib/python3.8/site-packages/torch/distributed/launcher/api.py", line 245, in launch_agent
    raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:
============================================================
scripts/torch_convnext.py FAILED
------------------------------------------------------------
Failures:
[1]:
  time      : 2022-08-22_19:41:49
  host      : gpu-st-p4d-24xlarge-433.hpc-1click-prod450.pcluster
  rank      : 1 (local_rank: 1)
  exitcode  : 2 (pid: 5446)
  error_file: <N/A>
  traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
------------------------------------------------------------
Root Cause (first observed failure):
[0]:
  time      : 2022-08-22_19:41:49
  host      : gpu-st-p4d-24xlarge-433.hpc-1click-prod450.pcluster
  rank      : 0 (local_rank: 0)
  exitcode  : 2 (pid: 5445)
  error_file: <N/A>
  traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
============================================================

muellerzr · 2022-08-25T13:58:56Z

@neel04 should have a solution here soon for you to try. Thanks for the clear bug report!

muellerzr · 2022-09-02T18:57:53Z

@neel04 could you try this again, and install accelerate via:

pip install git+https://github.com/huggingface/accelerate@rdvz-fix

Thanks!

Note: You will need to run accelerate config again

neel04 · 2022-09-02T19:34:47Z

@muellerzr you might've forgotten to make the git branch public 😄

muellerzr · 2022-09-02T19:35:42Z

@neel04 actually it's now in main 😅 so just do pip install git+https://github.com/huggingface/accelerate!

neel04 · 2022-09-03T11:34:29Z

I tried training on 2 nodes, 16x GPUs - it seems to be training well.

accelerate launch --config_file './root_config.yaml' scripts/torch_convnext.py --model_name='convnext_base_in22k' --num_workers 2 --lr=6e-5 --optimize='AdamW' --weight_decay=0.002 --group_name='ConvNext_Baselines'

But its terribly slow. It takes thrice the time vs vanilla DDP, But more importantly, if it takes 3.1s/step then for a 1000 steps I'd expect to see a runtime of 50 minutes - while the wall time of the run is an exorbitant 3 hours.

Code & Configs: https://rentry.co/xxh9s

I haven't changed much (I kept a seperate copy for accelerate) and these slowdowns are hard to understand. Assuming the dataloader isn't to blame (because it works fast for DDP), and seeing spiky GPU util, I can only guess that syncing between nodes is taking more time than expected keeping the GPUs waiting. I swapped the backend from static to c10d (which DDP used) no improvement.

I'll try more tests and fixes, but if there's anything obvious I might've missed - do let me know.

pacman100 · 2022-09-03T12:10:53Z

Hello, what do you mean by Vanilla DDP? Do you mean multi-gpu training instead of multi-node multi-gpu training? As per your information on spiky GPU utilisation, communication between nodes is likely the reason behind slowdown. How are nodes interconnected? For maximum speeds, nodes need to be interconnected using infiniband.

neel04 · 2022-09-03T12:44:12Z

@pacman100 This is all multi-node. I meant that using torch.distributed's DDP and Elastic/torchrun is much faster than Accelerate. As I understand, the nodes on my HPC are connected via Ethernet which NCCL uses for Elastic. In that case, should the same_network arg in the config be set to True - especially if I'm using private IP address for Accelerate?

muellerzr · 2022-09-03T13:36:30Z

@neel04 what is your torch distributed command that achieves the speedup? (A bit unclear in your messages)

neel04 · 2022-09-03T13:48:12Z

torchrun --nproc_per_node=6 --nnodes=$COUNT_NODE --node_rank=$THEID --master_addr=$MASTER_ADDR --master_port=$MASTER_PORT  \
scripts/ddp_convnext.py --model_name ....

pacman100 · 2022-09-03T14:09:30Z

@neel04, can you try with same_network being True?

neel04 · 2022-09-03T14:23:24Z

I have, there seems to be no difference.
Code & configs: https://rentry.co/xxh9s

pacman100 · 2022-09-03T14:28:35Z

I have, there seems to be no difference.
Code & configs: https://rentry.co/xxh9s

In the above config, it is false

neel04 · 2022-09-03T14:30:09Z

right, I changed backend to static and same_network=true

compute_environment: LOCAL_MACHINE
deepspeed_config: {}
distributed_type: MULTI_GPU
downcast_bf16: 'no'
fsdp_config: {}
machine_rank: 0
main_process_ip: 172.31.231.142 
main_process_port: 8887
main_training_function: main
mixed_precision: fp16
num_machines: 2
num_processes: 16
rdzv_backend: static
same_network: true
use_cpu: false

github-actions · 2022-09-27T15:07:41Z

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.

Please note that issues that do not follow the contributing guidelines are likely to be ignored.

flckv · 2023-05-27T06:29:05Z

pacman100
@pacman100 @muellerzr
Is accelarate command updated now to do what torchrun or torch.distributed.launch do? is accelerate still slow?

neel04 changed the title ~~Multi-node training does not work - The client socket has timed out after 900s while trying to connect~~ Multi-node setup, host can't connect to it's own provided IP address Aug 22, 2022

github-actions bot closed this as completed Oct 5, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Multi-node setup, host can't connect to it's own provided IP address #647

Multi-node setup, host can't connect to it's own provided IP address #647

neel04 commented Aug 21, 2022

neel04 commented Aug 21, 2022 •

edited

pacman100 commented Aug 22, 2022

pacman100 commented Aug 22, 2022

neel04 commented Aug 22, 2022 •

edited

pacman100 commented Aug 22, 2022 •

edited

neel04 commented Aug 22, 2022 •

edited

muellerzr commented Aug 25, 2022

muellerzr commented Sep 2, 2022 •

edited

neel04 commented Sep 2, 2022

muellerzr commented Sep 2, 2022

neel04 commented Sep 3, 2022 •

edited

pacman100 commented Sep 3, 2022 •

edited

neel04 commented Sep 3, 2022 •

edited

muellerzr commented Sep 3, 2022

neel04 commented Sep 3, 2022

pacman100 commented Sep 3, 2022

neel04 commented Sep 3, 2022 •

edited

pacman100 commented Sep 3, 2022

neel04 commented Sep 3, 2022 •

edited

github-actions bot commented Sep 27, 2022

flckv commented May 27, 2023 •

edited

Multi-node setup, host can't connect to it's own provided IP address #647

Multi-node setup, host can't connect to it's own provided IP address #647

Comments

neel04 commented Aug 21, 2022

Env

Multi-node training

neel04 commented Aug 21, 2022 • edited

pacman100 commented Aug 22, 2022

pacman100 commented Aug 22, 2022

neel04 commented Aug 22, 2022 • edited

pacman100 commented Aug 22, 2022 • edited

neel04 commented Aug 22, 2022 • edited

Error @ command - 1 [Main Host]

muellerzr commented Aug 25, 2022

muellerzr commented Sep 2, 2022 • edited

neel04 commented Sep 2, 2022

muellerzr commented Sep 2, 2022

neel04 commented Sep 3, 2022 • edited

pacman100 commented Sep 3, 2022 • edited

neel04 commented Sep 3, 2022 • edited

muellerzr commented Sep 3, 2022

neel04 commented Sep 3, 2022

pacman100 commented Sep 3, 2022

neel04 commented Sep 3, 2022 • edited

pacman100 commented Sep 3, 2022

neel04 commented Sep 3, 2022 • edited

github-actions bot commented Sep 27, 2022

flckv commented May 27, 2023 • edited

neel04 commented Aug 21, 2022 •

edited

neel04 commented Aug 22, 2022 •

edited

pacman100 commented Aug 22, 2022 •

edited

neel04 commented Aug 22, 2022 •

edited

muellerzr commented Sep 2, 2022 •

edited

neel04 commented Sep 3, 2022 •

edited

pacman100 commented Sep 3, 2022 •

edited

neel04 commented Sep 3, 2022 •

edited

neel04 commented Sep 3, 2022 •

edited

neel04 commented Sep 3, 2022 •

edited

flckv commented May 27, 2023 •

edited