Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We鈥檒l occasionally send you account related emails.

Already on GitHub? Sign in to your account

Multi-node setup, host can't connect to it's own provided IP address #647

Closed
neel04 opened this issue Aug 21, 2022 · 21 comments
Closed

Multi-node setup, host can't connect to it's own provided IP address #647

neel04 opened this issue Aug 21, 2022 · 21 comments

Comments

@neel04
Copy link

neel04 commented Aug 21, 2022

Hi 馃 I have a 2 Nodes, each of 8xA100 for a total of 16 GPUs. I'm utilizing SLURM for launching the jobs:
SLURM scripts for the curious: https://rentry.co/9geu8n

Here, the main script uses the alloted 2 nodes and runs srun over it, i.e each node is given the PY file to execute once.

Env

  • Accelerate version: 0.13.0.dev0
  • Platform: Linux-5.10.126-117.518.amzn2.x86_64-x86_64-with-glibc2.10
  • Python version: 3.8.13
  • Numpy version: 1.22.4
  • PyTorch version (GPU?): 1.13.0a0+08820cb (True)
  • Accelerate default config:
    Not found

Now, I noticed a peculiar behavior. When on a single node (no SLURM, no multi-node, only multi-GPU) and run this:

accelerate launch --num_processes 8 --num_machines 1 --multi_gpu \
--mixed_precision fp16 --machine_rank 0 --main_process_ip 172.... --main_process_port 69420 \
\
scripts/...

The script won't run - the command simply executes, and I'm back the the command prompt again - no stdout or stderr.

But with

accelerate launch --num_processes 8 --num_machines 1 --multi_gpu \
--mixed_precision fp16  \
\
scripts/torch_...

It works fine. The scripts runs alone on the 8 GPUs, and I can monitor the WandB logs.

This is a little quirk which puzzled me, and I can neither make head or tail of. I suspect it might mean something to someone here..


Multi-node training

For multi-node training, this is the PY script being executed: https://rentry.co/tz465

  • This script works correctly for multi-GPU cases, but NOT for multi-node

Most of it's standard snippets, but it may have some glaring flaw

Output:

This is the output of the main sbatch script, which tells SLURM to deploy

Number of Nodes: 2
Name of all Hosts: gpu-st-p4d-24xlarge-60 gpu-st-p4d-24xlarge-61 # two nodes here, each 8xA100s
Master IP: 172.3.... # IP address of the main node
MASTER_PORT= 16543
ID: 0 # Each node reporting its RANK
ID: 1
NODE_COUNT=2 #number of nodes deployed

[18:14:34] WARNING  The following values were not passed to        launch.py:838
                    `accelerate launch` and had defaults used                   
                    instead:                                                    
                            `--num_cpu_threads_per_process` was                 
                    set to `48` to improve out-of-box performance               
                    To avoid this warning pass in values for each               
                    of the problematic parameters or run                        
                    `accelerate config`.                                        
[18:14:35] WARNING  The following values were not passed to        launch.py:838
                    `accelerate launch` and had defaults used                   
                    instead:                                                    
                            `--num_cpu_threads_per_process` was                 
                    set to `48` to improve out-of-box performance               
                    To avoid this warning pass in values for each               
                    of the problematic parameters or run                        
                    `accelerate config`.  
{Waiting about 15 mins}

[E socket.cpp:858] [c10d] The client socket has timed out after 900s while trying to connect to (gpu-st-p4d-24xlarge-60, 16543).
[E socket.cpp:858] [c10d] The client socket has timed out after 900s while trying to connect to (gpu-st-p4d-24xlarge-60, 16543).

Trying random ports yields no results.

I think it might be connected with the problem specified above. Does anyone have any idea?

@neel04
Copy link
Author

neel04 commented Aug 21, 2022

So I think I have a lead:
Command: (Singularity is just HPC version of docker)

singularity exec --nv --cleanenv -B /fsx/awesome:/home/awesome torch_cuda_11_7.sif accelerate launch --num_processes=8 \
--num_chines 2 --multi_gpu --mixed_precision fp16 --main_process_port 4444 scripts/torch_convnext.py \
--....

output:

[22:45:53] WARNING  The following values were not passed to `accelerate launch` and had defaults used instead:                                             launch.py:838
                            `--num_cpu_threads_per_process` was set to `48` to improve out-of-box performance
                    To avoid this warning pass in values for each of the problematic parameters or run `accelerate config`.
[W socket.cpp:599] [c10d] The IPv6 network addresses of (None, 4444) cannot be retrieved (gai error: -2 - Name or service not known).

So it appears when I add main_process_ip flag on the 0th rank host (main) to it's own IP address (since that is the main host), then it tries to connect to itself, and obviously times out as shown above.

But if I leave that field empty (above) it doesn't work. I feel either that I may be setting something wrong, which may not be explicitly in the docs or it might happen to be a bug 馃

@pacman100
Copy link
Collaborator

Hello @neel04, the peculiar behaviour for single-node multi-gpu setup has nothing to do with accelerate or pytorch launchers. It is the 69420 port number that you pass for main_process_port. When I changed that to other conventional port numbers such as 8888, things work fine. Please retry with other ports and you will find the issue being resolved.

@neel04 neel04 changed the title Multi-node training does not work - The client socket has timed out after 900s while trying to connect Multi-node setup, host can't connect to it's own provided IP address Aug 22, 2022
@pacman100
Copy link
Collaborator

With respect to multi-node multi-gpu using Slurm, a user seems to have provided some approach in this comment on a related issue: Using accelerate config for SLURM cluster with dist_url input 路 Issue #145 路 huggingface/accelerate 路 GitHub 14

Here is the Git Gist from that comment: distributed dalle2 laion (github.com) 3

Could you please try it out and let us know if that works?

@neel04
Copy link
Author

neel04 commented Aug 22, 2022

@pacman100 Thank you for the reply! I'd indeed tried conventional ports, but they don't seem to work very well either.

>>> accelerate launch ...
[E socket.cpp:793] [c10d] The client socket has timed out after 900s while trying to connect to (172.31.****, 8888).
>>>

I'm using the eth0 IP for my host, which looks like 172.31.**** and I can confirm I can ping it fine from the host as well as the child machine.

As for the SLURM, I'm using scripts from the same user 馃槃 You'll find they're remarkably similar to mine.

  • However, I'm NOT using SLURM right now for debugging, instead running commands on both machines individually and manually. Sorry for throwing that in to confuse things 馃槗

  • Additionally, I was using singularity containers - but fearing network misconfiguration, I'm now on simple conda envs.

  • Here are both the config YAML for my machines - they're the same, except rank is updated: https://rentry.co/d4ev3

So far, isolating all factors - it simply seems accelerate is unable to connect. I'm new to multi-node distributed training, so if you have some obvious advice, please do not hesitate to double-check I've setup things correctly! 馃憤

EDIT: BTW, I've also tested accelerate test on both machines to ensure the installation is correct. Might not be relevant...

@pacman100
Copy link
Collaborator

pacman100 commented Aug 22, 2022

@neel04 , wrt single-node multi-gpu, I am unable to reproduce the error then. the below command works fine:

accelerate launch --num_processes 2 --num_machines 1 --multi_gpu --mixed_precision "fp16" --machine_rank 0 \
--main_process_ip "192.xxx.x.xx" --main_process_port 8888 accelerate/examples/nlp_example.py 

wrt multi-node multi-gpu do you observe issues when launching using torchrun or torch.distributed.launch?
Could you try the below ways of launching and check.
Using torch.distributed.launch:

NCCL_DEBUG=INFO python -m torch.distributed.launch --nproc_per_node=2 --nno
des=2  --node_rank=$NODE_RANK --master_addr="192.xxx.x.xx" --master_port=52178 --use_env accelerate/examples/
nlp_example.py
NCCL_DEBUG=INFO torchrun --nproc_per_node=2 --nnodes=2  --node_rank=$NODE_RANK --master_addr="192.xxx.x.xx" --master_port=52178 accelerate/examples/nlp_example.py 

@neel04
Copy link
Author

neel04 commented Aug 22, 2022

An interesting bag of results. using the new torch.distributed.launch commands, the first one half works - it complains about local_rank but it waits at the ***** Setting part unless I run the same command on the second machine - which implies that there is some inter-node communication atleast.

I feel the error could be resolved after some effort, for which I will update later on :)

The second command seems to work quite well 馃憣I wasn't able to train more than a couple steps (pre-emption) but the synchronized initial loss leads me to believe that atleast the parameters synced initially - and since training worked, inter-node comms are working.

So it appears some problem in accelerate with multi-node setup's networking. While torchrun works, I think I might need to add AMP to my setup for fp16. I'd still love to get to this issue's core so that future users have no problem - and as such, I'm up for further debugging and testing on my side 馃 let me know if you have any further stuff you might want me to try if that helps triaging the bug! 馃殌

I've put the error traceback for the first command just in case, though I'm pretty sure I can get it to work.

Error @ command - 1 [Main Host]

FutureWarning: The module torch.distributed.launch is deprecated
and will be removed in future. Use torchrun.
Note that --use_env is set by default in torchrun.
If your script expects `--local_rank` argument to be set, please
change it to read from `os.environ['LOCAL_RANK']` instead. See
https://pytorch.org/docs/stable/distributed.html#launch-utility for
further instructions

  warnings.warn(
WARNING:torch.distributed.run:
*****************************************
Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed.
*****************************************
usage: torch_convnext.py [-h] [--model_name MODEL_NAME] [--batch_size BATCH_SIZE] [--pretrained PRETRAINED] [--epochs EPOCHS] [--lr LR] [--optimizer OPTIMIZER]
                         [--log_frequency LOG_FREQUENCY] [--val_frequency VAL_FREQUENCY] [--input_shape INPUT_SHAPE] [--weight_decay WEIGHT_DECAY]
                         [--group_name GROUP_NAME]
torch_convnext.py: error: unrecognized arguments: --local_rank=0
usage: torch_convnext.py [-h] [--model_name MODEL_NAME] [--batch_size BATCH_SIZE] [--pretrained PRETRAINED] [--epochs EPOCHS] [--lr LR] [--optimizer OPTIMIZER]
                         [--log_frequency LOG_FREQUENCY] [--val_frequency VAL_FREQUENCY] [--input_shape INPUT_SHAPE] [--weight_decay WEIGHT_DECAY]
                         [--group_name GROUP_NAME]
torch_convnext.py: error: unrecognized arguments: --local_rank=1
ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 2) local_rank: 0 (pid: 5445) of binary: /home/awesome/awesome/anaconda3/envs/SUMO_dist/bin/python
Traceback (most recent call last):
  File "/home/awesome/awesome/anaconda3/envs/SUMO_dist/lib/python3.8/runpy.py", line 194, in _run_module_as_main
    return _run_code(code, main_globals, None,
  File "/home/awesome/awesome/anaconda3/envs/SUMO_dist/lib/python3.8/runpy.py", line 87, in _run_code
    exec(code, run_globals)
  File "/home/awesome/awesome/anaconda3/envs/SUMO_dist/lib/python3.8/site-packages/torch/distributed/launch.py", line 193, in <module>
    main()
  File "/home/awesome/awesome/anaconda3/envs/SUMO_dist/lib/python3.8/site-packages/torch/distributed/launch.py", line 189, in main
    launch(args)
  File "/home/awesome/awesome/anaconda3/envs/SUMO_dist/lib/python3.8/site-packages/torch/distributed/launch.py", line 174, in launch
    run(args)
  File "/home/awesome/awesome/anaconda3/envs/SUMO_dist/lib/python3.8/site-packages/torch/distributed/run.py", line 752, in run
    elastic_launch(
  File "/home/awesome/awesome/anaconda3/envs/SUMO_dist/lib/python3.8/site-packages/torch/distributed/launcher/api.py", line 131, in __call__
    return launch_agent(self._config, self._entrypoint, list(args))
  File "/home/awesome/awesome/anaconda3/envs/SUMO_dist/lib/python3.8/site-packages/torch/distributed/launcher/api.py", line 245, in launch_agent
    raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:
============================================================
scripts/torch_convnext.py FAILED
------------------------------------------------------------
Failures:
[1]:
  time      : 2022-08-22_19:41:49
  host      : gpu-st-p4d-24xlarge-433.hpc-1click-prod450.pcluster
  rank      : 1 (local_rank: 1)
  exitcode  : 2 (pid: 5446)
  error_file: <N/A>
  traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
------------------------------------------------------------
Root Cause (first observed failure):
[0]:
  time      : 2022-08-22_19:41:49
  host      : gpu-st-p4d-24xlarge-433.hpc-1click-prod450.pcluster
  rank      : 0 (local_rank: 0)
  exitcode  : 2 (pid: 5445)
  error_file: <N/A>
  traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
============================================================

@muellerzr
Copy link
Collaborator

@neel04 should have a solution here soon for you to try. Thanks for the clear bug report!

@muellerzr
Copy link
Collaborator

muellerzr commented Sep 2, 2022

@neel04 could you try this again, and install accelerate via:

pip install git+https://github.com/huggingface/accelerate@rdvz-fix

Thanks!

Note: You will need to run accelerate config again

@neel04
Copy link
Author

neel04 commented Sep 2, 2022

@muellerzr you might've forgotten to make the git branch public 馃槃

@muellerzr
Copy link
Collaborator

@neel04 actually it's now in main 馃槄 so just do pip install git+https://github.com/huggingface/accelerate!

@neel04
Copy link
Author

neel04 commented Sep 3, 2022

I tried training on 2 nodes, 16x GPUs - it seems to be training well.

accelerate launch --config_file './root_config.yaml' scripts/torch_convnext.py --model_name='convnext_base_in22k' --num_workers 2 --lr=6e-5 --optimize='AdamW' --weight_decay=0.002 --group_name='ConvNext_Baselines'

But its terribly slow. It takes thrice the time vs vanilla DDP, But more importantly, if it takes 3.1s/step then for a 1000 steps I'd expect to see a runtime of 50 minutes - while the wall time of the run is an exorbitant 3 hours.

Code & Configs: https://rentry.co/xxh9s

I haven't changed much (I kept a seperate copy for accelerate) and these slowdowns are hard to understand. Assuming the dataloader isn't to blame (because it works fast for DDP), and seeing spiky GPU util, I can only guess that syncing between nodes is taking more time than expected keeping the GPUs waiting. I swapped the backend from static to c10d (which DDP used) no improvement.

I'll try more tests and fixes, but if there's anything obvious I might've missed - do let me know.

@pacman100
Copy link
Collaborator

pacman100 commented Sep 3, 2022

Hello, what do you mean by Vanilla DDP? Do you mean multi-gpu training instead of multi-node multi-gpu training? As per your information on spiky GPU utilisation, communication between nodes is likely the reason behind slowdown. How are nodes interconnected? For maximum speeds, nodes need to be interconnected using infiniband.

@neel04
Copy link
Author

neel04 commented Sep 3, 2022

@pacman100 This is all multi-node. I meant that using torch.distributed's DDP and Elastic/torchrun is much faster than Accelerate. As I understand, the nodes on my HPC are connected via Ethernet which NCCL uses for Elastic. In that case, should the same_network arg in the config be set to True - especially if I'm using private IP address for Accelerate?

@muellerzr
Copy link
Collaborator

@neel04 what is your torch distributed command that achieves the speedup? (A bit unclear in your messages)

@neel04
Copy link
Author

neel04 commented Sep 3, 2022

torchrun --nproc_per_node=6 --nnodes=$COUNT_NODE --node_rank=$THEID --master_addr=$MASTER_ADDR --master_port=$MASTER_PORT  \
scripts/ddp_convnext.py --model_name ....

@pacman100
Copy link
Collaborator

@neel04, can you try with same_network being True?

@neel04
Copy link
Author

neel04 commented Sep 3, 2022

I have, there seems to be no difference.
Code & configs: https://rentry.co/xxh9s

@pacman100
Copy link
Collaborator

I have, there seems to be no difference.
Code & configs: https://rentry.co/xxh9s

In the above config, it is false

@neel04
Copy link
Author

neel04 commented Sep 3, 2022

right, I changed backend to static and same_network=true

compute_environment: LOCAL_MACHINE
deepspeed_config: {}
distributed_type: MULTI_GPU
downcast_bf16: 'no'
fsdp_config: {}
machine_rank: 0
main_process_ip: 172.31.231.142 
main_process_port: 8887
main_training_function: main
mixed_precision: fp16
num_machines: 2
num_processes: 16
rdzv_backend: static
same_network: true
use_cpu: false

@github-actions
Copy link

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.

Please note that issues that do not follow the contributing guidelines are likely to be ignored.

@github-actions github-actions bot closed this as completed Oct 5, 2022
@flckv
Copy link

flckv commented May 27, 2023

pacman100
@pacman100 @muellerzr
Is accelarate command updated now to do what torchrun or torch.distributed.launch do? is accelerate still slow?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants