New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We鈥檒l occasionally send you account related emails.
Already on GitHub? Sign in to your account
Multi-node setup, host can't connect to it's own provided IP address #647
Comments
So I think I have a lead: singularity exec --nv --cleanenv -B /fsx/awesome:/home/awesome torch_cuda_11_7.sif accelerate launch --num_processes=8 \
--num_chines 2 --multi_gpu --mixed_precision fp16 --main_process_port 4444 scripts/torch_convnext.py \
--.... output: [22:45:53] WARNING The following values were not passed to `accelerate launch` and had defaults used instead: launch.py:838
`--num_cpu_threads_per_process` was set to `48` to improve out-of-box performance
To avoid this warning pass in values for each of the problematic parameters or run `accelerate config`.
[W socket.cpp:599] [c10d] The IPv6 network addresses of (None, 4444) cannot be retrieved (gai error: -2 - Name or service not known). So it appears when I add But if I leave that field empty (above) it doesn't work. I feel either that I may be setting something wrong, which may not be explicitly in the docs or it might happen to be a bug 馃 |
Hello @neel04, the peculiar behaviour for single-node multi-gpu setup has nothing to do with accelerate or pytorch launchers. It is the |
With respect to multi-node multi-gpu using Slurm, a user seems to have provided some approach in this comment on a related issue: Using accelerate config for SLURM cluster with dist_url input 路 Issue #145 路 huggingface/accelerate 路 GitHub 14 Here is the Git Gist from that comment: distributed dalle2 laion (github.com) 3 Could you please try it out and let us know if that works? |
@pacman100 Thank you for the reply! I'd indeed tried conventional ports, but they don't seem to work very well either. >>> accelerate launch ...
[E socket.cpp:793] [c10d] The client socket has timed out after 900s while trying to connect to (172.31.****, 8888).
>>> I'm using the As for the SLURM, I'm using scripts from the same user 馃槃 You'll find they're remarkably similar to mine.
So far, isolating all factors - it simply seems EDIT: BTW, I've also tested |
@neel04 , wrt single-node multi-gpu, I am unable to reproduce the error then. the below command works fine: accelerate launch --num_processes 2 --num_machines 1 --multi_gpu --mixed_precision "fp16" --machine_rank 0 \
--main_process_ip "192.xxx.x.xx" --main_process_port 8888 accelerate/examples/nlp_example.py wrt multi-node multi-gpu do you observe issues when launching using NCCL_DEBUG=INFO python -m torch.distributed.launch --nproc_per_node=2 --nno
des=2 --node_rank=$NODE_RANK --master_addr="192.xxx.x.xx" --master_port=52178 --use_env accelerate/examples/
nlp_example.py NCCL_DEBUG=INFO torchrun --nproc_per_node=2 --nnodes=2 --node_rank=$NODE_RANK --master_addr="192.xxx.x.xx" --master_port=52178 accelerate/examples/nlp_example.py |
An interesting bag of results. using the new I feel the error could be resolved after some effort, for which I will update later on :) The second command seems to work quite well 馃憣I wasn't able to train more than a couple steps (pre-emption) but the synchronized initial loss leads me to believe that atleast the parameters synced initially - and since training worked, inter-node comms are working. So it appears some problem in I've put the error traceback for the first command just in case, though I'm pretty sure I can get it to work. Error @ command - 1 [Main Host]FutureWarning: The module torch.distributed.launch is deprecated
and will be removed in future. Use torchrun.
Note that --use_env is set by default in torchrun.
If your script expects `--local_rank` argument to be set, please
change it to read from `os.environ['LOCAL_RANK']` instead. See
https://pytorch.org/docs/stable/distributed.html#launch-utility for
further instructions
warnings.warn(
WARNING:torch.distributed.run:
*****************************************
Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed.
*****************************************
usage: torch_convnext.py [-h] [--model_name MODEL_NAME] [--batch_size BATCH_SIZE] [--pretrained PRETRAINED] [--epochs EPOCHS] [--lr LR] [--optimizer OPTIMIZER]
[--log_frequency LOG_FREQUENCY] [--val_frequency VAL_FREQUENCY] [--input_shape INPUT_SHAPE] [--weight_decay WEIGHT_DECAY]
[--group_name GROUP_NAME]
torch_convnext.py: error: unrecognized arguments: --local_rank=0
usage: torch_convnext.py [-h] [--model_name MODEL_NAME] [--batch_size BATCH_SIZE] [--pretrained PRETRAINED] [--epochs EPOCHS] [--lr LR] [--optimizer OPTIMIZER]
[--log_frequency LOG_FREQUENCY] [--val_frequency VAL_FREQUENCY] [--input_shape INPUT_SHAPE] [--weight_decay WEIGHT_DECAY]
[--group_name GROUP_NAME]
torch_convnext.py: error: unrecognized arguments: --local_rank=1
ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 2) local_rank: 0 (pid: 5445) of binary: /home/awesome/awesome/anaconda3/envs/SUMO_dist/bin/python
Traceback (most recent call last):
File "/home/awesome/awesome/anaconda3/envs/SUMO_dist/lib/python3.8/runpy.py", line 194, in _run_module_as_main
return _run_code(code, main_globals, None,
File "/home/awesome/awesome/anaconda3/envs/SUMO_dist/lib/python3.8/runpy.py", line 87, in _run_code
exec(code, run_globals)
File "/home/awesome/awesome/anaconda3/envs/SUMO_dist/lib/python3.8/site-packages/torch/distributed/launch.py", line 193, in <module>
main()
File "/home/awesome/awesome/anaconda3/envs/SUMO_dist/lib/python3.8/site-packages/torch/distributed/launch.py", line 189, in main
launch(args)
File "/home/awesome/awesome/anaconda3/envs/SUMO_dist/lib/python3.8/site-packages/torch/distributed/launch.py", line 174, in launch
run(args)
File "/home/awesome/awesome/anaconda3/envs/SUMO_dist/lib/python3.8/site-packages/torch/distributed/run.py", line 752, in run
elastic_launch(
File "/home/awesome/awesome/anaconda3/envs/SUMO_dist/lib/python3.8/site-packages/torch/distributed/launcher/api.py", line 131, in __call__
return launch_agent(self._config, self._entrypoint, list(args))
File "/home/awesome/awesome/anaconda3/envs/SUMO_dist/lib/python3.8/site-packages/torch/distributed/launcher/api.py", line 245, in launch_agent
raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:
============================================================
scripts/torch_convnext.py FAILED
------------------------------------------------------------
Failures:
[1]:
time : 2022-08-22_19:41:49
host : gpu-st-p4d-24xlarge-433.hpc-1click-prod450.pcluster
rank : 1 (local_rank: 1)
exitcode : 2 (pid: 5446)
error_file: <N/A>
traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
------------------------------------------------------------
Root Cause (first observed failure):
[0]:
time : 2022-08-22_19:41:49
host : gpu-st-p4d-24xlarge-433.hpc-1click-prod450.pcluster
rank : 0 (local_rank: 0)
exitcode : 2 (pid: 5445)
error_file: <N/A>
traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
============================================================ |
@neel04 should have a solution here soon for you to try. Thanks for the clear bug report! |
@neel04 could you try this again, and install accelerate via: pip install git+https://github.com/huggingface/accelerate@rdvz-fix Thanks! Note: You will need to run |
@muellerzr you might've forgotten to make the git branch public 馃槃 |
@neel04 actually it's now in main 馃槄 so just do |
I tried training on accelerate launch --config_file './root_config.yaml' scripts/torch_convnext.py --model_name='convnext_base_in22k' --num_workers 2 --lr=6e-5 --optimize='AdamW' --weight_decay=0.002 --group_name='ConvNext_Baselines' But its terribly slow. It takes thrice the time vs vanilla DDP, But more importantly, if it takes Code & Configs: https://rentry.co/xxh9s I haven't changed much (I kept a seperate copy for I'll try more tests and fixes, but if there's anything obvious I might've missed - do let me know. |
Hello, what do you mean by Vanilla DDP? Do you mean multi-gpu training instead of multi-node multi-gpu training? As per your information on spiky GPU utilisation, communication between nodes is likely the reason behind slowdown. How are nodes interconnected? For maximum speeds, nodes need to be interconnected using infiniband. |
@pacman100 This is all multi-node. I meant that using |
@neel04 what is your torch distributed command that achieves the speedup? (A bit unclear in your messages) |
torchrun --nproc_per_node=6 --nnodes=$COUNT_NODE --node_rank=$THEID --master_addr=$MASTER_ADDR --master_port=$MASTER_PORT \
scripts/ddp_convnext.py --model_name .... |
@neel04, can you try with |
I have, there seems to be no difference. |
In the above config, it is false |
right, I changed backend to compute_environment: LOCAL_MACHINE
deepspeed_config: {}
distributed_type: MULTI_GPU
downcast_bf16: 'no'
fsdp_config: {}
machine_rank: 0
main_process_ip: 172.31.231.142
main_process_port: 8887
main_training_function: main
mixed_precision: fp16
num_machines: 2
num_processes: 16
rdzv_backend: static
same_network: true
use_cpu: false
|
This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread. Please note that issues that do not follow the contributing guidelines are likely to be ignored. |
|
Hi 馃 I have a
2
Nodes, each of8xA100
for a total of 16 GPUs. I'm utilizing SLURM for launching the jobs:SLURM scripts for the curious: https://rentry.co/9geu8n
Here, the main script uses the alloted 2 nodes and runs
srun
over it, i.e each node is given the PY file to execute once.Env
Accelerate
version: 0.13.0.dev0Accelerate
default config:Not found
Now, I noticed a peculiar behavior. When on a single node (no SLURM, no multi-node, only multi-GPU) and run this:
The script won't run - the command simply executes, and I'm back the the command prompt again - no stdout or stderr.
But with
It works fine. The scripts runs alone on the 8 GPUs, and I can monitor the WandB logs.
This is a little quirk which puzzled me, and I can neither make head or tail of. I suspect it might mean something to someone here..
Multi-node training
For multi-node training, this is the PY script being executed: https://rentry.co/tz465
Most of it's standard snippets, but it may have some glaring flaw
Output:
This is the output of the main
sbatch
script, which tells SLURM to deployTrying random ports yields no results.
I think it might be connected with the problem specified above. Does anyone have any idea?
The text was updated successfully, but these errors were encountered: