Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Multinodes gpu training on Slurm cluster #2246

Closed
4 tasks
wormyu opened this issue Dec 13, 2023 · 2 comments
Closed
4 tasks

Multinodes gpu training on Slurm cluster #2246

wormyu opened this issue Dec 13, 2023 · 2 comments

Comments

@wormyu
Copy link

wormyu commented Dec 13, 2023

System Info

- `Accelerate` version: 0.25.0
- Platform: Linux-3.10.0-1160.95.1.el7.x86_64-x86_64-with-glibc2.17
- Python version: 3.9.18
- Numpy version: 1.23.5
- PyTorch version (GPU?): 2.1.1+cu118 (True)
- PyTorch XPU available: False
- PyTorch NPU available: False
- System RAM: 503.39 GB
- GPU type: NVIDIA A100-SXM4-80GB
- `Accelerate` default config:
	Not found

Information

  • The official example scripts
  • My own modified scripts

Tasks

  • One of the scripts in the examples/ folder of Accelerate or an officially supported no_trainer script in the examples folder of the transformers repo (such as run_no_trainer_glue.py)
  • My own task or dataset (give details below)

Reproduction

Sorry for the long log first, I just want to give detailed information about the error, please let me know if you think some part is redundant, I can delete it.

My training python file is coming from here, it is a modified script of huggingface run_mlm_no_trainer.py. For multimode training on Slurm cluster, I modifed the script from here, my script is:

#!/bin/bash
#SBATCH -A p32013 ## Required: your allocation/account name, i.e. eXXXX, pXXXX or bXXXX
#SBATCH -p gengpu ## Required: (buyin, short, normal, long, gengpu, genhimem, etc)
#SBATCH --gres gpu:a100:2
#SBATCH -t 48:00:00 ## Required: How long will the job need to run (remember different partitions have restrictions on this parameter)
#SBATCH --nodes 4 ## how many computers/nodes do you need (no default)
#SBATCH --cpus-per-task 24
#SBATCH --ntasks-per-node 2 ## how many cpus or processors (gpu) do you need on per computer/node (default value 1)
#SBATCH --mem 200G ## how much RAM do you need per computer/node (this affects your FairShare score so be careful to not ask for more than you need))

module purge
module load python-miniconda3/4.12.0
module load moose/1.0.0
module load cuda/11.4.0-gcc
module load gcc/9.2.0

conda init bash
source ~/.bashrc

conda activate outlier
conda list

cd /projects/p32013/outlier-free-transformers

export PYTHONPATH=${PYTHONPATH}:$(realpath "$PWD")

export NCCL_DEBUG=INFO

export GPUS_PER_NODE=2
head_node_ip=$(scontrol show hostnames $SLURM_JOB_NODELIST | head -n 1)
# export MASTER_PORT=$(expr 10000 + $(echo -n $SLURM_JOBID | tail -c 4) + $SLURM_ARRAY_TASK_ID)
export MASTER_PORT=$(expr 10000 + $(echo -n $SLURM_JOBID | tail -c 4))
export WORLD_SIZE=$(($SLURM_NNODES * $SLURM_NTASKS_PER_NODE))

echo "WORLD_SIZE="$WORLD_SIZE
echo "SLURM_PROCID="$SLURM_PROCID
echo "MASTER_PORT="$MASTER_PORT

# Reference from https://github.com/huggingface/accelerate/blob/main/examples/slurm/submit_multinode.sh.
export LAUNCHER="accelerate launch \
     --multi_gpu \
     --num_processes $WORLD_SIZE \
     --num_machines $SLURM_NNODES \
     --machine_rank $SLURM_PROCID \
     --rdzv_backend c10d  \
     --main_process_ip "$head_node_ip" \
     --main_process_port $MASTER_PORT \
     --mixed_precision fp16 \
     "

# Test torchrun
# export LAUNCHER=" \
#    torchrun \
#    --nnodes $SLURM_NNODES \
#    --nproc_per_node $GPUS_PER_NODE \
#    --rdzv_id $RANDOM \
#    --rdzv_backend c10d \
#    --rdzv_endpoint $head_node_ip:$MASTER_PORT \
#    "

export SCRIPT="/projects/p32013/outlier-free-transformers/run_mlm.py"
export SCRIPT_ARGS=" \
    --with_tracking \
    --report_to tensorboard \
    --extra_tb_stats \
    --seed 1000 \
    --dataset_setup wikitext_2 \
    --preprocessing_num_workers 4 \
    --data_cache_dir /projects/p32013/cache/.hf_data \
    --model_cache_dir /projects/p32013/cache/.hf_cache \
    --model_type bert \
    --tokenizer_name bert-base-uncased \
    --max_seq_length 128 \
    --mlm_probability 0.15 \
    --learning_rate 0.0001 \
    --lr_scheduler_type linear \
    --max_train_steps 1000000 \
    --num_warmup_steps 10000 \
    --per_device_train_batch_size 256 \
    --per_device_eval_batch_size 256 \
    --gradient_accumulation_steps 1 \
    --max_grad_norm 1.0 \
    --weight_decay 0.01 \
    --config_name bert-base-uncased \
    --checkpointing_steps 100000 \
    --tb_scalar_log_interval 2000 \
    --tb_hist_log_interval 100000 \
    --attn_softmax vanilla \
    --output_dir output \
    "

# This step is necessary because accelerate launch does not handle multiline arguments properly
export CMD="$LAUNCHER $SCRIPT $SCRIPT_ARGS" 
srun $CMD

The above script content both accelerate launch and torchrun (reference to here) to launch a multinode job. I tested both but they all failed. I briefly describe the error here and then I will put the complete error log in the end of this issue.

Accelerate launch

First, even if I have randomly selected the port number for different jobs I sent, the error will always say:

[W socket.cpp:436] [c10d] The server socket has failed to bind to [::]:12211 (errno: 98 - Address already in use).
[W socket.cpp:436] [c10d] The server socket has failed to bind to 0.0.0.0:12211 (errno: 98 - Address already in use).
[E socket.cpp:472] [c10d] The server socket has failed to listen on any local network address.

And then the program will still keep running for a while like loading configuration, loading tokenized data from cache...until it meets line 321 in run_mlm.py that writes with accelerator.main_process_first(), an error about NCCL appears and the job fails.

...
  File "/projects/p32013/outlier-free-transformers/run_mlm_origin.py", line 725, in <module>
Traceback (most recent call last):
  File "/projects/p32013/outlier-free-transformers/run_mlm_origin.py", line 725, in <module>
    main()
  File "/projects/p32013/outlier-free-transformers/run_mlm_origin.py", line 321, in main
    with accelerator.main_process_first():
Traceback (most recent call last):
  File "/home/hlv8980/.conda/envs/outlier/lib/python3.9/contextlib.py", line 119, in __enter__
  File "/projects/p32013/outlier-free-transformers/run_mlm_origin.py", line 725, in <module>
    main()
  File "/projects/p32013/outlier-free-transformers/run_mlm_origin.py", line 321, in main
Traceback (most recent call last):
  File "/projects/p32013/outlier-free-transformers/run_mlm_origin.py", line 725, in <module>
    return next(self.gen)
  File "/home/hlv8980/.conda/envs/outlier/lib/python3.9/site-packages/accelerate/accelerator.py", line 816, in main_process_first
    with accelerator.main_process_first():
  File "/home/hlv8980/.conda/envs/outlier/lib/python3.9/contextlib.py", line 119, in __enter__
    main()
  File "/projects/p32013/outlier-free-transformers/run_mlm_origin.py", line 321, in main
    with self.state.main_process_first():
  File "/home/hlv8980/.conda/envs/outlier/lib/python3.9/contextlib.py", line 119, in __enter__
    return next(self.gen)
  File "/home/hlv8980/.conda/envs/outlier/lib/python3.9/site-packages/accelerate/accelerator.py", line 816, in main_process_first
    with accelerator.main_process_first():
  File "/home/hlv8980/.conda/envs/outlier/lib/python3.9/contextlib.py", line 119, in __enter__
    return next(self.gen)
  File "/home/hlv8980/.conda/envs/outlier/lib/python3.9/site-packages/accelerate/state.py", line 956, in main_process_first
    main()
  File "/projects/p32013/outlier-free-transformers/run_mlm_origin.py", line 321, in main
    return next(self.gen)
  File "/home/hlv8980/.conda/envs/outlier/lib/python3.9/site-packages/accelerate/accelerator.py", line 816, in main_process_first
    with self.state.main_process_first():
  File "/home/hlv8980/.conda/envs/outlier/lib/python3.9/contextlib.py", line 119, in __enter__
    with accelerator.main_process_first():
  File "/home/hlv8980/.conda/envs/outlier/lib/python3.9/contextlib.py", line 119, in __enter__
    with PartialState().main_process_first():
  File "/home/hlv8980/.conda/envs/outlier/lib/python3.9/contextlib.py", line 119, in __enter__
    return next(self.gen)
  File "/home/hlv8980/.conda/envs/outlier/lib/python3.9/site-packages/accelerate/state.py", line 956, in main_process_first
    return next(self.gen)
    with self.state.main_process_first():
  File "/home/hlv8980/.conda/envs/outlier/lib/python3.9/contextlib.py", line 119, in __enter__
  File "/home/hlv8980/.conda/envs/outlier/lib/python3.9/site-packages/accelerate/accelerator.py", line 816, in main_process_first
    return next(self.gen)
  File "/home/hlv8980/.conda/envs/outlier/lib/python3.9/site-packages/accelerate/state.py", line 513, in main_process_first
    return next(self.gen)
  File "/home/hlv8980/.conda/envs/outlier/lib/python3.9/site-packages/accelerate/state.py", line 956, in main_process_first
    with PartialState().main_process_first():
    yield from self._goes_first(self.is_main_process)
  File "/home/hlv8980/.conda/envs/outlier/lib/python3.9/site-packages/accelerate/state.py", line 408, in _goes_first
  File "/home/hlv8980/.conda/envs/outlier/lib/python3.9/contextlib.py", line 119, in __enter__
    with self.state.main_process_first():
  File "/home/hlv8980/.conda/envs/outlier/lib/python3.9/contextlib.py", line 119, in __enter__
    return next(self.gen)
  File "/home/hlv8980/.conda/envs/outlier/lib/python3.9/site-packages/accelerate/state.py", line 513, in main_process_first
    self.wait_for_everyone()
  File "/home/hlv8980/.conda/envs/outlier/lib/python3.9/site-packages/accelerate/state.py", line 402, in wait_for_everyone
    with PartialState().main_process_first():
  File "/home/hlv8980/.conda/envs/outlier/lib/python3.9/contextlib.py", line 119, in __enter__
    return next(self.gen)
  File "/home/hlv8980/.conda/envs/outlier/lib/python3.9/site-packages/accelerate/state.py", line 956, in main_process_first
    yield from self._goes_first(self.is_main_process)
  File "/home/hlv8980/.conda/envs/outlier/lib/python3.9/site-packages/accelerate/state.py", line 408, in _goes_first
    torch.distributed.barrier()
  File "/home/hlv8980/.conda/envs/outlier/lib/python3.9/site-packages/torch/distributed/c10d_logger.py", line 47, in wrapper
    return next(self.gen)
  File "/home/hlv8980/.conda/envs/outlier/lib/python3.9/site-packages/accelerate/state.py", line 513, in main_process_first
    return func(*args, **kwargs)
  File "/home/hlv8980/.conda/envs/outlier/lib/python3.9/site-packages/torch/distributed/distributed_c10d.py", line 3696, in barrier
    self.wait_for_everyone()
  File "/home/hlv8980/.conda/envs/outlier/lib/python3.9/site-packages/accelerate/state.py", line 402, in wait_for_everyone
    yield from self._goes_first(self.is_main_process)
  File "/home/hlv8980/.conda/envs/outlier/lib/python3.9/site-packages/accelerate/state.py", line 408, in _goes_first
    with PartialState().main_process_first():
  File "/home/hlv8980/.conda/envs/outlier/lib/python3.9/contextlib.py", line 119, in __enter__
    torch.distributed.barrier()
  File "/home/hlv8980/.conda/envs/outlier/lib/python3.9/site-packages/torch/distributed/c10d_logger.py", line 47, in wrapper
    return next(self.gen)
  File "/home/hlv8980/.conda/envs/outlier/lib/python3.9/site-packages/accelerate/state.py", line 513, in main_process_first
    self.wait_for_everyone()
  File "/home/hlv8980/.conda/envs/outlier/lib/python3.9/site-packages/accelerate/state.py", line 402, in wait_for_everyone
    return func(*args, **kwargs)
  File "/home/hlv8980/.conda/envs/outlier/lib/python3.9/site-packages/torch/distributed/distributed_c10d.py", line 3696, in barrier
    torch.distributed.barrier()
  File "/home/hlv8980/.conda/envs/outlier/lib/python3.9/site-packages/torch/distributed/c10d_logger.py", line 47, in wrapper
    yield from self._goes_first(self.is_main_process)
  File "/home/hlv8980/.conda/envs/outlier/lib/python3.9/site-packages/accelerate/state.py", line 408, in _goes_first
    return func(*args, **kwargs)
  File "/home/hlv8980/.conda/envs/outlier/lib/python3.9/site-packages/torch/distributed/distributed_c10d.py", line 3696, in barrier
    self.wait_for_everyone()
  File "/home/hlv8980/.conda/envs/outlier/lib/python3.9/site-packages/accelerate/state.py", line 402, in wait_for_everyone
    torch.distributed.barrier()
  File "/home/hlv8980/.conda/envs/outlier/lib/python3.9/site-packages/torch/distributed/c10d_logger.py", line 47, in wrapper
    work = default_pg.barrier(opts=opts)
torch.distributed.DistBackendError: NCCL error in: ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1333, invalid usage (run with NCCL_DEBUG=WARN for details), NCCL version 2.18.6
ncclInvalidUsage: This usually reflects invalid usage of NCCL library.
Last error:
Duplicate GPU detected : rank 6 and rank 4 both on CUDA device 31000
    return func(*args, **kwargs)
  File "/home/hlv8980/.conda/envs/outlier/lib/python3.9/site-packages/torch/distributed/distributed_c10d.py", line 3696, in barrier
    work = default_pg.barrier(opts=opts)
torch.distributed.DistBackendError: NCCL error in: ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1333, invalid usage (run with NCCL_DEBUG=WARN for details), NCCL version 2.18.6
ncclInvalidUsage: This usually reflects invalid usage of NCCL library.
Last error:
Duplicate GPU detected : rank 5 and rank 7 both on CUDA device 4b000
...

For now, it looks like my program failed immediately once all the processes have to communicate to each other. I doubt this is caused by not successfully setting the master_port, but I am not sure.
To get a clear message from NCCL of why it is failing I set NCCL_DEBUG=INFO. I give the detailed NCCL INFO log below.

qgpu2008:21283:21283 [0] NCCL INFO Bootstrap : Using ib0:172.41.212.8<0>
qgpu2008:21283:21283 [0] NCCL INFO NET/Plugin : Plugin load (libnccl-net.so) returned 2 : libnccl-net.so: cannot open shared object file: No such file or directory
qgpu2008:21283:21283 [0] NCCL INFO NET/Plugin : No plugin found, using internal implementation
qgpu2018:59523:59523 [0] NCCL INFO cudaDriverVersion 12000
qgpu2018:59525:59525 [1] NCCL INFO cudaDriverVersion 12000
qgpu2008:21283:21283 [0] NCCL INFO cudaDriverVersion 12000
qgpu2018:59522:59522 [0] NCCL INFO cudaDriverVersion 12000
NCCL version 2.18.6+cuda11.8
qgpu2018:59524:59524 [1] NCCL INFO cudaDriverVersion 12000
qgpu2008:21284:21284 [1] NCCL INFO cudaDriverVersion 12000
qgpu2018:59523:59523 [0] NCCL INFO Bootstrap : Using ib0:172.41.212.18<0>
qgpu2018:59525:59525 [1] NCCL INFO Bootstrap : Using ib0:172.41.212.18<0>
qgpu2008:21284:21284 [1] NCCL INFO Bootstrap : Using ib0:172.41.212.8<0>
qgpu2018:59522:59522 [0] NCCL INFO Bootstrap : Using ib0:172.41.212.18<0>
qgpu2018:59524:59524 [1] NCCL INFO Bootstrap : Using ib0:172.41.212.18<0>
qgpu2008:21284:21284 [1] NCCL INFO NET/Plugin : Plugin load (libnccl-net.so) returned 2 : libnccl-net.so: cannot open shared object file: No such file or directory
qgpu2008:21284:21284 [1] NCCL INFO NET/Plugin : No plugin found, using internal implementation
qgpu2018:59525:59525 [1] NCCL INFO NET/Plugin : Plugin load (libnccl-net.so) returned 2 : libnccl-net.so: cannot open shared object file: No such file or directory
qgpu2018:59525:59525 [1] NCCL INFO NET/Plugin : No plugin found, using internal implementation
qgpu2018:59522:59522 [0] NCCL INFO NET/Plugin : Plugin load (libnccl-net.so) returned 2 : libnccl-net.so: cannot open shared object file: No such file or directory
qgpu2018:59522:59522 [0] NCCL INFO NET/Plugin : No plugin found, using internal implementation
qgpu2018:59523:59523 [0] NCCL INFO NET/Plugin : Plugin load (libnccl-net.so) returned 2 : libnccl-net.so: cannot open shared object file: No such file or directory
qgpu2018:59523:59523 [0] NCCL INFO NET/Plugin : No plugin found, using internal implementation
qgpu2018:59524:59524 [1] NCCL INFO NET/Plugin : Plugin load (libnccl-net.so) returned 2 : libnccl-net.so: cannot open shared object file: No such file or directory
qgpu2018:59524:59524 [1] NCCL INFO NET/Plugin : No plugin found, using internal implementation
qgpu2008:21284:21443 [1] NCCL INFO NET/IB : Using [0]mlx5_0:1/IB [1]mlx5_2:1/RoCE [RO]; OOB ib0:172.41.212.8<0>
qgpu2008:21284:21443 [1] NCCL INFO Using network IB
qgpu2008:21283:21442 [0] NCCL INFO NET/IB : Using [0]mlx5_0:1/IB [1]mlx5_2:1/RoCE [RO]; OOB ib0:172.41.212.8<0>
qgpu2008:21283:21442 [0] NCCL INFO Using network IB
qgpu2018:59522:59993 [0] NCCL INFO NET/IB : Using [0]mlx5_0:1/IB [1]mlx5_2:1/RoCE [RO]; OOB ib0:172.41.212.18<0>
qgpu2018:59522:59993 [0] NCCL INFO Using network IB
qgpu2018:59523:59996 [0] NCCL INFO NET/IB : Using [0]mlx5_0:1/IB [1]mlx5_2:1/RoCE [RO]; OOB ib0:172.41.212.18<0>
qgpu2018:59523:59996 [0] NCCL INFO Using network IB
qgpu2018:59524:59995 [1] NCCL INFO NET/IB : Using [0]mlx5_0:1/IB [1]mlx5_2:1/RoCE [RO]; OOB ib0:172.41.212.18<0>
qgpu2018:59524:59995 [1] NCCL INFO Using network IB
qgpu2018:59525:59994 [1] NCCL INFO NET/IB : Using [0]mlx5_0:1/IB [1]mlx5_2:1/RoCE [RO]; OOB ib0:172.41.212.18<0>
qgpu2018:59525:59994 [1] NCCL INFO Using network IB
qgpu2009:4956:4956 [0] NCCL INFO cudaDriverVersion 12000
qgpu2009:4956:4956 [0] NCCL INFO Bootstrap : Using ib0:172.41.212.9<0>
qgpu2009:4956:4956 [0] NCCL INFO NET/Plugin : Plugin load (libnccl-net.so) returned 2 : libnccl-net.so: cannot open shared object file: No such file or directory
qgpu2009:4956:4956 [0] NCCL INFO NET/Plugin : No plugin found, using internal implementation
qgpu2009:4957:4957 [1] NCCL INFO cudaDriverVersion 12000
qgpu2009:4957:4957 [1] NCCL INFO Bootstrap : Using ib0:172.41.212.9<0>
qgpu2009:4957:4957 [1] NCCL INFO NET/Plugin : Plugin load (libnccl-net.so) returned 2 : libnccl-net.so: cannot open shared object file: No such file or directory
qgpu2009:4957:4957 [1] NCCL INFO NET/Plugin : No plugin found, using internal implementation
qgpu2009:4956:5096 [0] NCCL INFO NET/IB : Using [0]mlx5_0:1/IB [1]mlx5_2:1/RoCE [RO]; OOB ib0:172.41.212.9<0>
qgpu2009:4956:5096 [0] NCCL INFO Using network IB
qgpu2009:4957:5097 [1] NCCL INFO NET/IB : Using [0]mlx5_0:1/IB [1]mlx5_2:1/RoCE [RO]; OOB ib0:172.41.212.9<0>
qgpu2009:4957:5097 [1] NCCL INFO Using network IB
qgpu2018:59524:59995 [1] NCCL INFO comm 0xd1f0ef0 rank 7 nranks 8 cudaDev 1 nvmlDev 1 busId 4b000 commId 0xec16febf07bc8cdc - Init START
qgpu2008:21284:21443 [1] NCCL INFO comm 0xdc4e6a0 rank 1 nranks 8 cudaDev 1 nvmlDev 1 busId 4b000 commId 0xec16febf07bc8cdc - Init START
qgpu2008:21283:21442 [0] NCCL INFO comm 0xeef0580 rank 0 nranks 8 cudaDev 0 nvmlDev 0 busId 31000 commId 0xec16febf07bc8cdc - Init START
qgpu2009:4956:5096 [0] NCCL INFO comm 0xd3c8f90 rank 2 nranks 8 cudaDev 0 nvmlDev 0 busId 31000 commId 0xec16febf07bc8cdc - Init START
qgpu2009:4957:5097 [1] NCCL INFO comm 0xe004430 rank 3 nranks 8 cudaDev 1 nvmlDev 1 busId 4b000 commId 0xec16febf07bc8cdc - Init START
qgpu2018:59522:59993 [0] NCCL INFO comm 0xcd690e0 rank 6 nranks 8 cudaDev 0 nvmlDev 0 busId 31000 commId 0xec16febf07bc8cdc - Init START
qgpu2018:59523:59996 [0] NCCL INFO comm 0xe3a02a0 rank 4 nranks 8 cudaDev 0 nvmlDev 0 busId 31000 commId 0xec16febf07bc8cdc - Init START
qgpu2018:59525:59994 [1] NCCL INFO comm 0xd6b2260 rank 5 nranks 8 cudaDev 1 nvmlDev 1 busId 4b000 commId 0xec16febf07bc8cdc - Init START

qgpu2018:59522:59993 [0] init.cc:795 NCCL WARN Duplicate GPU detected : rank 6 and rank 4 both on CUDA device 31000

qgpu2018:59524:59995 [1] init.cc:795 NCCL WARN Duplicate GPU detected : rank 7 and rank 5 both on CUDA device 4b000
qgpu2018:59522:59993 [0] NCCL INFO init.cc:1358 -> 5
qgpu2018:59524:59995 [1] NCCL INFO init.cc:1358 -> 5
qgpu2018:59522:59993 [0] NCCL INFO group.cc:65 -> 5 [Async thread]
qgpu2018:59524:59995 [1] NCCL INFO group.cc:65 -> 5 [Async thread]
qgpu2018:59522:59522 [0] NCCL INFO group.cc:406 -> 5
qgpu2018:59522:59522 [0] NCCL INFO group.cc:96 -> 5
qgpu2018:59524:59524 [1] NCCL INFO group.cc:406 -> 5
qgpu2018:59524:59524 [1] NCCL INFO group.cc:96 -> 5

qgpu2018:59525:59994 [1] init.cc:795 NCCL WARN Duplicate GPU detected : rank 5 and rank 7 both on CUDA device 4b000
qgpu2018:59525:59994 [1] NCCL INFO init.cc:1358 -> 5
qgpu2018:59525:59994 [1] NCCL INFO group.cc:65 -> 5 [Async thread]

qgpu2018:59523:59996 [0] init.cc:795 NCCL WARN Duplicate GPU detected : rank 4 and rank 6 both on CUDA device 31000
qgpu2018:59523:59996 [0] NCCL INFO init.cc:1358 -> 5
qgpu2018:59523:59996 [0] NCCL INFO group.cc:65 -> 5 [Async thread]
qgpu2018:59523:59523 [0] NCCL INFO group.cc:406 -> 5
qgpu2018:59523:59523 [0] NCCL INFO group.cc:96 -> 5
qgpu2018:59525:59525 [1] NCCL INFO group.cc:406 -> 5
qgpu2018:59525:59525 [1] NCCL INFO group.cc:96 -> 5
qgpu2009:4957:5097 [1] NCCL INFO Setting affinity for GPU 1 to ffffff
qgpu2009:4956:5096 [0] NCCL INFO Setting affinity for GPU 0 to ffffff
qgpu2008:21284:21443 [1] NCCL INFO Setting affinity for GPU 1 to ffffff
qgpu2008:21283:21442 [0] NCCL INFO Setting affinity for GPU 0 to ffffff
qgpu2018:59524:59524 [1] NCCL INFO comm 0xd1f0ef0 rank 7 nranks 8 cudaDev 1 busId 4b000 - Abort COMPLETE
qgpu2018:59523:59523 [0] NCCL INFO comm 0xe3a02a0 rank 4 nranks 8 cudaDev 0 busId 31000 - Abort COMPLETE
qgpu2018:59522:59522 [0] NCCL INFO comm 0xcd690e0 rank 6 nranks 8 cudaDev 0 busId 31000 - Abort COMPLETE
qgpu2018:59525:59525 [1] NCCL INFO comm 0xd6b2260 rank 5 nranks 8 cudaDev 1 busId 4b000 - Abort COMPLETE

qgpu2008:21283:21442 [0] misc/socket.cc:49 NCCL WARN socketProgress: Connection closed by remote peer qgpu2018-ib0.quest.it.northwestern.edu<40232>
qgpu2008:21283:21442 [0] NCCL INFO misc/socket.cc:57 -> 6
qgpu2008:21283:21442 [0] NCCL INFO misc/socket.cc:786 -> 6
qgpu2008:21283:21442 [0] NCCL INFO bootstrap.cc:71 -> 6
qgpu2008:21283:21442 [0] NCCL INFO bootstrap.cc:395 -> 6
qgpu2008:21283:21442 [0] NCCL INFO init.cc:938 -> 6
qgpu2008:21283:21442 [0] NCCL INFO init.cc:1358 -> 6
qgpu2008:21283:21442 [0] NCCL INFO group.cc:65 -> 6 [Async thread]
qgpu2008:21283:21283 [0] NCCL INFO group.cc:406 -> 6
qgpu2008:21283:21283 [0] NCCL INFO group.cc:96 -> 6
qgpu2008:21283:21283 [0] NCCL INFO comm 0xeef0580 rank 0 nranks 8 cudaDev 0 busId 31000 - Abort COMPLETE

qgpu2008:21284:21443 [1] misc/socket.cc:49 NCCL WARN socketProgress: Connection closed by remote peer qgpu2008-ib0.quest.it.northwestern.edu<49866>
qgpu2008:21284:21443 [1] NCCL INFO misc/socket.cc:57 -> 6
qgpu2008:21284:21443 [1] NCCL INFO misc/socket.cc:786 -> 6
qgpu2008:21284:21443 [1] NCCL INFO bootstrap.cc:71 -> 6
qgpu2008:21284:21443 [1] NCCL INFO bootstrap.cc:395 -> 6
qgpu2008:21284:21443 [1] NCCL INFO init.cc:938 -> 6
qgpu2008:21284:21443 [1] NCCL INFO init.cc:1358 -> 6
qgpu2008:21284:21443 [1] NCCL INFO group.cc:65 -> 6 [Async thread]
qgpu2008:21284:21284 [1] NCCL INFO group.cc:406 -> 6
qgpu2008:21284:21284 [1] NCCL INFO group.cc:96 -> 6
qgpu2008:21284:21284 [1] NCCL INFO comm 0xdc4e6a0 rank 1 nranks 8 cudaDev 1 busId 4b000 - Abort COMPLETE

qgpu2009:4956:5096 [0] misc/socket.cc:49 NCCL WARN socketProgress: Connection closed by remote peer qgpu2008-ib0.quest.it.northwestern.edu<56368>
qgpu2009:4956:5096 [0] NCCL INFO misc/socket.cc:57 -> 6
qgpu2009:4956:5096 [0] NCCL INFO misc/socket.cc:786 -> 6
qgpu2009:4956:5096 [0] NCCL INFO bootstrap.cc:71 -> 6
qgpu2009:4956:5096 [0] NCCL INFO bootstrap.cc:395 -> 6
qgpu2009:4956:5096 [0] NCCL INFO init.cc:938 -> 6
qgpu2009:4956:5096 [0] NCCL INFO init.cc:1358 -> 6
qgpu2009:4956:5096 [0] NCCL INFO group.cc:65 -> 6 [Async thread]
qgpu2009:4956:4956 [0] NCCL INFO group.cc:406 -> 6
qgpu2009:4956:4956 [0] NCCL INFO group.cc:96 -> 6
qgpu2009:4956:4956 [0] NCCL INFO comm 0xd3c8f90 rank 2 nranks 8 cudaDev 0 busId 31000 - Abort COMPLETE

qgpu2009:4957:5097 [1] misc/socket.cc:49 NCCL WARN socketProgress: Connection closed by remote peer qgpu2009-ib0.quest.it.northwestern.edu<46674>
qgpu2009:4957:5097 [1] NCCL INFO misc/socket.cc:57 -> 6
qgpu2009:4957:5097 [1] NCCL INFO misc/socket.cc:786 -> 6
qgpu2009:4957:5097 [1] NCCL INFO bootstrap.cc:71 -> 6
qgpu2009:4957:5097 [1] NCCL INFO bootstrap.cc:395 -> 6
qgpu2009:4957:5097 [1] NCCL INFO init.cc:938 -> 6
qgpu2009:4957:5097 [1] NCCL INFO init.cc:1358 -> 6
qgpu2009:4957:5097 [1] NCCL INFO group.cc:65 -> 6 [Async thread]
qgpu2009:4957:4957 [1] NCCL INFO group.cc:406 -> 6
qgpu2009:4957:4957 [1] NCCL INFO group.cc:96 -> 6
qgpu2009:4957:4957 [1] NCCL INFO comm 0xe004430 rank 3 nranks 8 cudaDev 1 busId 4b000 - Abort COMPLETE

torchrun

The torchrun error is similar to the above case but with slightly different.
The master port still already in use

[W socket.cpp:436] [c10d] The server socket has failed to bind to [::]:12211 (errno: 98 - Address already in use).
[W socket.cpp:436] [c10d] The server socket has failed to bind to 0.0.0.0:12211 (errno: 98 - Address already in use).
[E socket.cpp:472] [c10d] The server socket has failed to listen on any local network address.

Error still happen in with accelerator.main_process_first(),error log is slightly different.

...
  File "/home/hlv8980/.conda/envs/outlier/lib/python3.9/site-packages/torch/distributed/c10d_logger.py", line 47, in wrapper
    torch.distributed.barrier()
  File "/home/hlv8980/.conda/envs/outlier/lib/python3.9/site-packages/torch/distributed/c10d_logger.py", line 47, in wrapper
    return func(*args, **kwargs)
  File "/home/hlv8980/.conda/envs/outlier/lib/python3.9/site-packages/torch/distributed/distributed_c10d.py", line 3696, in barrier
    return func(*args, **kwargs)
  File "/home/hlv8980/.conda/envs/outlier/lib/python3.9/site-packages/torch/distributed/distributed_c10d.py", line 3696, in barrier
    work = default_pg.barrier(opts=opts)
torch.distributed    .work = default_pg.barrier(opts=opts)DistBackendError
: NCCL error in: ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1333, invalid usage (run with NCCL_DEBUG=WARN for details), NCCL version 2.18.6
ncclInvalidUsage: This usually reflects invalid usage of NCCL library.
Last error:
Duplicate GPU detected : rank 3 and rank 1 both on CUDA device 4b000torch.distributed.
DistBackendError: NCCL error in: ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1333, invalid usage (run with NCCL_DEBUG=WARN for details), NCCL version 2.18.6
ncclInvalidUsage: This usually reflects invalid usage of NCCL library.
Last error:
Duplicate GPU detected : rank 2 and rank 0 both on CUDA device 31000
...

Here is the NCCL log for torchrun case.
NCCL log:

qgpu2008:35182:35182 [0] NCCL INFO Bootstrap : Using ib0:172.41.212.8<0>
qgpu2008:35182:35182 [0] NCCL INFO NET/Plugin : Plugin load (libnccl-net.so) returned 2 : libnccl-net.so: cannot open shared object file: No such file or directory
qgpu2008:35182:35182 [0] NCCL INFO NET/Plugin : No plugin found, using internal implementation
qgpu2008:35182:35182 [0] NCCL INFO cudaDriverVersion 12000
NCCL version 2.18.6+cuda11.8
qgpu2008:35184:35184 [1] NCCL INFO cudaDriverVersion 12000
qgpu2008:35185:35185 [1] NCCL INFO cudaDriverVersion 12000
qgpu2009:9502:9502 [0] NCCL INFO cudaDriverVersion 12000
qgpu2017:4304:4304 [0] NCCL INFO cudaDriverVersion 12000
qgpu2009:9503:9503 [1] NCCL INFO cudaDriverVersion 12000
qgpu2017:4305:4305 [1] NCCL INFO cudaDriverVersion 12000
qgpu2008:35184:35184 [1] NCCL INFO Bootstrap : Using ib0:172.41.212.8<0>
qgpu2008:35185:35185 [1] NCCL INFO Bootstrap : Using ib0:172.41.212.8<0>
qgpu2009:9502:9502 [0] NCCL INFO Bootstrap : Using ib0:172.41.212.9<0>
qgpu2017:4304:4304 [0] NCCL INFO Bootstrap : Using ib0:172.41.212.17<0>
qgpu2017:4305:4305 [1] NCCL INFO Bootstrap : Using ib0:172.41.212.17<0>
qgpu2009:9503:9503 [1] NCCL INFO Bootstrap : Using ib0:172.41.212.9<0>
qgpu2008:35184:35184 [1] NCCL INFO NET/Plugin : Plugin load (libnccl-net.so) returned 2 : libnccl-net.so: cannot open shared object file: No such file or directory
qgpu2008:35184:35184 [1] NCCL INFO NET/Plugin : No plugin found, using internal implementation
qgpu2008:35185:35185 [1] NCCL INFO NET/Plugin : Plugin load (libnccl-net.so) returned 2 : libnccl-net.so: cannot open shared object file: No such file or directory
qgpu2008:35185:35185 [1] NCCL INFO NET/Plugin : No plugin found, using internal implementation
qgpu2009:9502:9502 [0] NCCL INFO NET/Plugin : Plugin load (libnccl-net.so) returned 2 : libnccl-net.so: cannot open shared object file: No such file or directory
qgpu2009:9503:9503 [1] NCCL INFO NET/Plugin : Plugin load (libnccl-net.so) returned 2 : libnccl-net.so: cannot open shared object file: No such file or directory
qgpu2009:9502:9502 [0] NCCL INFO NET/Plugin : No plugin found, using internal implementation
qgpu2009:9503:9503 [1] NCCL INFO NET/Plugin : No plugin found, using internal implementation
qgpu2017:4304:4304 [0] NCCL INFO NET/Plugin : Plugin load (libnccl-net.so) returned 2 : libnccl-net.so: cannot open shared object file: No such file or directory
qgpu2017:4305:4305 [1] NCCL INFO NET/Plugin : Plugin load (libnccl-net.so) returned 2 : libnccl-net.so: cannot open shared object file: No such file or directory
qgpu2017:4304:4304 [0] NCCL INFO NET/Plugin : No plugin found, using internal implementation
qgpu2017:4305:4305 [1] NCCL INFO NET/Plugin : No plugin found, using internal implementation
qgpu2008:35183:35183 [0] NCCL INFO cudaDriverVersion 12000
qgpu2008:35183:35183 [0] NCCL INFO Bootstrap : Using ib0:172.41.212.8<0>
qgpu2008:35183:35183 [0] NCCL INFO NET/Plugin : Plugin load (libnccl-net.so) returned 2 : libnccl-net.so: cannot open shared object file: No such file or directory
qgpu2008:35183:35183 [0] NCCL INFO NET/Plugin : No plugin found, using internal implementation
qgpu2008:35182:35220 [0] NCCL INFO NET/IB : Using [0]mlx5_0:1/IB [1]mlx5_2:1/RoCE [RO]; OOB ib0:172.41.212.8<0>
qgpu2008:35182:35220 [0] NCCL INFO Using network IB
qgpu2017:4304:4330 [0] NCCL INFO NET/IB : Using [0]mlx5_0:1/IB [1]mlx5_2:1/RoCE [RO]; OOB ib0:172.41.212.17<0>
qgpu2017:4304:4330 [0] NCCL INFO Using network IB
qgpu2017:4305:4329 [1] NCCL INFO NET/IB : Using [0]mlx5_0:1/IB [1]mlx5_2:1/RoCE [RO]; OOB ib0:172.41.212.17<0>
qgpu2017:4305:4329 [1] NCCL INFO Using network IB
qgpu2009:9503:9526 [1] NCCL INFO NET/IB : Using [0]mlx5_0:1/IB [1]mlx5_2:1/RoCE [RO]; OOB ib0:172.41.212.9<0>
qgpu2009:9503:9526 [1] NCCL INFO Using network IB
qgpu2009:9502:9525 [0] NCCL INFO NET/IB : Using [0]mlx5_0:1/IB [1]mlx5_2:1/RoCE [RO]; OOB ib0:172.41.212.9<0>
qgpu2009:9502:9525 [0] NCCL INFO Using network IB
qgpu2008:35184:35221 [1] NCCL INFO NET/IB : Using [0]mlx5_0:1/IB [1]mlx5_2:1/RoCE [RO]; OOB ib0:172.41.212.8<0>
qgpu2008:35184:35221 [1] NCCL INFO Using network IB
qgpu2008:35185:35222 [1] NCCL INFO NET/IB : Using [0]mlx5_0:1/IB [1]mlx5_2:1/RoCE [RO]; OOB ib0:172.41.212.8<0>
qgpu2008:35185:35222 [1] NCCL INFO Using network IB
qgpu2008:35183:35223 [0] NCCL INFO NET/IB : Using [0]mlx5_0:1/IB [1]mlx5_2:1/RoCE [RO]; OOB ib0:172.41.212.8<0>
qgpu2008:35183:35223 [0] NCCL INFO Using network IB
qgpu2017:4305:4329 [1] NCCL INFO comm 0x52589c20 rank 7 nranks 8 cudaDev 1 nvmlDev 1 busId 4b000 commId 0x26b3ece72ee71028 - Init START
qgpu2008:35182:35220 [0] NCCL INFO comm 0x562e3380 rank 0 nranks 8 cudaDev 0 nvmlDev 0 busId 31000 commId 0x26b3ece72ee71028 - Init START
qgpu2008:35184:35221 [1] NCCL INFO comm 0x555bcd30 rank 1 nranks 8 cudaDev 1 nvmlDev 1 busId 4b000 commId 0x26b3ece72ee71028 - Init START
qgpu2008:35183:35223 [0] NCCL INFO comm 0x54e23e50 rank 2 nranks 8 cudaDev 0 nvmlDev 0 busId 31000 commId 0x26b3ece72ee71028 - Init START
qgpu2008:35185:35222 [1] NCCL INFO comm 0x546df700 rank 3 nranks 8 cudaDev 1 nvmlDev 1 busId 4b000 commId 0x26b3ece72ee71028 - Init START
qgpu2009:9502:9525 [0] NCCL INFO comm 0x54a8c5a0 rank 4 nranks 8 cudaDev 0 nvmlDev 0 busId 31000 commId 0x26b3ece72ee71028 - Init START
qgpu2009:9503:9526 [1] NCCL INFO comm 0x53704630 rank 5 nranks 8 cudaDev 1 nvmlDev 1 busId 4b000 commId 0x26b3ece72ee71028 - Init START
qgpu2017:4304:4330 [0] NCCL INFO comm 0x55263d60 rank 6 nranks 8 cudaDev 0 nvmlDev 0 busId 31000 commId 0x26b3ece72ee71028 - Init START

qgpu2008:35182:35220 [0] init.cc:795 NCCL WARN Duplicate GPU detected : rank 0 and rank 2 both on CUDA device 31000
qgpu2008:35182:35220 [0] NCCL INFO init.cc:1358 -> 5

qgpu2008:35184:35221 [1] init.cc:795 NCCL WARN Duplicate GPU detected : rank 1 and rank 3 both on CUDA device 4b000
qgpu2008:35182:35220 [0] NCCL INFO group.cc:65 -> 5 [Async thread]
qgpu2008:35184:35221 [1] NCCL INFO init.cc:1358 -> 5

qgpu2008:35185:35222 [1] init.cc:795 NCCL WARN Duplicate GPU detected : rank 3 and rank 1 both on CUDA device 4b000

qgpu2008:35183:35223 [0] init.cc:795 NCCL WARN Duplicate GPU detected : rank 2 and rank 0 both on CUDA device 31000
qgpu2008:35185:35222 [1] NCCL INFO init.cc:1358 -> 5
qgpu2008:35183:35223 [0] NCCL INFO init.cc:1358 -> 5
qgpu2008:35185:35222 [1] NCCL INFO group.cc:65 -> 5 [Async thread]
qgpu2008:35183:35223 [0] NCCL INFO group.cc:65 -> 5 [Async thread]
qgpu2008:35184:35221 [1] NCCL INFO group.cc:65 -> 5 [Async thread]
qgpu2008:35184:35184 [1] NCCL INFO group.cc:406 -> 5
qgpu2008:35184:35184 [1] NCCL INFO group.cc:96 -> 5
qgpu2008:35185:35185 [1] NCCL INFO group.cc:406 -> 5
qgpu2008:35183:35183 [0] NCCL INFO group.cc:406 -> 5
qgpu2008:35183:35183 [0] NCCL INFO group.cc:96 -> 5
qgpu2008:35185:35185 [1] NCCL INFO group.cc:96 -> 5
qgpu2008:35182:35182 [0] NCCL INFO group.cc:406 -> 5
qgpu2008:35182:35182 [0] NCCL INFO group.cc:96 -> 5
qgpu2017:4304:4330 [0] NCCL INFO Setting affinity for GPU 0 to ffffff
qgpu2009:9503:9526 [1] NCCL INFO Setting affinity for GPU 1 to ffffff
qgpu2017:4305:4329 [1] NCCL INFO Setting affinity for GPU 1 to ffffff
qgpu2009:9502:9525 [0] NCCL INFO Setting affinity for GPU 0 to ffffff
qgpu2008:35183:35183 [0] NCCL INFO comm 0x54e23e50 rank 2 nranks 8 cudaDev 0 busId 31000 - Abort COMPLETE
qgpu2008:35184:35184 [1] NCCL INFO comm 0x555bcd30 rank 1 nranks 8 cudaDev 1 busId 4b000 - Abort COMPLETE
qgpu2008:35182:35182 [0] NCCL INFO comm 0x562e3380 rank 0 nranks 8 cudaDev 0 busId 31000 - Abort COMPLETE
qgpu2008:35185:35185 [1] NCCL INFO comm 0x546df700 rank 3 nranks 8 cudaDev 1 busId 4b000 - Abort COMPLETE

qgpu2009:9502:9525 [0] misc/socket.cc:49 NCCL WARN socketProgress: Connection closed by remote peer qgpu2008-ib0.quest.it.northwestern.edu<44340>
qgpu2009:9502:9525 [0] NCCL INFO misc/socket.cc:57 -> 6
qgpu2009:9502:9525 [0] NCCL INFO misc/socket.cc:786 -> 6
qgpu2009:9502:9525 [0] NCCL INFO bootstrap.cc:71 -> 6
qgpu2009:9502:9525 [0] NCCL INFO bootstrap.cc:395 -> 6
qgpu2009:9502:9525 [0] NCCL INFO init.cc:938 -> 6
qgpu2009:9502:9525 [0] NCCL INFO init.cc:1358 -> 6
qgpu2009:9502:9525 [0] NCCL INFO group.cc:65 -> 6 [Async thread]
qgpu2009:9502:9502 [0] NCCL INFO group.cc:406 -> 6
qgpu2009:9502:9502 [0] NCCL INFO group.cc:96 -> 6
qgpu2009:9502:9502 [0] NCCL INFO comm 0x54a8c5a0 rank 4 nranks 8 cudaDev 0 busId 31000 - Abort COMPLETE

qgpu2009:9503:9526 [1] misc/socket.cc:49 NCCL WARN socketProgress: Connection closed by remote peer qgpu2009-ib0.quest.it.northwestern.edu<58900>
qgpu2009:9503:9526 [1] NCCL INFO misc/socket.cc:57 -> 6
qgpu2009:9503:9526 [1] NCCL INFO misc/socket.cc:786 -> 6
qgpu2009:9503:9526 [1] NCCL INFO bootstrap.cc:71 -> 6
qgpu2009:9503:9526 [1] NCCL INFO bootstrap.cc:395 -> 6
qgpu2009:9503:9526 [1] NCCL INFO init.cc:938 -> 6
qgpu2009:9503:9526 [1] NCCL INFO init.cc:1358 -> 6
qgpu2009:9503:9526 [1] NCCL INFO group.cc:65 -> 6 [Async thread]
qgpu2009:9503:9503 [1] NCCL INFO group.cc:406 -> 6
qgpu2009:9503:9503 [1] NCCL INFO group.cc:96 -> 6
qgpu2009:9503:9503 [1] NCCL INFO comm 0x53704630 rank 5 nranks 8 cudaDev 1 busId 4b000 - Abort COMPLETE

qgpu2017:4304:4330 [0] misc/socket.cc:49 NCCL WARN socketProgress: Connection closed by remote peer qgpu2009-ib0.quest.it.northwestern.edu<59044>
qgpu2017:4304:4330 [0] NCCL INFO misc/socket.cc:57 -> 6
qgpu2017:4304:4330 [0] NCCL INFO misc/socket.cc:786 -> 6
qgpu2017:4304:4330 [0] NCCL INFO bootstrap.cc:71 -> 6
qgpu2017:4304:4330 [0] NCCL INFO bootstrap.cc:395 -> 6
qgpu2017:4304:4330 [0] NCCL INFO init.cc:938 -> 6
qgpu2017:4304:4330 [0] NCCL INFO init.cc:1358 -> 6
qgpu2017:4304:4330 [0] NCCL INFO group.cc:65 -> 6 [Async thread]
qgpu2017:4304:4304 [0] NCCL INFO group.cc:406 -> 6
qgpu2017:4304:4304 [0] NCCL INFO group.cc:96 -> 6
qgpu2017:4304:4304 [0] NCCL INFO comm 0x55263d60 rank 6 nranks 8 cudaDev 0 busId 31000 - Abort COMPLETE

qgpu2017:4305:4329 [1] misc/socket.cc:49 NCCL WARN socketProgress: Connection closed by remote peer qgpu2017-ib0.quest.it.northwestern.edu<44122>
qgpu2017:4305:4329 [1] NCCL INFO misc/socket.cc:57 -> 6
qgpu2017:4305:4329 [1] NCCL INFO misc/socket.cc:786 -> 6
qgpu2017:4305:4329 [1] NCCL INFO bootstrap.cc:71 -> 6
qgpu2017:4305:4329 [1] NCCL INFO bootstrap.cc:395 -> 6
qgpu2017:4305:4329 [1] NCCL INFO init.cc:938 -> 6
qgpu2017:4305:4329 [1] NCCL INFO init.cc:1358 -> 6
qgpu2017:4305:4329 [1] NCCL INFO group.cc:65 -> 6 [Async thread]
qgpu2017:4305:4305 [1] NCCL INFO group.cc:406 -> 6
qgpu2017:4305:4305 [1] NCCL INFO group.cc:96 -> 6
qgpu2017:4305:4305 [1] NCCL INFO comm 0x52589c20 rank 7 nranks 8 cudaDev 1 busId 4b000 - Abort COMPLETE

Expected behavior

I have successfully train on single node multi gpus by torchrun :

python -m torch.distributed.run --standalone --nnodes 1 --nproc_per_node=2 run_mlm.py \
--with_tracking \
...

but errors appear in multimode, is it because the master port isn't set correctly? If so, what might cause this problem, it might be due to our cluster problem, or I just didn't launch the job correctly?

Thanks for your reading, I appreciate any help!

@TJ-Solergibert
Copy link
Contributor

TJ-Solergibert commented Dec 13, 2023

Hi!
torchrun (accelerate launch) just takes care of spawning processes and setting rank & world_size environment variables. So it would be interesting to print this env variables at the beginning of the python script in order to check if we are able to init the processes and the value of this variables. If you are having problems with the network, the python script will never execute. You should ask your system support for the available ports.

I also noticed the following: You should set --ntasks-per-node to 1, it doesn't matters if you have 1, 2, 4 or 8 GPUs as you are running torchrun which accounts for 1 task (Even it creates more). You set export WORLD_SIZE=$(($SLURM_NNODES * $SLURM_NTASKS_PER_NODE)), just leave this task to torchrun. I suggest you the following bash script:

#!/bin/bash
#SBATCH -A p32013 ## Required: your allocation/account name, i.e. eXXXX, pXXXX or bXXXX
#SBATCH -p gengpu ## Required: (buyin, short, normal, long, gengpu, genhimem, etc)
#SBATCH --gres gpu:a100:2
#SBATCH -t 48:00:00 ## Required: How long will the job need to run (remember different partitions have restrictions on this parameter)
#SBATCH --nodes 4 ## how many computers/nodes do you need (no default)
#SBATCH --cpus-per-task 24
#SBATCH --ntasks-per-node 1 ## how many cpus or processors (gpu) do you need on per computer/node (default value 1)
#SBATCH --mem 200G ## how much RAM do you need per computer/node (this affects your FairShare score so be careful to not ask for more than you need))

module purge
module load python-miniconda3/4.12.0
module load moose/1.0.0
module load cuda/11.4.0-gcc
module load gcc/9.2.0

conda init bash
source ~/.bashrc

conda activate outlier
conda list

cd /projects/p32013/outlier-free-transformers

export PYTHONPATH=${PYTHONPATH}:$(realpath "$PWD")

export NCCL_DEBUG=INFO

export GPUS_PER_NODE=2
head_node_ip=$(scontrol show hostnames $SLURM_JOB_NODELIST | head -n 1)
# export MASTER_PORT=$(expr 10000 + $(echo -n $SLURM_JOBID | tail -c 4) + $SLURM_ARRAY_TASK_ID)
#export MASTER_PORT=$(expr 10000 + $(echo -n $SLURM_JOBID | tail -c 4))
#export WORLD_SIZE=$(($SLURM_NNODES * $SLURM_NTASKS_PER_NODE))

# echo "WORLD_SIZE="$WORLD_SIZE
# echo "SLURM_PROCID="$SLURM_PROCID
# echo "MASTER_PORT="$MASTER_PORT

# Reference from https://github.com/huggingface/accelerate/blob/main/examples/slurm/submit_multinode.sh.
#export LAUNCHER="accelerate launch \
#     --multi_gpu \
#     --num_processes $WORLD_SIZE \
#     --num_machines $SLURM_NNODES \
#     --machine_rank $SLURM_PROCID \
#     --rdzv_backend c10d  \
#     --main_process_ip "$head_node_ip" \
#     --main_process_port $MASTER_PORT \
#     --mixed_precision fp16 \
#     "

# Test torchrun
 export LAUNCHER=" \
    torchrun \
    --nnodes $SLURM_NNODES \
    --nproc_per_node $GPUS_PER_NODE \
    --rdzv_id $RANDOM \
    --rdzv_backend c10d \
    --rdzv_endpoint $head_node_ip:$UID \
    "

export SCRIPT="/projects/p32013/outlier-free-transformers/run_mlm.py"
export SCRIPT_ARGS=" \
    --with_tracking \
    --report_to tensorboard \
    --extra_tb_stats \
    --seed 1000 \
    --dataset_setup wikitext_2 \
    --preprocessing_num_workers 4 \
    --data_cache_dir /projects/p32013/cache/.hf_data \
    --model_cache_dir /projects/p32013/cache/.hf_cache \
    --model_type bert \
    --tokenizer_name bert-base-uncased \
    --max_seq_length 128 \
    --mlm_probability 0.15 \
    --learning_rate 0.0001 \
    --lr_scheduler_type linear \
    --max_train_steps 1000000 \
    --num_warmup_steps 10000 \
    --per_device_train_batch_size 256 \
    --per_device_eval_batch_size 256 \
    --gradient_accumulation_steps 1 \
    --max_grad_norm 1.0 \
    --weight_decay 0.01 \
    --config_name bert-base-uncased \
    --checkpointing_steps 100000 \
    --tb_scalar_log_interval 2000 \
    --tb_hist_log_interval 100000 \
    --attn_softmax vanilla \
    --output_dir output \
    "

# This step is necessary because accelerate launch does not handle multiline arguments properly
export CMD="$LAUNCHER $SCRIPT $SCRIPT_ARGS" 
srun $CMD

I set the master port to $UID. If you are able to initialize the processes in multiple nodes, you will be able to print world_size and rank, and the Duplicate GPU detected issue would be another problem. Maybe you are spawning more processes so they are fighting for the GPUs (e.g. 4 processes spawning for 2 GPUs)

Copy link

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.

Please note that issues that do not follow the contributing guidelines are likely to be ignored.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants