-
Notifications
You must be signed in to change notification settings - Fork 880
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Multinodes gpu training on Slurm cluster #2246
Comments
Hi! I also noticed the following: You should set --ntasks-per-node to 1, it doesn't matters if you have 1, 2, 4 or 8 GPUs as you are running torchrun which accounts for 1 task (Even it creates more). You set #!/bin/bash
#SBATCH -A p32013 ## Required: your allocation/account name, i.e. eXXXX, pXXXX or bXXXX
#SBATCH -p gengpu ## Required: (buyin, short, normal, long, gengpu, genhimem, etc)
#SBATCH --gres gpu:a100:2
#SBATCH -t 48:00:00 ## Required: How long will the job need to run (remember different partitions have restrictions on this parameter)
#SBATCH --nodes 4 ## how many computers/nodes do you need (no default)
#SBATCH --cpus-per-task 24
#SBATCH --ntasks-per-node 1 ## how many cpus or processors (gpu) do you need on per computer/node (default value 1)
#SBATCH --mem 200G ## how much RAM do you need per computer/node (this affects your FairShare score so be careful to not ask for more than you need))
module purge
module load python-miniconda3/4.12.0
module load moose/1.0.0
module load cuda/11.4.0-gcc
module load gcc/9.2.0
conda init bash
source ~/.bashrc
conda activate outlier
conda list
cd /projects/p32013/outlier-free-transformers
export PYTHONPATH=${PYTHONPATH}:$(realpath "$PWD")
export NCCL_DEBUG=INFO
export GPUS_PER_NODE=2
head_node_ip=$(scontrol show hostnames $SLURM_JOB_NODELIST | head -n 1)
# export MASTER_PORT=$(expr 10000 + $(echo -n $SLURM_JOBID | tail -c 4) + $SLURM_ARRAY_TASK_ID)
#export MASTER_PORT=$(expr 10000 + $(echo -n $SLURM_JOBID | tail -c 4))
#export WORLD_SIZE=$(($SLURM_NNODES * $SLURM_NTASKS_PER_NODE))
# echo "WORLD_SIZE="$WORLD_SIZE
# echo "SLURM_PROCID="$SLURM_PROCID
# echo "MASTER_PORT="$MASTER_PORT
# Reference from https://github.com/huggingface/accelerate/blob/main/examples/slurm/submit_multinode.sh.
#export LAUNCHER="accelerate launch \
# --multi_gpu \
# --num_processes $WORLD_SIZE \
# --num_machines $SLURM_NNODES \
# --machine_rank $SLURM_PROCID \
# --rdzv_backend c10d \
# --main_process_ip "$head_node_ip" \
# --main_process_port $MASTER_PORT \
# --mixed_precision fp16 \
# "
# Test torchrun
export LAUNCHER=" \
torchrun \
--nnodes $SLURM_NNODES \
--nproc_per_node $GPUS_PER_NODE \
--rdzv_id $RANDOM \
--rdzv_backend c10d \
--rdzv_endpoint $head_node_ip:$UID \
"
export SCRIPT="/projects/p32013/outlier-free-transformers/run_mlm.py"
export SCRIPT_ARGS=" \
--with_tracking \
--report_to tensorboard \
--extra_tb_stats \
--seed 1000 \
--dataset_setup wikitext_2 \
--preprocessing_num_workers 4 \
--data_cache_dir /projects/p32013/cache/.hf_data \
--model_cache_dir /projects/p32013/cache/.hf_cache \
--model_type bert \
--tokenizer_name bert-base-uncased \
--max_seq_length 128 \
--mlm_probability 0.15 \
--learning_rate 0.0001 \
--lr_scheduler_type linear \
--max_train_steps 1000000 \
--num_warmup_steps 10000 \
--per_device_train_batch_size 256 \
--per_device_eval_batch_size 256 \
--gradient_accumulation_steps 1 \
--max_grad_norm 1.0 \
--weight_decay 0.01 \
--config_name bert-base-uncased \
--checkpointing_steps 100000 \
--tb_scalar_log_interval 2000 \
--tb_hist_log_interval 100000 \
--attn_softmax vanilla \
--output_dir output \
"
# This step is necessary because accelerate launch does not handle multiline arguments properly
export CMD="$LAUNCHER $SCRIPT $SCRIPT_ARGS"
srun $CMD I set the master port to |
This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread. Please note that issues that do not follow the contributing guidelines are likely to be ignored. |
System Info
Information
Tasks
no_trainer
script in theexamples
folder of thetransformers
repo (such asrun_no_trainer_glue.py
)Reproduction
Sorry for the long log first, I just want to give detailed information about the error, please let me know if you think some part is redundant, I can delete it.
My training python file is coming from here, it is a modified script of huggingface
run_mlm_no_trainer.py
. For multimode training on Slurm cluster, I modifed the script from here, my script is:The above script content both
accelerate launch
andtorchrun
(reference to here) to launch a multinode job. I tested both but they all failed. I briefly describe the error here and then I will put the complete error log in the end of this issue.Accelerate launch
First, even if I have randomly selected the port number for different jobs I sent, the error will always say:
And then the program will still keep running for a while like loading configuration, loading tokenized data from cache...until it meets line 321 in
run_mlm.py
that writeswith accelerator.main_process_first()
, an error about NCCL appears and the job fails.For now, it looks like my program failed immediately once all the processes have to communicate to each other. I doubt this is caused by not successfully setting the master_port, but I am not sure.
To get a clear message from NCCL of why it is failing I set
NCCL_DEBUG=INFO
. I give the detailed NCCL INFO log below.torchrun
The
torchrun
error is similar to the above case but with slightly different.The master port still already in use
Error still happen in
with accelerator.main_process_first()
,error log is slightly different.Here is the NCCL log for
torchrun
case.NCCL log:
Expected behavior
I have successfully train on single node multi gpus by
torchrun
:but errors appear in multimode, is it because the master port isn't set correctly? If so, what might cause this problem, it might be due to our cluster problem, or I just didn't launch the job correctly?
Thanks for your reading, I appreciate any help!
The text was updated successfully, but these errors were encountered: