MiniVit: Some NCCL operations have failed or timed out #96

Ga-Lee · 2022-06-09T08:29:37Z

when I try to run Mini-Deit with 6 GPUs on the same node, the train stopped at some first several epoch, the error info like :

[E ProcessGroupNCCL.cpp:587] [Rank 5] Watchdog caught collective operation timeout: WorkNCCL(OpType=ALLREDUCE, Timeout(ms)=1800000) ran for 1808699 milliseconds before timing out.
[E ProcessGroupNCCL.cpp:587] [Rank 4] Watchdog caught collective operation timeout: WorkNCCL(OpType=ALLREDUCE, Timeout(ms)=1800000) ran for 1808705 milliseconds before timing out.
[E ProcessGroupNCCL.cpp:587] [Rank 3] Watchdog caught collective operation timeout: WorkNCCL(OpType=ALLREDUCE, Timeout(ms)=1800000) ran for 1808703 milliseconds before timing out.
[E ProcessGroupNCCL.cpp:587] [Rank 2] Watchdog caught collective operation timeout: WorkNCCL(OpType=ALLREDUCE, Timeout(ms)=1800000) ran for 1808731 milliseconds before timing out.
[E ProcessGroupNCCL.cpp:587] [Rank 1] Watchdog caught collective operation timeout: WorkNCCL(OpType=ALLREDUCE, Timeout(ms)=1800000) ran for 1808750 milliseconds before timing out.
[E ProcessGroupNCCL.cpp:587] [Rank 0] Watchdog caught collective operation timeout: WorkNCCL(OpType=ALLREDUCE, Timeout(ms)=1800000) ran for 1808749 milliseconds before timing out.
[E ProcessGroupNCCL.cpp:341] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data. To avoid this inconsistency, we are taking the entire process down.
[E ProcessGroupNCCL.cpp:341] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data. To avoid this inconsistency, we are taking the entire process down.
[E ProcessGroupNCCL.cpp:341] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data. To avoid this inconsistency, we are taking the entire process down.
[E ProcessGroupNCCL.cpp:341] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data. To avoid this inconsistency, we are taking the entire process down.

Can you tell me what reasons may lead to such problem? Thank you a lot !

wkcn · 2022-06-09T09:07:31Z

Hi @Ga-Lee , thanks for your attention to our work!

Can you please share the following information ?

nvcc --version
python -c "import torch;print('Torch:', torch.__version__, torch.version.cuda)"
python -c "import torchvision;print('TorchVision:', torchvision.__version__)"
python -c "import timm;print('TIMM:', timm.__version__)"
nvidia-smi  | grep "CUDA Version"
nvidia-smi  --list-gpus

And the training command like:

python -m torch.distributed.launch --nproc_per_node=8 --use_env main.py --model mini_deit_small_patch16_224 --batch-size 128 --data-path ./ImageNet --output_dir ./outputs  --teacher-model regnety_160 --distillation-type soft --distillation-alpha 1.0 --drop-path 0.0

Ga-Lee · 2022-06-09T09:53:02Z

Hi @Ga-Lee , thanks for your attention to our work!

Can you please share the following information ?

nvcc --version
python -c "import torch;print('Torch:', torch.__version__, torch.version.cuda)"
python -c "import torchvision;print('TorchVision:', torchvision.__version__)"
python -c "import timm;print('TIMM:', timm.__version__)"
nvidia-smi  | grep "CUDA Version"
nvidia-smi  --list-gpus

And the training command like:

python -m torch.distributed.launch --nproc_per_node=8 --use_env main.py --model mini_deit_small_patch16_224 --batch-size 128 --data-path ./ImageNet --output_dir ./outputs  --teacher-model regnety_160 --distillation-type soft --distillation-alpha 1.0 --drop-path 0.0

Thank you for your help!!
Followed your guidance,I got info:

[catarinajli@g-node04 ~]$ nvcc --version
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2021 NVIDIA Corporation
Built on Mon_May__3_19:15:13_PDT_2021
Cuda compilation tools, release 11.3, V11.3.109
Build cuda_11.3.r11.3/compiler.29920130_0
[catarinajli@g-node04 ~]$ python -c "import torchvision;print('TorchVision:', torchvision.version)"
TorchVision: 0.11.0
[catarinajli@g-node04 ~]$ python -c "import timm;print('TIMM:', timm.version)"
TIMM: 0.5.4
[catarinajli@g-node04 ~]$ nvidia-smi | grep "CUDA Version"
| NVIDIA-SMI 470.82.01 Driver Version: 470.82.01 CUDA Version: 11.4 |
[catarinajli@g-node04 ~]$ nvidia-smi --list-gpus
GPU 0: NVIDIA A30
GPU 1: NVIDIA A30
GPU 2: NVIDIA A30
GPU 3: NVIDIA A30
GPU 4: NVIDIA A30
GPU 5: NVIDIA A30
GPU 6: NVIDIA A30
GPU 7: NVIDIA A30

and for the command I have run it successfully several days ago and trained for 50 epochs without any error with another conda environment which was set following the project requirement.txt.

Actually the environment I am using now(occur the error) is cloned form the former one, but after that I didn't compile operations again, would this cause the timeout error?

wkcn · 2022-06-09T10:04:01Z

@Ga-Lee

The version of CUDA, torchvision, timm, GPU seems no problem.
The custom operator does not call NCCL function, so I think it is not related to the issue.
I have tried to train the model in 6 GPUs. No exception raises.

Is there any other training program running on the machine? Do you reproduce the issue again?

Ga-Lee · 2022-06-09T10:13:36Z

@Ga-Lee

The version of CUDA, torchvision, timm, GPU seems no problem. The custom operator does not call NCCL function, so I think it is not related to the issue. I have tried to train the model in 6 GPUs. No exception raises.

Is there any other training program running on the machine? Do you reproduce the issue again?

Yes, Actually it happens after I try to replace the teacher model with pretrained 'convnext'(called by timm), and student model replaced by AS-MLP (https://github.com/svip-lab/AS-MLP), using its models dir, maybe the shift operation in it cause such error? Maybe I will check this part, thank you !!!

wkcn · 2022-06-09T10:24:51Z

You are welcome : )

I have checked the training script of Mini-DeiT. There is no problem for the synchronous operator like metric_logger.synchronize_between_processes() in engine.py.

Replacing the teacher model with ConvNeXt does not cause the problem.

Ga-Lee · 2022-06-10T05:33:21Z

You are welcome : )

I have checked the training script of Mini-DeiT. There is no problem for the synchronous operator like metric_logger.synchronize_between_processes() in engine.py.

Replacing the teacher model with ConvNeXt does not cause the problem.

Thank you for your help!
I have reproduced the run for ConvNeXt and AS-MLP last night and the only difference is this time I used virtual env with compiled RPE operations(but actually, as you know, I didn't use model from project Mini-Deit) and it has been keeping run for 30 epochs(now it is still running) without error occurred before.
I feel pretty confused : ) I don't know which part can account for the change. But anyway at least now the good things are the code can run successfully and it seems the AS-MLP is learning form its CNN teacher, I will check the reason myself later.
Again, appreciate for your help!

wkcn · 2022-06-10T07:31:34Z

You are welcome ! : )

Ga-Lee changed the title ~~MiniVitSome NCCL operations have failed or timed out~~ MiniVit: Some NCCL operations have failed or timed out Jun 9, 2022

wkcn added the MiniViT label Jun 9, 2022

Ga-Lee closed this as completed Jun 10, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

MiniVit: Some NCCL operations have failed or timed out #96

MiniVit: Some NCCL operations have failed or timed out #96

Ga-Lee commented Jun 9, 2022 •

edited

wkcn commented Jun 9, 2022

Ga-Lee commented Jun 9, 2022

wkcn commented Jun 9, 2022

Ga-Lee commented Jun 9, 2022

wkcn commented Jun 9, 2022

Ga-Lee commented Jun 10, 2022

wkcn commented Jun 10, 2022

MiniVit: Some NCCL operations have failed or timed out #96

MiniVit: Some NCCL operations have failed or timed out #96

Comments

Ga-Lee commented Jun 9, 2022 • edited

wkcn commented Jun 9, 2022

Ga-Lee commented Jun 9, 2022

wkcn commented Jun 9, 2022

Ga-Lee commented Jun 9, 2022

wkcn commented Jun 9, 2022

Ga-Lee commented Jun 10, 2022

wkcn commented Jun 10, 2022

Ga-Lee commented Jun 9, 2022 •

edited