Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

MiniVit: Some NCCL operations have failed or timed out #96

Closed
Ga-Lee opened this issue Jun 9, 2022 · 7 comments
Closed

MiniVit: Some NCCL operations have failed or timed out #96

Ga-Lee opened this issue Jun 9, 2022 · 7 comments
Labels

Comments

@Ga-Lee
Copy link

Ga-Lee commented Jun 9, 2022

when I try to run Mini-Deit with 6 GPUs on the same node, the train stopped at some first several epoch, the error info like :

[E ProcessGroupNCCL.cpp:587] [Rank 5] Watchdog caught collective operation timeout: WorkNCCL(OpType=ALLREDUCE, Timeout(ms)=1800000) ran for 1808699 milliseconds before timing out.
[E ProcessGroupNCCL.cpp:587] [Rank 4] Watchdog caught collective operation timeout: WorkNCCL(OpType=ALLREDUCE, Timeout(ms)=1800000) ran for 1808705 milliseconds before timing out.
[E ProcessGroupNCCL.cpp:587] [Rank 3] Watchdog caught collective operation timeout: WorkNCCL(OpType=ALLREDUCE, Timeout(ms)=1800000) ran for 1808703 milliseconds before timing out.
[E ProcessGroupNCCL.cpp:587] [Rank 2] Watchdog caught collective operation timeout: WorkNCCL(OpType=ALLREDUCE, Timeout(ms)=1800000) ran for 1808731 milliseconds before timing out.
[E ProcessGroupNCCL.cpp:587] [Rank 1] Watchdog caught collective operation timeout: WorkNCCL(OpType=ALLREDUCE, Timeout(ms)=1800000) ran for 1808750 milliseconds before timing out.
[E ProcessGroupNCCL.cpp:587] [Rank 0] Watchdog caught collective operation timeout: WorkNCCL(OpType=ALLREDUCE, Timeout(ms)=1800000) ran for 1808749 milliseconds before timing out.
[E ProcessGroupNCCL.cpp:341] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data. To avoid this inconsistency, we are taking the entire process down.
[E ProcessGroupNCCL.cpp:341] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data. To avoid this inconsistency, we are taking the entire process down.
[E ProcessGroupNCCL.cpp:341] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data. To avoid this inconsistency, we are taking the entire process down.
[E ProcessGroupNCCL.cpp:341] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data. To avoid this inconsistency, we are taking the entire process down.

Can you tell me what reasons may lead to such problem? Thank you a lot !

@Ga-Lee Ga-Lee changed the title MiniVitSome NCCL operations have failed or timed out MiniVit: Some NCCL operations have failed or timed out Jun 9, 2022
@wkcn wkcn added the MiniViT label Jun 9, 2022
@wkcn
Copy link
Contributor

wkcn commented Jun 9, 2022

Hi @Ga-Lee , thanks for your attention to our work!

Can you please share the following information ?

nvcc --version
python -c "import torch;print('Torch:', torch.__version__, torch.version.cuda)"
python -c "import torchvision;print('TorchVision:', torchvision.__version__)"
python -c "import timm;print('TIMM:', timm.__version__)"
nvidia-smi  | grep "CUDA Version"
nvidia-smi  --list-gpus

And the training command like:

python -m torch.distributed.launch --nproc_per_node=8 --use_env main.py --model mini_deit_small_patch16_224 --batch-size 128 --data-path ./ImageNet --output_dir ./outputs  --teacher-model regnety_160 --distillation-type soft --distillation-alpha 1.0 --drop-path 0.0

@Ga-Lee
Copy link
Author

Ga-Lee commented Jun 9, 2022

Hi @Ga-Lee , thanks for your attention to our work!

Can you please share the following information ?

nvcc --version
python -c "import torch;print('Torch:', torch.__version__, torch.version.cuda)"
python -c "import torchvision;print('TorchVision:', torchvision.__version__)"
python -c "import timm;print('TIMM:', timm.__version__)"
nvidia-smi  | grep "CUDA Version"
nvidia-smi  --list-gpus

And the training command like:

python -m torch.distributed.launch --nproc_per_node=8 --use_env main.py --model mini_deit_small_patch16_224 --batch-size 128 --data-path ./ImageNet --output_dir ./outputs  --teacher-model regnety_160 --distillation-type soft --distillation-alpha 1.0 --drop-path 0.0

Thank you for your help!!
Followed your guidance,I got info:

[catarinajli@g-node04 ~]$ nvcc --version
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2021 NVIDIA Corporation
Built on Mon_May__3_19:15:13_PDT_2021
Cuda compilation tools, release 11.3, V11.3.109
Build cuda_11.3.r11.3/compiler.29920130_0
[catarinajli@g-node04 ~]$ python -c "import torchvision;print('TorchVision:', torchvision.version)"
TorchVision: 0.11.0
[catarinajli@g-node04 ~]$ python -c "import timm;print('TIMM:', timm.version)"
TIMM: 0.5.4
[catarinajli@g-node04 ~]$ nvidia-smi | grep "CUDA Version"
| NVIDIA-SMI 470.82.01 Driver Version: 470.82.01 CUDA Version: 11.4 |
[catarinajli@g-node04 ~]$ nvidia-smi --list-gpus
GPU 0: NVIDIA A30
GPU 1: NVIDIA A30
GPU 2: NVIDIA A30
GPU 3: NVIDIA A30
GPU 4: NVIDIA A30
GPU 5: NVIDIA A30
GPU 6: NVIDIA A30
GPU 7: NVIDIA A30

and for the command I have run it successfully several days ago and trained for 50 epochs without any error with another conda environment which was set following the project requirement.txt.

Actually the environment I am using now(occur the error) is cloned form the former one, but after that I didn't compile operations again, would this cause the timeout error?

@wkcn
Copy link
Contributor

wkcn commented Jun 9, 2022

@Ga-Lee

The version of CUDA, torchvision, timm, GPU seems no problem.
The custom operator does not call NCCL function, so I think it is not related to the issue.
I have tried to train the model in 6 GPUs. No exception raises.

Is there any other training program running on the machine? Do you reproduce the issue again?

@Ga-Lee
Copy link
Author

Ga-Lee commented Jun 9, 2022

@Ga-Lee

The version of CUDA, torchvision, timm, GPU seems no problem. The custom operator does not call NCCL function, so I think it is not related to the issue. I have tried to train the model in 6 GPUs. No exception raises.

Is there any other training program running on the machine? Do you reproduce the issue again?

Yes, Actually it happens after I try to replace the teacher model with pretrained 'convnext'(called by timm), and student model replaced by AS-MLP (https://github.com/svip-lab/AS-MLP), using its models dir, maybe the shift operation in it cause such error? Maybe I will check this part, thank you !!!

@wkcn
Copy link
Contributor

wkcn commented Jun 9, 2022

You are welcome : )

I have checked the training script of Mini-DeiT. There is no problem for the synchronous operator like metric_logger.synchronize_between_processes() in engine.py.

Replacing the teacher model with ConvNeXt does not cause the problem.

@Ga-Lee
Copy link
Author

Ga-Lee commented Jun 10, 2022

You are welcome : )

I have checked the training script of Mini-DeiT. There is no problem for the synchronous operator like metric_logger.synchronize_between_processes() in engine.py.

Replacing the teacher model with ConvNeXt does not cause the problem.

Thank you for your help!
I have reproduced the run for ConvNeXt and AS-MLP last night and the only difference is this time I used virtual env with compiled RPE operations(but actually, as you know, I didn't use model from project Mini-Deit) and it has been keeping run for 30 epochs(now it is still running) without error occurred before.
I feel pretty confused : ) I don't know which part can account for the change. But anyway at least now the good things are the code can run successfully and it seems the AS-MLP is learning form its CNN teacher, I will check the reason myself later.
Again, appreciate for your help!

@Ga-Lee Ga-Lee closed this as completed Jun 10, 2022
@wkcn
Copy link
Contributor

wkcn commented Jun 10, 2022

You are welcome ! : )

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

2 participants