New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
MiniVit: Some NCCL operations have failed or timed out #96
Comments
Hi @Ga-Lee , thanks for your attention to our work! Can you please share the following information ? nvcc --version
python -c "import torch;print('Torch:', torch.__version__, torch.version.cuda)"
python -c "import torchvision;print('TorchVision:', torchvision.__version__)"
python -c "import timm;print('TIMM:', timm.__version__)"
nvidia-smi | grep "CUDA Version"
nvidia-smi --list-gpus And the training command like:
|
Thank you for your help!!
and for the command I have run it successfully several days ago and trained for 50 epochs without any error with another conda environment which was set following the project requirement.txt. Actually the environment I am using now(occur the error) is cloned form the former one, but after that I didn't compile operations again, would this cause the timeout error? |
The version of CUDA, torchvision, timm, GPU seems no problem. Is there any other training program running on the machine? Do you reproduce the issue again? |
Yes, Actually it happens after I try to replace the teacher model with pretrained 'convnext'(called by timm), and student model replaced by AS-MLP (https://github.com/svip-lab/AS-MLP), using its models dir, maybe the shift operation in it cause such error? Maybe I will check this part, thank you !!! |
You are welcome : ) I have checked the training script of Mini-DeiT. There is no problem for the synchronous operator like Replacing the teacher model with ConvNeXt does not cause the problem. |
Thank you for your help! |
You are welcome ! : ) |
when I try to run Mini-Deit with 6 GPUs on the same node, the train stopped at some first several epoch, the error info like :
[E ProcessGroupNCCL.cpp:587] [Rank 5] Watchdog caught collective operation timeout: WorkNCCL(OpType=ALLREDUCE, Timeout(ms)=1800000) ran for 1808699 milliseconds before timing out.
[E ProcessGroupNCCL.cpp:587] [Rank 4] Watchdog caught collective operation timeout: WorkNCCL(OpType=ALLREDUCE, Timeout(ms)=1800000) ran for 1808705 milliseconds before timing out.
[E ProcessGroupNCCL.cpp:587] [Rank 3] Watchdog caught collective operation timeout: WorkNCCL(OpType=ALLREDUCE, Timeout(ms)=1800000) ran for 1808703 milliseconds before timing out.
[E ProcessGroupNCCL.cpp:587] [Rank 2] Watchdog caught collective operation timeout: WorkNCCL(OpType=ALLREDUCE, Timeout(ms)=1800000) ran for 1808731 milliseconds before timing out.
[E ProcessGroupNCCL.cpp:587] [Rank 1] Watchdog caught collective operation timeout: WorkNCCL(OpType=ALLREDUCE, Timeout(ms)=1800000) ran for 1808750 milliseconds before timing out.
[E ProcessGroupNCCL.cpp:587] [Rank 0] Watchdog caught collective operation timeout: WorkNCCL(OpType=ALLREDUCE, Timeout(ms)=1800000) ran for 1808749 milliseconds before timing out.
[E ProcessGroupNCCL.cpp:341] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data. To avoid this inconsistency, we are taking the entire process down.
[E ProcessGroupNCCL.cpp:341] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data. To avoid this inconsistency, we are taking the entire process down.
[E ProcessGroupNCCL.cpp:341] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data. To avoid this inconsistency, we are taking the entire process down.
[E ProcessGroupNCCL.cpp:341] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data. To avoid this inconsistency, we are taking the entire process down.
Can you tell me what reasons may lead to such problem? Thank you a lot !
The text was updated successfully, but these errors were encountered: