Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

mmyolo 0.1.0 rtmdet复现出错 #139

Closed
3 tasks done
huoshuai-dot opened this issue Oct 9, 2022 · 10 comments
Closed
3 tasks done

mmyolo 0.1.0 rtmdet复现出错 #139

huoshuai-dot opened this issue Oct 9, 2022 · 10 comments
Labels
question Further information is requested

Comments

@huoshuai-dot
Copy link

Prerequisite

💬 Describe the reimplementation questions

使用dist_train.sh 多卡训练rtmdet时 会卡死 强行中断 显示多线程挂死 显存也没有打满 但是 执行train.py或者dist_train.sh 指定单卡是可以正常训练的,
yolox这个任务是正常的 问题出现在rtmdet 多卡上 请问这个问题怎么排查呢?

Environment

fatal: not a git repository (or any parent up to mount point /)
Stopping at filesystem boundary (GIT_DISCOVERY_ACROSS_FILESYSTEM not set).
sys.platform: linux
Python: 3.8.13 (default, Mar 28 2022, 11:38:47) [GCC 7.5.0]
CUDA available: True
numpy_random_seed: 2147483648
GPU 0,1,2,3,4,5,6,7: NVIDIA GeForce GTX TITAN X
CUDA_HOME: :/usr/local/cuda-10.2
GCC: gcc (Ubuntu 7.5.0-3ubuntu1~18.04) 7.5.0
PyTorch: 1.12.1
PyTorch compiling details: PyTorch built with:

  • GCC 9.3
  • C++ Version: 201402
  • Intel(R) oneAPI Math Kernel Library Version 2021.4-Product Build 20210904 for Intel(R) 64 architecture applications
  • Intel(R) MKL-DNN v2.6.0 (Git Hash 52b5f107dd9cf10910aaa19cb47f3abf9b349815)
  • OpenMP 201511 (a.k.a. OpenMP 4.5)
  • LAPACK is enabled (usually provided by MKL)
  • NNPACK is enabled
  • CPU capability usage: AVX2
  • CUDA Runtime 11.3
  • NVCC architecture flags: -gencode;arch=compute_37,code=sm_37;-gencode;arch=compute_50,code=sm_50;-gencode;arch=compute_60,code=sm_60;-gencode;arch=compute_61,code=sm_61;-gencode;arch=compute_70,code=sm_70;-gencode;arch=compute_75,code=sm_75;-gencode;arch=compute_80,code=sm_80;-gencode;arch=compute_86,code=sm_86;-gencode;arch=compute_37,code=compute_37
  • CuDNN 8.3.2 (built against CUDA 11.5)
  • Magma 2.5.2
  • Build settings: BLAS_INFO=mkl, BUILD_TYPE=Release, CUDA_VERSION=11.3, CUDNN_VERSION=8.3.2, CXX_COMPILER=/opt/rh/devtoolset-9/root/usr/bin/c++, CXX_FLAGS= -fabi-version=11 -Wno-deprecated -fvisibility-inlines-hidden -DUSE_PTHREADPOOL -fopenmp -DNDEBUG -DUSE_KINETO -DUSE_FBGEMM -DUSE_QNNPACK -DUSE_PYTORCH_QNNPACK -DUSE_XNNPACK -DSYMBOLICATE_MOBILE_DEBUG_HANDLE -DEDGE_PROFILER_USE_KINETO -O2 -fPIC -Wno-narrowing -Wall -Wextra -Werror=return-type -Wno-missing-field-initializers -Wno-type-limits -Wno-array-bounds -Wno-unknown-pragmas -Wno-unused-parameter -Wno-unused-function -Wno-unused-result -Wno-unused-local-typedefs -Wno-strict-overflow -Wno-strict-aliasing -Wno-error=deprecated-declarations -Wno-stringop-overflow -Wno-psabi -Wno-error=pedantic -Wno-error=redundant-decls -Wno-error=old-style-cast -fdiagnostics-color=always -faligned-new -Wno-unused-but-set-variable -Wno-maybe-uninitialized -fno-math-errno -fno-trapping-math -Werror=format -Werror=cast-function-type -Wno-stringop-overflow, LAPACK_INFO=mkl, PERF_WITH_AVX=1, PERF_WITH_AVX2=1, PERF_WITH_AVX512=1, TORCH_VERSION=1.12.1, USE_CUDA=ON, USE_CUDNN=ON, USE_EXCEPTION_PTR=1, USE_GFLAGS=OFF, USE_GLOG=OFF, USE_MKL=ON, USE_MKLDNN=OFF, USE_MPI=OFF, USE_NCCL=ON, USE_NNPACK=ON, USE_OPENMP=ON, USE_ROCM=OFF,

TorchVision: 0.13.1
OpenCV: 4.6.0
MMEngine: 0.1.0
MMCV: 2.0.0rc1
MMDetection: 3.0.0rc1
MMYOLO: 0.1.1+

Expected results

No response

Additional information

No response

@mm-assistant
Copy link

mm-assistant bot commented Oct 9, 2022

We recommend using English or English & Chinese for issues so that we could have broader discussion.

@hhaAndroid
Copy link
Collaborator

@huoshuai-dot Is it possible to reduce the batchsize and try it?

@hhaAndroid hhaAndroid added the question Further information is requested label Oct 9, 2022
@huoshuai-dot
Copy link
Author

@huoshuai-dot Is it possible to reduce the batchsize and try it?

单卡我设置bs=16 是ok的 但是 多卡卡住的时候显存一直是500M左右 但是 单卡训练显存应该很大才对 感觉应该是读数据这块有什么问题导致挂起了,batchsize减少还是有这个问题

@huoshuai-dot
Copy link
Author

@hhaAndroid 还有一个现象就是 我把pretrained 模型给注释掉后 (不适用imagenet pretrained 模型初始化权重),训练过程是可以进行下去的,莫非跟这个预训练模型加载有关系?

@hhaAndroid
Copy link
Collaborator

@huoshuai-dot I have not encountered this situation. Can you upload your training log?

@hhaAndroid
Copy link
Collaborator

@huoshuai-dot Is it possible that the pytorch version is too high? Can you consider switching to pytorch1.9 and try it?

@huoshuai-dot
Copy link
Author

@huoshuai-dot Is it possible that the pytorch version is too high? Can you consider switching to pytorch1.9 and try it?

我的torch版本是1.12.1 确实版本比较高,我可以按照readme里面的配置安装下python环境再试试

@huoshuai-dot
Copy link
Author

@huoshuai-dot Is it possible that the pytorch version is too high? Can you consider switching to pytorch1.9 and try it?

@hhaAndroid 按照readme换了1.10的torch 问题还是存在 这个还可能是什么原因呢?

@huoshuai-dot
Copy link
Author

@hhaAndroid 你好 我昨天装了docker镜像 然后跑了一个例子,还是遇到了相同的问题 这次 挂起很久之后报如下错误:
RuntimeError: Timed out initializing process group in store based barrier on rank: 1, for key: store_based_barrier_key:1 (world_size=2, worker_count=6, timeout=0:30:00)
Traceback (most recent call last):
File "./tools/train.py", line 106, in
main()
File "./tools/train.py", line 95, in main
runner = Runner.from_cfg(cfg)
File "/opt/conda/lib/python3.7/site-packages/mmengine/runner/runner.py", line 458, in from_cfg
cfg=cfg,
File "/opt/conda/lib/python3.7/site-packages/mmengine/runner/runner.py", line 345, in init
self.setup_env(env_cfg)
File "/opt/conda/lib/python3.7/site-packages/mmengine/runner/runner.py", line 644, in setup_env
init_dist(self.launcher, **dist_cfg)
File "/opt/conda/lib/python3.7/site-packages/mmengine/dist/utils.py", line 56, in init_dist
_init_dist_pytorch(backend, **kwargs)
File "/opt/conda/lib/python3.7/site-packages/mmengine/dist/utils.py", line 86, in _init_dist_pytorch
torch_dist.init_process_group(backend=backend, **kwargs)
File "/opt/conda/lib/python3.7/site-packages/torch/distributed/distributed_c10d.py", line 547, in init_process_group
_store_based_barrier(rank, store, timeout)
File "/opt/conda/lib/python3.7/site-packages/torch/distributed/distributed_c10d.py", line 222, in _store_based_barrier
rank, store_key, world_size, worker_count, timeout))
RuntimeError: Timed out initializing process group in store based barrier on rank: 0, for key: store_based_barrier_key:1 (world_size=2, worker_count=6, timeout=0:30:00)
ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1) local_rank: 0 (pid: 1565) of binary: /opt/conda/bin/python
ERROR:torch.distributed.elastic.agent.server.local_elastic_agent:[default] Worker group failed
不知道这个问题怎么排查?

@hhaAndroid
Copy link
Collaborator

NCCL driver issue, resolved

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
question Further information is requested
Projects
None yet
Development

No branches or pull requests

2 participants