mmyolo 0.1.0 rtmdet复现出错 #139

huoshuai-dot · 2022-10-09T03:32:49Z

Prerequisite

I have searched the existing and past issues but cannot get the expected help.
I have read the FAQ documentation but cannot get the expected help.
The bug has not been fixed in the latest version.

💬 Describe the reimplementation questions

使用dist_train.sh 多卡训练rtmdet时会卡死强行中断显示多线程挂死显存也没有打满但是执行train.py或者dist_train.sh 指定单卡是可以正常训练的，
yolox这个任务是正常的问题出现在rtmdet 多卡上请问这个问题怎么排查呢？

Environment

fatal: not a git repository (or any parent up to mount point /)
Stopping at filesystem boundary (GIT_DISCOVERY_ACROSS_FILESYSTEM not set).
sys.platform: linux
Python: 3.8.13 (default, Mar 28 2022, 11:38:47) [GCC 7.5.0]
CUDA available: True
numpy_random_seed: 2147483648
GPU 0,1,2,3,4,5,6,7: NVIDIA GeForce GTX TITAN X
CUDA_HOME: :/usr/local/cuda-10.2
GCC: gcc (Ubuntu 7.5.0-3ubuntu1~18.04) 7.5.0
PyTorch: 1.12.1
PyTorch compiling details: PyTorch built with:

GCC 9.3
C++ Version: 201402
Intel(R) oneAPI Math Kernel Library Version 2021.4-Product Build 20210904 for Intel(R) 64 architecture applications
Intel(R) MKL-DNN v2.6.0 (Git Hash 52b5f107dd9cf10910aaa19cb47f3abf9b349815)
OpenMP 201511 (a.k.a. OpenMP 4.5)
LAPACK is enabled (usually provided by MKL)
NNPACK is enabled
CPU capability usage: AVX2
CUDA Runtime 11.3
NVCC architecture flags: -gencode;arch=compute_37,code=sm_37;-gencode;arch=compute_50,code=sm_50;-gencode;arch=compute_60,code=sm_60;-gencode;arch=compute_61,code=sm_61;-gencode;arch=compute_70,code=sm_70;-gencode;arch=compute_75,code=sm_75;-gencode;arch=compute_80,code=sm_80;-gencode;arch=compute_86,code=sm_86;-gencode;arch=compute_37,code=compute_37
CuDNN 8.3.2 (built against CUDA 11.5)
Magma 2.5.2
Build settings: BLAS_INFO=mkl, BUILD_TYPE=Release, CUDA_VERSION=11.3, CUDNN_VERSION=8.3.2, CXX_COMPILER=/opt/rh/devtoolset-9/root/usr/bin/c++, CXX_FLAGS= -fabi-version=11 -Wno-deprecated -fvisibility-inlines-hidden -DUSE_PTHREADPOOL -fopenmp -DNDEBUG -DUSE_KINETO -DUSE_FBGEMM -DUSE_QNNPACK -DUSE_PYTORCH_QNNPACK -DUSE_XNNPACK -DSYMBOLICATE_MOBILE_DEBUG_HANDLE -DEDGE_PROFILER_USE_KINETO -O2 -fPIC -Wno-narrowing -Wall -Wextra -Werror=return-type -Wno-missing-field-initializers -Wno-type-limits -Wno-array-bounds -Wno-unknown-pragmas -Wno-unused-parameter -Wno-unused-function -Wno-unused-result -Wno-unused-local-typedefs -Wno-strict-overflow -Wno-strict-aliasing -Wno-error=deprecated-declarations -Wno-stringop-overflow -Wno-psabi -Wno-error=pedantic -Wno-error=redundant-decls -Wno-error=old-style-cast -fdiagnostics-color=always -faligned-new -Wno-unused-but-set-variable -Wno-maybe-uninitialized -fno-math-errno -fno-trapping-math -Werror=format -Werror=cast-function-type -Wno-stringop-overflow, LAPACK_INFO=mkl, PERF_WITH_AVX=1, PERF_WITH_AVX2=1, PERF_WITH_AVX512=1, TORCH_VERSION=1.12.1, USE_CUDA=ON, USE_CUDNN=ON, USE_EXCEPTION_PTR=1, USE_GFLAGS=OFF, USE_GLOG=OFF, USE_MKL=ON, USE_MKLDNN=OFF, USE_MPI=OFF, USE_NCCL=ON, USE_NNPACK=ON, USE_OPENMP=ON, USE_ROCM=OFF,

TorchVision: 0.13.1
OpenCV: 4.6.0
MMEngine: 0.1.0
MMCV: 2.0.0rc1
MMDetection: 3.0.0rc1
MMYOLO: 0.1.1+

Expected results

No response

Additional information

No response

mm-assistant · 2022-10-09T03:32:52Z

We recommend using English or English & Chinese for issues so that we could have broader discussion.

hhaAndroid · 2022-10-09T06:45:23Z

@huoshuai-dot Is it possible to reduce the batchsize and try it?

huoshuai-dot · 2022-10-09T07:03:27Z

@huoshuai-dot Is it possible to reduce the batchsize and try it?

单卡我设置bs=16 是ok的但是多卡卡住的时候显存一直是500M左右但是单卡训练显存应该很大才对感觉应该是读数据这块有什么问题导致挂起了，batchsize减少还是有这个问题

huoshuai-dot · 2022-10-09T09:18:08Z

@hhaAndroid 还有一个现象就是我把pretrained 模型给注释掉后（不适用imagenet pretrained 模型初始化权重），训练过程是可以进行下去的，莫非跟这个预训练模型加载有关系？

hhaAndroid · 2022-10-09T11:04:39Z

@huoshuai-dot I have not encountered this situation. Can you upload your training log?

hhaAndroid · 2022-10-09T11:07:26Z

@huoshuai-dot Is it possible that the pytorch version is too high? Can you consider switching to pytorch1.9 and try it?

huoshuai-dot · 2022-10-10T01:15:39Z

@huoshuai-dot Is it possible that the pytorch version is too high? Can you consider switching to pytorch1.9 and try it?

我的torch版本是1.12.1 确实版本比较高，我可以按照readme里面的配置安装下python环境再试试

huoshuai-dot · 2022-10-10T02:41:46Z

@huoshuai-dot Is it possible that the pytorch version is too high? Can you consider switching to pytorch1.9 and try it?

@hhaAndroid 按照readme换了1.10的torch 问题还是存在这个还可能是什么原因呢？

huoshuai-dot · 2022-10-11T01:23:09Z

@hhaAndroid 你好我昨天装了docker镜像然后跑了一个例子，还是遇到了相同的问题这次挂起很久之后报如下错误：
RuntimeError: Timed out initializing process group in store based barrier on rank: 1, for key: store_based_barrier_key:1 (world_size=2, worker_count=6, timeout=0:30:00)
Traceback (most recent call last):
File "./tools/train.py", line 106, in
main()
File "./tools/train.py", line 95, in main
runner = Runner.from_cfg(cfg)
File "/opt/conda/lib/python3.7/site-packages/mmengine/runner/runner.py", line 458, in from_cfg
cfg=cfg,
File "/opt/conda/lib/python3.7/site-packages/mmengine/runner/runner.py", line 345, in init
self.setup_env(env_cfg)
File "/opt/conda/lib/python3.7/site-packages/mmengine/runner/runner.py", line 644, in setup_env
init_dist(self.launcher, **dist_cfg)
File "/opt/conda/lib/python3.7/site-packages/mmengine/dist/utils.py", line 56, in init_dist
_init_dist_pytorch(backend, **kwargs)
File "/opt/conda/lib/python3.7/site-packages/mmengine/dist/utils.py", line 86, in _init_dist_pytorch
torch_dist.init_process_group(backend=backend, **kwargs)
File "/opt/conda/lib/python3.7/site-packages/torch/distributed/distributed_c10d.py", line 547, in init_process_group
_store_based_barrier(rank, store, timeout)
File "/opt/conda/lib/python3.7/site-packages/torch/distributed/distributed_c10d.py", line 222, in _store_based_barrier
rank, store_key, world_size, worker_count, timeout))
RuntimeError: Timed out initializing process group in store based barrier on rank: 0, for key: store_based_barrier_key:1 (world_size=2, worker_count=6, timeout=0:30:00)
ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1) local_rank: 0 (pid: 1565) of binary: /opt/conda/bin/python
ERROR:torch.distributed.elastic.agent.server.local_elastic_agent:[default] Worker group failed
不知道这个问题怎么排查？

hhaAndroid · 2022-10-12T03:15:18Z

NCCL driver issue, resolved

hhaAndroid added the question Further information is requested label Oct 9, 2022

hhaAndroid closed this as completed Oct 12, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

mmyolo 0.1.0 rtmdet复现出错 #139

mmyolo 0.1.0 rtmdet复现出错 #139

huoshuai-dot commented Oct 9, 2022

mm-assistant bot commented Oct 9, 2022

hhaAndroid commented Oct 9, 2022

huoshuai-dot commented Oct 9, 2022

huoshuai-dot commented Oct 9, 2022

hhaAndroid commented Oct 9, 2022

hhaAndroid commented Oct 9, 2022

huoshuai-dot commented Oct 10, 2022

huoshuai-dot commented Oct 10, 2022

huoshuai-dot commented Oct 11, 2022

hhaAndroid commented Oct 12, 2022

mmyolo 0.1.0 rtmdet复现出错 #139

mmyolo 0.1.0 rtmdet复现出错 #139

Comments

huoshuai-dot commented Oct 9, 2022

Prerequisite

💬 Describe the reimplementation questions

Environment

Expected results

Additional information

mm-assistant bot commented Oct 9, 2022

hhaAndroid commented Oct 9, 2022

huoshuai-dot commented Oct 9, 2022

huoshuai-dot commented Oct 9, 2022

hhaAndroid commented Oct 9, 2022

hhaAndroid commented Oct 9, 2022

huoshuai-dot commented Oct 10, 2022

huoshuai-dot commented Oct 10, 2022

huoshuai-dot commented Oct 11, 2022

hhaAndroid commented Oct 12, 2022