AssertionError: Default process group is not initialized #22

bushou-yhh · 2020-10-16T08:25:30Z

[10/16 15:51:32 pods.engine.trainer]: Starting training from iteration 0
ERROR [10/16 15:51:32 pods.engine.trainer]: Exception during training:
Traceback (most recent call last):
File "/media/sda6/yhh/FCOS/BorderDet/cvpods/engine/trainer.py", line 89, in train
self.run_step()
File "/media/sda6/yhh/FCOS/BorderDet/cvpods/engine/trainer.py", line 193, in run_step
loss_dict = self.model(data)
File "/home/yons/anaconda3/envs/borderdet140-yhh/lib/python3.7/site-packages/torch/nn/modules/module.py", line 532, in call
result = self.forward(*input, **kwargs)
File "/media/sda6/yhh/FCOS/BorderDet/cvpods/modeling/meta_arch/borderdet.py", line 167, in forward
bd_box_delta,
File "/media/sda6/yhh/FCOS/BorderDet/cvpods/modeling/meta_arch/borderdet.py", line 237, in losses
dist.all_reduce(num_foreground)
File "/home/yons/anaconda3/envs/borderdet140-yhh/lib/python3.7/site-packages/torch/distributed/distributed_c10d.py", line 901, in all_reduce
_check_default_pg()
File "/home/yons/anaconda3/envs/borderdet140-yhh/lib/python3.7/site-packages/torch/distributed/distributed_c10d.py", line 193, in _check_default_pg
"Default process group is not initialized"
AssertionError: Default process group is not initialized
[10/16 15:51:32 pods.engine.hooks]: Total training time: 0:00:00 (0:00:00 on hooks)

bushou-yhh · 2020-10-16T08:35:41Z

soft link to Outputs/model_logs/cvpods_playground/detection/coco/borderdet/borderdet.res101.fpn.coco.800size.2x
Command Line Args: Namespace(dist_url='tcp://127.0.0.1:50152', eval_only=False, machine_rank=0, num_gpus=1, num_machines=1, opts=[], resume=False)
[10/16 16:29:02 cvpods]: Rank of current process: 0. World size: 1
[10/16 16:29:02 cvpods]: Environment info:

sys.platform linux
Python 3.7.9 (default, Aug 31 2020, 12:42:55) [GCC 7.3.0]
numpy 1.18.2
cvpods 0.1 @/media/sda6/yhh/FCOS/BorderDet/cvpods
cvpods compiler GCC 5.5
cvpods CUDA compiler 10.1
cvpods arch flags sm_61
cvpods_ENV_MODULE
PyTorch 1.4.0 @/home/yons/anaconda3/envs/borderdet140-yhh/lib/python3.7/site-packages/torch
PyTorch debug build False
CUDA available True
GPU 0 GeForce GTX 1080 Ti
CUDA_HOME /usr/local/cuda-10.1
NVCC Cuda compilation tools, release 10.1, V10.1.105
Pillow 6.2.0
torchvision 0.5.0 @/home/yons/anaconda3/envs/borderdet140-yhh/lib/python3.7/site-packages/torchvision
torchvision arch flags sm_35, sm_50, sm_60, sm_70, sm_75
cv2 4.4.0

PyTorch built with:

GCC 7.3
Intel(R) Math Kernel Library Version 2020.0.2 Product Build 20200624 for Intel(R) 64 architecture applications
Intel(R) MKL-DNN v0.21.1 (Git Hash 7d2fd500bc78936d1d648ca713b901012f470dbc)
OpenMP 201511 (a.k.a. OpenMP 4.5)
NNPACK is enabled
CUDA Runtime 10.1
NVCC architecture flags: -gencode;arch=compute_37,code=sm_37;-gencode;arch=compute_50,code=sm_50;-gencode;arch=compute_60,code=sm_60;-gencode;arch=compute_61,code=sm_61;-gencode;arch=compute_70,code=sm_70;-gencode;arch=compute_75,code=sm_75;-gencode;arch=compute_37,code=compute_37
CuDNN 7.6.3
Magma 2.5.1
Build settings: BLAS=MKL, BUILD_NAMEDTENSOR=OFF, BUILD_TYPE=Release, CXX_FLAGS= -Wno-deprecated -fvisibility-inlines-hidden -fopenmp -DUSE_FBGEMM -DUSE_QNNPACK -DUSE_PYTORCH_QNNPACK -O2 -fPIC -Wno-narrowing -Wall -Wextra -Wno-missing-field-initializers -Wno-type-limits -Wno-array-bounds -Wno-unknown-pragmas -Wno-sign-compare -Wno-unused-parameter -Wno-unused-variable -Wno-unused-function -Wno-unused-result -Wno-strict-overflow -Wno-strict-aliasing -Wno-error=deprecated-declarations -Wno-stringop-overflow -Wno-error=pedantic -Wno-error=redundant-decls -Wno-error=old-style-cast -fdiagnostics-color=always -faligned-new -Wno-unused-but-set-variable -Wno-maybe-uninitialized -fno-math-errno -fno-trapping-math -Wno-stringop-overflow, DISABLE_NUMA=1, PERF_WITH_AVX=1, PERF_WITH_AVX2=1, PERF_WITH_AVX512=1, USE_CUDA=ON, USE_EXCEPTION_PTR=1, USE_GFLAGS=OFF, USE_GLOG=OFF, USE_MKL=ON, USE_MKLDNN=ON, USE_MPI=OFF, USE_NCCL=ON, USE_NNPACK=ON, USE_OPENMP=ON, USE_STATIC_DISPATCH=OFF,

[10/16 16:29:02 cvpods]: Command line arguments: Namespace(dist_url='tcp://127.0.0.1:50152', eval_only=False, machine_rank=0, num_gpus=1, num_machines=1, opts=[], resume=False)

bushou-yhh · 2020-10-16T09:06:06Z

in cvpods/engine/launch.py
I changed
if world_size > 1:
# pytorch/pytorch#14391
# TODO prctl in spawned processes
to
if world_size >= 1:
# pytorch/pytorch#14391
# TODO prctl in spawned processes

it works

FateScript · 2020-10-19T10:15:09Z

Well, seems you are using 1-GPU during training? Such an error shouldn't happen. Could you please provide command your are using ?

FateScript · 2020-10-26T10:24:17Z

Since the reporter doesn't reply for a week, We close this issue.

Maycbj added bug Something isn't working help wanted Extra attention is needed and removed bug Something isn't working labels Oct 20, 2020

FateScript closed this as completed Oct 26, 2020

luowyang mentioned this issue Sep 26, 2023

Potential issue for single GPU training/inference. Megvii-BaseDetection/YOLOX#1721

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

AssertionError: Default process group is not initialized #22

AssertionError: Default process group is not initialized #22

bushou-yhh commented Oct 16, 2020

bushou-yhh commented Oct 16, 2020

bushou-yhh commented Oct 16, 2020

FateScript commented Oct 19, 2020

FateScript commented Oct 26, 2020

AssertionError: Default process group is not initialized #22

AssertionError: Default process group is not initialized #22

Comments

bushou-yhh commented Oct 16, 2020

bushou-yhh commented Oct 16, 2020

bushou-yhh commented Oct 16, 2020

FateScript commented Oct 19, 2020

FateScript commented Oct 26, 2020