Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

AssertionError: Default process group is not initialized #22

Closed
bushou-yhh opened this issue Oct 16, 2020 · 4 comments
Closed

AssertionError: Default process group is not initialized #22

bushou-yhh opened this issue Oct 16, 2020 · 4 comments
Labels
help wanted Extra attention is needed

Comments

@bushou-yhh
Copy link

[10/16 15:51:32 pods.engine.trainer]: Starting training from iteration 0
ERROR [10/16 15:51:32 pods.engine.trainer]: Exception during training:
Traceback (most recent call last):
File "/media/sda6/yhh/FCOS/BorderDet/cvpods/engine/trainer.py", line 89, in train
self.run_step()
File "/media/sda6/yhh/FCOS/BorderDet/cvpods/engine/trainer.py", line 193, in run_step
loss_dict = self.model(data)
File "/home/yons/anaconda3/envs/borderdet140-yhh/lib/python3.7/site-packages/torch/nn/modules/module.py", line 532, in call
result = self.forward(*input, **kwargs)
File "/media/sda6/yhh/FCOS/BorderDet/cvpods/modeling/meta_arch/borderdet.py", line 167, in forward
bd_box_delta,
File "/media/sda6/yhh/FCOS/BorderDet/cvpods/modeling/meta_arch/borderdet.py", line 237, in losses
dist.all_reduce(num_foreground)
File "/home/yons/anaconda3/envs/borderdet140-yhh/lib/python3.7/site-packages/torch/distributed/distributed_c10d.py", line 901, in all_reduce
_check_default_pg()
File "/home/yons/anaconda3/envs/borderdet140-yhh/lib/python3.7/site-packages/torch/distributed/distributed_c10d.py", line 193, in _check_default_pg
"Default process group is not initialized"
AssertionError: Default process group is not initialized
[10/16 15:51:32 pods.engine.hooks]: Total training time: 0:00:00 (0:00:00 on hooks)

@bushou-yhh
Copy link
Author

soft link to Outputs/model_logs/cvpods_playground/detection/coco/borderdet/borderdet.res101.fpn.coco.800size.2x
Command Line Args: Namespace(dist_url='tcp://127.0.0.1:50152', eval_only=False, machine_rank=0, num_gpus=1, num_machines=1, opts=[], resume=False)
[10/16 16:29:02 cvpods]: Rank of current process: 0. World size: 1
[10/16 16:29:02 cvpods]: Environment info:


sys.platform linux
Python 3.7.9 (default, Aug 31 2020, 12:42:55) [GCC 7.3.0]
numpy 1.18.2
cvpods 0.1 @/media/sda6/yhh/FCOS/BorderDet/cvpods
cvpods compiler GCC 5.5
cvpods CUDA compiler 10.1
cvpods arch flags sm_61
cvpods_ENV_MODULE
PyTorch 1.4.0 @/home/yons/anaconda3/envs/borderdet140-yhh/lib/python3.7/site-packages/torch
PyTorch debug build False
CUDA available True
GPU 0 GeForce GTX 1080 Ti
CUDA_HOME /usr/local/cuda-10.1
NVCC Cuda compilation tools, release 10.1, V10.1.105
Pillow 6.2.0
torchvision 0.5.0 @/home/yons/anaconda3/envs/borderdet140-yhh/lib/python3.7/site-packages/torchvision
torchvision arch flags sm_35, sm_50, sm_60, sm_70, sm_75
cv2 4.4.0


PyTorch built with:

  • GCC 7.3
  • Intel(R) Math Kernel Library Version 2020.0.2 Product Build 20200624 for Intel(R) 64 architecture applications
  • Intel(R) MKL-DNN v0.21.1 (Git Hash 7d2fd500bc78936d1d648ca713b901012f470dbc)
  • OpenMP 201511 (a.k.a. OpenMP 4.5)
  • NNPACK is enabled
  • CUDA Runtime 10.1
  • NVCC architecture flags: -gencode;arch=compute_37,code=sm_37;-gencode;arch=compute_50,code=sm_50;-gencode;arch=compute_60,code=sm_60;-gencode;arch=compute_61,code=sm_61;-gencode;arch=compute_70,code=sm_70;-gencode;arch=compute_75,code=sm_75;-gencode;arch=compute_37,code=compute_37
  • CuDNN 7.6.3
  • Magma 2.5.1
  • Build settings: BLAS=MKL, BUILD_NAMEDTENSOR=OFF, BUILD_TYPE=Release, CXX_FLAGS= -Wno-deprecated -fvisibility-inlines-hidden -fopenmp -DUSE_FBGEMM -DUSE_QNNPACK -DUSE_PYTORCH_QNNPACK -O2 -fPIC -Wno-narrowing -Wall -Wextra -Wno-missing-field-initializers -Wno-type-limits -Wno-array-bounds -Wno-unknown-pragmas -Wno-sign-compare -Wno-unused-parameter -Wno-unused-variable -Wno-unused-function -Wno-unused-result -Wno-strict-overflow -Wno-strict-aliasing -Wno-error=deprecated-declarations -Wno-stringop-overflow -Wno-error=pedantic -Wno-error=redundant-decls -Wno-error=old-style-cast -fdiagnostics-color=always -faligned-new -Wno-unused-but-set-variable -Wno-maybe-uninitialized -fno-math-errno -fno-trapping-math -Wno-stringop-overflow, DISABLE_NUMA=1, PERF_WITH_AVX=1, PERF_WITH_AVX2=1, PERF_WITH_AVX512=1, USE_CUDA=ON, USE_EXCEPTION_PTR=1, USE_GFLAGS=OFF, USE_GLOG=OFF, USE_MKL=ON, USE_MKLDNN=ON, USE_MPI=OFF, USE_NCCL=ON, USE_NNPACK=ON, USE_OPENMP=ON, USE_STATIC_DISPATCH=OFF,

[10/16 16:29:02 cvpods]: Command line arguments: Namespace(dist_url='tcp://127.0.0.1:50152', eval_only=False, machine_rank=0, num_gpus=1, num_machines=1, opts=[], resume=False)

@bushou-yhh
Copy link
Author

in cvpods/engine/launch.py
I changed
if world_size > 1:
# pytorch/pytorch#14391
# TODO prctl in spawned processes
to
if world_size >= 1:
# pytorch/pytorch#14391
# TODO prctl in spawned processes

it works

@FateScript
Copy link
Member

Well, seems you are using 1-GPU during training? Such an error shouldn't happen. Could you please provide command your are using ?

@Maycbj Maycbj added bug Something isn't working help wanted Extra attention is needed and removed bug Something isn't working labels Oct 20, 2020
@FateScript
Copy link
Member

Since the reporter doesn't reply for a week, We close this issue.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
help wanted Extra attention is needed
Projects
None yet
Development

No branches or pull requests

3 participants