Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

multi-gpu training problem #1424

Closed
wakananai opened this issue Sep 20, 2019 · 2 comments
Closed

multi-gpu training problem #1424

wakananai opened this issue Sep 20, 2019 · 2 comments

Comments

@wakananai
Copy link

wakananai commented Sep 20, 2019

Thanks for your error report and we appreciate it a lot.

Checklist

  1. I have searched related issues but cannot get the expected help.
  2. The bug has not been fixed in the latest version.

Describe the bug
I use the command CUDA_VISIBLE_DEVICES=0,1,2,3 ./tools/dist_train.sh configs/ssd512_coco.py 4 --validate to do multi-gpu training. After the message about loading coco annotations, it comes out with error messages, provided on Error traceback part:

It seems to be OpenMP problem, but I have no idea how to solve it.

Reproduction

  1. What command or script did you run?
CUDA_VISIBLE_DEVICES=0,1,2,3 ./tools/dist_train.sh configs/ssd512_coco.py 4 --validate
  1. Did you make any modifications on the code or config? Did you understand what you have modified?
    I use the default code for training
  2. What dataset did you use?
    coco2017

Environment

  • OS: Ubuntu 18.04.1
  • GCC 7.3.0
  • PyTorch version 1.2.0
  • How you installed PyTorch conda
  • GPU model V100
  • CUDA and CUDNN version CUDA:10.1 / CUDNN 7.6.2
  • [optional] Other information that may be related (such as $PATH, $LD_LIBRARY_PATH, $PYTHONPATH, etc.)

Error traceback

OMP: Error #13: Assertion failure at z_Linux_util.cpp(2361).
OMP: Hint Please submit a bug report with this message, compile and run commands used, and machine configuration info including native compiler and operating system versions. Faster response will be obtained by including all program sources. For information on submitting this issue, please see http://www.intel.com/software/products/support/.
OMP: Error #13: Assertion failure at z_Linux_util.cpp(2361).
OMP: Hint Please submit a bug report with this message, compile and run commands used, and machine configuration info including native compiler and operating system versions. Faster response will be obtained by including all program sources. For information on submitting this issue, please see http://www.intel.com/software/products/support/.
OMP: Error #13: Assertion failure at z_Linux_util.cpp(2361).
OMP: Hint Please submit a bug report with this message, compile and run commands used, and machine configuration info including native compiler and operating system versions. Faster response will be obtained by including all program sources. For information on submitting this issue, please see http://www.intel.com/software/products/support/.
OMP: Error #13: Assertion failure at z_Linux_util.cpp(2361).
OMP: Hint Please submit a bug report with this message, compile and run commands used, and machine configuration info including native compiler and operating system versions. Faster response will be obtained by including all program sources. For information on submitting this issue, please see http://www.intel.com/software/products/support/.

Bug fix
The issue addresses that there are some errors about intel-openmp=2019.5.
So the suggested solution would be downgrading the intel-openmp version by
conda install -y intel-openmp-2019.4

@lsongx
Copy link

lsongx commented Sep 22, 2019

Hi @wakananai ,

I got the same error and then solve it by conda install -y intel-openmp=2019.4. I followed this issue.

@wakananai
Copy link
Author

Hi @LcDog ,

After downgrading the intel-openmp version by conda install -y intel-openmp=2019.4, the multi-gpu training code can be run without error.

Thank you for your kind assistance.

liuhuiCNN pushed a commit to liuhuiCNN/mmdetection that referenced this issue May 21, 2021
* add tools/anchor_cluster.py
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants