-
Notifications
You must be signed in to change notification settings - Fork 927
Description
Background information
I'm trying to run Open-MPI with Horovod and it's breaking during MPI_Init(). I think it's something to do with pmi.
What version of Open MPI are you using? (e.g., v3.0.5, v4.0.2, git branch name and hash, etc.)
openmpi-4.1.0rc2
Describe how Open MPI was installed (e.g., from a source/distribution tarball, from a git clone, from an operating system distribution package, etc.)
Tarball distribution from open-mpi.org, I did rerun perl autogen.pl
in order to pick up an mca component I'm working on
Please describe the system on which you are running
GCC 8.3, CUDA/10.1,
- Operating system/version: CentOS7
- Computer hardware: Intel Ivy Bridge
- Network type: Infniniband
Details of the problem
I'm trying to get Horovod(a deeplearning tool built on Python+MPI) to run with OpenMPI and something is breaking during MPI_Init(). I want to say it's something to do with the pmi layer.
I've been able to run programs from OSU_microbenchmarks, a single threadded program, without any issues. Each Horovod process spawns a background thread and it's those threads who are responsible for calling MPI_Init(), I think my issue has something to do with that.
There is a pre-installed copy of OpenMPI 3.2.1 on the cluster, which runs Horovod without any issues, i've tried linking the libpmi.so it uses, but still doesn't work.
I've pasted the error message I get below.
--------------------------------------------------------------------------
It looks like orte_init failed for some reason; your parallel process is
likely to abort. There are many reasons that a parallel process can
fail during orte_init; some of which are due to configuration or
environment problems. This failure appears to be an internal failure;
here's some additional information (which may only be relevant to an
Open MPI developer):
orte_ess_init failed
--> Returned value Error (-1) instead of ORTE_SUCCESS
It looks like MPI_INIT failed for some reason; your parallel process is
--------------------------------------------------------------------------
likely to abort. There are many reasons that a parallel process can
fail during MPI_INIT; some of which are due to configuration or environment
problems. This failure appears to be an internal failure; here's some
additional information (which may only be relevant to an Open MPI
developer):
ompi_mpi_init: ompi_rte_init failed
--> Returned "Error" (-1) instead of "Success" (0)
--------------------------------------------------------------------------
*** An error occurred in MPI_Init_thread
*** on a NULL communicator
*** MPI_ERRORS_ARE_FATAL (processes in this communicator will now abort,
*** and potentially your MPI job)