Skip to content

orte_ess_init failed while trying to run Horovod application #8193

@BKitor

Description

@BKitor

Background information

I'm trying to run Open-MPI with Horovod and it's breaking during MPI_Init(). I think it's something to do with pmi.

What version of Open MPI are you using? (e.g., v3.0.5, v4.0.2, git branch name and hash, etc.)

openmpi-4.1.0rc2

Describe how Open MPI was installed (e.g., from a source/distribution tarball, from a git clone, from an operating system distribution package, etc.)

Tarball distribution from open-mpi.org, I did rerun perl autogen.pl in order to pick up an mca component I'm working on

Please describe the system on which you are running

GCC 8.3, CUDA/10.1,

  • Operating system/version: CentOS7
  • Computer hardware: Intel Ivy Bridge
  • Network type: Infniniband

Details of the problem

I'm trying to get Horovod(a deeplearning tool built on Python+MPI) to run with OpenMPI and something is breaking during MPI_Init(). I want to say it's something to do with the pmi layer.

I've been able to run programs from OSU_microbenchmarks, a single threadded program, without any issues. Each Horovod process spawns a background thread and it's those threads who are responsible for calling MPI_Init(), I think my issue has something to do with that.

There is a pre-installed copy of OpenMPI 3.2.1 on the cluster, which runs Horovod without any issues, i've tried linking the libpmi.so it uses, but still doesn't work.

I've pasted the error message I get below.

--------------------------------------------------------------------------
It looks like orte_init failed for some reason; your parallel process is
likely to abort.  There are many reasons that a parallel process can
fail during orte_init; some of which are due to configuration or
environment problems.  This failure appears to be an internal failure;
here's some additional information (which may only be relevant to an
Open MPI developer):

  orte_ess_init failed
  --> Returned value Error (-1) instead of ORTE_SUCCESS
It looks like MPI_INIT failed for some reason; your parallel process is
--------------------------------------------------------------------------
likely to abort.  There are many reasons that a parallel process can
fail during MPI_INIT; some of which are due to configuration or environment
problems.  This failure appears to be an internal failure; here's some
additional information (which may only be relevant to an Open MPI
developer):

  ompi_mpi_init: ompi_rte_init failed
  --> Returned "Error" (-1) instead of "Success" (0)
--------------------------------------------------------------------------
*** An error occurred in MPI_Init_thread
*** on a NULL communicator
*** MPI_ERRORS_ARE_FATAL (processes in this communicator will now abort,
***    and potentially your MPI job)

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions