Skip to content

v4.x: Using --map-by core:span causes error in MPI_Init #7867

@lyu

Description

@lyu

Background information

What version of Open MPI are you using?

4.0.4 from release tarball. Can't test the git version since it requires pandoc.

Describe how Open MPI was installed

OpenMPI configuration:
./configure --prefix=$PREFIX --without-verbs --with-hwloc=internal

Please describe the system on which you are running

  • Operating system/version: CentOS Linux release 7.5.1804
  • Computer hardware: Intel Xeon CPU E5-2690 v3
  • Network type: Mellanox Technologies MT27600 [Connect-IB]

A few more systems were tested (laptops, ARM machines) and they all have the same issue.


Details of the problem

Compile the following MPI program:

#include "mpi.h"

int main() {
    MPI_Init(NULL, NULL);
    MPI_Finalize();
}

Run it with
mpirun -n 4 --map-by core:span ./a.out

Error message:

[login2:55588] PMIX ERROR: NOT-FOUND in file dstore_base.c at line 2866
[login2:55588] PMIX ERROR: NOT-FOUND in file server/pmix_server.c at line 3408
[login2:55592] PMIX ERROR: OUT-OF-RESOURCE in file client/pmix_client.c at line 231
[login2:55592] OPAL ERROR: Error in file pmix3x_client.c at line 112
*** An error occurred in MPI_Init
*** on a NULL communicator
*** MPI_ERRORS_ARE_FATAL (processes in this communicator will now abort,
***    and potentially your MPI job)
[login2:55592] Local abort before MPI_INIT completed completed successfully, but am not able to aggregate error messages, and not able to guarantee that all other processes were killed!
--------------------------------------------------------------------------
Primary job  terminated normally, but 1 process returned
a non-zero exit code. Per user-direction, the job has been aborted.
--------------------------------------------------------------------------
--------------------------------------------------------------------------
mpirun detected that one or more processes exited with non-zero status, thus causing
the job to be terminated. The first process to do so was:

  Process name: [[21660,1],0]
  Exit code:    1
--------------------------------------------------------------------------
[login2:55588] PMIX ERROR: ERROR in file gds_ds21_lock_pthread.c at line 99
[login2:55588] PMIX ERROR: ERROR in file gds_ds21_lock_pthread.c at line 99

I tested this on both the login node and the compute node, the error persists as long as I pass --map-by core:span or --map-by numa:span to mpirun.

OpenMPI 4.0.3 works fine.

Metadata

Metadata

Assignees

Labels

Type

No type

Projects

No projects

Milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions