-
Notifications
You must be signed in to change notification settings - Fork 912
Closed
Labels
Milestone
Description
Background information
What version of Open MPI are you using?
4.0.4
from release tarball. Can't test the git version since it requires pandoc
.
Describe how Open MPI was installed
OpenMPI configuration:
./configure --prefix=$PREFIX --without-verbs --with-hwloc=internal
Please describe the system on which you are running
- Operating system/version: CentOS Linux release 7.5.1804
- Computer hardware: Intel Xeon CPU E5-2690 v3
- Network type: Mellanox Technologies MT27600 [Connect-IB]
A few more systems were tested (laptops, ARM machines) and they all have the same issue.
Details of the problem
Compile the following MPI program:
#include "mpi.h"
int main() {
MPI_Init(NULL, NULL);
MPI_Finalize();
}
Run it with
mpirun -n 4 --map-by core:span ./a.out
Error message:
[login2:55588] PMIX ERROR: NOT-FOUND in file dstore_base.c at line 2866
[login2:55588] PMIX ERROR: NOT-FOUND in file server/pmix_server.c at line 3408
[login2:55592] PMIX ERROR: OUT-OF-RESOURCE in file client/pmix_client.c at line 231
[login2:55592] OPAL ERROR: Error in file pmix3x_client.c at line 112
*** An error occurred in MPI_Init
*** on a NULL communicator
*** MPI_ERRORS_ARE_FATAL (processes in this communicator will now abort,
*** and potentially your MPI job)
[login2:55592] Local abort before MPI_INIT completed completed successfully, but am not able to aggregate error messages, and not able to guarantee that all other processes were killed!
--------------------------------------------------------------------------
Primary job terminated normally, but 1 process returned
a non-zero exit code. Per user-direction, the job has been aborted.
--------------------------------------------------------------------------
--------------------------------------------------------------------------
mpirun detected that one or more processes exited with non-zero status, thus causing
the job to be terminated. The first process to do so was:
Process name: [[21660,1],0]
Exit code: 1
--------------------------------------------------------------------------
[login2:55588] PMIX ERROR: ERROR in file gds_ds21_lock_pthread.c at line 99
[login2:55588] PMIX ERROR: ERROR in file gds_ds21_lock_pthread.c at line 99
I tested this on both the login node and the compute node, the error persists as long as I pass --map-by core:span
or --map-by numa:span
to mpirun
.
OpenMPI 4.0.3 works fine.