-
Notifications
You must be signed in to change notification settings - Fork 926
Closed
Labels
Description
Thank you for taking the time to submit an issue!
Background information
What verson of Open MPI are you using? (e.g., v3.0.5, v4.0.2, git branch name and hash, etc.)
v4.1.1
Describe how Open MPI was installed (e.g., from a source/distribution tarball, from a git clone, from an operating system distribution package, etc.)
- using EasyBuild v4.4.2
- using tarball on top of EasyBuild module file
- GCCcore/10.3.0 8) libpciaccess/0.16-GCCcore-10.3.0
- zlib/1.2.11-GCCcore-10.3.0 9) hwloc/2.4.1-GCCcore-10.3.0
- binutils/2.36.1-GCCcore-10.3.0 10) OpenSSL/1.1
- GCC/10.3.0 11) libevent/2.1.12-GCCcore-10.3.0
- numactl/2.0.14-GCCcore-10.3.0 12) UCX/1.10.0-GCCcore-10.3.0
- XZ/5.2.5-GCCcore-10.3.0 13) libfabric/1.12.1-GCCcore-10.3.0
- libxml2/2.9.10-GCCcore-10.3.0 14) PMIx/3.2.3-GCCcore-10.3.0
- using tarball on top of system gcc
If you are building/installing from a git clone, please copy-n-paste the output from git submodule status
.
no
Please describe the system on which you are running
- Operating system/version: Centos 7.9
- Computer hardware: Dell C6320 with dual Xeon E5-2640 v3
- Network type: QLogic Corp. IBA7322 QDR InfiniBand HCA
Installed Packages
Name : infinipath-psm
Arch : x86_64
Version : 3.3
Release : 26_g604758e_open.2.el7
Size : 434 k
Repo : installed
From repo : anaconda
Summary : QLogic PSM Libraries
URL : https://www.openfabrics.org/
License : BSD or GPLv2
Description : The PSM Messaging API, or PSM API, is QLogic's low-level
: user-level communications interface for the Truescale
: family of products. PSM users are enabled with mechanisms
: necessary to implement higher level communications
: interfaces in parallel environments.
Details of the problem
- configure, make and make install run without error.
- ompi_info show the correct mtl, btl, pml
ompi_info | grep -e mtl -e btl -e pml
MCA btl: self (MCA v2.1.0, API v3.1.0, Component v4.1.1)
MCA btl: ofi (MCA v2.1.0, API v3.1.0, Component v4.1.1)
MCA btl: openib (MCA v2.1.0, API v3.1.0, Component v4.1.1)
MCA btl: tcp (MCA v2.1.0, API v3.1.0, Component v4.1.1)
MCA btl: uct (MCA v2.1.0, API v3.1.0, Component v4.1.1)
MCA btl: usnic (MCA v2.1.0, API v3.1.0, Component v4.1.1)
MCA btl: vader (MCA v2.1.0, API v3.1.0, Component v4.1.1)
MCA fbtl: posix (MCA v2.1.0, API v2.0.0, Component v4.1.1)
MCA mtl: ofi (MCA v2.1.0, API v2.0.0, Component v4.1.1)
MCA mtl: psm (MCA v2.1.0, API v2.0.0, Component v4.1.1)
MCA pml: v (MCA v2.1.0, API v2.0.0, Component v4.1.1)
MCA pml: cm (MCA v2.1.0, API v2.0.0, Component v4.1.1)
MCA pml: monitoring (MCA v2.1.0, API v2.0.0, Component v4.1.1)
MCA pml: ob1 (MCA v2.1.0, API v2.0.0, Component v4.1.1)
MCA pml: ucx (MCA v2.1.0, API v2.0.0, Component v4.1.1)
- compilation of the examples are ok.
On execution of all examples, there is an Assertion failure
lpouillo@host:~/openmpi-4.1.1/examples$ mpirun -np 2 ./hello_c
--------------------------------------------------------------------------
WARNING: There was an error initializing an OpenFabrics device.
Local host: host
Local device: qib0
--------------------------------------------------------------------------
Hello, world, I am 0 of 2, (Open MPI v4.1.1, package: Open MPI lpouillo@host Distribution, ident: 4.1.1, repo rev: v4.1.1, Apr 24, 2021, 118)
Hello, world, I am 1 of 2, (Open MPI v4.1.1, package: Open MPI lpouillo@host Distribution, ident: 4.1.1, repo rev: v4.1.1, Apr 24, 2021, 118)
host.17542Assertion failure at psm_ep.c:1074: ep->mctxt_master == ep
host.17543Assertion failure at psm_ep.c:1074: ep->mctxt_master == ep
[host:17542] *** Process received signal ***
[host:17542] Signal: Aborted (6)
[host:17542] Signal code: (-6)
[host:17542] [ 0] /lib64/libpthread.so.0(+0xf630)[0x7f95a370f630]
[host:17542] [ 1] /lib64/libc.so.6(gsignal+0x37)[0x7f95a33683d7]
[host:17542] [ 2] /lib64/libc.so.6(abort+0x148)[0x7f95a3369ac8]
[host:17542] [ 3] /lib64/libpsm_infinipath.so.1(+0x19b4a)[0x7f95955e3b4a]
[host:17542] [ 4] /lib64/libpsm_infinipath.so.1(+0x19fb1)[0x7f95955e3fb1]
[host:17542] [ 5] /lib64/libpsm_infinipath.so.1(psm_ep_close+0x3cf)[0x7f95955e2b1f]
[host:17542] [ 6] /lib64/libpsm_infinipath.so.1(__psm_finalize+0x4e)[0x7f95955ea92e]
[host:17542] [ 7] /home/lpouillo/openmpi_system/lib/openmpi/mca_mtl_psm.so(ompi_mtl_psm_finalize+0x40)[0x7f9589bc5d60]
[host:17542] [ 8] /home/lpouillo/openmpi_system/lib/libmpi.so.40(ompi_mpi_finalize+0x628)[0x7f95a396f218]
[host:17542] [haswell-t16-44:17543] *** Process received signal ***
[host:17543] Signal: Aborted (6)
[host:17543] Signal code: (-6)
[ 9] ./hello_c[0x4008af]
[host:17542] [10] /lib64/libc.so.6(__libc_start_main+0xf5)[0x7f95a3354555]
[host:17542] [11] ./hello_c[0x400759]
[host:17542] *** End of error message ***
[host:17543] [ 0] /lib64/libpthread.so.0(+0xf630)[0x7fe41c8f0630]
[host:17543] [ 1] /lib64/libc.so.6(gsignal+0x37)[0x7fe41c5493d7]
[host:17543] [ 2] /lib64/libc.so.6(abort+0x148)[0x7fe41c54aac8]
[host:17543] [ 3] /lib64/libpsm_infinipath.so.1(+0x19b4a)[0x7fe40a608b4a]
[host:17543] [ 4] /lib64/libpsm_infinipath.so.1(+0x19fb1)[0x7fe40a608fb1]
[host:17543] [ 5] /lib64/libpsm_infinipath.so.1(psm_ep_close+0x3cf)[0x7fe40a607b1f]
[host:17543] [ 6] /lib64/libpsm_infinipath.so.1(__psm_finalize+0x4e)[0x7fe40a60f92e]
[host:17543] [ 7] /home/lpouillo/openmpi_system/lib/openmpi/mca_mtl_psm.so(ompi_mtl_psm_finalize+0x40)[0x7fe402d88d60]
[host:17543] [ 8] /home/lpouillo/openmpi_system/lib/libmpi.so.40(ompi_mpi_finalize+0x628)[0x7fe41cb50218]
[host:17543] [ 9] ./hello_c[0x4008af]
[host:17543] [10] /lib64/libc.so.6(__libc_start_main+0xf5)[0x7fe41c535555]
[host:17543] [11] ./hello_c[0x400759]
[host:17543] *** End of error message ***
--------------------------------------------------------------------------
Primary job terminated normally, but 1 process returned
a non-zero exit code. Per user-direction, the job has been aborted.
--------------------------------------------------------------------------
--------------------------------------------------------------------------
mpirun noticed that process rank 1 with PID 0 on node haswell-t16-44 exited on signal 6 (Aborted).
--------------------------------------------------------------------------
[haswell-t16-44:17537] 1 more process has sent help message help-mpi-btl-openib.txt / error in device init
[haswell-t16-44:17537] Set MCA parameter "orte_base_help_aggregate" to 0 to see all help / error messages
The example compiles and runs perfectly with OpenMPI 3.1.1
Thank you if you have any idea on howto solve this issue.
Best regards,
Laurent