Skip to content

Assertion failure on MPI_Finalize with libpsm_infinipath #9386

@lpouillo

Description

@lpouillo

Thank you for taking the time to submit an issue!

Background information

What verson of Open MPI are you using? (e.g., v3.0.5, v4.0.2, git branch name and hash, etc.)

v4.1.1

Describe how Open MPI was installed (e.g., from a source/distribution tarball, from a git clone, from an operating system distribution package, etc.)

  • using EasyBuild v4.4.2
  • using tarball on top of EasyBuild module file
    1. GCCcore/10.3.0 8) libpciaccess/0.16-GCCcore-10.3.0
    2. zlib/1.2.11-GCCcore-10.3.0 9) hwloc/2.4.1-GCCcore-10.3.0
    3. binutils/2.36.1-GCCcore-10.3.0 10) OpenSSL/1.1
    4. GCC/10.3.0 11) libevent/2.1.12-GCCcore-10.3.0
    5. numactl/2.0.14-GCCcore-10.3.0 12) UCX/1.10.0-GCCcore-10.3.0
    6. XZ/5.2.5-GCCcore-10.3.0 13) libfabric/1.12.1-GCCcore-10.3.0
    7. libxml2/2.9.10-GCCcore-10.3.0 14) PMIx/3.2.3-GCCcore-10.3.0
  • using tarball on top of system gcc

If you are building/installing from a git clone, please copy-n-paste the output from git submodule status.

no

Please describe the system on which you are running

  • Operating system/version: Centos 7.9
  • Computer hardware: Dell C6320 with dual Xeon E5-2640 v3
  • Network type: QLogic Corp. IBA7322 QDR InfiniBand HCA
    Installed Packages
    Name : infinipath-psm
    Arch : x86_64
    Version : 3.3
    Release : 26_g604758e_open.2.el7
    Size : 434 k
    Repo : installed
    From repo : anaconda
    Summary : QLogic PSM Libraries
    URL : https://www.openfabrics.org/
    License : BSD or GPLv2
    Description : The PSM Messaging API, or PSM API, is QLogic's low-level
    : user-level communications interface for the Truescale
    : family of products. PSM users are enabled with mechanisms
    : necessary to implement higher level communications
    : interfaces in parallel environments.

Details of the problem

  • configure, make and make install run without error.
  • ompi_info show the correct mtl, btl, pml
ompi_info | grep -e mtl -e btl -e pml
                 MCA btl: self (MCA v2.1.0, API v3.1.0, Component v4.1.1)
                 MCA btl: ofi (MCA v2.1.0, API v3.1.0, Component v4.1.1)
                 MCA btl: openib (MCA v2.1.0, API v3.1.0, Component v4.1.1)
                 MCA btl: tcp (MCA v2.1.0, API v3.1.0, Component v4.1.1)
                 MCA btl: uct (MCA v2.1.0, API v3.1.0, Component v4.1.1)
                 MCA btl: usnic (MCA v2.1.0, API v3.1.0, Component v4.1.1)
                 MCA btl: vader (MCA v2.1.0, API v3.1.0, Component v4.1.1)
                MCA fbtl: posix (MCA v2.1.0, API v2.0.0, Component v4.1.1)
                 MCA mtl: ofi (MCA v2.1.0, API v2.0.0, Component v4.1.1)
                 MCA mtl: psm (MCA v2.1.0, API v2.0.0, Component v4.1.1)
                 MCA pml: v (MCA v2.1.0, API v2.0.0, Component v4.1.1)
                 MCA pml: cm (MCA v2.1.0, API v2.0.0, Component v4.1.1)
                 MCA pml: monitoring (MCA v2.1.0, API v2.0.0, Component v4.1.1)
                 MCA pml: ob1 (MCA v2.1.0, API v2.0.0, Component v4.1.1)
                 MCA pml: ucx (MCA v2.1.0, API v2.0.0, Component v4.1.1)
  • compilation of the examples are ok.

On execution of all examples, there is an Assertion failure

lpouillo@host:~/openmpi-4.1.1/examples$ mpirun -np 2 ./hello_c
--------------------------------------------------------------------------
WARNING: There was an error initializing an OpenFabrics device.

  Local host:   host
  Local device: qib0
--------------------------------------------------------------------------
Hello, world, I am 0 of 2, (Open MPI v4.1.1, package: Open MPI lpouillo@host Distribution, ident: 4.1.1, repo rev: v4.1.1, Apr 24, 2021, 118)
Hello, world, I am 1 of 2, (Open MPI v4.1.1, package: Open MPI lpouillo@host Distribution, ident: 4.1.1, repo rev: v4.1.1, Apr 24, 2021, 118)
host.17542Assertion failure at psm_ep.c:1074: ep->mctxt_master == ep
host.17543Assertion failure at psm_ep.c:1074: ep->mctxt_master == ep
[host:17542] *** Process received signal ***
[host:17542] Signal: Aborted (6)
[host:17542] Signal code:  (-6)
[host:17542] [ 0] /lib64/libpthread.so.0(+0xf630)[0x7f95a370f630]
[host:17542] [ 1] /lib64/libc.so.6(gsignal+0x37)[0x7f95a33683d7]
[host:17542] [ 2] /lib64/libc.so.6(abort+0x148)[0x7f95a3369ac8]
[host:17542] [ 3] /lib64/libpsm_infinipath.so.1(+0x19b4a)[0x7f95955e3b4a]
[host:17542] [ 4] /lib64/libpsm_infinipath.so.1(+0x19fb1)[0x7f95955e3fb1]
[host:17542] [ 5] /lib64/libpsm_infinipath.so.1(psm_ep_close+0x3cf)[0x7f95955e2b1f]
[host:17542] [ 6] /lib64/libpsm_infinipath.so.1(__psm_finalize+0x4e)[0x7f95955ea92e]
[host:17542] [ 7] /home/lpouillo/openmpi_system/lib/openmpi/mca_mtl_psm.so(ompi_mtl_psm_finalize+0x40)[0x7f9589bc5d60]
[host:17542] [ 8] /home/lpouillo/openmpi_system/lib/libmpi.so.40(ompi_mpi_finalize+0x628)[0x7f95a396f218]
[host:17542] [haswell-t16-44:17543] *** Process received signal ***
[host:17543] Signal: Aborted (6)
[host:17543] Signal code:  (-6)
[ 9] ./hello_c[0x4008af]
[host:17542] [10] /lib64/libc.so.6(__libc_start_main+0xf5)[0x7f95a3354555]
[host:17542] [11] ./hello_c[0x400759]
[host:17542] *** End of error message ***
[host:17543] [ 0] /lib64/libpthread.so.0(+0xf630)[0x7fe41c8f0630]
[host:17543] [ 1] /lib64/libc.so.6(gsignal+0x37)[0x7fe41c5493d7]
[host:17543] [ 2] /lib64/libc.so.6(abort+0x148)[0x7fe41c54aac8]
[host:17543] [ 3] /lib64/libpsm_infinipath.so.1(+0x19b4a)[0x7fe40a608b4a]
[host:17543] [ 4] /lib64/libpsm_infinipath.so.1(+0x19fb1)[0x7fe40a608fb1]
[host:17543] [ 5] /lib64/libpsm_infinipath.so.1(psm_ep_close+0x3cf)[0x7fe40a607b1f]
[host:17543] [ 6] /lib64/libpsm_infinipath.so.1(__psm_finalize+0x4e)[0x7fe40a60f92e]
[host:17543] [ 7] /home/lpouillo/openmpi_system/lib/openmpi/mca_mtl_psm.so(ompi_mtl_psm_finalize+0x40)[0x7fe402d88d60]
[host:17543] [ 8] /home/lpouillo/openmpi_system/lib/libmpi.so.40(ompi_mpi_finalize+0x628)[0x7fe41cb50218]
[host:17543] [ 9] ./hello_c[0x4008af]
[host:17543] [10] /lib64/libc.so.6(__libc_start_main+0xf5)[0x7fe41c535555]
[host:17543] [11] ./hello_c[0x400759]
[host:17543] *** End of error message ***
--------------------------------------------------------------------------
Primary job  terminated normally, but 1 process returned
a non-zero exit code. Per user-direction, the job has been aborted.
--------------------------------------------------------------------------
--------------------------------------------------------------------------
mpirun noticed that process rank 1 with PID 0 on node haswell-t16-44 exited on signal 6 (Aborted).
--------------------------------------------------------------------------
[haswell-t16-44:17537] 1 more process has sent help message help-mpi-btl-openib.txt / error in device init
[haswell-t16-44:17537] Set MCA parameter "orte_base_help_aggregate" to 0 to see all help / error messages

The example compiles and runs perfectly with OpenMPI 3.1.1

Thank you if you have any idea on howto solve this issue.

Best regards,

Laurent

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions