Skip to content

Cray CXI SHS11.1 and openmpi@main fail with intra-node communication #13148

@germanne

Description

@germanne

Background information

What version of Open MPI are you using? (e.g., v4.1.6, v5.0.1, git branch name and hash, etc.)

Branch main, 10 March 2025

Describe how Open MPI was installed (e.g., from a source/distribution tarball, from a git clone, from an operating system distribution package, etc.)

Was installed via Spack, using OSS libfabric and cxi support. Compiler args:

--enable-shared --disable-silent-rules --disable-sphinx --enable-builtin-atomics --disable-static --with-slingshot --enable-mpi1-compatibility --without-psm --without-psm2 --without-fca --without-cma --without-knem --with-xpmem=/usr --without-hcoll --without-mxm --with-ofi=/afs/psi.ch/sys/spack/develop/opt/spack/testing/[padded-to-256-chars]/linux-sles15-aarch64/gcc-14.2.0/libfabric-main-iegtmu74ojgb2pvywqfrlzvedlcz7cps --without-ucc --without-ucx --without-verbs --with-cray-xpmem --without-sge --without-alps --without-loadleveler --without-tm --with-slurm --without-lsf --disable-memchecker --with-libevent=/afs/psi.ch/sys/spack/develop/opt/spack/testing/[padded-to-256-chars]/linux-sles15-aarch64/gcc-7.5.0/libevent-2.1.12-p5qzh7hez5qdbtl7j3avjhf4mry5fm4n --without-lustre --with-pmix=internal --with-zlib=/afs/psi.ch/sys/spack/develop/opt/spack/testing/[padded-to-256-chars]/linux-sles15-aarch64/gcc-7.5.0/zlib-ng-2.2.3-o7xbhxhnqpk5ljcpjmgple6rv3i75z2h --with-hwloc=/afs/psi.ch/sys/spack/develop/opt/spack/testing/[padded-to-256-chars]/linux-sles15-aarch64/gcc-14.2.0/hwloc-2.11.1-3jxzkohocpqjyvd2irytycprfi2bom5q --disable-java --disable-mpi-java --disable-io-romio --with-gpfs=no --enable-dlopen --with-cuda=/afs/psi.ch/sys/spack/develop/opt/spack/testing/[padded-to-256-chars]/linux-sles15-aarch64/gcc-14.2.0/cuda-12.8.0-ne7ulo7g6hhe7dv5nhh4nxchdenls3r2 --with-cuda-libdir=/afs/psi.ch/sys/spack/develop/opt/spack/testing/[padded-to-256-chars]/linux-sles15-aarch64/gcc-14.2.0/cuda-12.8.0-ne7ulo7g6hhe7dv5nhh4nxchdenls3r2/lib64/stubs --enable-wrapper-rpath --disable-wrapper-runpath --with-wrapper-ldflags=-Wl,-rpath,/afs/psi.ch/sys/spack/develop/opt/spack/unstable/linux-sles15-aarch64/gcc-7.5.0/gcc-14.2.0-tln2ck4lolcipi2fj2klu5dei3oac4sv/lib/gcc/aarch64-unknown-linux-gnu/14.2.0 -Wl,-rpath,/afs/psi.ch/sys/spack/develop/opt/spack/unstable/linux-sles15-aarch64/gcc-7.5.0/gcc-14.2.0-tln2ck4lolcipi2fj2klu5dei3oac4sv/lib64 CFLAGS=-DYY_BUF_SIZE=1048576 --disable-debug

spack spec:

-   scyqclc  openmpi@main%gcc@14.2.0+atomics+cuda~debug~gpfs~internal-hwloc~internal-libevent+internal-pmix~java~lustre~memchecker~openshmem~romio+rsh~static~two_level_namespace+vt+wrapper-rpath build_system=autotools cuda_arch=90 fabrics=ofi,xpmem romio-filesystem=none schedulers=slurm arch=linux-sles15-aarch64
[+]  mcdzcmr      ^autoconf@2.72%gcc@7.5.0 build_system=autotools arch=linux-sles15-aarch64
[+]  btyzacb      ^automake@1.16.5%gcc@7.5.0 build_system=autotools arch=linux-sles15-aarch64
[+]  ne7ulo7      ^cuda@12.8.0%gcc@14.2.0~allow-unsupported-compilers~dev build_system=generic arch=linux-sles15-aarch64
[+]  gzc3f4t          ^libxml2@2.13.5%gcc@7.5.0~http+pic~python+shared build_system=autotools arch=linux-sles15-aarch64
[+]  bu4jqoi              ^libiconv@1.17%gcc@7.5.0 build_system=autotools libs=shared,static arch=linux-sles15-aarch64
[+]  5ss23k5              ^pkg-config@0.29.2%gcc@7.5.0+internal_glib build_system=autotools arch=linux-sles15-aarch64
[e]  fkyyhdc              ^xz@5.2.3%gcc@7.5.0~pic build_system=autotools libs=shared,static arch=linux-sles15-aarch64
[+]  nyb2vfy      ^gcc-runtime@14.2.0%gcc@14.2.0 build_system=generic arch=linux-sles15-aarch64
[e]  3egpojh      ^glibc@2.31%gcc@14.2.0 build_system=autotools arch=linux-sles15-aarch64
[+]  hpibhrn      ^gnuconfig@2024-07-27%gcc@14.2.0 build_system=generic arch=linux-sles15-aarch64
[+]  3jxzkoh      ^hwloc@2.11.1%gcc@14.2.0~cairo+cuda~gl~level_zero~libudev+libxml2~nvml~opencl+pci~rocm build_system=autotools cuda_arch=90 libs=shared,static arch=linux-sles15-aarch64
[+]  4eajtzs          ^libpciaccess@0.17%gcc@7.5.0 build_system=autotools arch=linux-sles15-aarch64
[+]  txi65ob              ^gcc-runtime@7.5.0%gcc@7.5.0 build_system=generic arch=linux-sles15-aarch64
[+]  nwu26be              ^util-macros@1.20.1%gcc@7.5.0 build_system=autotools arch=linux-sles15-aarch64
[+]  wfg3cd7                  ^gcc-runtime@12.3%gcc@12.3 build_system=generic arch=linux-sles15-aarch64
[+]  zia4ebj          ^ncurses@6.5%gcc@7.5.0~symlinks+termlib abi=none build_system=autotools patches=7a351bc arch=linux-sles15-aarch64
[+]  p5qzh7h      ^libevent@2.1.12%gcc@7.5.0+openssl build_system=autotools arch=linux-sles15-aarch64
[+]  m3gwtgf          ^openssl@3.4.0%gcc@7.5.0~docs+shared build_system=generic certs=mozilla arch=linux-sles15-aarch64
[+]  3aq2syu              ^ca-certificates-mozilla@2023-05-30%gcc@7.5.0 build_system=generic arch=linux-sles15-aarch64
[e]  eyczfjv              ^perl@5.26.1%gcc@7.5.0+cpanm+opcode+open+shared+threads build_system=generic patches=0eac10e,8cf4302 arch=linux-sles15-aarch64
 -   zgkq6vw      ^libfabric@main%gcc@14.2.0+cuda~debug~kdreg~level_zero+uring build_system=autotools cuda_arch=90 fabrics=cxi,sockets,tcp,udp,xpmem arch=linux-sles15-aarch64
[+]  u5d4zw4          ^curl@8.11.1%gcc@7.5.0~gssapi~ldap~libidn2~librtmp~libssh~libssh2+nghttp2 build_system=autotools libs=shared,static tls=openssl arch=linux-sles15-aarch64
[+]  6mvnrnk              ^nghttp2@1.64.0%gcc@7.5.0 build_system=autotools arch=linux-sles15-aarch64
[+]  5t2hvib          ^json-c@0.16%gcc@7.5.0~ipo build_system=cmake build_type=Release generator=make arch=linux-sles15-aarch64
[+]  u2nmjzn              ^cmake@3.31.4%gcc@7.5.0~doc+ncurses+ownlibs~qtgui build_system=generic build_type=Release arch=linux-sles15-aarch64
 -   nhjhwto          ^libcxi@main%gcc@14.2.0+cuda~level_zero~rocm build_system=autotools arch=linux-sles15-aarch64
 -   5wgs3er              ^cassini-headers@main%gcc@14.2.0 build_system=generic arch=linux-sles15-aarch64
 -   u6zzijb              ^cxi-driver@main%gcc@14.2.0 build_system=generic arch=linux-sles15-aarch64
[+]  7cnxi2c              ^libconfig@1.7.3%gcc@7.5.0 build_system=autotools arch=linux-sles15-aarch64
[+]  rq33jed                  ^automake@1.16.5%gcc@7.5.0 build_system=autotools arch=linux-sles15-aarch64
[+]  dcjewuo                      ^autoconf@2.72%gcc@7.5.0 build_system=autotools arch=linux-sles15-aarch64
[+]  l6mjl5c                  ^gcc-runtime@14.2.0%gcc@14.2.0 build_system=generic arch=linux-sles15-aarch64
[+]  p7ozge3                  ^libtool@2.4.7%gcc@7.5.0 build_system=autotools arch=linux-sles15-aarch64
[+]  trzm7v5                      ^findutils@4.9.0%gcc@7.5.0 build_system=autotools patches=440b954 arch=linux-sles15-aarch64
[+]  jppuqwv                      ^m4@1.4.19%gcc@7.5.0+sigsegv build_system=autotools patches=9dc5fbd,bfdffa7 arch=linux-sles15-aarch64
[+]  ivhh3c7                          ^libsigsegv@2.14%gcc@7.5.0 build_system=autotools arch=linux-sles15-aarch64
[+]  oj3ovvr                  ^texinfo@7.1%gcc@7.5.0 build_system=autotools arch=linux-sles15-aarch64
[+]  laxgbis                      ^gcc-runtime@12.3%gcc@12.3 build_system=generic arch=linux-sles15-aarch64
[+]  fmk7hej                      ^ncurses@6.5%gcc@7.5.0~symlinks+termlib abi=none build_system=autotools patches=7a351bc arch=linux-sles15-aarch64
 -   pjzsvu4              ^libfuse@2.9.9%gcc@14.2.0~strip~system_install~useroot+utils build_system=meson buildtype=release default_library=shared arch=linux-sles15-aarch64
[+]  yczbssx                  ^meson@1.5.1%gcc@7.5.0 build_system=python_pip patches=0f0b1bd arch=linux-sles15-aarch64
[+]  zvoecxo                      ^py-pip@24.3.1%gcc@7.5.0 build_system=generic arch=linux-sles15-aarch64
[+]  qqyoi74                      ^py-setuptools@75.8.0%gcc@7.5.0 build_system=generic arch=linux-sles15-aarch64
[+]  wk6kswr                      ^py-wheel@0.41.2%gcc@7.5.0 build_system=generic arch=linux-sles15-aarch64
[+]  q7jjsue                      ^python@3.12.8%gcc@7.5.0+bz2+crypt+ctypes+dbm~debug+libxml2+lzma~optimizations+pic+pyexpat+pythoncmd+readline+shared+sqlite3+ssl~tkinter+uuid+zlib build_system=generic arch=linux-sles15-aarch64
[+]  3hlzxo5                          ^expat@2.6.4%gcc@7.5.0+libbsd build_system=autotools arch=linux-sles15-aarch64
[+]  lducxxr                              ^libbsd@0.10.0%gcc@7.5.0 build_system=autotools arch=linux-sles15-aarch64
[+]  qerkf3p                          ^gdbm@1.24%gcc@7.5.0 build_system=autotools arch=linux-sles15-aarch64
[+]  ds2kwc3                          ^libffi@3.4.6%gcc@7.5.0 build_system=autotools arch=linux-sles15-aarch64
[+]  mfeth7l                          ^readline@8.2%gcc@7.5.0 build_system=autotools patches=1ea4349,24f587b,3d9885e,5911a5b,622ba38,6c8adf8,758e2ec,79572ee,a177edc,bbf97f1,c7b45ff,e0013d9,e065038 arch=linux-sles15-aarch64
[+]  teiwdpd                          ^sqlite@3.46.0%gcc@7.5.0+column_metadata+dynamic_extensions+fts~functions+rtree build_system=autotools arch=linux-sles15-aarch64
[+]  suaqfjx                          ^util-linux-uuid@2.40.2%gcc@7.5.0 build_system=autotools arch=linux-sles15-aarch64
[+]  hjcwama                      ^python-venv@1.0%gcc@7.5.0 build_system=generic arch=linux-sles15-aarch64
[+]  saybo2v                  ^ninja@1.12.1%gcc@7.5.0~re2c build_system=generic arch=linux-sles15-aarch64
[+]  zq5s6y2                      ^gcc-runtime@13.2.0%gcc@13.2.0 build_system=generic arch=linux-sles15-aarch64
[+]  3bgmubm                      ^python@3.8.19%gcc@7.5.0~bz2~crypt+ctypes~dbm~debug+libxml2+lzma~nis~optimizations+pic~pyexpat+pythoncmd~readline+shared~sqlite3~ssl~tkinter~uuid+zlib build_system=generic patches=0d98e93,4c24573,ebdca64,f2fd060 arch=linux-sles15-aarch64
[e]  gnju5co                          ^gettext@0.20.2%gcc@7.5.0+bzip2+curses+git~libunistring+libxml2+pic+shared+tar+xz build_system=autotools arch=linux-sles15-aarch64
[+]  s2jrvfo                          ^zlib-ng@2.1.6%gcc@7.5.0+compat+new_strategies+opt+pic+shared build_system=autotools arch=linux-sles15-aarch64
[+]  qlqjhch                              ^gnuconfig@2022-09-17%gcc@12.3 build_system=generic arch=linux-sles15-aarch64
[+]  xnbrchw              ^libnl@3.3.0%gcc@7.5.0 build_system=autotools arch=linux-sles15-aarch64
[+]  3aey2ot                  ^flex@2.6.3%gcc@7.5.0+lex~nls build_system=autotools arch=linux-sles15-aarch64
[+]  3e23eaj              ^libuv@1.48.0%gcc@14.2.0 build_system=autotools arch=linux-sles15-aarch64
[+]  7sa6suu              ^libyaml@0.2.5%gcc@14.2.0 build_system=autotools arch=linux-sles15-aarch64
[+]  pnnsys3              ^lm-sensors@3-6-0%gcc@14.2.0 build_system=makefile arch=linux-sles15-aarch64
[+]  sawly4e                  ^bison@3.8.2%gcc@7.5.0~color build_system=autotools arch=linux-sles15-aarch64
[+]  ghouivi                      ^m4@1.4.19%gcc@7.5.0~sigsegv build_system=autotools patches=9dc5fbd,bfdffa7 arch=linux-sles15-aarch64
[+]  r2g4qhm                  ^flex@2.6.3%gcc@7.5.0+lex~nls build_system=autotools arch=linux-sles15-aarch64
[+]  5cb63ad          ^liburing@2.3%gcc@14.2.0 build_system=autotools arch=linux-sles15-aarch64
[+]  3gxsior      ^numactl@2.0.18%gcc@7.5.0 build_system=autotools arch=linux-sles15-aarch64
[+]  l2qugjv          ^autoconf@2.72%gcc@7.5.0 build_system=autotools arch=linux-sles15-aarch64
[+]  tvosith          ^automake@1.16.5%gcc@7.5.0 build_system=autotools arch=linux-sles15-aarch64
[+]  2xmogbm      ^openssh@9.9p1%gcc@7.5.0+gssapi build_system=autotools arch=linux-sles15-aarch64
[+]  7cgifzm          ^krb5@1.21.3%gcc@7.5.0+shared build_system=autotools arch=linux-sles15-aarch64
[+]  q7hhrig              ^bison@3.8.2%gcc@7.5.0~color build_system=autotools arch=linux-sles15-aarch64
[+]  obetosr          ^libedit@3.1-20240808%gcc@7.5.0 build_system=autotools arch=linux-sles15-aarch64
[+]  w6hnyfk          ^libxcrypt@4.4.35%gcc@7.5.0~obsolete_api build_system=autotools patches=4885da3 arch=linux-sles15-aarch64
[+]  iacvnhj      ^pkg-config@0.29.2%gcc@7.5.0+internal_glib build_system=autotools arch=linux-sles15-aarch64
[e]  2d7jkg5      ^slurm@24.05.3%gcc@7.5.0+cgroup~cray_shasta+gtk~hdf5+hwloc+mariadb+nvml+pam+pmix+readline+restd~rsmi build_system=autotools sysconfdir=PREFIX/etc arch=linux-sles15-aarch64
[e]  znxqplr      ^xpmem@2.9.6-1.1%gcc@14.2.0+kernel-module build_system=autotools arch=linux-sles15-aarch64

Please describe the system on which you are running

  • Operating system/version: SLES15 14.21-150500.55.65_13.0.73-cray_shasta_c_64k aarch64
  • Computer hardware: Grace Hopper GPU, aarch64
  • Network type: CXI, SHS11.1

Details of the problem

Multi nodes jobs do run without any problem. Multi tasks jobs on single node do fail with the following error:

[gpu001.merlin7.psi.ch:177843] [[46903,1],1] selected pml ob1, but peer [[46903,1],0] on gpu001 selected pml cm

ompi_info returns that the btl ofi component is there, but it still seems to fail

ompi_info
...
MCA btl: ofi (MCA v2.1.0, API v3.3.0, Component v5.1.0)
shell$ mpirun --mca btl_base_verbose 100 -np 2 osu_bw -d cuda D D
[gpu001.merlin7.psi.ch:177843] mca: base: components_register: registering framework btl components
[gpu001.merlin7.psi.ch:177843] mca: base: components_register: found loaded component self
[gpu001.merlin7.psi.ch:177843] mca: base: components_register: component self register function successful
[gpu001.merlin7.psi.ch:177843] mca: base: components_register: found loaded component ofi
[gpu001.merlin7.psi.ch:177843] mca: base: components_register: component ofi register function successful
[gpu001.merlin7.psi.ch:177843] mca: base: components_register: found loaded component sm
[gpu001.merlin7.psi.ch:177843] mca: base: components_register: component sm register function successful
[gpu001.merlin7.psi.ch:177843] mca: base: components_register: found loaded component tcp
[gpu001.merlin7.psi.ch:177843] mca: base: components_register: component tcp register function successful
[gpu001.merlin7.psi.ch:177843] mca: base: components_register: found loaded component smcuda
[gpu001.merlin7.psi.ch:177843] mca: base: components_register: component smcuda register function successful
[gpu001.merlin7.psi.ch:177843] mca: base: components_open: opening btl components
[gpu001.merlin7.psi.ch:177843] mca: base: components_open: found loaded component self
[gpu001.merlin7.psi.ch:177843] mca: base: components_open: component self open function successful
[gpu001.merlin7.psi.ch:177843] mca: base: components_open: found loaded component ofi
[gpu001.merlin7.psi.ch:177843] mca: base: components_open: component ofi open function successful
[gpu001.merlin7.psi.ch:177843] mca: base: components_open: found loaded component sm
[gpu001.merlin7.psi.ch:177843] mca: base: components_open: component sm open function successful
[gpu001.merlin7.psi.ch:177843] mca: base: components_open: found loaded component tcp
[gpu001.merlin7.psi.ch:177843] mca: base: components_open: component tcp open function successful
[gpu001.merlin7.psi.ch:177843] mca: base: components_open: found loaded component smcuda
[gpu001.merlin7.psi.ch:177843] btl: smcuda: cuda_max_send_size=131072, max_send_size=32768, max_frag_size=131072
[gpu001.merlin7.psi.ch:177843] mca: base: components_open: component smcuda open function successful
[gpu001.merlin7.psi.ch:177842] mca: base: components_register: registering framework btl components
[gpu001.merlin7.psi.ch:177842] mca: base: components_register: found loaded component self
[gpu001.merlin7.psi.ch:177842] mca: base: components_register: component self register function successful
[gpu001.merlin7.psi.ch:177842] mca: base: components_register: found loaded component ofi
[gpu001.merlin7.psi.ch:177842] mca: base: components_register: component ofi register function successful
[gpu001.merlin7.psi.ch:177842] mca: base: components_register: found loaded component sm
[gpu001.merlin7.psi.ch:177842] mca: base: components_register: component sm register function successful
[gpu001.merlin7.psi.ch:177842] mca: base: components_register: found loaded component tcp
[gpu001.merlin7.psi.ch:177842] mca: base: components_register: component tcp register function successful
[gpu001.merlin7.psi.ch:177842] mca: base: components_register: found loaded component smcuda
[gpu001.merlin7.psi.ch:177842] mca: base: components_register: component smcuda register function successful
[gpu001.merlin7.psi.ch:177842] mca: base: components_open: opening btl components
[gpu001.merlin7.psi.ch:177842] mca: base: components_open: found loaded component self
[gpu001.merlin7.psi.ch:177842] mca: base: components_open: component self open function successful
[gpu001.merlin7.psi.ch:177842] mca: base: components_open: found loaded component ofi
[gpu001.merlin7.psi.ch:177842] mca: base: components_open: component ofi open function successful
[gpu001.merlin7.psi.ch:177842] mca: base: components_open: found loaded component sm
[gpu001.merlin7.psi.ch:177842] mca: base: components_open: component sm open function successful
[gpu001.merlin7.psi.ch:177842] mca: base: components_open: found loaded component tcp
[gpu001.merlin7.psi.ch:177842] mca: base: components_open: component tcp open function successful
[gpu001.merlin7.psi.ch:177842] mca: base: components_open: found loaded component smcuda
[gpu001.merlin7.psi.ch:177842] btl: smcuda: cuda_max_send_size=131072, max_send_size=32768, max_frag_size=131072
[gpu001.merlin7.psi.ch:177842] mca: base: components_open: component smcuda open function successful
--------------------------------------------------------------------------
Open MPI failed an OFI Libfabric library call (fi_domain).  This is highly
unusual; your job may behave unpredictably (and/or abort) after this.

  Local host: gpu001
  Location: mtl_ofi_component.c:1007
  Error: Function not implemented (70368744177702)
--------------------------------------------------------------------------
[gpu001.merlin7.psi.ch:177843] select: initializing btl component self
[gpu001.merlin7.psi.ch:177843] select: init of component self returned success
[gpu001.merlin7.psi.ch:177843] select: initializing btl component ofi
[gpu001.merlin7.psi.ch:177842] select: initializing btl component self
[gpu001.merlin7.psi.ch:177842] select: init of component self returned success
[gpu001.merlin7.psi.ch:177842] select: initializing btl component ofi
[gpu001.merlin7.psi.ch:177843] select: init of component ofi returned failure
[gpu001.merlin7.psi.ch:177842] select: init of component ofi returned success
[gpu001.merlin7.psi.ch:177842] select: initializing btl component sm
[gpu001.merlin7.psi.ch:177842] select: init of component sm returned success
[gpu001.merlin7.psi.ch:177842] select: initializing btl component tcp
[gpu001.merlin7.psi.ch:177842] btl: tcp: Searching for exclude address+prefix: 127.0.0.1 / 8
[gpu001.merlin7.psi.ch:177842] btl: tcp: Found match: 127.0.0.1 (lo)
[gpu001.merlin7.psi.ch:177842] btl: tcp: Using interface: sppp 
[gpu001.merlin7.psi.ch:177842] btl:tcp: 0x323ce000: if nmn0 kidx 2 cnt 0 addr 10.100.36.33 IPv4 bw 1000 lt 100
[gpu001.merlin7.psi.ch:177842] btl:tcp: 0x32860a90: if hsn0 kidx 3 cnt 0 addr 172.30.138.1 IPv4 bw 200000 lt 100
[gpu001.merlin7.psi.ch:177842] btl:tcp: 0x32861280: if hsn2 kidx 4 cnt 0 addr 172.30.138.3 IPv4 bw 200000 lt 100
[gpu001.merlin7.psi.ch:177842] btl:tcp: 0x32861b70: if hsn3 kidx 5 cnt 0 addr 172.30.138.4 IPv4 bw 200000 lt 100
[gpu001.merlin7.psi.ch:177842] btl:tcp: 0x32862380: if hsn1 kidx 6 cnt 0 addr 172.30.138.2 IPv4 bw 200000 lt 100
[gpu001.merlin7.psi.ch:177842] btl:tcp: Attempting to bind to AF_INET port 1024
[gpu001.merlin7.psi.ch:177842] btl:tcp: Successfully bound to AF_INET port 1024
[gpu001.merlin7.psi.ch:177842] btl:tcp: my listening v4 socket is 0.0.0.0:1024
[gpu001.merlin7.psi.ch:177842] btl: tcp: exchange: 0 2 IPv4 10.100.36.33
[gpu001.merlin7.psi.ch:177842] btl: tcp: exchange: 1 3 IPv4 172.30.138.1
[gpu001.merlin7.psi.ch:177842] btl: tcp: exchange: 2 4 IPv4 172.30.138.3
[gpu001.merlin7.psi.ch:177842] btl: tcp: exchange: 3 5 IPv4 172.30.138.4
[gpu001.merlin7.psi.ch:177842] btl: tcp: exchange: 4 6 IPv4 172.30.138.2
[gpu001.merlin7.psi.ch:177842] select: init of component tcp returned success
[gpu001.merlin7.psi.ch:177842] select: initializing btl component smcuda
[gpu001.merlin7.psi.ch:177842] select: init of component smcuda returned success
[gpu001.merlin7.psi.ch:177843] mca: base: close: component ofi closed
[gpu001.merlin7.psi.ch:177843] mca: base: close: unloading component ofi
[gpu001.merlin7.psi.ch:177843] select: initializing btl component sm
[gpu001.merlin7.psi.ch:177843] select: init of component sm returned success
[gpu001.merlin7.psi.ch:177843] select: initializing btl component tcp
[gpu001.merlin7.psi.ch:177843] btl: tcp: Searching for exclude address+prefix: 127.0.0.1 / 8
[gpu001.merlin7.psi.ch:177843] btl: tcp: Found match: 127.0.0.1 (lo)
[gpu001.merlin7.psi.ch:177843] btl: tcp: Using interface: sppp 
[gpu001.merlin7.psi.ch:177843] btl:tcp: 0x37a72a80: if nmn0 kidx 2 cnt 0 addr 10.100.36.33 IPv4 bw 1000 lt 100
[gpu001.merlin7.psi.ch:177843] btl:tcp: 0x37a72f80: if hsn0 kidx 3 cnt 0 addr 172.30.138.1 IPv4 bw 200000 lt 100
[gpu001.merlin7.psi.ch:177843] btl:tcp: 0x37a73660: if hsn2 kidx 4 cnt 0 addr 172.30.138.3 IPv4 bw 200000 lt 100
[gpu001.merlin7.psi.ch:177843] btl:tcp: 0x37a6d130: if hsn3 kidx 5 cnt 0 addr 172.30.138.4 IPv4 bw 200000 lt 100
[gpu001.merlin7.psi.ch:177843] btl:tcp: 0x37a6da20: if hsn1 kidx 6 cnt 0 addr 172.30.138.2 IPv4 bw 200000 lt 100
[gpu001.merlin7.psi.ch:177843] btl:tcp: Attempting to bind to AF_INET port 1024
[gpu001.merlin7.psi.ch:177843] btl:tcp: Attempting to bind to AF_INET port 1025
[gpu001.merlin7.psi.ch:177843] btl:tcp: Successfully bound to AF_INET port 1025
[gpu001.merlin7.psi.ch:177843] btl:tcp: my listening v4 socket is 0.0.0.0:1025
[gpu001.merlin7.psi.ch:177843] btl: tcp: exchange: 0 2 IPv4 10.100.36.33
[gpu001.merlin7.psi.ch:177843] btl: tcp: exchange: 1 3 IPv4 172.30.138.1
[gpu001.merlin7.psi.ch:177843] btl: tcp: exchange: 2 4 IPv4 172.30.138.3
[gpu001.merlin7.psi.ch:177843] btl: tcp: exchange: 3 5 IPv4 172.30.138.4
[gpu001.merlin7.psi.ch:177843] btl: tcp: exchange: 4 6 IPv4 172.30.138.2
[gpu001.merlin7.psi.ch:177843] select: init of component tcp returned success
[gpu001.merlin7.psi.ch:177843] select: initializing btl component smcuda
[gpu001.merlin7.psi.ch:177843] select: init of component smcuda returned success
[gpu001.merlin7.psi.ch:177843] [[46903,1],1] selected pml ob1, but peer [[46903,1],0] on gpu001 selected pml cm
--------------------------------------------------------------------------
MPI_INIT has failed because at least one MPI process is unreachable
from another.  This *usually* means that an underlying communication
plugin -- such as a BTL or an MTL -- has either not loaded or not
allowed itself to be used.  Your MPI job will now abort.

You may wish to try to narrow down the problem;

 * Check the output of ompi_info to see which BTL/MTL plugins are
   available.
 * Run your application with MPI_THREAD_SINGLE.
 * Set the MCA parameter btl_base_verbose to 100 (or mtl_base_verbose,
   if using MTL-based communications) to see exactly which
   communication plugins were considered and/or discarded.
--------------------------------------------------------------------------
--------------------------------------------------------------------------
It looks like MPI_INIT failed for some reason; your parallel process is
likely to abort.  There are many reasons that a parallel process can
fail during MPI_INIT; some of which are due to configuration or environment
problems.  This failure appears to be an internal failure; here's some
additional information (which may only be relevant to an Open MPI
developer):

  ompi_mpi_init: ompi_mpi_instance_init failed
  --> Returned "Unreachable" (-12) instead of "Success" (0)
--------------------------------------------------------------------------
[gpu001:00000] *** An error occurred in MPI_Init
[gpu001:00000] *** reported by process [3073835009,281470681743361]
[gpu001:00000] *** on a NULL communicator
[gpu001:00000] *** Unknown error
[gpu001:00000] *** MPI_ERRORS_ARE_FATAL (processes in this communicator will now abort,
[gpu001:00000] ***    and MPI will try to terminate your MPI job as well)
--------------------------------------------------------------------------
    This help section is empty because PRRTE was built without Sphinx.
--------------------------------------------------------------------------

It seems very similar to the issue #12038 but since I am using the branch main, this shoud have been fixed in the meantime...

Thanks a lot for any help in advance!

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions