Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

UCX ERROR no active messages transport #4742

Closed
yimin-zhao opened this issue Feb 7, 2020 · 10 comments
Closed

UCX ERROR no active messages transport #4742

yimin-zhao opened this issue Feb 7, 2020 · 10 comments
Labels

Comments

@yimin-zhao
Copy link

Describe the bug

When I try to run a mpi application - ior, it threw those error messages:

[1581060992.288391] [daishan:9817 :0]         select.c:438  UCX  ERROR no active messages transport to <no debug data>: posix/memory - Destination is unreachable, sysv/memory - Destination is unreachable, self/memory - Destination is unreachable, sockcm/sockaddr - no am bcopy, rdmacm/sockaddr - no am bcopy, cma/memory - no am bcopy
Abort(1091471) on node 11 (rank 11 in comm 0): Fatal error in PMPI_Init: Other MPI error, error stack:
MPIR_Init_thread(703)........:
MPID_Init(958)...............:
MPIDI_OFI_mpi_init_hook(1382): OFI get address vector map failed
[1581060992.288448] [daishan:9814 :0]         select.c:438  UCX  ERROR no active messages transport to <no debug data>: posix/memory - Destination is unreachable, sysv/memory - Destination is unreachable, self/memory - Destination is unreachable, sockcm/sockaddr - no am bcopy, rdmacm/sockaddr - no am bcopy, cma/memory - no am bcopy
[1581060992.288452] [daishan:9815 :0]         select.c:438  UCX  ERROR no active messages transport to <no debug data>: posix/memory - Destination is unreachable, sysv/memory - Destination is unreachable, self/memory - Destination is unreachable, sockcm/sockaddr - no am bcopy, rdmacm/sockaddr - no am bcopy, cma/memory - no am bcopy
[1581060992.288445] [daishan:9816 :0]         select.c:438  UCX  ERROR no active messages transport to <no debug data>: posix/memory - Destination is unreachable, sysv/memory - Destination is unreachable, self/memory - Destination is unreachable, sockcm/sockaddr - no am bcopy, rdmacm/sockaddr - no am bcopy, cma/memory - no am bcopy
[1581060992.288448] [daishan:9818 :0]         select.c:438  UCX  ERROR no active messages transport to <no debug data>: posix/memory - Destination is unreachable, sysv/memory - Destination is unreachable, self/memory - Destination is unreachable, sockcm/sockaddr - no am bcopy, rdmacm/sockaddr - no am bcopy, cma/memory - no am bcopy
[1581060992.288446] [daishan:9820 :0]         select.c:438  UCX  ERROR no active messages transport to <no debug data>: posix/memory - Destination is unreachable, sysv/memory - Destination is unreachable, self/memory - Destination is unreachable, sockcm/sockaddr - no am bcopy, rdmacm/sockaddr - no am bcopy, cma/memory - no am bcopy
Abort(1091471) on node 13 (rank 13 in comm 0): Fatal error in PMPI_Init: Other MPI error, error stack:
MPIR_Init_thread(703)........:
MPID_Init(958)...............:
MPIDI_OFI_mpi_init_hook(1382): OFI get address vector map failed
[1581060992.288452] [daishan:9821 :0]         select.c:438  UCX  ERROR no active messages transport to <no debug data>: posix/memory - Destination is unreachable, sysv/memory - Destination is unreachable, self/memory - Destination is unreachable, sockcm/sockaddr - no am bcopy, rdmacm/sockaddr - no am bcopy, cma/memory - no am bcopy
[1581060992.288449] [daishan:9822 :0]         select.c:438  UCX  ERROR no active messages transport to <no debug data>: posix/memory - Destination is unreachable, sysv/memory - Destination is unreachable, self/memory - Destination is unreachable, sockcm/sockaddr - no am bcopy, rdmacm/sockaddr - no am bcopy, cma/memory - no am bcopy
Abort(1091471) on node 15 (rank 15 in comm 0): Fatal error in PMPI_Init: Other MPI error, error stack:
MPIR_Init_thread(703)........:
MPID_Init(958)...............:
MPIDI_OFI_mpi_init_hook(1382): OFI get address vector map failed

Steps to Reproduce

  • Command line
mpirun -n 16 -ppn 8 -f ./hostfile /home/user/Repository/io-500-dev/build/ior/src/ior -w -s 50000 -a MVFS --mvfs.sock=io500.sock -i 1 -C -Q 1 -g -G 27 -k -e -t 47008 -b 47008 -o /opt/datafiles/io500.2020.02.07-15.41.54/ior_hard/IOR_file -O stoneWallingStatusFile=/opt/datafiles/io500.2020.02.07-15.41.54/ior_hard/stonewall -O stoneWallingWearOut=1 -D 20
  • UCX version used
    ucx 1.4 and ucx 1.7 (Found a similar question in this repo, so I switch to ucx1.7 but got same errors)
  • Any UCX environment variables used
    No

Setup and versions

  • OS version (e.g Linux distro) + CPU architecture (x86_64/aarch64/ppc64le/...)
    • Centos 7.6, Kernel 4.20
  • For RDMA/IB/RoCE related issues:
    • Driver version:
      • MLNX_OFED_LINUX-4.7-1.0.0.1
    • HW information
CA 'mlx5_0'
	CA type: MT4117
	Number of ports: 1
	Firmware version: 14.26.1040
	Hardware version: 0
	Node GUID: 0x506b4b0300494a2a
	System image GUID: 0x506b4b0300494a2a
	Port 1:
		State: Active
		Physical state: LinkUp
		Rate: 25
		Base lid: 0
		LMC: 0
		SM lid: 0
		Capability mask: 0x00010000
		Port GUID: 0x526b4bfffe494a2a
		Link layer: Ethernet
CA 'mlx5_1'
	CA type: MT4119
	Number of ports: 1
	Firmware version: 16.26.1040
	Hardware version: 0
	Node GUID: 0x98039b0300855d92
	System image GUID: 0x98039b0300855d92
	Port 1:
		State: Active
		Physical state: LinkUp
		Rate: 100
		Base lid: 0
		LMC: 0
		SM lid: 0
		Capability mask: 0x00010000
		Port GUID: 0x9a039bfffe855d92
		Link layer: Ethernet
CA 'mlx5_2'
	CA type: MT4119
	Number of ports: 1
	Firmware version: 16.26.1040
	Hardware version: 0
	Node GUID: 0x98039b0300855d93
	System image GUID: 0x98039b0300855d92
	Port 1:
		State: Down
		Physical state: Disabled
		Rate: 40
		Base lid: 0
		LMC: 0
		SM lid: 0
		Capability mask: 0x00010000
		Port GUID: 0x9a039bfffe855d93
		Link layer: Ethernet

Additional information (depending on the issue)

  • OpenMPI version
    Intel Mpi (Intel(R) MPI Library for Linux* OS, Version 2019 Update 6 Build 20191024)

  • Output of ucx_info -d to show transports and devices recognized by UCX
    ucx_info.txt.txt

At the very beginning, ucx 1.4 works fine with Intel's Mpi. I met this error after I uninstalled openmpi and switched to intel's mpi, I don't know whether it's due to I removed some necessary compenents during this proceduce. Anyway, when I install intel's mpi, it doesn't warn me any about it.

Tell me if you have any ideas.
Thanks in advance.

@yimin-zhao yimin-zhao added the Bug label Feb 7, 2020
@brminich
Copy link
Contributor

brminich commented Feb 7, 2020

the errors come from Intel MPI->OFI->UCX.
Looks like inter nodes transports, such as IB or tcp, are disabled(not available?).
Worth checking Intel MPI and OFI settings, which could set some UCX env vars

@dmitrygx
Copy link
Member

dmitrygx commented Feb 7, 2020

ofiwg/libfabric#5281
It seems that OFI community no longer supports libfabric over UCX

@yosefe
Copy link
Contributor

yosefe commented Feb 7, 2020

@yimin-zhao pls see https://openucx.readthedocs.io/en/master/running.html for details how to run OpenMPI with UCX.

@yimin-zhao
Copy link
Author

ofiwg/libfabric#5281
It seems that OFI community no longer supports libfabric over UCX

So if I understand correctly, OFI removed the Mellanox (using ucx) provider in the latest release, but how does that explain I was able to run Intel MPI just before I installed the OpenMPI. Anyway, thanks for the info.

@dmitrygx
Copy link
Member

but how does that explain I was able to run Intel MPI just before I installed the OpenMPI

Actually, this is not a good idea mixing two MPIs, better to run them in different terminals to avoid any potential errors.

@yimin-zhao do you have any other questions regarding OMPI or UCX? if not, is it ok to close the issue?

@yimin-zhao
Copy link
Author

Sure, no more questions so far, I will just close this issue for now.

@dmitrygx
Copy link
Member

Sure, no more questions so far, I will just close this issue for now.

thank you!

@ddurnov
Copy link

ddurnov commented Feb 10, 2020

IMPI 2019 U6 uses dc transport by default. Unfortunately we faced with a set of issues related to unexpected fallback from dc to ud at scale so we had to force it. According to provided ucx_info output you don't have the transport available. You may try to set UCX_TLS=ud,sm,self

@yimin-zhao
Copy link
Author

I see, will try it later. Thank you!

@jvdp1
Copy link

jvdp1 commented Dec 1, 2020

FOR INFORMATION:

I tried to run a Co-array Fortran program with Intel oneAPI (2021.1.10.2477) and got an issue that leads me to this issue. Since I found my solution on this page, I though to write this message if it might help other people. Please remove it if inappropriate.

On a Fedora 31 computer, I tried to run a simple distributed Co-array Fortran program on 4 images and got the following error message:

[1606848334.356770] [L0146910:5338 :0]         select.c:409  UCX  ERROR no active messages transport to <no debug data>: self/self - Destination is unreachable
[1606848334.356762] [L0146910:5339 :0]         select.c:409  UCX  ERROR no active messages transport to <no debug data>: self/self - Destination is unreachable
[1606848334.356767] [L0146910:5340 :0]         select.c:409  UCX  ERROR no active messages transport to <no debug data>: self/self - Destination is unreachable
[1606848334.356764] [L0146910:5341 :0]         select.c:409  UCX  ERROR no active messages transport to <no debug data>: self/self - Destination is unreachable

SOLUTION:
Exporting UCX_TLS=ud,sm,self as suggested in this comment solved this issue.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

6 participants