Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Cannot open openib btl component #5280

Closed
vmaffione opened this issue Jun 16, 2018 · 34 comments
Closed

Cannot open openib btl component #5280

vmaffione opened this issue Jun 16, 2018 · 34 comments
Labels

Comments

@vmaffione
Copy link

Background information

What version of Open MPI are you using? (e.g., v1.10.3, v2.1.0, git branch name and hash, etc.)

v3.1.0

Describe how Open MPI was installed (e.g., from a source/distribution tarball, from a git clone, from an operating system distribution package, etc.)

Installed from precompiled distribution package.

Please describe the system on which you are running

  • Operating system/version:
    Linux, Archlinux
  • Computer hardware:
    x86_64, processor belonging to the Intel i7 family
  • Network type:
    Mellanox Technologies MT27500 Family [ConnectX-3]

Details of the problem

Hello,
I'm trying to run a simple (echo) MPI program over two hosts connected through a direct cable over Infiniband.
Each host has a 40 Gbps Mellanox card, with inifiniband support. Infiniband/ibverbs/RDMA software is provided by a rdma-core package (https://github.com/linux-rdma/rdma-core).
(The two hosts can also communicate on separate IP LAN).
The infiniband setup is already working: I'm able to run libibverbs programs over the Mellanox NICs.

$ ibv_devinfo 
hca_id:	mlx4_0
	transport:			InfiniBand (0)
	fw_ver:				2.36.5000
	node_guid:			e41d:2d03:001e:68a0
	sys_image_guid:			e41d:2d03:001e:68a3
	vendor_id:			0x02c9
	vendor_part_id:			4099
	hw_ver:				0x1
	board_id:			MT_1090120019
	phys_port_cnt:			2
		port:	1
			state:			PORT_ACTIVE (4)
			max_mtu:		4096 (5)
			active_mtu:		4096 (5)
			sm_lid:			1
			port_lid:		1
			port_lmc:		0x00
			link_layer:		InfiniBand

		port:	2
			state:			PORT_DOWN (1)
			max_mtu:		4096 (5)
			active_mtu:		4096 (5)
			sm_lid:			0
			port_lid:		0
			port_lmc:		0x00
			link_layer:		InfiniBand

However, when I try to run my MPI program forcing openib, I get the following:

$ mpirun --mca btl self,openib --host bach,server ./mpiprodcons
--------------------------------------------------------------------------
A requested component was not found, or was unable to be opened.  This
means that this component is either not installed or is unable to be
used on your system (e.g., sometimes this means that shared libraries
that the component requires are unable to be found/loaded).  Note that
Open MPI stopped checking at the first component that it did not find.

Host:      server
Framework: btl
Component: openib
--------------------------------------------------------------------------
--------------------------------------------------------------------------
It looks like MPI_INIT failed for some reason; your parallel process is
likely to abort.  There are many reasons that a parallel process can
fail during MPI_INIT; some of which are due to configuration or environment
problems.  This failure appears to be an internal failure; here's some
additional information (which may only be relevant to an Open MPI
developer):

  mca_bml_base_open() failed
  --> Returned "Not found" (-13) instead of "Success" (0)
--------------------------------------------------------------------------
[server:26865] *** An error occurred in MPI_Init
[server:26865] *** reported by process [4043636737,1]
[server:26865] *** on a NULL communicator
[server:26865] *** Unknown error
[server:26865] *** MPI_ERRORS_ARE_FATAL (processes in this communicator will now abort,
[server:26865] ***    and potentially your MPI job)
[bach:28421] 1 more process has sent help message help-mca-base.txt / find-available:not-valid
[bach:28421] Set MCA parameter "orte_base_help_aggregate" to 0 to see all help / error messages
[bach:28421] 1 more process has sent help message help-mpi-runtime.txt / mpi_init:startup:internal-failure
[bach:28421] 1 more process has sent help message help-mpi-errors.txt / mpi_errors_are_fatal unknown handle

It seems it fails to load openib, but I don't see enough information to narrow down the issue.
Any suggestions?

Thanks a lot,
Vincenzo

@jsquyres
Copy link
Member

Can you run with --mca btl_base_verbose 100? That should include quite a bit more output, and perhaps shed light on why it's failing to open the openib BTL.

@vmaffione
Copy link
Author

Thank you for the quick reply.
I tried to increase the verbosity as you suggest, and also to set the orte_base_help_aggregate as suggested, but unfortunately no further information.

$ mpirun --mca btl self,openib --mca btl_base_verbose 100 --mca orte_base_help_aggregate 0 --host bach,server ./mpiprodcons
--------------------------------------------------------------------------
A requested component was not found, or was unable to be opened.  This
means that this component is either not installed or is unable to be
used on your system (e.g., sometimes this means that shared libraries
that the component requires are unable to be found/loaded).  Note that
Open MPI stopped checking at the first component that it did not find.

Host:      server
Framework: btl
Component: openib
--------------------------------------------------------------------------
--------------------------------------------------------------------------
It looks like MPI_INIT failed for some reason; your parallel process is
likely to abort.  There are many reasons that a parallel process can
fail during MPI_INIT; some of which are due to configuration or environment
problems.  This failure appears to be an internal failure; here's some
additional information (which may only be relevant to an Open MPI
developer):

  mca_bml_base_open() failed
  --> Returned "Not found" (-13) instead of "Success" (0)
--------------------------------------------------------------------------
[server:00658] *** An error occurred in MPI_Init
[server:00658] *** reported by process [2519400449,1]
[server:00658] *** on a NULL communicator
[server:00658] *** Unknown error
[server:00658] *** MPI_ERRORS_ARE_FATAL (processes in this communicator will now abort,
[server:00658] ***    and potentially your MPI job)
--------------------------------------------------------------------------
A requested component was not found, or was unable to be opened.  This
means that this component is either not installed or is unable to be
used on your system (e.g., sometimes this means that shared libraries
that the component requires are unable to be found/loaded).  Note that
Open MPI stopped checking at the first component that it did not find.

Host:      bach
Framework: btl
Component: openib
--------------------------------------------------------------------------
--------------------------------------------------------------------------
It looks like MPI_INIT failed for some reason; your parallel process is
likely to abort.  There are many reasons that a parallel process can
fail during MPI_INIT; some of which are due to configuration or environment
problems.  This failure appears to be an internal failure; here's some
additional information (which may only be relevant to an Open MPI
developer):

  mca_bml_base_open() failed
  --> Returned "Not found" (-13) instead of "Success" (0)
--------------------------------------------------------------------------
[bach:02097] *** An error occurred in MPI_Init
[bach:02097] *** reported by process [2519400449,0]
[bach:02097] *** on a NULL communicator
[bach:02097] *** Unknown error
[bach:02097] *** MPI_ERRORS_ARE_FATAL (processes in this communicator will now abort,
[bach:02097] ***    and potentially your MPI job)

Is openmpi supposed to work if the only RDMA/infiniband software installed on the system is rdma-core?

@vmaffione
Copy link
Author

Is there a simple way to check if my openmpi installation includes openib support? It may be that my distribution compiled openmpi without that component...

@rhc54
Copy link
Contributor

rhc54 commented Jun 19, 2018

yeah - just use "ompi_info" and you'll see the list of available components

@vmaffione
Copy link
Author

$ ompi_info | grep btl
[hadron:31660] mca_base_component_repository_open: unable to open mca_fcoll_two_phase: /usr/lib/openmpi/openmpi/mca_fcoll_two_phase.so: undefined symbol: mca_common_ompio_register_print_entry (ignored)
[hadron:31660] mca_base_component_repository_open: unable to open mca_fcoll_individual: /usr/lib/openmpi/openmpi/mca_fcoll_individual.so: undefined symbol: mca_common_ompio_file_write (ignored)
[hadron:31660] mca_base_component_repository_open: unable to open mca_fcoll_static: /usr/lib/openmpi/openmpi/mca_fcoll_static.so: undefined symbol: mca_common_ompio_register_print_entry (ignored)
[hadron:31660] mca_base_component_repository_open: unable to open mca_fcoll_dynamic: /usr/lib/openmpi/openmpi/mca_fcoll_dynamic.so: undefined symbol: mca_common_ompio_register_print_entry (ignored)
[hadron:31660] mca_base_component_repository_open: unable to open mca_fcoll_dynamic_gen2: /usr/lib/openmpi/openmpi/mca_fcoll_dynamic_gen2.so: undefined symbol: mca_common_ompio_register_print_entry (ignored)
                 MCA btl: tcp (MCA v2.1.0, API v3.0.0, Component v3.1.0)
                 MCA btl: vader (MCA v2.1.0, API v3.0.0, Component v3.1.0)
                 MCA btl: self (MCA v2.1.0, API v3.0.0, Component v3.1.0)
                MCA fbtl: posix (MCA v2.1.0, API v2.0.0, Component v3.1.0)

I guess this means that I miss the openib btl?

@rhc54
Copy link
Contributor

rhc54 commented Jun 19, 2018

Yep - you can look in config.log to see why

@vmaffione
Copy link
Author

Great, thanks a lot! At least now I know where the issue comes from. I'm using an openmpi package built by my distribution (Arch), and my distribution does not officially support OFED.
So I'll need to hack a little bit the ./configure invocation to see how to force it to use the rdma-core components.
I'll close this for now, thanks for the help!

@vmaffione
Copy link
Author

It seems now I got openib component, and the component is able to find the mellanox NICs. However, my "hello world" MPI program does not even start, hanging at startup

$ mpirun --mca btl self,openib --mca btl_base_verbose 100 --host bach,server ./mpiprodcons -n 1
[server:11637] mca: base: components_register: registering framework btl components
[server:11637] mca: base: components_register: found loaded component self
[server:11637] mca: base: components_register: component self register function successful
[server:11637] mca: base: components_register: found loaded component openib
[server:11637] mca: base: components_register: component openib register function successful
[server:11637] mca: base: components_open: opening btl components
[server:11637] mca: base: components_open: found loaded component self
[server:11637] mca: base: components_open: component self open function successful
[server:11637] mca: base: components_open: found loaded component openib
[server:11637] mca: base: components_open: component openib open function successful
[server:11637] select: initializing btl component self
[server:11637] select: init of component self returned success
[server:11637] select: initializing btl component openib
[server:11637] Checking distance from this process to device=mlx4_0
[server:11637] Process is not bound: distance to device is 0.000000
[server:11637] rdmacm CPC only supported when the first QP is a PP QP; skipped
[server:11637] openib BTL: rdmacm CPC unavailable for use on mlx4_0:1; skipped
[server:11637] [rank=1] openib: using port mlx4_0:1
[server:11637] select: init of component openib returned success
[bach:13285] mca: base: components_register: registering framework btl components
[bach:13285] mca: base: components_register: found loaded component openib
[bach:13285] mca: base: components_register: component openib register function successful
[bach:13285] mca: base: components_register: found loaded component self
[bach:13285] mca: base: components_register: component self register function successful
[bach:13285] mca: base: components_open: opening btl components
[bach:13285] mca: base: components_open: found loaded component openib
[bach:13285] mca: base: components_open: component openib open function successful
[bach:13285] mca: base: components_open: found loaded component self
[bach:13285] mca: base: components_open: component self open function successful
[bach:13285] select: initializing btl component openib
[bach:13285] Checking distance from this process to device=mlx4_0
[bach:13285] Process is not bound: distance to device is 0.000000
[bach:13285] rdmacm CPC only supported when the first QP is a PP QP; skipped
[bach:13285] openib BTL: rdmacm CPC unavailable for use on mlx4_0:1; skipped
[bach:13285] [rank=0] openib: using port mlx4_0:1
[bach:13285] select: init of component openib returned success
[bach:13285] select: initializing btl component self
[bach:13285] select: init of component self returned success
[bach:13285] mca: bml: Using self btl for send to [[44511,1],0] on node bach
[server:11637] mca: bml: Using self btl for send to [[44511,1],1] on node server

(does not terminate)

I can add that in my setup I'm able to use both libibverbs programs directly and legacy IP application throough IPoIB (I have network interfaces ib0 and ib1 on both hosts).
Any more suggestions? Am I correct when using -blt self,openib, excluding therefore tcp?

@vmaffione vmaffione reopened this Jun 19, 2018
@rhc54
Copy link
Contributor

rhc54 commented Jun 19, 2018

The problem is right here:

[bach:13285] rdmacm CPC only supported when the first QP is a PP QP; skipped
[bach:13285] openib BTL: rdmacm CPC unavailable for use on mlx4_0:1; skipped

IB was unable to make the connection, and so openib disqualified itself.

@jsquyres
Copy link
Member

Specifically, I think you have two options here:

  1. Use the UCX PML instead of the verbs-based openib BTL (i.e., get UCX from openucx.org -- it's emerging as the replacement of choice for vendors who have verbs-based devices).
  2. Use --mca btl_openib_cpc_base_exclude rdmacm to have the openib BTL exclude the rdmacm connection method.

@vmaffione
Copy link
Author

Thanks a lot for your help. If rdmacm refers to the RDMA communication library (librdmacm.so), then I don't understand the failure. I have a libibverbs application up and running that uses librdmacm to establish the connection (before starting to use libibverbs directly)... Any explanation for this?

I tried to add the suggested flag (removing base_), and it seems to have effect (warning message disappeared). But it still hangs in the same position

mpirun --mca btl self,openib --mca btl_base_verbose 100 --mca btl_openib_cpc_exclude rdmacm --host bach,server ./mpiprodcons -n 10
[server:20859] mca: base: components_register: registering framework btl components
[server:20859] mca: base: components_register: found loaded component self
[server:20859] mca: base: components_register: component self register function successful
[server:20859] mca: base: components_register: found loaded component openib
[server:20859] mca: base: components_register: component openib register function successful
[server:20859] mca: base: components_open: opening btl components
[server:20859] mca: base: components_open: found loaded component self
[server:20859] mca: base: components_open: component self open function successful
[server:20859] mca: base: components_open: found loaded component openib
[server:20859] mca: base: components_open: component openib open function successful
[server:20859] select: initializing btl component self
[server:20859] select: init of component self returned success
[server:20859] select: initializing btl component openib
[server:20859] Checking distance from this process to device=mlx4_0
[server:20859] Process is not bound: distance to device is 0.000000
[server:20859] [rank=1] openib: using port mlx4_0:1
[server:20859] select: init of component openib returned success
[bach:16202] mca: base: components_register: registering framework btl components
[bach:16202] mca: base: components_register: found loaded component openib
[bach:16202] mca: base: components_register: component openib register function successful
[bach:16202] mca: base: components_register: found loaded component self
[bach:16202] mca: base: components_register: component self register function successful
[bach:16202] mca: base: components_open: opening btl components
[bach:16202] mca: base: components_open: found loaded component openib
[bach:16202] mca: base: components_open: component openib open function successful
[bach:16202] mca: base: components_open: found loaded component self
[bach:16202] mca: base: components_open: component self open function successful
[bach:16202] select: initializing btl component openib
[bach:16202] Checking distance from this process to device=mlx4_0
[bach:16202] Process is not bound: distance to device is 0.000000
[bach:16202] [rank=0] openib: using port mlx4_0:1
[bach:16202] select: init of component openib returned success
[bach:16202] select: initializing btl component self
[bach:16202] select: init of component self returned success
[bach:16202] mca: bml: Using self btl for send to [[41284,1],0] on node bach
[server:20859] mca: bml: Using self btl for send to [[41284,1],1] on node server

I would need to use MPI directly over ibverbs if possible... it this UCX adding an additional abstraction layer in the "datapath"? Do I need to build and install UCX and then re-configure and re-build openmpi so that it finds the new component?

Thanks

@thananon
Copy link
Member

I would need to use MPI directly over ibverbs if possible... it this UCX adding an additional abstraction layer in the "datapath"?

It does but UCX is actively maintained and optimized for infiniband hardware. Openib on the otherhand is old and on "maintenance only" stage. From our test results, UCX is generally better performing. (You can try to compare them yourself actually, just disable the rdmacm on openib btl as suggested.)

Do I need to build and install UCX and then re-configure and re-build openmpi so that it finds the new component?

Yes, you do.

@vmaffione
Copy link
Author

Hi, thanks for the help. If you look at my latest output run, I actually tried to disable rdmacm, but with no luck. Anyway, I'll definitely try this UCX. I am interested in performance limits with very small messages (e.g. 8-16 bytes for message), that's why I want as little overhead/abstraction layer over the ibverbs.
That's why it's a pity that openib does not work for me.

@thananon
Copy link
Member

thananon commented Jun 22, 2018

Hey @vmaffione sorry I didn't read your output log.

So from what I'm seeing, the openib btl is initialized properly.

[bach:16202] select: initializing btl component openib
[bach:16202] Checking distance from this process to device=mlx4_0
[bach:16202] Process is not bound: distance to device is 0.000000
[bach:16202] [rank=0] openib: using port mlx4_0:1
[bach:16202] select: init of component openib returned success

So now the problem is not what you originally have anymore. The hang suggested that there is something wrong in the component selection. (as in which component to use to send data from node x to node y.)

Can you try to use tcp to see if the application works? -mca btl tcp,self

@vmaffione
Copy link
Author

vmaffione commented Jun 22, 2018

Yes, excluding rdmacm has effect, but it still hangs. This is weird, because I use rdmacm in another (non-MPI) program, with the very same setup (not at the same time, ofc).
I have tried many times with the tcp btl, and everything works correctly. If this is relevant, my host has mutiple interfaces, as explained in my very first messate.

@thananon
Copy link
Member

How many messages are you sending at the same time?

Can you try openib btl with --mca btl_openib_btls_per_lid 8 ?

@vmaffione
Copy link
Author

I'm sending one message at a time, but that's not point. My MPI program prints an "Hello world" after MPI_init, and before sending/receiving any message. And I don't see that hello world, so I think it hangs in MPI-init.

$ mpirun --mca btl self,openib --mca btl_base_verbose 100 --mca btl_openib_cpc_exclude rdmacm --mca btl_openib_btls_per_lid 8  --host bach,server ./mpiprodcons -n 10
[server:26717] mca: base: components_register: registering framework btl components
[server:26717] mca: base: components_register: found loaded component self
[server:26717] mca: base: components_register: component self register function successful
[server:26717] mca: base: components_register: found loaded component openib
[server:26717] mca: base: components_register: component openib register function successful
[server:26717] mca: base: components_open: opening btl components
[server:26717] mca: base: components_open: found loaded component self
[server:26717] mca: base: components_open: component self open function successful
[server:26717] mca: base: components_open: found loaded component openib
[server:26717] mca: base: components_open: component openib open function successful
[server:26717] select: initializing btl component self
[server:26717] select: init of component self returned success
[server:26717] select: initializing btl component openib
[server:26717] Checking distance from this process to device=mlx4_0
[server:26717] Process is not bound: distance to device is 0.000000
[server:26717] [rank=1] openib: using port mlx4_0:1
[server:26717] [rank=1] openib: using port mlx4_0:1
[server:26717] [rank=1] openib: using port mlx4_0:1
[server:26717] [rank=1] openib: using port mlx4_0:1
[server:26717] [rank=1] openib: using port mlx4_0:1
[server:26717] [rank=1] openib: using port mlx4_0:1
[server:26717] [rank=1] openib: using port mlx4_0:1
[server:26717] [rank=1] openib: using port mlx4_0:1
[server:26717] select: init of component openib returned success
[bach:21923] mca: base: components_register: registering framework btl components
[bach:21923] mca: base: components_register: found loaded component openib
[bach:21923] mca: base: components_register: component openib register function successful
[bach:21923] mca: base: components_register: found loaded component self
[bach:21923] mca: base: components_register: component self register function successful
[bach:21923] mca: base: components_open: opening btl components
[bach:21923] mca: base: components_open: found loaded component openib
[bach:21923] mca: base: components_open: component openib open function successful
[bach:21923] mca: base: components_open: found loaded component self
[bach:21923] mca: base: components_open: component self open function successful
[bach:21923] select: initializing btl component openib
[bach:21923] Checking distance from this process to device=mlx4_0
[bach:21923] Process is not bound: distance to device is 0.000000
[bach:21923] [rank=0] openib: using port mlx4_0:1
[bach:21923] [rank=0] openib: using port mlx4_0:1
[bach:21923] [rank=0] openib: using port mlx4_0:1
[bach:21923] [rank=0] openib: using port mlx4_0:1
[bach:21923] [rank=0] openib: using port mlx4_0:1
[bach:21923] [rank=0] openib: using port mlx4_0:1
[bach:21923] [rank=0] openib: using port mlx4_0:1
[bach:21923] [rank=0] openib: using port mlx4_0:1
[bach:21923] select: init of component openib returned success
[bach:21923] select: initializing btl component self
[bach:21923] select: init of component self returned success
[bach:21923] mca: bml: Using self btl for send to [[52125,1],0] on node bach
[server:26717] mca: bml: Using self btl for send to [[52125,1],1] on node server

@vmaffione
Copy link
Author

I added a printf() before and after MPI_Init, and now I can confirm that the MPI program hangs on both hosts inside MPI_Init.

@rhc54
Copy link
Contributor

rhc54 commented Jun 22, 2018

Have you tried stripping the program of everything except the MPI_Init, print, and MPI_Finalize and see if that works. If it does, then it tells you that you truly are hanging on the communication, and the print statement after MPI_Init might be failing to show up for other reasons (like the daemons can't forward it because they are hung doing something else).

I would then add --mca pmix_base_verbose 5 --mca grpcomm_base_verbose 5 to your cmd line and see if maybe the problem is that you are hanging in the fence operations

@vmaffione
Copy link
Author

Yes, right now my program is just a call to MPI_Init followed by a call to MPI_Finalize and exit(0).
Adding the suggested parameters adds more output, but it still hangs inside MPI_Init:

$ mpirun --mca btl self,openib --mca btl_base_verbose 100 --mca btl_openib_cpc_exclude rdmacm --mca btl_openib_btls_per_lid 8 --mca pmix_base_verbose 5 --mca grpcomm_base_verbose 5 --host bach,server ./mpiprodcons -n 10
[bach:22102] psec: native init
[bach:22102] psec: none init
[bach:22102] pmix:server init called
[bach:22102] pmix_register_event_hdlr shifting to progress thread
[bach:22102] pmix: register event_hdlr with 1 infos
[bach:22102] pmix: _add_hdlr
[bach:22102] PMIX server errreg_cbfunc - error handler registered status=0, reference=0
[bach:22102] pmix:server _register_nspace 3361079296
[bach:22102] mca:grpcomm:select: checking available component direct
[bach:22102] mca:grpcomm:select: Querying component [direct]
[bach:22102] [[51286,0],0]: Final grpcomm priorities
[bach:22102] 	Component: direct Priority: 85
[server:26888] psec: native init
[server:26888] psec: none init
[server:26888] pmix:server init called
[server:26888] pmix_register_event_hdlr shifting to progress thread
[server:26888] pmix: register event_hdlr with 1 infos
[server:26888] pmix: _add_hdlr
[server:26888] PMIX server errreg_cbfunc - error handler registered status=0, reference=0
[server:26888] pmix:server _register_nspace 3361079296
[server:26888] mca:grpcomm:select: checking available component direct
[server:26888] mca:grpcomm:select: Querying component [direct]
[server:26888] [[51286,0],1]: Final grpcomm priorities
[server:26888] 	Component: direct Priority: 85
[server:26888] [[51286,0],1] pmix2x:client get on proc [[51286,0],1] key opal.puri
[server:26888] pmix: get_nb value for proc 3361079296:1 key opal.puri
[server:26888] pmix: getnbfn value for proc 3361079296:1 key opal.puri
[server:26888] pmix:client get completed
[server:26888] [[51286,0],1] pmix2x:client get on proc [[51286,0],0] key opal.puri
[server:26888] pmix: get_nb value for proc 3361079296:0 key opal.puri
[server:26888] pmix: getnbfn value for proc 3361079296:0 key opal.puri
[server:26888] pmix:client get completed
[bach:22102] [[51286,0],0] pmix2x:client get on proc [[51286,0],0] key opal.puri
[bach:22102] pmix: get_nb value for proc 3361079296:0 key opal.puri
[bach:22102] pmix: getnbfn value for proc 3361079296:0 key opal.puri
[bach:22102] pmix:client get completed
[bach:22102] [[51286,0],0] pmix2x:client get on proc [[51286,0],1] key opal.puri
[bach:22102] pmix: get_nb value for proc 3361079296:1 key opal.puri
[bach:22102] pmix: getnbfn value for proc 3361079296:1 key opal.puri
[bach:22102] pmix:client get completed
[bach:22102] pmix:server register client 3361079297:0
[bach:22102] pmix:server _register_client for nspace 3361079297 rank 0
[bach:22102] pmix:server _register_nspace 3361079297
[server:26888] pmix:server register client 3361079297:1
[server:26888] pmix:server _register_client for nspace 3361079297 rank 1
[server:26888] pmix:server _register_nspace 3361079297
[server:26888] pmix:server setup_fork for nspace 3361079297 rank 1
[bach:22102] pmix:server setup_fork for nspace 3361079297 rank 0
BEFORE MPI_INIT
[server:26892] PMIx_client initialized
[server:26892] PMIx_client init
BEFORE MPI INIT
[server:26892] psec: native init
[server:26892] psec: none init
[server:26888] psec: native validate_cred ��
[server:26888] psec: native credential 1000:1000 valid
[server:26888] credential validated
[server:26892] pmix: init called
[server:26888] SWITCHYARD for 3361079297:1:24
[server:26888] recvd pmix cmd 0 from 3361079297:1
[server:26888] SWITCHYARD for 3361079297:1:24
[server:26888] recvd pmix cmd 13 from 3361079297:1
[server:26888] recvd register events
[server:26888] server register events: ninfo =0 rc =0
[server:26888] server:regevents_cbfunc called status = 0
[server:26892] pmix: get_nb value for proc 3361079297:4294967294 key pmix.dbg.init
[server:26892] pmix: getnbfn value for proc 3361079297:4294967294 key pmix.dbg.init
[server:26892] pmix:client get completed
[server:26892] pmix_register_event_hdlr shifting to progress thread
[server:26892] pmix: register event_hdlr with 0 infos
[server:26892] pmix: _add_hdlr
[server:26892] pmix: _add_hdlr sending to server
[server:26892] pmix: regevents callback recvd
[server:26892] PMIX client errreg_cbfunc - error handler registered status=0, reference=0
[server:26892] [[51286,1],1] pmix2x:client get on proc [[51286,1],1] key pmix.lrank
[server:26892] pmix: get_nb value for proc 3361079297:1 key pmix.lrank
[server:26892] pmix: getnbfn value for proc 3361079297:1 key pmix.lrank
[server:26892] pmix:client get completed
[server:26892] [[51286,1],1] pmix2x:client get on proc [[51286,1],1] key pmix.nrank
[server:26892] pmix: get_nb value for proc 3361079297:1 key pmix.nrank
[server:26892] pmix: getnbfn value for proc 3361079297:1 key pmix.nrank
[server:26892] pmix:client get completed
[server:26892] [[51286,1],1] pmix2x:client get on proc [[51286,1],WILDCARD] key pmix.max.size
[server:26892] pmix: get_nb value for proc 3361079297:4294967294 key pmix.max.size
[server:26892] pmix: getnbfn value for proc 3361079297:4294967294 key pmix.max.size
[server:26892] pmix:client get completed
[server:26892] [[51286,1],1] pmix2x:client get on proc [[51286,1],WILDCARD] key pmix.job.size
[server:26892] pmix: get_nb value for proc 3361079297:4294967294 key pmix.job.size
[server:26892] pmix: getnbfn value for proc 3361079297:4294967294 key pmix.job.size
[server:26892] pmix:client get completed
[server:26892] [[51286,1],1] pmix2x:client get on proc [[51286,1],1] key pmix.appnum
[server:26892] pmix: get_nb value for proc 3361079297:1 key pmix.appnum
[server:26892] pmix: getnbfn value for proc 3361079297:1 key pmix.appnum
[server:26892] pmix:client get completed
[server:26892] [[51286,1],1] pmix2x:client get on proc [[51286,1],WILDCARD] key pmix.local.size
[server:26892] pmix: get_nb value for proc 3361079297:4294967294 key pmix.local.size
[server:26892] pmix: getnbfn value for proc 3361079297:4294967294 key pmix.local.size
[server:26892] pmix:client get completed
[server:26892] [[51286,1],1] pmix2x:client get on proc [[51286,1],WILDCARD] key pmix.num.nodes
[server:26892] pmix: get_nb value for proc 3361079297:4294967294 key pmix.num.nodes
[server:26892] pmix: getnbfn value for proc 3361079297:4294967294 key pmix.num.nodes
[server:26892] pmix:client get completed
[server:26892] [[51286,1],1] pmix2x:client get on proc [[51286,1],WILDCARD] key pmix.tmpdir
[server:26892] pmix: get_nb value for proc 3361079297:4294967294 key pmix.tmpdir
[server:26892] pmix: getnbfn value for proc 3361079297:4294967294 key pmix.tmpdir
[server:26892] pmix:client get completed
[server:26892] [[51286,1],1] pmix2x:client get on proc [[51286,1],WILDCARD] key pmix.nsdir
[server:26892] pmix: get_nb value for proc 3361079297:4294967294 key pmix.nsdir
[server:26892] pmix: getnbfn value for proc 3361079297:4294967294 key pmix.nsdir
[server:26892] pmix:client get completed
[server:26892] [[51286,1],1] pmix2x:client get on proc [[51286,1],WILDCARD] key pmix.pdir
[server:26892] pmix: get_nb value for proc 3361079297:4294967294 key pmix.pdir
[server:26892] pmix: getnbfn value for proc 3361079297:4294967294 key pmix.pdir
[server:26892] pmix:client get completed
[server:26892] [[51286,1],1] pmix2x:client get on proc [[51286,1],WILDCARD] key pmix.tdir.rmclean
[server:26892] pmix: get_nb value for proc 3361079297:4294967294 key pmix.tdir.rmclean
[server:26892] pmix: getnbfn value for proc 3361079297:4294967294 key pmix.tdir.rmclean
[server:26892] pmix:client get completed
[bach:22108] PMIx_client initialized
[bach:22108] PMIx_client init
[bach:22108] psec: native init
[bach:22108] psec: none init
[bach:22108] pmix: init called
[bach:22102] psec: native validate_cred ��
[bach:22102] psec: native credential 1000:100 valid
[bach:22102] credential validated
[bach:22102] SWITCHYARD for 3361079297:0:29
[bach:22102] recvd pmix cmd 0 from 3361079297:0
[server:26892] pmix_register_event_hdlr shifting to progress thread
[server:26892] pmix: register event_hdlr with 1 infos
[server:26892] pmix: _add_hdlr
[server:26892] PMIX2x errreg_cbfunc - error handler registered status=0, reference=1
[server:26892] mca:grpcomm:select: checking available component direct
[server:26892] mca:grpcomm:select: Querying component [direct]
[server:26892] [[51286,1],1]: Final grpcomm priorities
[server:26892] 	Component: direct Priority: 85
[server:26892] PMIx_client put
[server:26892] pmix: executing put for key pmix.hname type 3
[server:26892] pmix_register_event_hdlr shifting to progress thread
[server:26892] pmix: register event_hdlr with 2 infos
[server:26892] pmix: _add_hdlr
[server:26888] SWITCHYARD for 3361079297:1:24
[server:26888] recvd pmix cmd 13 from 3361079297:1
[server:26888] recvd register events
[server:26888] server register events: ninfo =0 rc =0
[server:26888] server:regevents_cbfunc called status = 0
[server:26892] PMIX2x errreg_cbfunc - error handler registered status=0, reference=2
[server:26892] pmix_register_event_hdlr shifting to progress thread
[server:26892] pmix: register event_hdlr with 1 infos
[server:26892] pmix: _add_hdlr
[server:26892] pmix: _add_hdlr sending to server
[server:26892] pmix: regevents callback recvd
[server:26892] PMIX2x errreg_cbfunc - error handler registered status=0, reference=3
[server:26892] PMIx_client init
[server:26892] client: notifying server 3361079297:1 of status PMIX MODEL DECLARED
[server:26892] 3361079297:1 invoke_local_event_hdlr for status PMIX MODEL DECLARED
[server:26892] [event/pmix_event_notification.c:767] INVOKING EVHDLR MPI-Model-Declarations
[server:26892] [[51286,1],1] RECEIVED NOTIFICATION OF STATUS -147
[server:26892] [[51286,1],1] CONVERTED STATUS -147 TO STATUS -69
[server:26892] [[51286,1],1] _EVENT_HDLR CALLING EVHDLR
[server:26892] pmix_client_notify_event source = 3361079297:1 event_status =-147, rc=0
[server:26892] PMIx_client finalize
[server:26892] mca: base: components_register: registering framework btl components
[server:26892] mca: base: components_register: found loaded component self
[server:26892] mca: base: components_register: component self register function successful
[server:26892] mca: base: components_register: found loaded component openib
[server:26892] mca: base: components_register: component openib register function successful
[server:26892] mca: base: components_open: opening btl components
[server:26892] mca: base: components_open: found loaded component self
[server:26892] mca: base: components_open: component self open function successful
[server:26892] mca: base: components_open: found loaded component openib
[server:26892] mca: base: components_open: component openib open function successful
[server:26892] select: initializing btl component self
[server:26892] select: init of component self returned success
[server:26892] select: initializing btl component openib
[server:26892] Checking distance from this process to device=mlx4_0
[server:26892] Process is not bound: distance to device is 0.000000
[server:26892] [[51286,1],1] pmix2x:client get on proc [[51286,1],WILDCARD] key pmix.pmem
[server:26892] pmix: get_nb value for proc 3361079297:4294967294 key pmix.pmem
[server:26892] pmix: getnbfn value for proc 3361079297:4294967294 key pmix.pmem
[server:26892] pmix:client get completed
[bach:22108] pmix: get_nb value for proc 3361079297:4294967294 key pmix.dbg.init
[bach:22108] pmix: getnbfn value for proc 3361079297:4294967294 key pmix.dbg.init
[server:26892] [rank=1] openib: using port mlx4_0:1
[bach:22108] pmix:client get completed
[bach:22108] pmix_register_event_hdlr shifting to progress thread
[bach:22108] pmix: register event_hdlr with 0 infos
[bach:22108] pmix: _add_hdlr
[bach:22108] pmix: _add_hdlr sending to server
[bach:22102] SWITCHYARD for 3361079297:0:29
[bach:22102] recvd pmix cmd 13 from 3361079297:0
[bach:22102] recvd register events
[bach:22102] server register events: ninfo =0 rc =0
[bach:22102] server:regevents_cbfunc called status = 0
[bach:22108] pmix: regevents callback recvd
[bach:22108] PMIX client errreg_cbfunc - error handler registered status=0, reference=0
[bach:22108] [[51286,1],0] pmix2x:client get on proc [[51286,1],0] key pmix.lrank
[bach:22108] pmix: get_nb value for proc 3361079297:0 key pmix.lrank
[bach:22108] pmix: getnbfn value for proc 3361079297:0 key pmix.lrank
[bach:22108] pmix:client get completed
[bach:22108] [[51286,1],0] pmix2x:client get on proc [[51286,1],0] key pmix.nrank
[bach:22108] pmix: get_nb value for proc 3361079297:0 key pmix.nrank
[bach:22108] pmix: getnbfn value for proc 3361079297:0 key pmix.nrank
[bach:22108] pmix:client get completed
[bach:22108] [[51286,1],0] pmix2x:client get on proc [[51286,1],WILDCARD] key pmix.max.size
[bach:22108] pmix: get_nb value for proc 3361079297:4294967294 key pmix.max.size
[bach:22108] pmix: getnbfn value for proc 3361079297:4294967294 key pmix.max.size
[bach:22108] pmix:client get completed
[bach:22108] [[51286,1],0] pmix2x:client get on proc [[51286,1],WILDCARD] key pmix.job.size
[bach:22108] pmix: get_nb value for proc 3361079297:4294967294 key pmix.job.size
[bach:22108] pmix: getnbfn value for proc 3361079297:4294967294 key pmix.job.size
[bach:22108] pmix:client get completed
[bach:22108] [[51286,1],0] pmix2x:client get on proc [[51286,1],0] key pmix.appnum
[bach:22108] pmix: get_nb value for proc 3361079297:0 key pmix.appnum
[bach:22108] pmix: getnbfn value for proc 3361079297:0 key pmix.appnum
[bach:22108] pmix:client get completed
[bach:22108] [[51286,1],0] pmix2x:client get on proc [[51286,1],WILDCARD] key pmix.local.size
[bach:22108] pmix: get_nb value for proc 3361079297:4294967294 key pmix.local.size
[bach:22108] pmix: getnbfn value for proc 3361079297:4294967294 key pmix.local.size
[bach:22108] pmix:client get completed
[bach:22108] [[51286,1],0] pmix2x:client get on proc [[51286,1],WILDCARD] key pmix.num.nodes
[bach:22108] pmix: get_nb value for proc 3361079297:4294967294 key pmix.num.nodes
[bach:22108] pmix: getnbfn value for proc 3361079297:4294967294 key pmix.num.nodes
[bach:22108] pmix:client get completed
[bach:22108] [[51286,1],0] pmix2x:client get on proc [[51286,1],WILDCARD] key pmix.tmpdir
[bach:22108] pmix: get_nb value for proc 3361079297:4294967294 key pmix.tmpdir
[bach:22108] pmix: getnbfn value for proc 3361079297:4294967294 key pmix.tmpdir
[bach:22108] pmix:client get completed
[bach:22108] [[51286,1],0] pmix2x:client get on proc [[51286,1],WILDCARD] key pmix.nsdir
[bach:22108] pmix: get_nb value for proc 3361079297:4294967294 key pmix.nsdir
[bach:22108] pmix: getnbfn value for proc 3361079297:4294967294 key pmix.nsdir
[bach:22108] pmix:client get completed
[bach:22108] [[51286,1],0] pmix2x:client get on proc [[51286,1],WILDCARD] key pmix.pdir
[bach:22108] pmix: get_nb value for proc 3361079297:4294967294 key pmix.pdir
[bach:22108] pmix: getnbfn value for proc 3361079297:4294967294 key pmix.pdir
[bach:22108] pmix:client get completed
[bach:22108] [[51286,1],0] pmix2x:client get on proc [[51286,1],WILDCARD] key pmix.tdir.rmclean
[bach:22108] pmix: get_nb value for proc 3361079297:4294967294 key pmix.tdir.rmclean
[bach:22108] pmix: getnbfn value for proc 3361079297:4294967294 key pmix.tdir.rmclean
[bach:22108] pmix:client get completed
[server:26892] [rank=1] openib: using port mlx4_0:1
[server:26892] [rank=1] openib: using port mlx4_0:1
[server:26892] [rank=1] openib: using port mlx4_0:1
[server:26892] [rank=1] openib: using port mlx4_0:1
[server:26892] [rank=1] openib: using port mlx4_0:1
[server:26892] [rank=1] openib: using port mlx4_0:1
[server:26892] [rank=1] openib: using port mlx4_0:1
[server:26892] PMIx_client put
[server:26892] pmix: executing put for key btl.openib.3.1 type 27
[server:26892] select: init of component openib returned success
[server:26888] SWITCHYARD for 3361079297:1:24
[server:26888] recvd pmix cmd 2 from 3361079297:1
[server:26888] 3361079296:1 EXECUTE COMMIT FOR 3361079297:1
[server:26888] SWITCHYARD for 3361079297:1:24
[server:26888] recvd pmix cmd 3 from 3361079297:1
[server:26888] recvd FENCE
[server:26888] recvd fence from 3361079297:1 with 1 procs
[server:26888] get_tracker called with 1 procs
[server:26888] new_tracker called with 1 procs
[server:26888] adding new tracker with 1 procs
[server:26888] adding local proc 3361079297.1 to tracker
[server:26888] fence complete
[server:26888] fence - assembling data
[server:26888] [[51286,0],1] FENCE CALLED
[server:26892] pmix:client wait_cbfunc received
[server:26892] PMIx_client fencenb
[server:26892] pmix: fence_nb called
[server:26888] [[51286,0],1] grpcomm:base: creating new coll for ORTE_SIG  Procs: [[51286,1],WILDCARD]
[bach:22102] [[51286,0],0] grpcomm:base: creating new coll for ORTE_SIG  Procs: [[51286,1],WILDCARD]
[bach:22108] pmix_register_event_hdlr shifting to progress thread
[bach:22108] pmix: register event_hdlr with 1 infos
[bach:22108] pmix: _add_hdlr
[bach:22108] PMIX2x errreg_cbfunc - error handler registered status=0, reference=1
[bach:22108] mca:grpcomm:select: checking available component direct
[bach:22108] mca:grpcomm:select: Querying component [direct]
[bach:22108] [[51286,1],0]: Final grpcomm priorities
[bach:22108] 	Component: direct Priority: 85
[bach:22108] PMIx_client put
[bach:22108] pmix: executing put for key pmix.hname type 3
[bach:22108] pmix_register_event_hdlr shifting to progress thread
[bach:22108] pmix: register event_hdlr with 2 infos
[bach:22108] pmix: _add_hdlr
[bach:22108] PMIX2x errreg_cbfunc - error handler registered status=0, reference=2
[bach:22108] pmix_register_event_hdlr shifting to progress thread
[bach:22108] pmix: register event_hdlr with 1 infos
[bach:22108] pmix: _add_hdlr
[bach:22108] pmix: _add_hdlr sending to server
[bach:22102] SWITCHYARD for 3361079297:0:29
[bach:22102] recvd pmix cmd 13 from 3361079297:0
[bach:22102] recvd register events
[bach:22102] server register events: ninfo =0 rc =0
[bach:22102] server:regevents_cbfunc called status = 0
[bach:22108] pmix: regevents callback recvd
[bach:22108] PMIX2x errreg_cbfunc - error handler registered status=0, reference=3
[bach:22108] PMIx_client init
[bach:22108] client: notifying server 3361079297:0 of status PMIX MODEL DECLARED
[bach:22108] 3361079297:0 invoke_local_event_hdlr for status PMIX MODEL DECLARED
[bach:22108] [event/pmix_event_notification.c:767] INVOKING EVHDLR MPI-Model-Declarations
[bach:22108] [[51286,1],0] RECEIVED NOTIFICATION OF STATUS -147
[bach:22108] [[51286,1],0] CONVERTED STATUS -147 TO STATUS -69
[bach:22108] [[51286,1],0] _EVENT_HDLR CALLING EVHDLR
[bach:22108] pmix_client_notify_event source = 3361079297:0 event_status =-147, rc=0
[bach:22108] PMIx_client finalize
[bach:22108] mca: base: components_register: registering framework btl components
[bach:22108] mca: base: components_register: found loaded component openib
[bach:22108] mca: base: components_register: component openib register function successful
[bach:22108] mca: base: components_register: found loaded component self
[bach:22108] mca: base: components_register: component self register function successful
[bach:22108] mca: base: components_open: opening btl components
[bach:22108] mca: base: components_open: found loaded component openib
[bach:22108] mca: base: components_open: component openib open function successful
[bach:22108] mca: base: components_open: found loaded component self
[bach:22108] mca: base: components_open: component self open function successful
[bach:22108] select: initializing btl component openib
[bach:22108] Checking distance from this process to device=mlx4_0
[bach:22108] Process is not bound: distance to device is 0.000000
[bach:22108] [[51286,1],0] pmix2x:client get on proc [[51286,1],WILDCARD] key pmix.pmem
[bach:22108] pmix: get_nb value for proc 3361079297:4294967294 key pmix.pmem
[bach:22108] pmix: getnbfn value for proc 3361079297:4294967294 key pmix.pmem
[bach:22108] pmix:client get completed
[bach:22108] [rank=0] openib: using port mlx4_0:1
[bach:22108] [rank=0] openib: using port mlx4_0:1
[bach:22108] [rank=0] openib: using port mlx4_0:1
[bach:22108] [rank=0] openib: using port mlx4_0:1
[bach:22108] [rank=0] openib: using port mlx4_0:1
[bach:22108] [rank=0] openib: using port mlx4_0:1
[bach:22108] [rank=0] openib: using port mlx4_0:1
[bach:22108] [rank=0] openib: using port mlx4_0:1
[bach:22108] PMIx_client put
[bach:22108] pmix: executing put for key btl.openib.3.1 type 27
[bach:22108] select: init of component openib returned success
[bach:22108] select: initializing btl component self
[bach:22108] select: init of component self returned success
[bach:22102] SWITCHYARD for 3361079297:0:29
[bach:22102] recvd pmix cmd 2 from 3361079297:0
[bach:22102] 3361079296:0 EXECUTE COMMIT FOR 3361079297:0
[bach:22108] pmix:client wait_cbfunc received
[bach:22108] PMIx_client fencenb
[bach:22108] pmix: fence_nb called
[bach:22102] SWITCHYARD for 3361079297:0:29
[bach:22102] recvd pmix cmd 3 from 3361079297:0
[bach:22102] recvd FENCE
[bach:22102] recvd fence from 3361079297:0 with 1 procs
[bach:22102] get_tracker called with 1 procs
[bach:22102] new_tracker called with 1 procs
[bach:22102] adding new tracker with 1 procs
[bach:22102] adding local proc 3361079297.0 to tracker
[bach:22102] fence complete
[bach:22102] fence - assembling data
[bach:22102] [[51286,0],0] FENCE CALLED
[bach:22102] server:modex_cbfunc called with 1032 bytes
[bach:22102] server:modex_cbfunc reply being sent to 3361079297:0
[bach:22108] pmix: fence_nb callback recvd
[bach:22108] client:unpack fence called
[bach:22108] client:unpack fence received status 0
[bach:22108] [[51286,1],0] pmix2x:client get on proc [[51286,1],WILDCARD] key pmix.mapby
[bach:22108] pmix: get_nb value for proc 3361079297:4294967294 key pmix.mapby
[bach:22108] pmix: getnbfn value for proc 3361079297:4294967294 key pmix.mapby
[bach:22108] pmix:client get completed
[bach:22108] [[51286,1],0] pmix2x:client get on proc [[51286,1],WILDCARD] key pmix.lpeers
[bach:22108] pmix: get_nb value for proc 3361079297:4294967294 key pmix.lpeers
[bach:22108] pmix: getnbfn value for proc 3361079297:4294967294 key pmix.lpeers
[server:26892] pmix: fence_nb callback recvd
[server:26892] client:unpack fence called
[server:26888] server:modex_cbfunc called with 1032 bytes
[server:26888] server:modex_cbfunc reply being sent to 3361079297:1
[server:26892] client:unpack fence received status 0
[bach:22108] pmix:client get completed
[bach:22108] mca: bml: Using self btl for send to [[51286,1],0] on node bach
[server:26892] [[51286,1],1] pmix2x:client get on proc [[51286,1],WILDCARD] key pmix.mapby
[server:26892] pmix: get_nb value for proc 3361079297:4294967294 key pmix.mapby
[server:26892] pmix: getnbfn value for proc 3361079297:4294967294 key pmix.mapby
[server:26892] pmix:client get completed
[server:26892] [[51286,1],1] pmix2x:client get on proc [[51286,1],WILDCARD] key pmix.lpeers
[server:26892] pmix: get_nb value for proc 3361079297:4294967294 key pmix.lpeers
[server:26892] pmix: getnbfn value for proc 3361079297:4294967294 key pmix.lpeers
[server:26892] pmix:client get completed
[server:26892] mca: bml: Using self btl for send to [[51286,1],1] on node server
[bach:22108] [[51286,1],0] pmix2x:client get on proc [[51286,1],0] key btl.openib.3.1
[bach:22108] pmix: get_nb value for proc 3361079297:0 key btl.openib.3.1
[bach:22108] pmix: getnbfn value for proc 3361079297:0 key btl.openib.3.1
[server:26892] [[51286,1],1] pmix2x:client get on proc [[51286,1],1] key btl.openib.3.1
[server:26892] pmix: get_nb value for proc 3361079297:1 key btl.openib.3.1
[bach:22108] pmix:client get completed
[server:26892] pmix: getnbfn value for proc 3361079297:1 key btl.openib.3.1
[server:26892] pmix:client get completed
[bach:22108] PMIx_client fencenb
[bach:22108] pmix: fence_nb called
[bach:22102] SWITCHYARD for 3361079297:0:29
[bach:22102] recvd pmix cmd 3 from 3361079297:0
[bach:22102] recvd FENCE
[bach:22102] recvd fence from 3361079297:0 with 1 procs
[bach:22102] get_tracker called with 1 procs
[bach:22102] new_tracker called with 1 procs
[bach:22102] adding new tracker with 1 procs
[bach:22102] adding local proc 3361079297.0 to tracker
[bach:22102] fence complete
[bach:22102] [[51286,0],0] FENCE CALLED
[bach:22102] [[51286,0],0] grpcomm:base: creating new coll for ORTE_SIG  Procs: [[51286,1],WILDCARD]

@vmaffione
Copy link
Author

From gdb I see that there are two threads blocked on libevent based event loop. The third thread is the main thread, and it seems it keeps running something (it's not stuck)

#0  0x00007fadbba67f09 in syscall () from /usr/lib/libc.so.6
#1  0x00007fadbb484a92 in ?? () from /usr/lib/openmpi/libopen-pal.so.40
#2  0x00007fadbb4287aa in opal_free_list_grow_st () from /usr/lib/openmpi/libopen-pal.so.40
#3  0x00007fada8919b2f in mca_btl_openib_post_srr () from /usr/lib/openmpi/openmpi/mca_btl_openib.so
#4  0x00007fada891eacc in mca_btl_openib_endpoint_post_recvs () from /usr/lib/openmpi/openmpi/mca_btl_openib.so
#5  0x00007fada892f378 in ?? () from /usr/lib/openmpi/openmpi/mca_btl_openib.so
#6  0x00007fada892f8f6 in ?? () from /usr/lib/openmpi/openmpi/mca_btl_openib.so
#7  0x00007fada89129eb in ?? () from /usr/lib/openmpi/openmpi/mca_btl_openib.so
#8  0x00007fada89130d5 in mca_btl_openib_add_procs () from /usr/lib/openmpi/openmpi/mca_btl_openib.so
#9  0x00007fada8d4528d in ?? () from /usr/lib/openmpi/openmpi/mca_bml_r2.so
#10 0x00007fada82c8d6c in mca_pml_ob1_add_procs () from /usr/lib/openmpi/openmpi/mca_pml_ob1.so
#11 0x00007fadbbf9ac9b in ompi_mpi_init () from /usr/lib/openmpi/libmpi.so.40
#12 0x00007fadbbfca8f9 in PMPI_Init () from /usr/lib/openmpi/libmpi.so.40
#13 0x000055884f6c59bc in main (argc=3, argv=0x7ffcb6089e48) at mpiprodcons.c:153

@rhc54
Copy link
Contributor

rhc54 commented Jun 22, 2018

Yes - here is the problem:

[bach:22102] [[51286,0],0] FENCE CALLED
[bach:22102] [[51286,0],0] grpcomm:base: creating new coll for ORTE_SIG  Procs: [[51286,1],WILDCARD]

Looks like you are stuck in the fence. So try adding these to the cmd line (just to test this out):
--mca pmix_base_async_modex 1 --mca pmix_base_collect_data 0

This tells OMPI not to execute a fence during MPI_Init

@vmaffione
Copy link
Author

Thanks. Unfortunately it seems to have no effect (unless the parameters are not spelled correctly, afaiu you get no feedback if they are misspelled).

$ mpirun --mca btl self,openib --mca btl_base_verbose 100 --mca btl_openib_cpc_exclude rdmacm --mca btl_openib_btls_per_lid 8 --mca pmix_base_verbose 5 --mca grpcomm_base_verbose 5 --mca pmix_base_async_modex 1 --mca pmix_base_collect_data 0 --host bach,server ./mpiprodcons -n 2
[bach:22275] psec: native init
[bach:22275] psec: none init
[bach:22275] pmix:server init called
[bach:22275] pmix_register_event_hdlr shifting to progress thread
[bach:22275] pmix: register event_hdlr with 1 infos
[bach:22275] pmix: _add_hdlr
[bach:22275] PMIX server errreg_cbfunc - error handler registered status=0, reference=0
[bach:22275] pmix:server _register_nspace 3372417024
[bach:22275] mca:grpcomm:select: checking available component direct
[bach:22275] mca:grpcomm:select: Querying component [direct]
[bach:22275] [[51459,0],0]: Final grpcomm priorities
[bach:22275] 	Component: direct Priority: 85
[server:27095] psec: native init
[server:27095] psec: none init
[server:27095] pmix:server init called
[server:27095] pmix_register_event_hdlr shifting to progress thread
[server:27095] pmix: register event_hdlr with 1 infos
[server:27095] pmix: _add_hdlr
[server:27095] PMIX server errreg_cbfunc - error handler registered status=0, reference=0
[server:27095] pmix:server _register_nspace 3372417024
[server:27095] mca:grpcomm:select: checking available component direct
[server:27095] mca:grpcomm:select: Querying component [direct]
[server:27095] [[51459,0],1]: Final grpcomm priorities
[server:27095] 	Component: direct Priority: 85
[server:27095] [[51459,0],1] pmix2x:client get on proc [[51459,0],1] key opal.puri
[server:27095] pmix: get_nb value for proc 3372417024:1 key opal.puri
[server:27095] pmix: getnbfn value for proc 3372417024:1 key opal.puri
[server:27095] pmix:client get completed
[server:27095] [[51459,0],1] pmix2x:client get on proc [[51459,0],0] key opal.puri
[server:27095] pmix: get_nb value for proc 3372417024:0 key opal.puri
[server:27095] pmix: getnbfn value for proc 3372417024:0 key opal.puri
[server:27095] pmix:client get completed
[bach:22275] [[51459,0],0] pmix2x:client get on proc [[51459,0],0] key opal.puri
[bach:22275] pmix: get_nb value for proc 3372417024:0 key opal.puri
[bach:22275] pmix: getnbfn value for proc 3372417024:0 key opal.puri
[bach:22275] pmix:client get completed
[bach:22275] [[51459,0],0] pmix2x:client get on proc [[51459,0],1] key opal.puri
[bach:22275] pmix: get_nb value for proc 3372417024:1 key opal.puri
[bach:22275] pmix: getnbfn value for proc 3372417024:1 key opal.puri
[bach:22275] pmix:client get completed
[bach:22275] pmix:server register client 3372417025:0
[bach:22275] pmix:server _register_client for nspace 3372417025 rank 0
[bach:22275] pmix:server _register_nspace 3372417025
[server:27095] pmix:server register client 3372417025:1
[server:27095] pmix:server _register_client for nspace 3372417025 rank 1
[server:27095] pmix:server _register_nspace 3372417025
[bach:22275] pmix:server setup_fork for nspace 3372417025 rank 0
[server:27095] pmix:server setup_fork for nspace 3372417025 rank 1
BEFORE MPI_INIT
BEFORE MPI INIT
[server:27099] PMIx_client initialized
[server:27099] PMIx_client init
[server:27099] psec: native init
[server:27099] psec: none init
[server:27095] psec: native validate_cred ��
[server:27095] psec: native credential 1000:1000 valid
[server:27095] credential validated
[server:27099] pmix: init called
[server:27095] SWITCHYARD for 3372417025:1:24
[server:27095] recvd pmix cmd 0 from 3372417025:1
[bach:22281] PMIx_client initialized
[bach:22281] PMIx_client init
[server:27095] SWITCHYARD for 3372417025:1:24
[server:27095] recvd pmix cmd 13 from 3372417025:1
[server:27095] recvd register events
[server:27095] server register events: ninfo =0 rc =0
[server:27095] server:regevents_cbfunc called status = 0
[server:27099] pmix: get_nb value for proc 3372417025:4294967294 key pmix.dbg.init
[server:27099] pmix: getnbfn value for proc 3372417025:4294967294 key pmix.dbg.init
[server:27099] pmix:client get completed
[server:27099] pmix_register_event_hdlr shifting to progress thread
[server:27099] pmix: register event_hdlr with 0 infos
[server:27099] pmix: _add_hdlr
[server:27099] pmix: _add_hdlr sending to server
[server:27099] pmix: regevents callback recvd
[server:27099] PMIX client errreg_cbfunc - error handler registered status=0, reference=0
[server:27099] [[51459,1],1] pmix2x:client get on proc [[51459,1],1] key pmix.lrank
[server:27099] pmix: get_nb value for proc 3372417025:1 key pmix.lrank
[server:27099] pmix: getnbfn value for proc 3372417025:1 key pmix.lrank
[server:27099] pmix:client get completed
[server:27099] [[51459,1],1] pmix2x:client get on proc [[51459,1],1] key pmix.nrank
[server:27099] pmix: get_nb value for proc 3372417025:1 key pmix.nrank
[server:27099] pmix: getnbfn value for proc 3372417025:1 key pmix.nrank
[server:27099] pmix:client get completed
[server:27099] [[51459,1],1] pmix2x:client get on proc [[51459,1],WILDCARD] key pmix.max.size
[server:27099] pmix: get_nb value for proc 3372417025:4294967294 key pmix.max.size
[server:27099] pmix: getnbfn value for proc 3372417025:4294967294 key pmix.max.size
[server:27099] pmix:client get completed
[server:27099] [[51459,1],1] pmix2x:client get on proc [[51459,1],WILDCARD] key pmix.job.size
[server:27099] pmix: get_nb value for proc 3372417025:4294967294 key pmix.job.size
[server:27099] pmix: getnbfn value for proc 3372417025:4294967294 key pmix.job.size
[server:27099] pmix:client get completed
[server:27099] [[51459,1],1] pmix2x:client get on proc [[51459,1],1] key pmix.appnum
[server:27099] pmix: get_nb value for proc 3372417025:1 key pmix.appnum
[server:27099] pmix: getnbfn value for proc 3372417025:1 key pmix.appnum
[server:27099] pmix:client get completed
[server:27099] [[51459,1],1] pmix2x:client get on proc [[51459,1],WILDCARD] key pmix.local.size
[server:27099] pmix: get_nb value for proc 3372417025:4294967294 key pmix.local.size
[server:27099] pmix: getnbfn value for proc 3372417025:4294967294 key pmix.local.size
[server:27099] pmix:client get completed
[server:27099] [[51459,1],1] pmix2x:client get on proc [[51459,1],WILDCARD] key pmix.num.nodes
[server:27099] pmix: get_nb value for proc 3372417025:4294967294 key pmix.num.nodes
[server:27099] pmix: getnbfn value for proc 3372417025:4294967294 key pmix.num.nodes
[server:27099] pmix:client get completed
[server:27099] [[51459,1],1] pmix2x:client get on proc [[51459,1],WILDCARD] key pmix.tmpdir
[server:27099] pmix: get_nb value for proc 3372417025:4294967294 key pmix.tmpdir
[server:27099] pmix: getnbfn value for proc 3372417025:4294967294 key pmix.tmpdir
[server:27099] pmix:client get completed
[server:27099] [[51459,1],1] pmix2x:client get on proc [[51459,1],WILDCARD] key pmix.nsdir
[bach:22281] psec: native init
[bach:22281] psec: none init
[server:27099] pmix: get_nb value for proc 3372417025:4294967294 key pmix.nsdir
[server:27099] pmix: getnbfn value for proc 3372417025:4294967294 key pmix.nsdir
[server:27099] pmix:client get completed
[server:27099] [[51459,1],1] pmix2x:client get on proc [[51459,1],WILDCARD] key pmix.pdir
[server:27099] pmix: get_nb value for proc 3372417025:4294967294 key pmix.pdir
[server:27099] pmix: getnbfn value for proc 3372417025:4294967294 key pmix.pdir
[server:27099] pmix:client get completed
[server:27099] [[51459,1],1] pmix2x:client get on proc [[51459,1],WILDCARD] key pmix.tdir.rmclean
[server:27099] pmix: get_nb value for proc 3372417025:4294967294 key pmix.tdir.rmclean
[server:27099] pmix: getnbfn value for proc 3372417025:4294967294 key pmix.tdir.rmclean
[server:27099] pmix:client get completed
[bach:22281] pmix: init called
[bach:22275] psec: native validate_cred ��
[bach:22275] psec: native credential 1000:100 valid
[bach:22275] credential validated
[bach:22275] SWITCHYARD for 3372417025:0:29
[bach:22275] recvd pmix cmd 0 from 3372417025:0
[bach:22281] pmix: get_nb value for proc 3372417025:4294967294 key pmix.dbg.init
[bach:22281] pmix: getnbfn value for proc 3372417025:4294967294 key pmix.dbg.init
[bach:22281] pmix:client get completed
[bach:22281] pmix_register_event_hdlr shifting to progress thread
[bach:22281] pmix: register event_hdlr with 0 infos
[bach:22281] pmix: _add_hdlr
[bach:22281] pmix: _add_hdlr sending to server
[bach:22275] SWITCHYARD for 3372417025:0:29
[bach:22275] recvd pmix cmd 13 from 3372417025:0
[bach:22275] recvd register events
[bach:22275] server register events: ninfo =0 rc =0
[bach:22275] server:regevents_cbfunc called status = 0
[bach:22281] pmix: regevents callback recvd
[bach:22281] PMIX client errreg_cbfunc - error handler registered status=0, reference=0
[bach:22281] [[51459,1],0] pmix2x:client get on proc [[51459,1],0] key pmix.lrank
[bach:22281] pmix: get_nb value for proc 3372417025:0 key pmix.lrank
[bach:22281] pmix: getnbfn value for proc 3372417025:0 key pmix.lrank
[bach:22281] pmix:client get completed
[bach:22281] [[51459,1],0] pmix2x:client get on proc [[51459,1],0] key pmix.nrank
[bach:22281] pmix: get_nb value for proc 3372417025:0 key pmix.nrank
[bach:22281] pmix: getnbfn value for proc 3372417025:0 key pmix.nrank
[bach:22281] pmix:client get completed
[bach:22281] [[51459,1],0] pmix2x:client get on proc [[51459,1],WILDCARD] key pmix.max.size
[bach:22281] pmix: get_nb value for proc 3372417025:4294967294 key pmix.max.size
[bach:22281] pmix: getnbfn value for proc 3372417025:4294967294 key pmix.max.size
[bach:22281] pmix:client get completed
[bach:22281] [[51459,1],0] pmix2x:client get on proc [[51459,1],WILDCARD] key pmix.job.size
[bach:22281] pmix: get_nb value for proc 3372417025:4294967294 key pmix.job.size
[bach:22281] pmix: getnbfn value for proc 3372417025:4294967294 key pmix.job.size
[bach:22281] pmix:client get completed
[bach:22281] [[51459,1],0] pmix2x:client get on proc [[51459,1],0] key pmix.appnum
[bach:22281] pmix: get_nb value for proc 3372417025:0 key pmix.appnum
[server:27099] pmix_register_event_hdlr shifting to progress thread
[server:27099] pmix: register event_hdlr with 1 infos
[server:27099] pmix: _add_hdlr
[server:27099] PMIX2x errreg_cbfunc - error handler registered status=0, reference=1
[bach:22281] pmix: getnbfn value for proc 3372417025:0 key pmix.appnum
[bach:22281] pmix:client get completed
[bach:22281] [[51459,1],0] pmix2x:client get on proc [[51459,1],WILDCARD] key pmix.local.size
[bach:22281] pmix: get_nb value for proc 3372417025:4294967294 key pmix.local.size
[server:27099] mca:grpcomm:select: checking available component direct
[server:27099] mca:grpcomm:select: Querying component [direct]
[bach:22281] pmix: getnbfn value for proc 3372417025:4294967294 key pmix.local.size
[bach:22281] pmix:client get completed
[bach:22281] [[51459,1],0] pmix2x:client get on proc [[51459,1],WILDCARD] key pmix.num.nodes
[bach:22281] pmix: get_nb value for proc 3372417025:4294967294 key pmix.num.nodes
[server:27099] [[51459,1],1]: Final grpcomm priorities
[server:27099] 	Component: direct Priority: 85
[bach:22281] pmix: getnbfn value for proc 3372417025:4294967294 key pmix.num.nodes
[bach:22281] pmix:client get completed
[bach:22281] [[51459,1],0] pmix2x:client get on proc [[51459,1],WILDCARD] key pmix.tmpdir
[bach:22281] pmix: get_nb value for proc 3372417025:4294967294 key pmix.tmpdir
[bach:22281] pmix: getnbfn value for proc 3372417025:4294967294 key pmix.tmpdir
[bach:22281] pmix:client get completed
[bach:22281] [[51459,1],0] pmix2x:client get on proc [[51459,1],WILDCARD] key pmix.nsdir
[bach:22281] pmix: get_nb value for proc 3372417025:4294967294 key pmix.nsdir
[bach:22281] pmix: getnbfn value for proc 3372417025:4294967294 key pmix.nsdir
[bach:22281] pmix:client get completed
[bach:22281] [[51459,1],0] pmix2x:client get on proc [[51459,1],WILDCARD] key pmix.pdir
[bach:22281] pmix: get_nb value for proc 3372417025:4294967294 key pmix.pdir
[bach:22281] pmix: getnbfn value for proc 3372417025:4294967294 key pmix.pdir
[bach:22281] pmix:client get completed
[bach:22281] [[51459,1],0] pmix2x:client get on proc [[51459,1],WILDCARD] key pmix.tdir.rmclean
[bach:22281] pmix: get_nb value for proc 3372417025:4294967294 key pmix.tdir.rmclean
[bach:22281] pmix: getnbfn value for proc 3372417025:4294967294 key pmix.tdir.rmclean
[bach:22281] pmix:client get completed
[server:27099] PMIx_client put
[server:27099] pmix: executing put for key pmix.hname type 3
[server:27099] pmix_register_event_hdlr shifting to progress thread
[server:27099] pmix: register event_hdlr with 2 infos
[server:27099] pmix: _add_hdlr
[server:27099] PMIX2x errreg_cbfunc - error handler registered status=0, reference=2
[server:27095] SWITCHYARD for 3372417025:1:24
[server:27095] recvd pmix cmd 13 from 3372417025:1
[server:27095] recvd register events
[server:27095] server register events: ninfo =0 rc =0
[server:27095] server:regevents_cbfunc called status = 0
[server:27099] pmix_register_event_hdlr shifting to progress thread
[server:27099] pmix: register event_hdlr with 1 infos
[server:27099] pmix: _add_hdlr
[server:27099] pmix: _add_hdlr sending to server
[server:27099] pmix: regevents callback recvd
[server:27099] PMIX2x errreg_cbfunc - error handler registered status=0, reference=3
[server:27099] PMIx_client init
[server:27099] client: notifying server 3372417025:1 of status PMIX MODEL DECLARED
[server:27099] 3372417025:1 invoke_local_event_hdlr for status PMIX MODEL DECLARED
[server:27099] [event/pmix_event_notification.c:767] INVOKING EVHDLR MPI-Model-Declarations
[server:27099] [[51459,1],1] RECEIVED NOTIFICATION OF STATUS -147
[server:27099] [[51459,1],1] CONVERTED STATUS -147 TO STATUS -69
[server:27099] [[51459,1],1] _EVENT_HDLR CALLING EVHDLR
[server:27099] pmix_client_notify_event source = 3372417025:1 event_status =-147, rc=0
[server:27099] PMIx_client finalize
[server:27099] mca: base: components_register: registering framework btl components
[server:27099] mca: base: components_register: found loaded component self
[server:27099] mca: base: components_register: component self register function successful
[server:27099] mca: base: components_register: found loaded component openib
[server:27099] mca: base: components_register: component openib register function successful
[server:27099] mca: base: components_open: opening btl components
[server:27099] mca: base: components_open: found loaded component self
[server:27099] mca: base: components_open: component self open function successful
[server:27099] mca: base: components_open: found loaded component openib
[server:27099] mca: base: components_open: component openib open function successful
[server:27099] select: initializing btl component self
[server:27099] select: init of component self returned success
[server:27099] select: initializing btl component openib
[server:27099] Checking distance from this process to device=mlx4_0
[server:27099] Process is not bound: distance to device is 0.000000
[server:27099] [[51459,1],1] pmix2x:client get on proc [[51459,1],WILDCARD] key pmix.pmem
[server:27099] pmix: get_nb value for proc 3372417025:4294967294 key pmix.pmem
[server:27099] pmix: getnbfn value for proc 3372417025:4294967294 key pmix.pmem
[server:27099] pmix:client get completed
[server:27099] [rank=1] openib: using port mlx4_0:1
[server:27099] [rank=1] openib: using port mlx4_0:1
[server:27099] [rank=1] openib: using port mlx4_0:1
[server:27099] [rank=1] openib: using port mlx4_0:1
[server:27099] [rank=1] openib: using port mlx4_0:1
[server:27099] [rank=1] openib: using port mlx4_0:1
[server:27099] [rank=1] openib: using port mlx4_0:1
[server:27099] [rank=1] openib: using port mlx4_0:1
[server:27099] PMIx_client put
[server:27099] pmix: executing put for key btl.openib.3.1 type 27
[server:27099] select: init of component openib returned success
[bach:22281] pmix_register_event_hdlr shifting to progress thread
[bach:22281] pmix: register event_hdlr with 1 infos
[bach:22281] pmix: _add_hdlr
[bach:22281] PMIX2x errreg_cbfunc - error handler registered status=0, reference=1
[bach:22281] mca:grpcomm:select: checking available component direct
[bach:22281] mca:grpcomm:select: Querying component [direct]
[bach:22281] [[51459,1],0]: Final grpcomm priorities
[bach:22281] 	Component: direct Priority: 85
[bach:22281] PMIx_client put
[bach:22281] pmix: executing put for key pmix.hname type 3
[bach:22281] pmix_register_event_hdlr shifting to progress thread
[bach:22281] pmix: register event_hdlr with 2 infos
[bach:22281] pmix: _add_hdlr
[bach:22281] PMIX2x errreg_cbfunc - error handler registered status=0, reference=2
[bach:22281] pmix_register_event_hdlr shifting to progress thread
[bach:22281] pmix: register event_hdlr with 1 infos
[bach:22281] pmix: _add_hdlr
[bach:22281] pmix: _add_hdlr sending to server
[bach:22275] SWITCHYARD for 3372417025:0:29
[bach:22275] recvd pmix cmd 13 from 3372417025:0
[bach:22275] recvd register events
[bach:22275] server register events: ninfo =0 rc =0
[bach:22275] server:regevents_cbfunc called status = 0
[bach:22281] pmix: regevents callback recvd
[bach:22281] PMIX2x errreg_cbfunc - error handler registered status=0, reference=3
[bach:22281] PMIx_client init
[bach:22281] client: notifying server 3372417025:0 of status PMIX MODEL DECLARED
[bach:22281] 3372417025:0 invoke_local_event_hdlr for status PMIX MODEL DECLARED
[bach:22281] [event/pmix_event_notification.c:767] INVOKING EVHDLR MPI-Model-Declarations
[bach:22281] [[51459,1],0] RECEIVED NOTIFICATION OF STATUS -147
[bach:22281] [[51459,1],0] CONVERTED STATUS -147 TO STATUS -69
[bach:22281] [[51459,1],0] _EVENT_HDLR CALLING EVHDLR
[bach:22281] pmix_client_notify_event source = 3372417025:0 event_status =-147, rc=0
[bach:22281] PMIx_client finalize
[server:27095] SWITCHYARD for 3372417025:1:24
[server:27095] recvd pmix cmd 2 from 3372417025:1
[server:27095] 3372417024:1 EXECUTE COMMIT FOR 3372417025:1
[server:27099] pmix:client wait_cbfunc received
[server:27099] [[51459,1],1] pmix2x:client get on proc [[51459,1],WILDCARD] key pmix.mapby
[server:27099] pmix: get_nb value for proc 3372417025:4294967294 key pmix.mapby
[server:27099] pmix: getnbfn value for proc 3372417025:4294967294 key pmix.mapby
[server:27099] pmix:client get completed
[server:27099] [[51459,1],1] pmix2x:client get on proc [[51459,1],WILDCARD] key pmix.lpeers
[server:27099] pmix: get_nb value for proc 3372417025:4294967294 key pmix.lpeers
[server:27099] pmix: getnbfn value for proc 3372417025:4294967294 key pmix.lpeers
[server:27099] pmix:client get completed
[server:27099] mca: bml: Using self btl for send to [[51459,1],1] on node server
[bach:22281] mca: base: components_register: registering framework btl components
[bach:22281] mca: base: components_register: found loaded component openib
[bach:22281] mca: base: components_register: component openib register function successful
[bach:22281] mca: base: components_register: found loaded component self
[bach:22281] mca: base: components_register: component self register function successful
[bach:22281] mca: base: components_open: opening btl components
[bach:22281] mca: base: components_open: found loaded component openib
[bach:22281] mca: base: components_open: component openib open function successful
[bach:22281] mca: base: components_open: found loaded component self
[bach:22281] mca: base: components_open: component self open function successful
[bach:22281] select: initializing btl component openib
[bach:22281] Checking distance from this process to device=mlx4_0
[bach:22281] Process is not bound: distance to device is 0.000000
[bach:22281] [[51459,1],0] pmix2x:client get on proc [[51459,1],WILDCARD] key pmix.pmem
[bach:22281] pmix: get_nb value for proc 3372417025:4294967294 key pmix.pmem
[bach:22281] pmix: getnbfn value for proc 3372417025:4294967294 key pmix.pmem
[bach:22281] pmix:client get completed
[server:27099] [[51459,1],1] pmix2x:client get on proc [[51459,1],1] key btl.openib.3.1
[server:27099] pmix: get_nb value for proc 3372417025:1 key btl.openib.3.1
[server:27099] pmix: getnbfn value for proc 3372417025:1 key btl.openib.3.1
[server:27099] pmix:client get completed
[bach:22281] [rank=0] openib: using port mlx4_0:1
[bach:22281] [rank=0] openib: using port mlx4_0:1
[bach:22281] [rank=0] openib: using port mlx4_0:1
[bach:22281] [rank=0] openib: using port mlx4_0:1
[bach:22281] [rank=0] openib: using port mlx4_0:1
[bach:22281] [rank=0] openib: using port mlx4_0:1
[bach:22281] [rank=0] openib: using port mlx4_0:1
[bach:22281] [rank=0] openib: using port mlx4_0:1
[bach:22281] PMIx_client put
[bach:22281] pmix: executing put for key btl.openib.3.1 type 27
[bach:22281] select: init of component openib returned success
[bach:22281] select: initializing btl component self
[bach:22281] select: init of component self returned success
[bach:22275] SWITCHYARD for 3372417025:0:29
[bach:22275] recvd pmix cmd 2 from 3372417025:0
[bach:22275] 3372417024:0 EXECUTE COMMIT FOR 3372417025:0
[bach:22281] pmix:client wait_cbfunc received
[bach:22281] [[51459,1],0] pmix2x:client get on proc [[51459,1],WILDCARD] key pmix.mapby
[bach:22281] pmix: get_nb value for proc 3372417025:4294967294 key pmix.mapby
[bach:22281] pmix: getnbfn value for proc 3372417025:4294967294 key pmix.mapby
[bach:22281] pmix:client get completed
[bach:22281] [[51459,1],0] pmix2x:client get on proc [[51459,1],WILDCARD] key pmix.lpeers
[bach:22281] pmix: get_nb value for proc 3372417025:4294967294 key pmix.lpeers
[bach:22281] pmix: getnbfn value for proc 3372417025:4294967294 key pmix.lpeers
[bach:22281] pmix:client get completed
[bach:22281] mca: bml: Using self btl for send to [[51459,1],0] on node bach
[bach:22281] [[51459,1],0] pmix2x:client get on proc [[51459,1],0] key btl.openib.3.1
[bach:22281] pmix: get_nb value for proc 3372417025:0 key btl.openib.3.1
[bach:22281] pmix: getnbfn value for proc 3372417025:0 key btl.openib.3.1
[bach:22281] pmix:client get completed
[bach:22281] PMIx_client fencenb
[bach:22281] pmix: fence_nb called
[bach:22275] SWITCHYARD for 3372417025:0:29
[bach:22275] recvd pmix cmd 3 from 3372417025:0
[bach:22275] recvd FENCE
[bach:22275] recvd fence from 3372417025:0 with 1 procs
[bach:22275] get_tracker called with 1 procs
[bach:22275] new_tracker called with 1 procs
[bach:22275] adding new tracker with 1 procs
[bach:22275] adding local proc 3372417025.0 to tracker
[bach:22275] fence complete
[bach:22275] [[51459,0],0] FENCE CALLED
[bach:22275] [[51459,0],0] grpcomm:base: creating new coll for ORTE_SIG  Procs: [[51459,1],WILDCARD]

@vmaffione
Copy link
Author

Hi,
Just installed UCX and recompiled MPI. It seems I'm able to load the component. But messages are not flowing. My program just sends a single message (8 bytes) from process 1 to process 2.
I get these

$ mpirun --mca pml ucx --host bach,server ./mpiprodcons -n 1
Hello world from processor bach, rank 0 out of 2 processors
Hello world from processor server, rank 1 out of 2 processors
[1529871139.428305] [server:28772:0]         rcache.c:534  UCX  WARN  failed to register region 0x563b3f909550 [0x7fae34a00000..0x7fae35000000]: Input/output error
[1529871139.428786] [server:28772:0]          mpool.c:176  UCX  ERROR Failed to allocate memory pool (name=ud_tx_skb) chunk: Input/output error
[1529871139.428805] [server:28772:0]         rcache.c:534  UCX  WARN  failed to register region 0x563b3f909550 [0x7fae34a00000..0x7fae35000000]: Input/output error
[1529871139.429251] [server:28772:0]          mpool.c:176  UCX  ERROR Failed to allocate memory pool (name=ud_tx_skb) chunk: Input/output error
[1529871139.429263] [server:28772:0]         rcache.c:534  UCX  WARN  failed to register region 0x563b3f909550 [0x7fae34a00000..0x7fae35000000]: Input/output error
[1529871139.429704] [server:28772:0]          mpool.c:176  UCX  ERROR Failed to allocate memory pool (name=ud_tx_skb) chunk: Input/output error
[WARN and ERROR repeat indefinitely]

Any suggestions?

@mike-dubman
Copy link
Member

seems like you have issue with ofed installation on your node. Does low-level (ibv_send_bw etc) tests working on your node?

@mike-dubman
Copy link
Member

@alinask

@vmaffione
Copy link
Author

Yes, I am able to use ibv on my hardware. Both custom libibverbs/librdmacm programs and benchmarks tools. It may be relevant that I am not using OFED, but rdma-core (https://github.com/linux-rdma/rdma-core), as this is what my distro (Arch) uses.

@yosefe
Copy link
Contributor

yosefe commented Jun 25, 2018

is there any error in dmesg?

@vmaffione
Copy link
Author

Nothing, unfortunately. Can it be a permission problem?

@vmaffione
Copy link
Author

Any ibv program or tool works correctly. Example:

$ ibv_ud_pingpong server
  local address:  LID 0x0002, QPN 0x00059e, PSN 0x6bb9af: GID ::
  remote address: LID 0x0001, QPN 0x000429, PSN 0x44d77c, GID ::
4096000 bytes in 0.01 seconds = 5586.09 Mbit/sec
1000 iters in 0.01 seconds = 5.87 usec/iter

@vmaffione
Copy link
Author

I was able to solve this openucx/ucx#2710

@zrss
Copy link

zrss commented Jan 19, 2019

It seems now I got openib component, and the component is able to find the mellanox NICs. However, my "hello world" MPI program does not even start, hanging at startup

$ mpirun --mca btl self,openib --mca btl_base_verbose 100 --host bach,server ./mpiprodcons -n 1
[server:11637] mca: base: components_register: registering framework btl components
[server:11637] mca: base: components_register: found loaded component self
[server:11637] mca: base: components_register: component self register function successful
[server:11637] mca: base: components_register: found loaded component openib
[server:11637] mca: base: components_register: component openib register function successful
[server:11637] mca: base: components_open: opening btl components
[server:11637] mca: base: components_open: found loaded component self
[server:11637] mca: base: components_open: component self open function successful
[server:11637] mca: base: components_open: found loaded component openib
[server:11637] mca: base: components_open: component openib open function successful
[server:11637] select: initializing btl component self
[server:11637] select: init of component self returned success
[server:11637] select: initializing btl component openib
[server:11637] Checking distance from this process to device=mlx4_0
[server:11637] Process is not bound: distance to device is 0.000000
[server:11637] rdmacm CPC only supported when the first QP is a PP QP; skipped
[server:11637] openib BTL: rdmacm CPC unavailable for use on mlx4_0:1; skipped
[server:11637] [rank=1] openib: using port mlx4_0:1
[server:11637] select: init of component openib returned success
[bach:13285] mca: base: components_register: registering framework btl components
[bach:13285] mca: base: components_register: found loaded component openib
[bach:13285] mca: base: components_register: component openib register function successful
[bach:13285] mca: base: components_register: found loaded component self
[bach:13285] mca: base: components_register: component self register function successful
[bach:13285] mca: base: components_open: opening btl components
[bach:13285] mca: base: components_open: found loaded component openib
[bach:13285] mca: base: components_open: component openib open function successful
[bach:13285] mca: base: components_open: found loaded component self
[bach:13285] mca: base: components_open: component self open function successful
[bach:13285] select: initializing btl component openib
[bach:13285] Checking distance from this process to device=mlx4_0
[bach:13285] Process is not bound: distance to device is 0.000000
[bach:13285] rdmacm CPC only supported when the first QP is a PP QP; skipped
[bach:13285] openib BTL: rdmacm CPC unavailable for use on mlx4_0:1; skipped
[bach:13285] [rank=0] openib: using port mlx4_0:1
[bach:13285] select: init of component openib returned success
[bach:13285] select: initializing btl component self
[bach:13285] select: init of component self returned success
[bach:13285] mca: bml: Using self btl for send to [[44511,1],0] on node bach
[server:11637] mca: bml: Using self btl for send to [[44511,1],1] on node server

(does not terminate)

I can add that in my setup I'm able to use both libibverbs programs directly and legacy IP application throough IPoIB (I have network interfaces ib0 and ib1 on both hosts).
Any more suggestions? Am I correct when using -blt self,openib, excluding therefore tcp?

thx for pointing out ompi_info | grep btl, and i also be trapped in this case, but i don't know how to install the openib, could u have a case

@vmaffione
Copy link
Author

I was never able to use openib. I switched to openucx and everything worked fine. You can follow the link posted in my last comment.

@zrss
Copy link

zrss commented Jan 20, 2019

er .. thx for the reply

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

7 participants