Skip to content

rdmacm sometimes hangs in finalize in Mellanox Jenkins #1829

@jsquyres

Description

@jsquyres

http://bgate.mellanox.com/jenkins/job/gh-ompi-master-pr/2638/console -- which is a Jenkins run off #1821 -- shows a problem that we've been seeing in a few Jenkins runs: the rdmacm CPC in the openib BTL hangs during finalize.

Here's the command that is run:

/var/lib/jenkins/jobs/gh-ompi-master-pr/workspace/ompi_install1/bin/mpirun -np 8 -bind-to core --report-state-on-timeout --get-stack-traces --timeout 600 -mca btl_openib_receive_queues P,65536,256,192,128:S,128,256,192,128:S,2048,1024,1008,64:S,12288,1024,1008,64:S,65536,1024,1008,64 -mca btl_openib_cpc_include rdmacm -mca pml ob1 -mca btl self,openib -mca btl_if_include mlx4_0:2 /var/lib/jenkins/jobs/gh-ompi-master-pr/workspace/ompi_install1/examples/hello_c

This warning comes up; I don't know if it's significant:

11:17:09 --------------------------------------------------------------------------
11:17:09 No OpenFabrics connection schemes reported that they were able to be
11:17:09 used on a specific port.  As such, the openib BTL (OpenFabrics
11:17:09 support) will be disabled for this port.
11:17:09 
11:17:09   Local host:           jenkins01
11:17:09   Local device:         mlx5_0
11:17:09   Local port:           1
11:17:09   CPCs attempted:       rdmacm
11:17:09 --------------------------------------------------------------------------

But then all procs have a backtrace like this during finalize:

11:27:13    Thread 1 (Thread 0x7ffff73e3700 (LWP 10492)):
11:27:13    #0  0x0000003d6980b5bc in pthread_cond_wait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0
11:27:13    #1  0x00007fffeeb81c6d in rdmacm_endpoint_finalize (endpoint=0x7fbdb0) at connect/btl_openib_connect_rdmacm.c:1229
11:27:13    #2  0x00007fffeeb6cdeb in mca_btl_openib_endpoint_destruct (endpoint=0x7fbdb0) at btl_openib_endpoint.c:368
11:27:13    #3  0x00007fffeeb559b7 in opal_obj_run_destructors (object=0x7fbdb0) at ../../../../opal/class/opal_object.h:460
11:27:13    #4  0x00007fffeeb5ae97 in mca_btl_openib_del_procs (btl=0x748ed0, nprocs=1, procs=0x7fffffffc768, peers=0x802fa0) at btl_openib.c:1328
11:27:13    #5  0x00007fffeefb2159 in mca_bml_r2_del_procs (nprocs=8, procs=0x78bf60) at bml_r2.c:623
11:27:13    #6  0x00007fffee2ba612 in mca_pml_ob1_del_procs (procs=0x78bf60, nprocs=8) at pml_ob1.c:455
11:27:13    #7  0x00007ffff7ca4b94 in ompi_mpi_finalize () at runtime/ompi_mpi_finalize.c:333
11:27:13    #8  0x00007ffff7cd6fe1 in PMPI_Finalize () at pfinalize.c:47
11:27:13    #9  0x0000000000400890 in main (argc=1, argv=0x7fffffffcb38) at hello_c.c:24

@jladd-mlnx @artpol84 Can you look into this?

@larrystevenwise @bharatpotnuri Is this happening on the v2.x branch for iWARP?

@hppritcha Is this a v2.0.0 or v2.0.1 item? IIRC, rdmacm is a non-default CPC for IB, and you have to add a per-peer QP -- I thought we had this conversation before about a previous rdmacm CPC error: that we would push it to v2.0.1 (but then we didn't because it also affected iWARP). Is my memory correct?

Metadata

Metadata

Assignees

Labels

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions