-
Notifications
You must be signed in to change notification settings - Fork 931
Open
Description
We are currently seeing failures with rdmacm when multiple threads are used. No similar problem is seen with udcm so this is probably either a rdmacm issue or an issue with the CTS connection paths used by rdmacm. Opening this bug to track the issue.
19:15:47 STACK TRACE FOR PROC [[61451,1],0] (jenkins01, PID 24302)
19:15:47 Thread 5 (Thread 0x7ffff67c8700 (LWP 24306)):
19:15:47 #0 0x0000003d690e9163 in epoll_wait () from /lib64/libc.so.6
19:15:47 #1 0x00007ffff7685ae6 in epoll_dispatch (base=0x6673f0, tv=<value optimized out>) at epoll.c:407
19:15:47 #2 0x00007ffff768b2a6 in opal_libevent2022_event_base_loop (base=0x6673f0, flags=1) at event.c:1630
19:15:47 #3 0x00007ffff762c19a in progress_engine (obj=0x666c48) at runtime/opal_progress_threads.c:105
19:15:47 #4 0x0000003d698079d1 in start_thread () from /lib64/libpthread.so.0
19:15:47 #5 0x0000003d690e8b6d in clone () from /lib64/libc.so.6
19:15:47 Thread 4 (Thread 0x7ffff5919700 (LWP 24307)):
19:15:47 #0 0x0000003d690e9163 in epoll_wait () from /lib64/libc.so.6
19:15:47 #1 0x00007ffff7685ae6 in epoll_dispatch (base=0x68c2b0, tv=<value optimized out>) at epoll.c:407
19:15:47 #2 0x00007ffff768b2a6 in opal_libevent2022_event_base_loop (base=0x68c2b0, flags=1) at event.c:1630
19:15:47 #3 0x00007ffff593ff55 in progress_engine (obj=0x68c2b0) at src/util/progress_threads.c:52
19:15:47 #4 0x0000003d698079d1 in start_thread () from /lib64/libpthread.so.0
19:15:47 #5 0x0000003d690e8b6d in clone () from /lib64/libc.so.6
19:15:47 Thread 3 (Thread 0x7fffce1a5700 (LWP 24339)):
19:15:47 #0 0x0000003d690e9163 in epoll_wait () from /lib64/libc.so.6
19:15:47 #1 0x00007fffecd1ef5f in mxm_async_thread_func (arg=<value optimized out>) at mxm/core/async.c:316
19:15:47 #2 0x0000003d698079d1 in start_thread () from /lib64/libpthread.so.0
19:15:47 #3 0x0000003d690e8b6d in clone () from /lib64/libc.so.6
19:15:47 Thread 2 (Thread 0x7fff980e6700 (LWP 24343)):
19:15:47 #0 0x0000003d6980a7d9 in pthread_mutex_unlock () from /lib64/libpthread.so.0
19:15:47 #1 0x00007fffee94c355 in opal_mutex_unlock (m=0x7fffeeb97110) at ../../../../opal/threads/mutex_unix.h:153
19:15:47 #2 0x00007fffee94c7e6 in opal_pointer_array_get_item (table=0x7fffeeb970e8, element_index=0) at ../../../../opal/class/opal_pointer_array.h:133
19:15:47 #3 0x00007fffee95820e in btl_openib_component_progress () at btl_openib_component.c:3775
19:15:47 #4 0x00007ffff7624f62 in opal_progress () at runtime/opal_progress.c:221
19:15:47 #5 0x00007ffff762d1be in sync_wait_mt (sync=0x7fff980e5d00) at threads/wait_sync.c:72
19:15:47 #6 0x00007fffee0aec9a in ompi_request_wait_completion (req=0x785780) at ../../../../ompi/request/request.h:385
19:15:47 #7 0x00007fffee0aff4f in mca_pml_ob1_recv (addr=0x0, count=0, datatype=0x601c40, src=0, tag=34532, comm=0x601640, status=0x0) at pml_ob1_irecv.c:123
19:15:47 #8 0x00007ffff7cf83bb in PMPI_Recv (buf=0x0, count=0, type=0x601c40, source=0, tag=34532, comm=0x601640, status=0x0) at precv.c:79
19:15:47 #9 0x0000000000400c63 in threadfunc ()
19:15:47 #10 0x0000003d698079d1 in start_thread () from /lib64/libpthread.so.0
19:15:47 #11 0x0000003d690e8b6d in clone () from /lib64/libc.so.6
19:15:47 Thread 1 (Thread 0x7ffff73e3700 (LWP 24302)):
19:15:47 #0 0x0000003d6980b5bc in pthread_cond_wait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0
19:15:47 #1 0x00007ffff762d18b in sync_wait_mt (sync=0x7fffffffc880) at threads/wait_sync.c:54
19:15:47 #2 0x00007ffff7ca1394 in ompi_request_default_wait_all (count=8, requests=0x7fffffffca00, statuses=0x0) at request/req_wait.c:226
19:15:47 #3 0x00007ffff7d16b58 in PMPI_Waitall (count=8, requests=0x7fffffffca00, statuses=0x0) at pwaitall.c:77
19:15:47 #4 0x0000000000400fe3 in main ()