Skip to content

collcomp multithreaded test hangs on v2.x #1828

@nysal

Description

@nysal

With the latest v2.x branch the collcomp test case from ANL thread-tests hangs when run across two Power 8 nodes with 16 threads. Here's the stack:

Stack trace(s) for thread: 13
  1: -----------------
  1: Branch 1 of 1, 2 processes, 100% of total [0-1]
  1: -----------------
  1: start_thread() at ?:?
  2:  -----------------
  2:  Branch 1 of 2, 1 process, 50% of total [0]
  2:  -----------------
  2:  runfunc() at collcomp.c:140
  2:  -----------------
  2:  Branch 2 of 2, 1 process, 50% of total [1]
  2:  -----------------
  2:  runfunc() at collcomp.c:135
Stack trace(s) for thread: 14
  1: -----------------
  1: Branch 1 of 1, 2 processes, 100% of total [0-1]
  1: -----------------
  1: start_thread() at ?:?
  2:  -----------------
  2:  Branch 1 of 2, 1 process, 50% of total [0]
  2:  -----------------
  2:  runfunc() at collcomp.c:135
  2:  -----------------
  2:  Branch 2 of 2, 1 process, 50% of total [1]
  2:  -----------------
  2:  runfunc() at collcomp.c:147
Stack trace(s) for thread: 15
  1: -----------------
  1: Branch 1 of 1, 2 processes, 100% of total [0-1]
  1: -----------------
  1: start_thread() at ?:?
  2:  -----------------
  2:  Branch 1 of 2, 1 process, 50% of total [0]
  2:  -----------------
  2:  runfunc() at collcomp.c:140
  2:  -----------------
  2:  Branch 2 of 2, 1 process, 50% of total [1]
  2:  -----------------
  2:  runfunc() at collcomp.c:136
Stack trace(s) for thread: 16
  1: -----------------
  1: Branch 1 of 1, 2 processes, 100% of total [0-1]
  1: -----------------
  1: start_thread() at ?:?
  2:  runfunc() at collcomp.c:136
Stack trace(s) for thread: 17
  1: -----------------
  1: Branch 1 of 1, 2 processes, 100% of total [0-1]
  1: -----------------
  1: start_thread() at ?:?
  2:  runfunc() at collcomp.c:98
  3:   PMPI_Allreduce() at pallreduce.c:107
  4:    ompi_coll_tuned_allreduce_intra_dec_fixed() at /u/jnysal/src/mirror-ompi-release/ompi/mca/coll/tuned/coll_tuned_$
ecision_fixed.c:77
  5:     -----------------
  5:     Branch 1 of 2, 1 process, 50% of total [0]
  5:     -----------------
  5:     ompi_coll_base_allreduce_intra_ring_segmented() at /u/jnysal/src/mirror-ompi-release/ompi/mca/coll/base/coll_ba$
e_allreduce.c:795
  6:      ompi_request_default_wait() at /u/jnysal/src/mirror-ompi-release/ompi/request/req_wait.c:41
  7:       ompi_request_wait_completion() at /u/jnysal/src/mirror-ompi-release/ompi/request/request.h:389
  5:     -----------------
  5:     Branch 2 of 2, 1 process, 50% of total [1]
  5:     -----------------
  5:     ompi_coll_base_allreduce_intra_ring_segmented() at /u/jnysal/src/mirror-ompi-release/ompi/mca/coll/base/coll_ba$
e_allreduce.c:749
  6:      mca_pml_ob1_send() at /u/jnysal/src/mirror-ompi-release/ompi/mca/pml/ob1/pml_ob1_isend.c:265
  7:       ompi_request_wait_completion() at /u/jnysal/src/mirror-ompi-release/ompi/request/request.h:385
  8:        sync_wait_mt() at /u/jnysal/src/mirror-ompi-release/opal/threads/wait_sync.c:72
  9:         opal_progress() at /u/jnysal/src/mirror-ompi-release/opal/runtime/opal_progress.c:216
Stack trace(s) for thread: 18
  1: -----------------
  1: Branch 1 of 1, 2 processes, 100% of total [0-1]
  1: -----------------
  1: start_thread() at ?:?
  2:  progress_engine() at /u/jnysal/src/mirror-ompi-release/opal/runtime/opal_progress_threads.c:105
  3:   opal_libevent2022_event_base_loop() at /u/jnysal/src/mirror-ompi-release/opal/mca/event/libevent2022/libevent/event.c:1630
  4:    poll_dispatch() at /u/jnysal/src/mirror-ompi-release/opal/mca/event/libevent2022/libevent/poll.c:165
  5:     poll() at ?:?
Stack trace(s) for thread: 19
  1: -----------------
  1: Branch 1 of 1, 2 processes, 100% of total [0-1]
  1: -----------------
  1: start_thread() at ?:?
  2:  progress_engine() at /u/jnysal/src/mirror-ompi-release/opal/mca/pmix/pmix112/pmix/src/util/progress_threads.c:49
  3:   opal_libevent2022_event_base_loop() at /u/jnysal/src/mirror-ompi-release/opal/mca/event/libevent2022/libevent/event.c:1630
  4:    epoll_dispatch() at /u/jnysal/src/mirror-ompi-release/opal/mca/event/libevent2022/libevent/epoll.c:407
  5:     epoll_wait() at ?:?

I was able to reproduce it with the TCP BTL, it but doesn't seem transport specific. The complete stack from all threads are not pasted, as they are just in the compute loop. Thread 17's stack seems to be the only relevant one.

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions