Skip to content

Graph500 failure with ompi-release v2.x #1599

@artpol84

Description

@artpol84

Hello, I'm facing the the error (the output log is below) with graph500 and ompi-release v2.x and ob1/openib.
With ompi-release v1.10 this doesn't happen. This also doesn't appear with v2.x and yalla

Way to reproduce:

  1. git clone ompi-release and switch to the v2.x branch
  2. build and install ompi-release
  3. Use following sources of graph500 http://www.graph500.org/sites/default/files/files/graph500-2.1.4.tar.bz2
  4. untar and go to ./mpi directory and run make.
  5. Run using the command line below adjusted to the actual environment.

The error output:

mpirun -np 16 -bind-to core -mca btl_openib_warn_default_gid_prefix 0 -report-bindings \
     -display-map -mca btl_openib_if_include mlx4_0:1 -mca pml ob1 -mca btl self,vader,openib \
     -mca opal_pmix_base_async_modex 0 -mca mpi_add_procs_cutoff 100000 \
     -mca scoll_mpi_enable 0 -mca coll_hcoll_enable 0 -mca scoll_fca_enable 0 \
     -mca coll_fca_enable 0 --map-by node ./graph500_mpi_simple 16

 Data for JOB [52131,1] offset 0

 ========================   JOB MAP   ========================

 Data for node: cn31    Num slots: 8    Max slots: 0    Num procs: 4
    Process OMPI jobid: [52131,1] App: 0 Process rank: 0 Bound: socket 0[core 0[hwt 0]]:[B/././.][./././.]
    Process OMPI jobid: [52131,1] App: 0 Process rank: 4 Bound: socket 0[core 1[hwt 0]]:[./B/./.][./././.]
    Process OMPI jobid: [52131,1] App: 0 Process rank: 8 Bound: socket 0[core 2[hwt 0]]:[././B/.][./././.]
    Process OMPI jobid: [52131,1] App: 0 Process rank: 12 Bound: socket 0[core 3[hwt 0]]:[./././B][./././.]

 Data for node: cn32    Num slots: 8    Max slots: 0    Num procs: 4
    Process OMPI jobid: [52131,1] App: 0 Process rank: 1 Bound: socket 0[core 0[hwt 0]]:[B/././.][./././.]
    Process OMPI jobid: [52131,1] App: 0 Process rank: 5 Bound: socket 0[core 1[hwt 0]]:[./B/./.][./././.]
    Process OMPI jobid: [52131,1] App: 0 Process rank: 9 Bound: socket 0[core 2[hwt 0]]:[././B/.][./././.]
    Process OMPI jobid: [52131,1] App: 0 Process rank: 13 Bound: socket 0[core 3[hwt 0]]:[./././B][./././.]

 Data for node: cn33    Num slots: 8    Max slots: 0    Num procs: 4
    Process OMPI jobid: [52131,1] App: 0 Process rank: 2 Bound: socket 0[core 0[hwt 0]]:[B/././.][./././.]
    Process OMPI jobid: [52131,1] App: 0 Process rank: 6 Bound: socket 0[core 1[hwt 0]]:[./B/./.][./././.]
    Process OMPI jobid: [52131,1] App: 0 Process rank: 10 Bound: socket 0[core 2[hwt 0]]:[././B/.][./././.]
    Process OMPI jobid: [52131,1] App: 0 Process rank: 14 Bound: socket 0[core 3[hwt 0]]:[./././B][./././.]

 Data for node: cn34    Num slots: 8    Max slots: 0    Num procs: 4
    Process OMPI jobid: [52131,1] App: 0 Process rank: 3 Bound: socket 0[core 0[hwt 0]]:[B/././.][./././.]
    Process OMPI jobid: [52131,1] App: 0 Process rank: 7 Bound: socket 0[core 1[hwt 0]]:[./B/./.][./././.]
    Process OMPI jobid: [52131,1] App: 0 Process rank: 11 Bound: socket 0[core 2[hwt 0]]:[././B/.][./././.]
    Process OMPI jobid: [52131,1] App: 0 Process rank: 15 Bound: socket 0[core 3[hwt 0]]:[./././B][./././.]

 =============================================================[cn31:05345] MCW rank 0 bound to socket 0[core 0[hwt 0]]: [B/././.][./././.]
[cn31:05345] MCW rank 4 bound to socket 0[core 1[hwt 0]]: [./B/./.][./././.]
[cn31:05345] MCW rank 8 bound to socket 0[core 2[hwt 0]]: [././B/.][./././.]
[cn31:05345] MCW rank 12 bound to socket 0[core 3[hwt 0]]: [./././B][./././.]
[cn34:15847] MCW rank 3 bound to socket 0[core 0[hwt 0]]: [B/././.][./././.]
[cn34:15847] MCW rank 7 bound to socket 0[core 1[hwt 0]]: [./B/./.][./././.]
[cn34:15847] MCW rank 11 bound to socket 0[core 2[hwt 0]]: [././B/.][./././.]
[cn34:15847] MCW rank 15 bound to socket 0[core 3[hwt 0]]: [./././B][./././.]
[cn32:10315] MCW rank 1 bound to socket 0[core 0[hwt 0]]: [B/././.][./././.]
[cn32:10315] MCW rank 5 bound to socket 0[core 1[hwt 0]]: [./B/./.][./././.]
[cn32:10315] MCW rank 9 bound to socket 0[core 2[hwt 0]]: [././B/.][./././.]
[cn32:10315] MCW rank 13 bound to socket 0[core 3[hwt 0]]: [./././B][./././.]
[cn33:05262] MCW rank 2 bound to socket 0[core 0[hwt 0]]: [B/././.][./././.]
[cn33:05262] MCW rank 6 bound to socket 0[core 1[hwt 0]]: [./B/./.][./././.]
[cn33:05262] MCW rank 10 bound to socket 0[core 2[hwt 0]]: [././B/.][./././.]
[cn33:05262] MCW rank 14 bound to socket 0[core 3[hwt 0]]: [./././B][./././.]
[warn] Epoll ADD(1) on fd 0 failed.  Old events were 0; read change was 1 (add); write change was 0 (none): Operation not permitted
graph_generation:               1.032993 s
construction_time:              0.308546 s
Running BFS 0
Time for BFS 0 is 0.110816
Validating BFS 0
[[52131,1],13][btl_openib_component.c:3544:handle_wc] from cn32 to: cn31 error polling LP CQ with status REMOTE ACCESS ERROR status number 10 for wr_id 77a8c8 opcode 201326592  vendor error 136 qp_idx 3
[[52131,1],2][btl_openib_component.c:3544:handle_wc] from cn33 to: cn31 [[52131,1],5][btl_openib_component.c:3544:handle_wc] from cn32 to: cn31 error polling LP CQ with status REMOTE ACCESS ERROR status number 10 for wr_id 77a8c8 opcode 201326592  vendor error 136 qp_idx 3
error polling LP CQ with status REMOTE ACCESS ERROR status number 10 for wr_id 77a278 opcode 0  vendor error 136 qp_idx 3
[[52131,1],7][btl_openib_component.c:3544:handle_wc] from cn34 to: cn31 error polling LP CQ with status REMOTE ACCESS ERROR status number 10 for wr_id 77cd38 opcode 0  vendor error 136 qp_idx 3[[52131,1],11][btl_openib_component.c:3544:handle_wc] from cn34 to: cn31 error polling LP CQ with status REMOTE ACCESS ERROR status number 10 for wr_id 77cd38 opcode 0  vendor error 136 qp_idx 3
[[52131,1],14][btl_openib_component.c:3544:handle_wc] from cn33 to: cn31 error polling LP CQ with status REMOTE ACCESS ERROR status number 10 for wr_id 77a198 opcode 2  vendor error 136 qp_idx 3
[[52131,1],3][btl_openib_component.c:3544:handle_wc] from cn34 [[52131,1],15][btl_openib_component.c:3544:handle_wc] from cn34 to: cn31 error polling LP CQ with status REMOTE ACCESS ERROR status number 10 for wr_id 77cc18 opcode 2  vendor error 136 qp_idx 3to: cn31 error polling LP CQ with status REMOTE ACCESS ERROR status number 10 for wr_id 77cc58 opcode 2  vendor error 136 qp_idx 3[[52131,1],9][btl_openib_component.c:3544:handle_wc] from cn32 to: cn31 error polling LP CQ with status REMOTE ACCESS ERROR status number 10 for wr_id 77a7a8 opcode 2  vendor error 136 qp_idx 3[[52131,1],6][btl_openib_component.c:3544:handle_wc] from cn33 to: cn31 error polling LP CQ with status REMOTE ACCESS ERROR status number 10 for wr_id 77a2b8 opcode 1  vendor error 136 qp_idx 3

[[52131,1],1][btl_openib_component.c:3544:handle_wc] from cn32 to: cn31 [[52131,1],8][btl_openib_component.c:3544:handle_wc] from cn31 to: cn31 error polling LP CQ with status REMOTE ACCESS ERROR status number 10 for wr_id 77a088 opcode 201326592  vendor error 136 qp_idx 3
error polling LP CQ with status REMOTE ACCESS ERROR status number 10 for wr_id 77a6a8 opcode 2  vendor error 136 qp_idx 3[[52131,1],12][btl_openib_component.c:3544:handle_wc] from cn31 to: cn31 error polling LP CQ with status REMOTE ACCESS ERROR status number 10 for wr_id 779d28 opcode 2  vendor error 136 qp_idx 3
[[52131,1],10][btl_openib_component.c:3544:handle_wc] from cn33 to: cn31 error polling LP CQ with status REMOTE ACCESS ERROR status number 10 for wr_id 77a198 opcode 1  vendor error 136 qp_idx 3
[[52131,1],0][btl_openib_component.c:3544:handle_wc] from cn31 to: cn31 error polling LP CQ with status REMOTE ACCESS ERROR status number 10 for wr_id 77a068 opcode 0  vendor error 136 qp_idx 3
[[52131,1],4][btl_openib_component.c:3544:handle_wc] from cn31 to: cn31 error polling LP CQ with status REMOTE ACCESS ERROR status number 10 for wr_id 77a088 opcode 1  vendor error 136 qp_idx 3

the backtrace shows hang in the PMPI_Win_fence():

rank=0
(gdb) bt
#0  0x00007ffff3aeb0e2 in check_cantmatch_for_match (btl=<value optimized out>, hdr=<value optimized out>, segments=0x8695fe00,
    num_segments=1, type=<value optimized out>) at pml_ob1_recvfrag.c:573
#1  mca_pml_ob1_recv_frag_match (btl=<value optimized out>, hdr=<value optimized out>, segments=0x8695fe00, num_segments=1,
    type=<value optimized out>) at pml_ob1_recvfrag.c:725
#2  0x00007ffff3aeb749 in mca_pml_ob1_recv_frag_callback_match (btl=<value optimized out>, tag=<value optimized out>,
    des=<value optimized out>, cbdata=<value optimized out>) at pml_ob1_recvfrag.c:249
#3  0x00007ffff3cfa5b5 in mca_btl_vader_poll_handle_frag (hdr=0x7fffee518888, endpoint=0x7ab840) at btl_vader_component.c:603
#4  0x00007ffff3cfa9e1 in mca_btl_vader_check_fboxes () at btl_vader_fbox.h:233
#5  mca_btl_vader_component_progress () at btl_vader_component.c:703
#6  0x00007ffff77fa5ea in opal_progress () at runtime/opal_progress.c:187
#7  0x00007ffff025ffe5 in opal_condition_wait (assert=<value optimized out>, win=<value optimized out>)
    at ../../../../opal/threads/condition.h:78
#8  ompi_osc_pt2pt_fence (assert=<value optimized out>, win=<value optimized out>) at osc_pt2pt_active_target.c:166
#9  0x00007ffff7d8348b in PMPI_Win_fence (assert=<value optimized out>, win=0xdb8110) at pwin_fence.c:59
#10 0x0000000000409ea2 in end_gather (g=0xdb80d0) at onesided.c:63
#11 0x0000000000409484 in validate_bfs_result (tg=0x7fffffffbb70, nglobalverts=<value optimized out>, nlocalverts=4096, root=29348,
    pred=<value optimized out>, edge_visit_count_ptr=0x7fffffffbc00) at validate.c:354
#12 0x0000000000403c49 in main (argc=2, argv=0x7fffffffbd68) at main.c:323

The corresponding backtraces are:

rank=3
(gdb) bt
#0  0x00007ffff3ae93a7 in opal_convertor_set_position () at ../../../../opal/datatype/opal_convertor.h:296
#1  mca_pml_ob1_progress () at pml_ob1_progress.c:66
#2  0x00007ffff77fa5ea in opal_progress () at runtime/opal_progress.c:187
#3  0x00007ffff025ffe5 in opal_condition_wait (assert=<value optimized out>, win=<value optimized out>)
    at ../../../../opal/threads/condition.h:78
#4  ompi_osc_pt2pt_fence (assert=<value optimized out>, win=<value optimized out>) at osc_pt2pt_active_target.c:166
#5  0x00007ffff7d8348b in PMPI_Win_fence (assert=<value optimized out>, win=0x1031480) at pwin_fence.c:59
#6  0x0000000000409ea2 in end_gather (g=0x1031440) at onesided.c:63
#7  0x0000000000409484 in validate_bfs_result (tg=0x7fffffffbb70, nglobalverts=<value optimized out>, nlocalverts=4096, root=29348,
    pred=<value optimized out>, edge_visit_count_ptr=0x7fffffffbc00) at validate.c:354
#8  0x0000000000403c49 in main (argc=2, argv=0x7fffffffbd68) at main.c:323

Metadata

Metadata

Assignees

Type

No type

Projects

No projects

Milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions