Skip to content

Conversation

hjelmn
Copy link
Member

@hjelmn hjelmn commented Oct 3, 2018

This commit updates the uct btl to change the transports parameter
into a priority list. The dc_mlx5, rc_mlx5, and ud transports to the
priority list. This will give better out of the box performance for
multi-threaded codes beacuse the *_mlx5 transports can avoid the mlx5
lock inside libmlx5_rdmav2.

This commit also fixes a number of leaks and a possible deadlock when
using RDMA.

Signed-off-by: Nathan Hjelm hjelmn@lanl.gov

@hjelmn
Copy link
Member Author

hjelmn commented Oct 3, 2018

@thananon See if this helps with the zcopy issue.

@thananon
Copy link
Member

thananon commented Oct 5, 2018

This patch give me segfault.

[a08:14689:0:14689] Caught signal 11 (Segmentation fault: address not mapped to object at address (nil))

@hjelmn
Copy link
Member Author

hjelmn commented Oct 5, 2018

I have some more changes that need to be pushed. Will push those this afternoon.

@hjelmn
Copy link
Member Author

hjelmn commented Oct 9, 2018

@thananon Pushed the updates.

@hjelmn
Copy link
Member Author

hjelmn commented Oct 9, 2018

:bot:retest:mellanox

@thananon
Copy link
Member

thananon commented Oct 9, 2018

I still have the same segfault with this commit. I will have the trace here after I got the debug build built.

@thananon
Copy link
Member

thananon commented Oct 9, 2018

backtrace here:

#1  0x00007fffe9fe1473 in mca_btl_uct_endpoint_connect_iface (
    uct_btl=0x71cd30, tl=0x73b300, tl_context=0x6673e0,
    tl_endpoint=0x7c04f0, tl_data=0x7c07a5 "<e\003!#")
    at btl_uct_endpoint.c:117
#2  0x00007fffe9fe2dc8 in mca_btl_uct_endpoint_connect (
    uct_btl=0x71cd30, endpoint=0x7c0450, context_id=0,
    ep_addr=0x0, tl_index=0) at btl_uct_endpoint.c:373
#3  0x00007fffe9fe4f5e in mca_btl_uct_endpoint_check (
    module=0x71cd30, endpoint=0x7c0450, context=0x73bc90,
    ep_handle=0x7fffffffa920, tl_index=0) at btl_uct_endpoint.h:73
#4  0x00007fffe9fe508e in mca_btl_uct_endpoint_check_am (
    module=0x71cd30, endpoint=0x7c0450, context=0x73bc90,
    ep_handle=0x7fffffffa920) at btl_uct_endpoint.h:90
#5  0x00007fffe9fe5dd1 in mca_btl_uct_sendi (btl=0x71cd30,
    endpoint=0x7c0450, convertor=0x7fffffffaaf0,
    header=0x7fffffffac00, header_size=14, payload_size=4,
    order=255 '\377', flags=3, tag=65 'A', descriptor=0x0)
    at btl_uct_am.c:276
#6  0x00007fffdfd8dd0f in mca_bml_base_sendi (bml_btl=0x7c06c0,
    convertor=0x7fffffffaaf0, header=0x7fffffffac00,
    header_size=14, payload_size=4, order=255 '\377', flags=3,
    tag=65 'A', descriptor=0x0)
    at ../../../../ompi/mca/bml/bml.h:303
#7  0x00007fffdfd8ef9a in mca_pml_ob1_send_inline (buf=0x7beb24,
    count=1, datatype=0x7ffff7dba5c0 <ompi_mpi_int>, dst=0,
    tag=-27, seqn=1, dst_proc=0x7c0270, endpoint=0x7c0560,
    comm=0x603160 <ompi_mpi_comm_world>) at pml_ob1_isend.c:120
#8  0x00007fffdfd8f0b7 in mca_pml_ob1_isend (buf=0x7beb24,
    count=1, datatype=0x7ffff7dba5c0 <ompi_mpi_int>, dst=0,
    tag=-27, sendmode=MCA_PML_BASE_SEND_STANDARD,
    comm=0x603160 <ompi_mpi_comm_world>, request=0x7b92c0)
    at pml_ob1_isend.c:172
#9  0x00007fffdf949d75 in NBC_Start_round (handle=0x7c0158)
    at nbc.c:443
#10 0x00007fffdf94a585 in NBC_Start (handle=0x7c0158) at nbc.c:641
#11 0x00007fffdf950a3b in ompi_coll_libnbc_iallreduce (
    sendbuf=0x7beb24, recvbuf=0x7beb20, count=1,
    datatype=0x7ffff7dba5c0 <ompi_mpi_int>,
    op=0x7ffff7dd1840 <ompi_mpi_op_max>,
    comm=0x603160 <ompi_mpi_comm_world>, request=0x7fffffffaf30,
    module=0x7ba480) at nbc_iallreduce.c:207
#12 0x00007ffff7a3eb11 in ompi_comm_allreduce_intra_nb (
    inbuf=0x7beb24, outbuf=0x7beb20, count=1,
    op=0x7ffff7dd1840 <ompi_mpi_op_max>, context=0x7bead0,
    req=0x7fffffffaf30) at communicator/comm_cid.c:633
#13 0x00007ffff7a3dec7 in ompi_comm_allreduce_getnextcid (
    request=0x7bf488) at communicator/comm_cid.c:340
#14 0x00007ffff7a41f53 in ompi_comm_request_progress ()
    at communicator/comm_request.c:133
#15 0x00007ffff6e01a9a in opal_progress ()
    at runtime/opal_progress.c:231
#16 0x00007ffff7a3d523 in ompi_request_wait_completion (
    req=0x7bf488) at ../ompi/request/request.h:415
#17 0x00007ffff7a3dcf2 in ompi_comm_nextcid (newcomm=0x7bdfb0,
    comm=0x603160 <ompi_mpi_comm_world>, bridgecomm=0x0, arg0=0x0,
    arg1=0x0, send_first=false, mode=32)
    at communicator/comm_cid.c:295
#18 0x00007ffff7a3a8b1 in ompi_comm_dup_with_info (
    comm=0x603160 <ompi_mpi_comm_world>, info=0x0,

@hjelmn
Copy link
Member Author

hjelmn commented Oct 9, 2018

@thananon Can you share the program and mpirun line that triggers the SEGV.

@thananon
Copy link
Member

thananon commented Oct 9, 2018

The program is NetPIPE. I think it segfaulted in MPI_Init.
mpirun -np 2 --map-by node --bind-to none -mca pml ob1 -mca btl uct,self -mca btl_uct_memory_domains ib/mlx5_0 ./NPmpi

@thananon
Copy link
Member

thananon commented Oct 9, 2018

Let me take that back. The stack trace is from my benchmark but I can reproduce it with NetPIPE and both seems to crash in uct_endpoint_create_connected()

@hjelmn
Copy link
Member Author

hjelmn commented Oct 9, 2018

Well damn. It working perfectly for me:

mpirun -np 2 --map-by node --bind-to none -mca pml ob1 -mca btl uct,self -mca btl_uct_memory_domains ib/mlx5_0 ./NPmpi
0: cn824
1: cn825
Now starting the main loop
  0:       1 bytes  37957 times -->      3.12 Mbps in       2.45 usec
  1:       2 bytes  40897 times -->      6.16 Mbps in       2.48 usec
  2:       3 bytes  40339 times -->      9.04 Mbps in       2.53 usec
  3:       4 bytes  26340 times -->     12.08 Mbps in       2.53 usec
  4:       6 bytes  29689 times -->     18.20 Mbps in       2.51 usec
  5:       8 bytes  19881 times -->     24.40 Mbps in       2.50 usec
  6:      12 bytes  24988 times -->     36.62 Mbps in       2.50 usec
  7:      13 bytes  16667 times -->     39.68 Mbps in       2.50 usec
  8:      16 bytes  18463 times -->     48.50 Mbps in       2.52 usec
  9:      19 bytes  22350 times -->     57.92 Mbps in       2.50 usec
 10:      21 bytes  25236 times -->     64.04 Mbps in       2.50 usec
 11:      24 bytes  26646 times -->     72.99 Mbps in       2.51 usec
 12:      27 bytes  28234 times -->     81.99 Mbps in       2.51 usec
 13:      29 bytes  17690 times -->     88.30 Mbps in       2.51 usec
 14:      32 bytes  19267 times -->     97.25 Mbps in       2.51 usec

@hjelmn
Copy link
Member Author

hjelmn commented Oct 9, 2018

Maybe something stale in your install?

@hjelmn
Copy link
Member Author

hjelmn commented Oct 9, 2018

What UCX version?

@thananon
Copy link
Member

thananon commented Oct 9, 2018

My install is fresh. I can try to distclean and make again.

my ucx version is master from April.

commit d58ab0b0b1536012b72d5dc8700bc82f9479dc5e (HEAD -> master, origin/master, origin/HEAD)
Merge: 31a9f510 243e77a4
Author: Yossi Itigin <yosefe@mellanox.com>
Date:   Mon Apr 9 14:56:02 2018 +0300

    Merge pull request #2495 from Artemy-Mellanox/topic/atomic_mr

    UCT/IB: Disable atomic UMR

@hjelmn
Copy link
Member Author

hjelmn commented Oct 9, 2018

@thananon I will try that version and see what happens.

@thananon
Copy link
Member

thananon commented Oct 9, 2018

@hjelmn I rebased to the current master and still have the same problem. I'm still waiting for my new build. Will report back.

@thananon
Copy link
Member

thananon commented Oct 9, 2018

build finished. Still have the same problem. I think the error is from uct as everything looks normal from MPI stack.

@hjelmn
Copy link
Member Author

hjelmn commented Oct 9, 2018

very odd. working for me with that revision. though i am testing on aarch64 and power9.

@hjelmn
Copy link
Member Author

hjelmn commented Oct 9, 2018

what transport is it using? it will say in btl_base_verbose output.

@thananon
Copy link
Member

thananon commented Oct 9, 2018

[a08.alembert:12529] select: initializing btl component uct
[a08][[20152,1],1][btl_uct_component.c:413:mca_btl_uct_component_init] initializing uct btl
[a07][[20152,1],0][btl_uct_component.c:326:mca_btl_uct_component_process_uct_md] processing memory domain self
[a07][[20152,1],0][btl_uct_component.c:326:mca_btl_uct_component_process_uct_md] processing memory domain tcp
[a07][[20152,1],0][btl_uct_component.c:326:mca_btl_uct_component_process_uct_md] processing memory domain ib/mlx5_0
[a08][[20152,1],1][btl_uct_component.c:326:mca_btl_uct_component_process_uct_md] processing memory domain self
[a08][[20152,1],1][btl_uct_component.c:326:mca_btl_uct_component_process_uct_md] processing memory domain tcp
[a08][[20152,1],1][btl_uct_component.c:326:mca_btl_uct_component_process_uct_md] processing memory domain ib/mlx5_0
[a07][[20152,1],0][btl_uct_tl.c:550:mca_btl_uct_query_tls] tl filter: tl_name = rc, use = 1, priority = 3
[a08][[20152,1],1][btl_uct_tl.c:550:mca_btl_uct_query_tls] tl filter: tl_name = rc, use = 1, priority = 3
[a08][[20152,1],1][btl_uct_tl.c:302:mca_btl_uct_context_create] enabling progress for tl 0x71cd80 context id 0
[a08][[20152,1],1][btl_uct_tl.c:383:mca_btl_uct_create_tl] Interface CAPS for tl ib/mlx5_0::rc: 0xd2408000067f
[a07][[20152,1],0][btl_uct_tl.c:302:mca_btl_uct_context_create] enabling progress for tl 0x71cf90 context id 0
[a08][[20152,1],1][btl_uct_tl.c:550:mca_btl_uct_query_tls] tl filter: tl_name = ud, use = 1, priority = 2
[a07][[20152,1],0][btl_uct_tl.c:383:mca_btl_uct_create_tl] Interface CAPS for tl ib/mlx5_0::rc: 0xd2408000067f
[a07][[20152,1],0][btl_uct_tl.c:550:mca_btl_uct_query_tls] tl filter: tl_name = ud, use = 1, priority = 2
[a07][[20152,1],0][btl_uct_tl.c:302:mca_btl_uct_context_create] enabling progress for tl 0x73c460 context id 0
[a07][[20152,1],0][btl_uct_tl.c:383:mca_btl_uct_create_tl] Interface CAPS for tl ib/mlx5_0::ud: 0xf3400000000f
[a07][[20152,1],0][btl_uct_tl.c:550:mca_btl_uct_query_tls] tl filter: tl_name = cm, use = 1, priority = 3
[a08][[20152,1],1][btl_uct_tl.c:302:mca_btl_uct_context_create] enabling progress for tl 0x73c460 context id 0
[a08][[20152,1],1][btl_uct_tl.c:383:mca_btl_uct_create_tl] Interface CAPS for tl ib/mlx5_0::ud: 0xf3400000000f
[a08][[20152,1],1][btl_uct_tl.c:550:mca_btl_uct_query_tls] tl filter: tl_name = cm, use = 1, priority = 3
[a07][[20152,1],0][btl_uct_tl.c:302:mca_btl_uct_context_create] enabling progress for tl 0x760950 context id 0
[a07][[20152,1],0][btl_uct_tl.c:383:mca_btl_uct_create_tl] Interface CAPS for tl ib/mlx5_0::cm: 0x29000000000a
[a07][[20152,1],0][btl_uct_tl.c:466:mca_btl_uct_evaluate_tl] evaluating tl ud
[a07][[20152,1],0][btl_uct_tl.c:421:mca_btl_uct_set_tl_am] tl ud is suitable for active-messaging
[a07][[20152,1],0][btl_uct_tl.c:443:mca_btl_uct_set_tl_conn] tl ud is suitable for making connections
[a07][[20152,1],0][btl_uct_tl.c:483:mca_btl_uct_evaluate_tl] tl has flags 0xf3400000000f
[a07][[20152,1],0][btl_uct_tl.c:466:mca_btl_uct_evaluate_tl] evaluating tl rc
[a07][[20152,1],0][btl_uct_tl.c:390:mca_btl_uct_set_tl_rdma] tl rc is suitable for RDMA
[a07][[20152,1],0][btl_uct_tl.c:483:mca_btl_uct_evaluate_tl] tl has flags 0xd2408000067f
[a08][[20152,1],1][btl_uct_tl.c:302:mca_btl_uct_context_create] enabling progress for tl 0x7609c0 context id 0
[a08][[20152,1],1][btl_uct_tl.c:383:mca_btl_uct_create_tl] Interface CAPS for tl ib/mlx5_0::cm: 0x29000000000a
[a08][[20152,1],1][btl_uct_tl.c:466:mca_btl_uct_evaluate_tl] evaluating tl ud
[a08][[20152,1],1][btl_uct_tl.c:421:mca_btl_uct_set_tl_am] tl ud is suitable for active-messaging
[a08][[20152,1],1][btl_uct_tl.c:443:mca_btl_uct_set_tl_conn] tl ud is suitable for making connections
[a08][[20152,1],1][btl_uct_tl.c:483:mca_btl_uct_evaluate_tl] tl has flags 0xf3400000000f
[a08][[20152,1],1][btl_uct_tl.c:466:mca_btl_uct_evaluate_tl] evaluating tl rc
[a08][[20152,1],1][btl_uct_tl.c:390:mca_btl_uct_set_tl_rdma] tl rc is suitable for RDMA
[a08][[20152,1],1][btl_uct_tl.c:483:mca_btl_uct_evaluate_tl] tl has flags 0xd2408000067f

...

[a08][[20152,1],1][btl_uct_component.c:455:mca_btl_uct_component_init] uct btl initialization complete. found 1 suitable memory domains
[a08.alembert:12529] select: init of component uct returned success
[a08.alembert:12529] select: initializing btl component self
[a08.alembert:12529] select: init of component self returned success
[a07][[20152,1],0][btl_uct_module.c:56:mca_btl_uct_get_ep] endpoint initialized. new endpoint: 0x7b9ca0
[a07.alembert:03663] mca: bml: Using uct btl for send to [[20152,1],0] on node a07
[a08][[20152,1],1][btl_uct_module.c:56:mca_btl_uct_get_ep] endpoint initialized. new endpoint: 0x7b9ca0
[a07.alembert:03663] mca: bml: Using self btl for send to [[20152,1],0] on node a07
[a08.alembert:12529] mca: bml: Using uct btl for send to [[20152,1],1] on node a08
[a08.alembert:12529] mca: bml: Using self btl for send to [[20152,1],1] on node a08
0: a07
[a08][[20152,1],1][btl_uct_module.c:56:mca_btl_uct_get_ep] endpoint initialized. new endpoint: 0x7bf790
[a08.alembert:12529] mca: bml: Using uct btl for send to [[20152,1],0] on node a07
[a08][[20152,1],1][btl_uct_endpoint.c:316:mca_btl_uct_endpoint_connect] checking endpoint 0x7bf790 with context id 0. cached uct ep: (nil), ready: 0
[a08][[20152,1],1][btl_uct_endpoint.c:341:mca_btl_uct_endpoint_connect] received modex of size 46 for proc [[20152,1],0]. module count 1
[a08][[20152,1],1][btl_uct_endpoint.c:348:mca_btl_uct_endpoint_connect] found modex for md ib/mlx5_0, searching for ib/mlx5_0
[a08][[20152,1],1][btl_uct_endpoint.c:69:mca_btl_uct_process_modex] processing remote modex data
[a08][[20152,1],1][btl_uct_endpoint.c:72:mca_btl_uct_process_modex] modex contains RDMA data
[a08][[20152,1],1][btl_uct_endpoint.c:59:mca_btl_uct_process_modex_tl] processing modex for tl
                                                                                               . size: 12
[a08][[20152,1],1][btl_uct_endpoint.c:82:mca_btl_uct_process_modex] modex contains active message data
[a08][[20152,1],1][btl_uct_endpoint.c:59:mca_btl_uct_process_modex_tl] processing modex for tl . size: 16
[a08][[20152,1],1][btl_uct_endpoint.c:114:mca_btl_uct_endpoint_connect_iface] connecting endpoint to interface
[a08:12529:0:12529] Caught signal 11 (Segmentation fault: address not mapped to object at address (nil))
1: a08
[a07][[20152,1],0][btl_uct_module.c:56:mca_btl_uct_get_ep] endpoint initialized. new endpoint: 0x7bf9d0
[a07.alembert:03663] mca: bml: Using uct btl for send to [[20152,1],1] on node a08
[a07][[20152,1],0][btl_uct_endpoint.c:316:mca_btl_uct_endpoint_connect] checking endpoint 0x7bf9d0 with context id 0. cached uct ep: (nil), ready: 0
[a07][[20152,1],0][btl_uct_endpoint.c:341:mca_btl_uct_endpoint_connect] received modex of size 46 for proc [[20152,1],1]. module count 1
[a07][[20152,1],0][btl_uct_endpoint.c:348:mca_btl_uct_endpoint_connect] found modex for md ib/mlx5_0, searching for ib/mlx5_0
[a07][[20152,1],0][btl_uct_endpoint.c:69:mca_btl_uct_process_modex] processing remote modex data
[a07][[20152,1],0][btl_uct_endpoint.c:72:mca_btl_uct_process_modex] modex contains RDMA data
[a07][[20152,1],0][btl_uct_endpoint.c:59:mca_btl_uct_process_modex_tl] processing modex for tl
                                                                                               . size: 12
[a07][[20152,1],0][btl_uct_endpoint.c:82:mca_btl_uct_process_modex] modex contains active message data
[a07][[20152,1],0][btl_uct_endpoint.c:59:mca_btl_uct_process_modex_tl] processing modex for tl . size: 16
[a07][[20152,1],0][btl_uct_endpoint.c:114:mca_btl_uct_endpoint_connect_iface] connecting endpoint to interface
[a07:3663 :0:3663] Caught signal 11 (Segmentation fault: address not mapped to object at address (nil))

I'm not sure.

@hjelmn
Copy link
Member Author

hjelmn commented Oct 9, 2018

Thats the problem. UD is selected for active-messages in your case. I will fix that now. Try with --mca btl_uct_transports rc,ud

@hjelmn
Copy link
Member Author

hjelmn commented Oct 9, 2018

With UD as AM:

mpirun --mca btl_uct_transports ud,rc  -np 2 --map-by node --bind-to none -mca pml ob1 -mca btl uct,self -mca btl_uct_memory_domains ib/mlx5_0 ./NPmpi
0: cn826
1: cn827
[cn827:25061:0:25061] Caught signal 11 (Segmentation fault: address not mapped to object at address (nil))
[cn826:36812:0:36812] Caught signal 11 (Segmentation fault: address not mapped to object at address (nil))
==== backtrace ====
===================
==== backtrace ====
===================

@thananon
Copy link
Member

thananon commented Oct 9, 2018

--mca btl_uct_transports rc,ud works or at least up to 8kb.

@hjelmn
Copy link
Member Author

hjelmn commented Oct 9, 2018

Ok, adding a check to ensure we don't try to use UD for communication. Its a special case.

This commit updates the uct btl to change the transports parameter
into a priority list. The dc_mlx5, rc_mlx5, and ud transports to the
priority list. This will give better out of the box performance for
multi-threaded codes beacuse the *_mlx5 transports can avoid the mlx5
lock inside libmlx5_rdmav2.

This commit also fixes a number of leaks and a possible deadlock when
using RDMA.

Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>
@hjelmn
Copy link
Member Author

hjelmn commented Oct 9, 2018

Done. Please try the latest version without setting the transports. UD will get thrown to the end of the list of tl's to check.

@thananon
Copy link
Member

thananon commented Oct 9, 2018

👍 works.

@hjelmn
Copy link
Member Author

hjelmn commented Oct 9, 2018

Did this fix the zero-copy issue?

@thananon
Copy link
Member

thananon commented Oct 9, 2018

It did.

@hjelmn hjelmn merged commit 121b492 into open-mpi:master Oct 9, 2018
@hjelmn hjelmn deleted the uct_update branch October 9, 2018 22:28
@thananon
Copy link
Member

thananon commented Oct 22, 2018

FYI: I did a performance run and this patch reduced multithreaded performance around 20-25%.

Prepatch:

1       5       100     1024            1643245.075239  0.038947 sec
1       6       100     1024            1697376.510113  0.045246 sec
1       9       100     1024            1705292.579801  0.067554 sec

Post patch:

1       7       100     1024            1331728.184706  0.067281 sec
1       8       100     1024            1343513.668205  0.076218 sec
1       9       100     1024            1256868.384769  0.091656 sec
1       10      100     1024            1279331.408613  0.100052 sec

Seems like something got funneled here.

@hjelmn
Copy link
Member Author

hjelmn commented Oct 22, 2018

@thananon Interesting. Only thing I can think of is the incorrect fragment fields making the results look better than they should have. The other reason could be that recursive progress will no longer be made because it can lead to deep recursion.

@thananon
Copy link
Member

Hmm, thank you @hjelmn . I will look into this. I think you are right about result looking better than it should. btl/uct should not beat openib btl single threaded right?

@hjelmn
Copy link
Member Author

hjelmn commented Oct 22, 2018

@thananon Not sure. I would assume that going directly to verbs is better than uct if not using the optimized paths (rc_mlx5, dc_mlx5). Otherwise UCT will be faster.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants