-
Notifications
You must be signed in to change notification settings - Fork 927
Description
Running Open MPI over Libfabric (RXM/verbs) MTL works fine, but there is no shared memory support in this configuration. Open MPI supports an OFI BTL, but RXM is explicitly excluded due to a lack of real FI_DELIVERY_COMPLETE
support. Ref: 41acfee
By commenting that commit out, it "works" and I can observe shared memory being used for intra-node communication, leading to a massive performance benefit. Of course, I may have just opened myself up to subtle race conditions and correctness issues...
Can we re-evaluate whether this needs to still be excluded? This may be a libfabric question, so I will follow up there as well.
Out of curiosity, is there a reason why Open MPI needs the "delivery complete" semantic as opposed to 'transmit complete"? If so, is there a test case or example of how this more relaxed guarantee could result in a correctness issue?
I could be wrong, but I assume that libfabric is basically using the underlying verbs RDMA write completion, which indeed doesn't guarantee that the data has actually landed in remote memory yet (only that the remote NIC has ACKd it). However, it is guaranteed that any subsequent read will reflect all previously written data from prior WQEs, so I am curious how any side effect of this weaker completion guarantee could actually be observed anyway.
I am also curious how this same thing is handled with UCX - as in, are they doing a full software ACK?