Definition peer to peer support #8610

wzamazon · 2023-03-03T20:35:59Z

This came from a discussion #8529

Background is that application like NCCL need a way to specify libfabric endpoint cannot make calls to CUDA API to support CUDA memory.

@shefty suggested to use the FI_OPT_HMEM_P2P_REQUIRED, which currently states as the following:

FI_HMEM_P2P_REQUIRED: Peer to peer support must be used for transfers, transfers that cannot be performed using p2p will be reported as failing.

From https://ofiwg.github.io/libfabric/main/man/fi_endpoint.3.html

However, to use this option for the purpose I described, we need a definition of "peer to peer support", which is lacking in the fi_endpoint document. So I opened this issue to ask whether libfabric community to agree on a definition for "peer to peer" support.

One thing I want to mention is that NCCL does allow libfabric to use GDRcopy, see this comment from @jdinan. EFA provider does use GDRcopy when used by NCCL, and found it to be efficient for small messages.

I understand that other providers, like RxM, also want to use GDRcopy to support NCCL.

Therefore, I think it would be ideal if we can define "peer to peer support" in a way that mechanisms like GDRcopy is counted as "peer to peer" support.

The text was updated successfully, but these errors were encountered:

shefty · 2023-03-03T21:07:38Z

Peer to peer is meant to describe PCI peer to peer transfers, or device to device transfers that do not require bouncing data through host buffers. This could also apply to other device buses, not just PCI.

wzamazon · 2023-03-06T17:33:34Z

I see.

I think for the case of NCCL, HMEM_P2P_REQUIRED is too strong. Basically, it need a way to know whether the provider is capable of P2P, not necessarily that all transfer must be through peer 2 peer.

I am reading the man page for FI_HMEM_P2P_ENABLED. It did not specify what provider should do if it does not support Peer 2 peer.

Would it be reasonable for a provider to return -FI_EOPNOSUPP, if user set FI_HMEM_P2P_ENABLED and the provider is incapable of peer 2 peer support?

shefty · 2023-03-06T19:07:01Z

Maybe the question is whether HMEM_P2P_REQUIRED is useful? Or is it only useful if it also allows gdrcopy?

Does gdrcopy behave the same as if p2p were used?

wzamazon · 2023-03-06T22:07:58Z

Maybe the question is whether HMEM_P2P_REQUIRED is useful? Or is it only useful if it also allows gdrcopy?

I think P2P_REQUIRED is still useful, if we define P2P support as NIC access HMEM memory directly.

I can think of at least 1 case that NCCL does not want libfabric to only use NIC to access HMEM memory, (do NOT use gdrcopy), which is when NCCL uses its LL128 protocol.

Does gdrcopy behave the same as if p2p were used?

I do not think so. gdrcopy basically map GPU memory to host's memory address space. then do a memcpy, so it is driven by CPU.

shefty · 2023-03-06T22:28:25Z

So, it sounds like we need some other option that can be used to query/restrict the type of operations that a provider can undertake. Maybe this is a new HMEM option, or some sort of XPU option. Right now there's no way to convey that P2P is okay, but if you can't use P2P, then only this 'other' mechanism is usable.

That's hard to define generically, however. Maybe it's something like P2P_OR_CPU_ONLY?

shefty · 2023-03-07T17:31:15Z

From ofiwg call: Keep current FI_HMEM_P2P options restrictive in the definition. May need CUDA specific option. NCCL restricts the use of any CUDA call from any lower layer. Proposal: FI_CUDA_API_ENABLED/ALLOWED/DISABLED/PERMITTED ? Boolean option is sufficient.

wzamazon · 2023-03-08T15:44:26Z

#8624 introduced FI_CUDA_API_PERMITTED

shefty · 2023-06-05T17:48:14Z

Has this issue been resolved with the introduction of FI_CUDA_API_PERMITTED?

wzamazon · 2023-06-05T18:08:20Z

Yes

wzamazon added the enhancement label Mar 3, 2023

shefty mentioned this issue Mar 7, 2023

core: Allow input and/or output of additional hmem details #8599

Closed

tschuett mentioned this issue May 16, 2023

prov/sm2: Add CMA and CUDA IPC capabilities to sm2 #8919

Closed

wzamazon closed this as completed Jun 5, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Definition peer to peer support #8610

Definition peer to peer support #8610

wzamazon commented Mar 3, 2023

shefty commented Mar 3, 2023

wzamazon commented Mar 6, 2023

shefty commented Mar 6, 2023

wzamazon commented Mar 6, 2023

shefty commented Mar 6, 2023

shefty commented Mar 7, 2023

wzamazon commented Mar 8, 2023 •

edited

Loading

shefty commented Jun 5, 2023

wzamazon commented Jun 5, 2023

Definition peer to peer support #8610

Definition peer to peer support #8610

Comments

wzamazon commented Mar 3, 2023

shefty commented Mar 3, 2023

wzamazon commented Mar 6, 2023

shefty commented Mar 6, 2023

wzamazon commented Mar 6, 2023

shefty commented Mar 6, 2023

shefty commented Mar 7, 2023

wzamazon commented Mar 8, 2023 • edited Loading

shefty commented Jun 5, 2023

wzamazon commented Jun 5, 2023

wzamazon commented Mar 8, 2023 •

edited

Loading