prov/shm: introduce gdrcopy awareness to hmem copy #8881

wenduwan · 2023-05-02T01:08:24Z

From v1.19, OFI_HMEM_DATA_GDRCOPY_HANDLE flag will signal the presence of gdrcopy handle in ofi_mr.hmem_data. SHM provider can take advantage of gdrcopy to achieve lower memcpy latencies from/to cuda devices.

This patch introduces the logic to select gdrcopy on hmem copy paths, with the exception of cuda IPC which is not supported by gdrcopy.

For OMB single-node cuda->cuda latency we are seeing improvement(via EFA owner provider, which is responsible for gdr pinning).

pt2pt/osu_latency
Size	Before	After
0	0.95	0.93
1	15.41	2.78
2	15.18	4.36
4	15.04	4.45
8	15.11	4.38
16	15.13	3.00
32	15.29	2.99
64	14.85	3.07
128	14.85	4.66
256	14.62	3.34
512	14.82	4.96
1024	15.13	6.38
2048	15.68	10.99
2560	15.57	12.77
3072	15.57	15.79
3584	15.65	15.80
4096	15.89	15.95
8192	15.58	15.59
16384	15.31	15.24
32768	15.11	15.12
65536	15.24	15.27
131072	15.52	15.60
262144	16.01	16.07
524288	17.24	19.95
1048576	19.28	19.29
2097152	23.23	23.20
4194304	30.93	30.90

wzamazon · 2023-05-02T02:40:13Z

prov/shm/src/smr_hmem.h

+                                           uint64_t iov_offset)
+{
+    if (FI_VERSION_GE(smr_prov.fi_version, FI_VERSION(1,19)) &&
+        1 == iov_count && mr && mr[0] && FI_HMEM_CUDA == mr[0]->iface &&


Why 1 == iov_count?

I remember there was such condition in shm for ipc, I guess here is that we may not want to support mixed iface type.

Yes. SHM currently does not support mixed HMEM iface. Here is an example: https://github.com/ofiwg/libfabric/blob/main/prov/shm/src/smr_msg.c#L312

shm does not support mixed iface for the IPC path but does support mixed ifaces in the inline/inject paths which is what this is optimizing. I added the ofi_copy_from_mr_iov functions for that reason.
I'm thinking it could be better to have the gdr copy size limit be global and move this to the mr_iov copy path since the hmem data is part of the mr info. thoughts?

shm does not support mixed iface for the IPC path but does support mixed ifaces in the inline/inject paths which is what this is optimizing.

Actually we are optimizing the current cuda ipc path, which only supports 1 hmem iface as mentioned. But what you are suggesting here also makes sense:

it could be better to have the gdr copy size limit be global ...

It could be simpler to make the size limit global (because this limit is only related to the hardware and cuda driver, not efa/shm/etc provider).

... and move this to the mr_iov copy path

I considered this option but it sorta conflicted with our intention of separating gdrcopy from generic hmem copy. Our idea is to make the caller explicitly choose the copy method. Therefore hiding gdrcopy inside mr iov copy path sounds contradictory to this end.

You pointed out a valid issue - the current impl will not use gdrcopy for inline/inject paths when there are mixed ifaces. This is not an issue here since we are optimizing the cuda ipc path that has only 1 iface, i.e.

/* Do not inline/inject if IPC is available so device to device * transfer may occur if possible. */ if (iov_count == 1 && desc && desc[0]) { smr_desc = (struct ofi_mr *) *desc; iface = smr_desc->iface; use_ipc = ofi_hmem_is_ipc_enabled(iface) && smr_desc->flags & FI_HMEM_DEVICE_ONLY && !(op_flags & FI_INJECT); if (FI_VERSION_GE(smr_prov.fi_version, FI_VERSION(1,19)) && FI_HMEM_CUDA == iface && (OFI_HMEM_DATA_GDRCOPY_HANDLE & smr_desc->flags)) { assert(smr_desc->hmem_data); gdrcopy_available = true; } }

An alternative to changing the common mr iov path is to change e.g. smr_copy_from_mr_iov and iterate over the ifaces, then we can choose the proper memcopy method case by case. The downside is code duplication with mr iov copy, plus we don't have an immediate use for that.

@aingerson WDYT?

@wenduwan I see what you mean about the duplicate code... I'm leaning towards saying to just do it in the mr iov path with maybe a compile check around the gdr check? shm is the only provider using the mr iov copy functions and since the gdrcopy hmem data is in the ofi mr, I think it makes sense to say if you're going to use those functions that you would be opting in to the gdr copy if the gdr copy handle was registered it.
if there comes another provider/use case that needs to use it that explicitly needs to avoid the gdr code on a system with it available when it was registered with the gdr handle in the hmem data... then I think we can cross that bridge when we come to it.

@aingerson Sneaky! Yeah if SHM is the only user then we can definitely do that.

But I just realized that there is another caveat. For gdrcopy to work, the caller should make sure the src and dest cannot both be on cuda. Based on my reading SHM does not have this issue(need to run more tests), so I will leave a note in code.

Oh if that's the case, then we can never use gdr copy for the inline/inject paths since those always copy to the shm region which will never be cuda. If we have to make sure they are both cuda then we can only call it on the progress ipc path if the cmd->iface and rx_entry->iface are both cuda. I think it would be best to keep the ofi_mr copy functions as is and just add a case in the progress ipc path to call the gdrcopy function

@aingerson lol I think you read this in reverse?

For gdrcopy to work, the caller should make sure the src and dest cannot both be on cuda

@wenduwan oh lol yup. my bad!

wenduwan · 2023-05-02T14:07:10Z

bot:aws:retest

aingerson

Just a few comments. I think this approach seems the best to me. Thank you!

prov/shm/src/smr_msg.c

src/hmem.c

From v1.19, OFI_HMEM_DATA_GDRCOPY_HANDLE flag will signal the presence of gdrcopy handle in ofi_mr.hmem_data. SHM provider can take advantage of gdrcopy to achieve lower memcpy latencies from/to cuda devices. This patch introduces the logic to select gdrcopy on hmem copy paths, with the exception of cuda IPC which is not supported by gdrcopy. Signed-off-by: Wenduo Wang <wenduwan@amazon.com>

wenduwan · 2023-05-04T15:09:35Z

@aingerson Please see update. Thanks for simplifying the end result!

aingerson

Thank you for all the adjustments!

shijin-aws

LGTM, a monor comment

shijin-aws · 2023-05-04T16:31:03Z

src/hmem.c

+			 * TODO: Fine tune the max data size to switch from gdrcopy to cudaMemcpy
+			 * Note: buf must be on the host since gdrcopy does not support D2D copy
+			 */
+			if (dir == OFI_COPY_BUF_TO_IOV)


hmmm. so there is no way to make cuda_gdrcopy* called by ofi_hmem_copy* ?

For that we need to pass along hmem_data in addition to device.

So based on our prior discussion this is not a good idea since gdrcopy is the single use case right now. We prefer to keep the hmem copy api stable at this point, and switch to gdrcopy on a higher level of the call stack. Hence we are not changing hmem copy functions.

Got it , makes sense. I see it is wrapped inside ofi_mr_iov_copy, which should be fine

shijin-aws · 2023-05-04T16:46:42Z

what is the error in Intel CI?

aingerson · 2023-05-04T17:13:50Z

@shijin-aws Unrelated CI error. You can ignore.

wenduwan · 2023-05-05T14:36:55Z

@shefty @wzamazon Do you have additional comments?

wenduwan requested review from aingerson, shefty and a team May 2, 2023 01:08

wenduwan added the ⚠️ Do not merge label May 2, 2023

wenduwan force-pushed the shm_gdrcopy_aware branch from 50bee13 to 62335ec Compare May 2, 2023 01:09

wenduwan removed the ⚠️ Do not merge label May 2, 2023

wenduwan force-pushed the shm_gdrcopy_aware branch from 62335ec to ba36f26 Compare May 2, 2023 01:31

wzamazon reviewed May 2, 2023

View reviewed changes

wenduwan force-pushed the shm_gdrcopy_aware branch 3 times, most recently from fbba715 to 3547270 Compare May 4, 2023 02:17

aingerson reviewed May 4, 2023

View reviewed changes

prov/shm/src/smr_msg.c Outdated Show resolved Hide resolved

src/hmem.c Outdated Show resolved Hide resolved

src/hmem.c Outdated Show resolved Hide resolved

src/hmem.c Outdated Show resolved Hide resolved

wenduwan force-pushed the shm_gdrcopy_aware branch from 3547270 to 721c96a Compare May 4, 2023 15:07

aingerson approved these changes May 4, 2023

View reviewed changes

wenduwan requested a review from wzamazon May 4, 2023 15:59

shijin-aws approved these changes May 4, 2023

View reviewed changes

shefty merged commit 2c5f988 into ofiwg:main May 5, 2023

wenduwan mentioned this pull request May 8, 2023

Proposal: SHM cuda-to-cuda performance improvement #8745

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

prov/shm: introduce gdrcopy awareness to hmem copy #8881

prov/shm: introduce gdrcopy awareness to hmem copy #8881

wenduwan commented May 2, 2023 •

edited

Loading

wzamazon May 2, 2023

shijin-aws May 2, 2023

wenduwan May 2, 2023

aingerson May 2, 2023

wenduwan May 2, 2023 •

edited

Loading

aingerson May 3, 2023

wenduwan May 4, 2023

aingerson May 4, 2023

wenduwan May 4, 2023

aingerson May 4, 2023

wenduwan commented May 2, 2023

aingerson left a comment

wenduwan commented May 4, 2023

aingerson left a comment

shijin-aws left a comment

shijin-aws May 4, 2023

wenduwan May 4, 2023

shijin-aws May 4, 2023

shijin-aws commented May 4, 2023

aingerson commented May 4, 2023

wenduwan commented May 5, 2023

prov/shm: introduce gdrcopy awareness to hmem copy #8881

prov/shm: introduce gdrcopy awareness to hmem copy #8881

Conversation

wenduwan commented May 2, 2023 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

wenduwan May 2, 2023 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

wenduwan commented May 2, 2023

aingerson left a comment

Choose a reason for hiding this comment

wenduwan commented May 4, 2023

aingerson left a comment

Choose a reason for hiding this comment

shijin-aws left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

shijin-aws commented May 4, 2023

aingerson commented May 4, 2023

wenduwan commented May 5, 2023

wenduwan commented May 2, 2023 •

edited

Loading

wenduwan May 2, 2023 •

edited

Loading