-
Notifications
You must be signed in to change notification settings - Fork 382
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
prov/shm: introduce gdrcopy awareness to hmem copy #8881
Conversation
50bee13
to
62335ec
Compare
62335ec
to
ba36f26
Compare
prov/shm/src/smr_hmem.h
Outdated
uint64_t iov_offset) | ||
{ | ||
if (FI_VERSION_GE(smr_prov.fi_version, FI_VERSION(1,19)) && | ||
1 == iov_count && mr && mr[0] && FI_HMEM_CUDA == mr[0]->iface && |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why 1 == iov_count
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I remember there was such condition in shm for ipc, I guess here is that we may not want to support mixed iface type.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes. SHM currently does not support mixed HMEM iface. Here is an example: https://github.com/ofiwg/libfabric/blob/main/prov/shm/src/smr_msg.c#L312
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
shm does not support mixed iface for the IPC path but does support mixed ifaces in the inline/inject paths which is what this is optimizing. I added the ofi_copy_from_mr_iov functions for that reason.
I'm thinking it could be better to have the gdr copy size limit be global and move this to the mr_iov copy path since the hmem data is part of the mr info. thoughts?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
shm does not support mixed iface for the IPC path but does support mixed ifaces in the inline/inject paths which is what this is optimizing.
Actually we are optimizing the current cuda ipc path, which only supports 1 hmem iface as mentioned. But what you are suggesting here also makes sense:
it could be better to have the gdr copy size limit be global ...
It could be simpler to make the size limit global (because this limit is only related to the hardware and cuda driver, not efa/shm/etc provider).
... and move this to the mr_iov copy path
I considered this option but it sorta conflicted with our intention of separating gdrcopy from generic hmem copy. Our idea is to make the caller explicitly choose the copy method. Therefore hiding gdrcopy inside mr iov copy path sounds contradictory to this end.
You pointed out a valid issue - the current impl will not use gdrcopy for inline/inject paths when there are mixed ifaces. This is not an issue here since we are optimizing the cuda ipc path that has only 1 iface, i.e.
/* Do not inline/inject if IPC is available so device to device
* transfer may occur if possible. */
if (iov_count == 1 && desc && desc[0]) {
smr_desc = (struct ofi_mr *) *desc;
iface = smr_desc->iface;
use_ipc = ofi_hmem_is_ipc_enabled(iface) &&
smr_desc->flags & FI_HMEM_DEVICE_ONLY &&
!(op_flags & FI_INJECT);
if (FI_VERSION_GE(smr_prov.fi_version, FI_VERSION(1,19)) &&
FI_HMEM_CUDA == iface &&
(OFI_HMEM_DATA_GDRCOPY_HANDLE & smr_desc->flags)) {
assert(smr_desc->hmem_data);
gdrcopy_available = true;
}
}
An alternative to changing the common mr iov path is to change e.g. smr_copy_from_mr_iov
and iterate over the ifaces, then we can choose the proper memcopy method case by case. The downside is code duplication with mr iov copy, plus we don't have an immediate use for that.
@aingerson WDYT?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@wenduwan I see what you mean about the duplicate code... I'm leaning towards saying to just do it in the mr iov path with maybe a compile check around the gdr check? shm is the only provider using the mr iov copy functions and since the gdrcopy hmem data is in the ofi mr, I think it makes sense to say if you're going to use those functions that you would be opting in to the gdr copy if the gdr copy handle was registered it.
if there comes another provider/use case that needs to use it that explicitly needs to avoid the gdr code on a system with it available when it was registered with the gdr handle in the hmem data... then I think we can cross that bridge when we come to it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@aingerson Sneaky! Yeah if SHM is the only user then we can definitely do that.
But I just realized that there is another caveat. For gdrcopy to work, the caller should make sure the src and dest cannot both be on cuda. Based on my reading SHM does not have this issue(need to run more tests), so I will leave a note in code.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh if that's the case, then we can never use gdr copy for the inline/inject paths since those always copy to the shm region which will never be cuda. If we have to make sure they are both cuda then we can only call it on the progress ipc path if the cmd->iface and rx_entry->iface are both cuda. I think it would be best to keep the ofi_mr copy functions as is and just add a case in the progress ipc path to call the gdrcopy function
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@aingerson lol I think you read this in reverse?
For gdrcopy to work, the caller should make sure the src and dest cannot both be on cuda
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@wenduwan oh lol yup. my bad!
bot:aws:retest |
fbba715
to
3547270
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just a few comments. I think this approach seems the best to me. Thank you!
From v1.19, OFI_HMEM_DATA_GDRCOPY_HANDLE flag will signal the presence of gdrcopy handle in ofi_mr.hmem_data. SHM provider can take advantage of gdrcopy to achieve lower memcpy latencies from/to cuda devices. This patch introduces the logic to select gdrcopy on hmem copy paths, with the exception of cuda IPC which is not supported by gdrcopy. Signed-off-by: Wenduo Wang <wenduwan@amazon.com>
3547270
to
721c96a
Compare
@aingerson Please see update. Thanks for simplifying the end result! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for all the adjustments!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, a monor comment
* TODO: Fine tune the max data size to switch from gdrcopy to cudaMemcpy | ||
* Note: buf must be on the host since gdrcopy does not support D2D copy | ||
*/ | ||
if (dir == OFI_COPY_BUF_TO_IOV) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
hmmm. so there is no way to make cuda_gdrcopy* called by ofi_hmem_copy* ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For that we need to pass along hmem_data
in addition to device
.
So based on our prior discussion this is not a good idea since gdrcopy is the single use case right now. We prefer to keep the hmem copy api stable at this point, and switch to gdrcopy on a higher level of the call stack. Hence we are not changing hmem copy functions.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Got it , makes sense. I see it is wrapped inside ofi_mr_iov_copy, which should be fine
what is the error in Intel CI? |
@shijin-aws Unrelated CI error. You can ignore. |
From v1.19, OFI_HMEM_DATA_GDRCOPY_HANDLE flag will signal the presence of gdrcopy handle in ofi_mr.hmem_data. SHM provider can take advantage of gdrcopy to achieve lower memcpy latencies from/to cuda devices.
This patch introduces the logic to select gdrcopy on hmem copy paths, with the exception of cuda IPC which is not supported by gdrcopy.
For OMB single-node cuda->cuda latency we are seeing improvement(via EFA owner provider, which is responsible for gdr pinning).