-
Notifications
You must be signed in to change notification settings - Fork 371
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
core/hmem: introduce hmem_data field to memory registration #8787
Conversation
include/rdma/fabric.h
Outdated
@@ -171,6 +171,7 @@ typedef struct fid *fid_t; | |||
#define FI_XPU_TRIGGER (1ULL << 44) | |||
#define FI_HMEM_HOST_ALLOC (1ULL << 45) | |||
#define FI_HMEM_DEVICE_ONLY (1ULL << 46) | |||
#define FI_HMEM_DATA_GDR (1ULL << 47) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Though this value is equal to FI_HMEM, it shouldn't be a problem. They are not expected to be used in the same way.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Probably makes sense to just model it as an alias for FI_HMEM
in that case
#define FI_HMEM_DATA_GDR (1ULL << 47) | |
#define FI_HMEM_DATA_GDR FI_HMEM |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh? IMO it is confusing since it is incidental that these 2 flags are equal.
Or maybe I should simply pick a different value to avoid this confusion.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Having the same value and being an alias are not the same. Flags for mr_attr should go into fi_domain.h. The FI_HMEM_HOST_ALLOC and FI_HMEM_DEVICE_ONLY flags should have gone there (and probably have been given different values). See comment in fi_domain.h on memory registration flags. I would not choose a value that conflicts with FI_HMEM. The possible need to use that flag is too high.
In general, flags for functions typically start at bit 59 and go down from there, skipping any conflicts (mostly with caps). So, I'd make this bit 59, since that's still available.
I think I think the flag should be |
I agree. Renamed the flag to |
include/rdma/fabric.h
Outdated
@@ -171,6 +171,7 @@ typedef struct fid *fid_t; | |||
#define FI_XPU_TRIGGER (1ULL << 44) | |||
#define FI_HMEM_HOST_ALLOC (1ULL << 45) | |||
#define FI_HMEM_DEVICE_ONLY (1ULL << 46) | |||
#define FI_HMEM_DATA_GDR (1ULL << 47) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Having the same value and being an alias are not the same. Flags for mr_attr should go into fi_domain.h. The FI_HMEM_HOST_ALLOC and FI_HMEM_DEVICE_ONLY flags should have gone there (and probably have been given different values). See comment in fi_domain.h on memory registration flags. I would not choose a value that conflicts with FI_HMEM. The possible need to use that flag is too high.
In general, flags for functions typically start at bit 59 and go down from there, skipping any conflicts (mostly with caps). So, I'd make this bit 59, since that's still available.
man/fi_mr.3.md
Outdated
@@ -736,6 +743,10 @@ The follow flag may be specified to any memory registration call. | |||
API AllocHost call (as opposed to using malloc-like host allocation, | |||
unified/shared memory allocation, or AllocDevice). | |||
|
|||
*FI_HMEM_DATA_GDR_HANDLE* | |||
: This flag indicates that hmem_data points to a GDR handle. The owner of the | |||
handle is responsible for unsetting this flag if the handle is invalidated. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Remove comment about 'owner of the handle'. This man page is user-facing, and that comment doesn't make sense.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I will remove this for now. My intention is to define the expectation if hmem_data is passed from the application, such that libfabric does not own the handle.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Having the same value and being an alias are not the same. Flags for mr_attr should go into fi_domain.h. The FI_HMEM_HOST_ALLOC and FI_HMEM_DEVICE_ONLY flags should have gone there (and probably have been given different values). See comment in fi_domain.h on memory registration flags. I would not choose a value that conflicts with FI_HMEM. The possible need to use that flag is too high.
In general, flags for functions typically start at bit 59 and go down from there, skipping any conflicts (mostly with caps). So, I'd make this bit 59, since that's still available.
I see. Thanks for the tip. I will include a commit to move the flags.
should be |
I see. Updated the name. |
b2ba0f7
to
c27e273
Compare
include/rdma/fi_domain.h
Outdated
@@ -115,6 +115,11 @@ struct fid_av { | |||
* Tracks registered memory regions, primarily for remote access, | |||
* but also for local access until we can remove that need. | |||
*/ | |||
|
|||
#define FI_HMEM_HOST_ALLOC (1ULL << 60) | |||
#define FI_HMEM_DEVICE_ONLY (1ULL << 61) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We can't change these flags. This breaks the API. I was only saying that they should have been given a different value, not that can be.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
phew.. near miss. I will revert this part. Thanks!
But I doubt if this will actually break anything at this point. Did a quick code search, and the usage is strictly local, i.e. no exchange between endpoints. So I guess their values can be changed.... later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These have been exposed to existing applications, which could be using them as input into a memory registration call. E.g. fabtests uses them. So, they can't change even later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
😬
include/rdma/fi_domain.h
Outdated
|
||
#define FI_HMEM_HOST_ALLOC (1ULL << 60) | ||
#define FI_HMEM_DEVICE_ONLY (1ULL << 61) | ||
#define FI_HMEM_DATA_GDRCOPY_HANDLE (1ULL << 62) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The values need to start at 59 and go down (i.e. 58, 57, etc.). The bit values 60-63 are reserved for provider specific use (or by libfabric for internal use).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Got it. I was using a different endianess.
man/fi_mr.3.md
Outdated
An optional pointer to data defined for the HMEM interface. This field is | ||
used alongside flags to provide additional data required for the HMEM ops | ||
if device identifier alone is not sufficient. This field is ignored unless | ||
the iface field is valid. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
add: is valid and an appropriate call flag is set.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Makes sense. Reworded this section.
@shefty Could you confirm if continuous-integration/jenkins/pr-head failure is caused by this change? |
@wenduwan It doesn't look related. We had our cluster go down last week and this looks like lingering issues from that (lack of resources causing timeouts). The builds passed and a few tests passed and the others just couldn't get nodes so I think it's fine. |
include/rdma/fi_domain.h
Outdated
|
||
#define FI_HMEM_HOST_ALLOC (1ULL << 45) | ||
#define FI_HMEM_DEVICE_ONLY (1ULL << 46) | ||
#define FI_HMEM_DATA_GDRCOPY_HANDLE (1ULL << 59) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please move these to line 298 (where there's a comment regarding mr flags)
GDRCOPY_HANDLE should probably be something like GDR_MAP_HANDLE based on the comments in gdrapi.h:
https://github.com/NVIDIA/gdrcopy/blob/master/include/gdrapi.h
man/fi_mr.3.md
Outdated
@@ -736,6 +743,9 @@ The follow flag may be specified to any memory registration call. | |||
API AllocHost call (as opposed to using malloc-like host allocation, | |||
unified/shared memory allocation, or AllocDevice). | |||
|
|||
*FI_HMEM_DATA_GDRCOPY_HANDLE* | |||
: This flag indicates that hmem_data points to a gdrcopy_handle struct. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
struct gdrcopy_handle is something internal to libfabric. A user of the API won't have this. For the external API, this should be *gdr_mh_t. If additional information beyond gdr_mh_t is needed, then we will need to define a structure in the exported API files for callers to use. Such a structure cannot create a dependency between the libfabric build and cuda (or any other GPU library).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We could actually expose gdrcopy_handle for the application to do the memory pinning etc. and pass it to libfabric.
// src/hmem_cuda_gdrcopy.c
struct gdrcopy_handle {
gdr_mh_t mh; /* memory handler */
void *cuda_ptr; /* page aligned gpu pointer */
void *user_ptr; /* user space ptr mapped to GPU memory */
size_t length; /* page aligned length */
};
Because of gdrcopy's page alignment requirements, gdr_mh_t
alone is not sufficient for address translation, so we do need to define a different struct for callers to use.
libfabric build should not have a hard dependency on gdrcopy, and we should skip gdrcopy if it is not available.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We cannot have gdr_mh_t in any of the libfabric external header files (include/rdma/*.h) without forcing a build dependency on the gdrcopy library, which is a no-go. The register mr APIs already have parameters to pass in an array of addresses and lengths. To register this memory for use with gdr copy, which of the ptrs do we actually need?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To register this memory for use with gdr copy, which of the ptrs do we actually need?
All of these fields are actually required to use gdrcopy. cuda_ptr and user_ptr are both needed to calculate offset induced by page alignment.
We cannot have gdr_mh_t in any of the libfabric external header files (include/rdma/*.h) without forcing a build dependency on the gdrcopy library, which is a no-go.
This makes sense. Practically gdr_mh_t is simply
typedef struct gdr_mh_s {
unsigned long h;
} gdr_mh_t;
But for libfabric it is an opaque value so probably not a good idea to replace it with unsigned long
.
In this case, we probably shouldn't advertise struct gdrcopy_handle
and keep it for internal use only as you suggested.
I can remove the mentioning of struct gdrcopy_handle
from man page.
man/fi_mr.3.md
Outdated
@@ -736,6 +743,10 @@ The follow flag may be specified to any memory registration call. | |||
API AllocHost call (as opposed to using malloc-like host allocation, | |||
unified/shared memory allocation, or AllocDevice). | |||
|
|||
*FI_HMEM_DATA_GDRCOPY_HANDLE* | |||
: This flag indicates that hmem_data points to a gdrcopy handle. This flag is | |||
ignored unless iface is set to FI_HMEM_CUDA. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If this is defined and exposed to users, it needs to be well defined, so that it's usable. If we're only going to use this for some internal calls, then we can use one of the reserved internal flags, with some internal structure.
These are the relevant gdrcopy calls:
gdr_pin_buffer(gdr_t g, unsigned long addr, size_t size, uint64_t p2p_token, uint32_t va_space, gdr_mh_t *handle)
gdr_map(gdr_t g, gdr_mh_t handle, void **va, size_t size);
gdr_copy_to_mapping(gdr_mh_t handle, void *map_d_ptr, const void *h_ptr, size_t size)
This is the flow I anticipated:
gdr_mh_t gdr_mem_handle;
void *gdr_mapped_addr;
struct fi_mr_attr mr_attr;
struct iovec iov;
gdr_pin_buffer(g, addr, len, ?, ?, &gdr_mem_handle);
gdr_map(g, gdr_mem_handle, &gdr_mapped_addr, len);
/* Register the gdr mapped buffer */
iov.iov_base = gdr_mapped_addr;
iov.iov_len = len;
mr_attr.iov = &iov;
mr_attr. ...
mr_attr.hmem_data = gdr_mem_handle;
fi_mr_regattr(..., &mr_attr, FI_HMEM_GDR_MAPPED);
The grpcopy routine needs the gdr_mem_handle and gdr_mapped_addr.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This makes sense if gdr pinning and address translation are delegated to the application, e.g. application ensures page alignment, and always passes page-aligned addresses to libfabric.
It's different in the current gdrcopy implementation. libfabric encapsulates this process from the application, and caches the alignment information in struct gdrcopy_handle
, used for address translation during gdr_copy_from/to_mapping
.
If we're only going to use this for some internal calls, then we can use one of the reserved internal flags, with some internal structure.
This is a good suggestion. In this case an internal flag & struct should be clear and sufficient. I will propose a change accordingly.
include/rdma/fi_domain.h
Outdated
#define FI_HMEM_DEVICE_ONLY (1ULL << 46) | ||
|
||
/* libfabric internal flags starting from 60 to 63 */ | ||
#define FI_HMEM_DATA_GDRCOPY_HANDLE (1ULL << 60) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi @shefty I set this flag to 60 based on your prior comment of internal flags. Please let me know if you prefer a different value/name.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You are still changing the public header files. These changes need to be kept internal
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ack. Addressed in updated.
include/ofi_mr.h
Outdated
@@ -264,13 +264,17 @@ int ofi_mr_map_verify(struct ofi_mr_map *map, uintptr_t *io_addr, | |||
* registration. | |||
*/ | |||
|
|||
/* libfabric internal flags starting from 60 to 63 */ | |||
#define FI_HMEM_DATA_GDRCOPY_HANDLE (1ULL << 60) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To my understanding ofi_mr.h
is more suited for this declaration than ofi_hmem.h
, since this flag should be used next to struct ofi_mr
. Please let me know otherwise.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please rename with OFI prefix to indicate that this is an internal flag.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ack. Will address.
Before that I would like to clarify the change to struct fi_mr_attr
. Please see the other response.
I don't imagine the CI failure is related. Please let me know otherwise @shefty / @aingerson |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
forgot to submit review
include/ofi_mr.h
Outdated
@@ -264,13 +264,17 @@ int ofi_mr_map_verify(struct ofi_mr_map *map, uintptr_t *io_addr, | |||
* registration. | |||
*/ | |||
|
|||
/* libfabric internal flags starting from 60 to 63 */ | |||
#define FI_HMEM_DATA_GDRCOPY_HANDLE (1ULL << 60) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please rename with OFI prefix to indicate that this is an internal flag.
man/fi_mr.3.md
Outdated
: For libfabric internal use only. This flag indicates that hmem_data points to | ||
a gdrcopy_handle data structure. This flag is ignored unless iface is set to | ||
FI_HMEM_CUDA. | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
All updates to this man page should be removed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks. Addressed comments. Stripped away unrelated changes as well since we are not exposing the new OFI_HMEM_*
flag. This makes the change minimal.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The CI appears to be down recently. Assuming the failure is unrelated.
Introduce a wildcard field hmem_data to provide an extension point for HMEM interfaces that require additional information to device id. The immediate use case is to allow different providers to communicate the presence of a GDR handle. This is achieved with a companion flag, i.e. OFI_HMEM_DATA_GDRCOPY_HANDLE. Furthermore, this change makes it possible to offload similar operations to the user application, e.g. GDR pinning. Signed-off-by: Wenduo Wang <wenduwan@amazon.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These changes work, and if not, we should be safe to change the implementation later. Thanks!
Recent appveyor changes seem to be widening race windows, resulting in more frequent failures. Ignore for now, since the CI only tests the socket provider. I plan on changing that in the near future. |
@wzamazon Do you have more comments on this PR? |
"continuous-integration/appveyor/pr" failure is not related. |
Introduce a wildcard field hmem_data to provide an extension point for HMEM interfaces that require additional information to device id.
The immediate use case is to allow different providers to communicate the presence of a GDR handle. This is achieved with a companion flag, i.e. FI_HMEM_DATA_GDR_HANDLE.
Furthermore, this change makes it possible to offload similar operations to the user application, e.g. GDR pinning.