Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

core/hmem: introduce hmem_data field to memory registration #8787

Merged
merged 1 commit into from
Apr 18, 2023

Conversation

wenduwan
Copy link
Contributor

@wenduwan wenduwan commented Apr 11, 2023

Introduce a wildcard field hmem_data to provide an extension point for HMEM interfaces that require additional information to device id.

The immediate use case is to allow different providers to communicate the presence of a GDR handle. This is achieved with a companion flag, i.e. FI_HMEM_DATA_GDR_HANDLE.

Furthermore, this change makes it possible to offload similar operations to the user application, e.g. GDR pinning.

@wenduwan wenduwan requested review from shefty and a team April 11, 2023 19:49
@@ -171,6 +171,7 @@ typedef struct fid *fid_t;
#define FI_XPU_TRIGGER (1ULL << 44)
#define FI_HMEM_HOST_ALLOC (1ULL << 45)
#define FI_HMEM_DEVICE_ONLY (1ULL << 46)
#define FI_HMEM_DATA_GDR (1ULL << 47)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Though this value is equal to FI_HMEM, it shouldn't be a problem. They are not expected to be used in the same way.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Probably makes sense to just model it as an alias for FI_HMEM in that case

Suggested change
#define FI_HMEM_DATA_GDR (1ULL << 47)
#define FI_HMEM_DATA_GDR FI_HMEM

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh? IMO it is confusing since it is incidental that these 2 flags are equal.

Or maybe I should simply pick a different value to avoid this confusion.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Having the same value and being an alias are not the same. Flags for mr_attr should go into fi_domain.h. The FI_HMEM_HOST_ALLOC and FI_HMEM_DEVICE_ONLY flags should have gone there (and probably have been given different values). See comment in fi_domain.h on memory registration flags. I would not choose a value that conflicts with FI_HMEM. The possible need to use that flag is too high.

In general, flags for functions typically start at bit 59 and go down from there, skipping any conflicts (mostly with caps). So, I'd make this bit 59, since that's still available.

@wzamazon
Copy link
Contributor

I think FI_HMEM_DATA_GDR is not a good name. GDR is a much bigger concept than gdrcopy.

I think the flag should be FI_HMEM_CUDA_GDRCOPY_HANDLE

@wenduwan
Copy link
Contributor Author

I think FI_HMEM_DATA_GDR is not a good name. GDR is a much bigger concept than gdrcopy.

I think the flag should be FI_HMEM_CUDA_GDRCOPY_HANDLE

I agree. Renamed the flag to FI_HMEM_DATA_GDR_HANDLE which is shorter.

@@ -171,6 +171,7 @@ typedef struct fid *fid_t;
#define FI_XPU_TRIGGER (1ULL << 44)
#define FI_HMEM_HOST_ALLOC (1ULL << 45)
#define FI_HMEM_DEVICE_ONLY (1ULL << 46)
#define FI_HMEM_DATA_GDR (1ULL << 47)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Having the same value and being an alias are not the same. Flags for mr_attr should go into fi_domain.h. The FI_HMEM_HOST_ALLOC and FI_HMEM_DEVICE_ONLY flags should have gone there (and probably have been given different values). See comment in fi_domain.h on memory registration flags. I would not choose a value that conflicts with FI_HMEM. The possible need to use that flag is too high.

In general, flags for functions typically start at bit 59 and go down from there, skipping any conflicts (mostly with caps). So, I'd make this bit 59, since that's still available.

man/fi_mr.3.md Outdated
@@ -736,6 +743,10 @@ The follow flag may be specified to any memory registration call.
API AllocHost call (as opposed to using malloc-like host allocation,
unified/shared memory allocation, or AllocDevice).

*FI_HMEM_DATA_GDR_HANDLE*
: This flag indicates that hmem_data points to a GDR handle. The owner of the
handle is responsible for unsetting this flag if the handle is invalidated.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Remove comment about 'owner of the handle'. This man page is user-facing, and that comment doesn't make sense.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I will remove this for now. My intention is to define the expectation if hmem_data is passed from the application, such that libfabric does not own the handle.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Having the same value and being an alias are not the same. Flags for mr_attr should go into fi_domain.h. The FI_HMEM_HOST_ALLOC and FI_HMEM_DEVICE_ONLY flags should have gone there (and probably have been given different values). See comment in fi_domain.h on memory registration flags. I would not choose a value that conflicts with FI_HMEM. The possible need to use that flag is too high.
In general, flags for functions typically start at bit 59 and go down from there, skipping any conflicts (mostly with caps). So, I'd make this bit 59, since that's still available.

I see. Thanks for the tip. I will include a commit to move the flags.

@wzamazon
Copy link
Contributor

FI_HMEM_DATA_GDR_HANDLE which is shorter.

should be GDRCOPY_HANDLE. GDR is different from GDRcopy

@wenduwan
Copy link
Contributor Author

FI_HMEM_DATA_GDR_HANDLE which is shorter.

should be GDRCOPY_HANDLE. GDR is different from GDRcopy

I see. Updated the name.

@wenduwan wenduwan force-pushed the extend_mr branch 3 times, most recently from b2ba0f7 to c27e273 Compare April 11, 2023 20:38
@@ -115,6 +115,11 @@ struct fid_av {
* Tracks registered memory regions, primarily for remote access,
* but also for local access until we can remove that need.
*/

#define FI_HMEM_HOST_ALLOC (1ULL << 60)
#define FI_HMEM_DEVICE_ONLY (1ULL << 61)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can't change these flags. This breaks the API. I was only saying that they should have been given a different value, not that can be.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

phew.. near miss. I will revert this part. Thanks!

But I doubt if this will actually break anything at this point. Did a quick code search, and the usage is strictly local, i.e. no exchange between endpoints. So I guess their values can be changed.... later.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These have been exposed to existing applications, which could be using them as input into a memory registration call. E.g. fabtests uses them. So, they can't change even later.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

😬


#define FI_HMEM_HOST_ALLOC (1ULL << 60)
#define FI_HMEM_DEVICE_ONLY (1ULL << 61)
#define FI_HMEM_DATA_GDRCOPY_HANDLE (1ULL << 62)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The values need to start at 59 and go down (i.e. 58, 57, etc.). The bit values 60-63 are reserved for provider specific use (or by libfabric for internal use).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Got it. I was using a different endianess.

man/fi_mr.3.md Outdated
An optional pointer to data defined for the HMEM interface. This field is
used alongside flags to provide additional data required for the HMEM ops
if device identifier alone is not sufficient. This field is ignored unless
the iface field is valid.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

add: is valid and an appropriate call flag is set.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Makes sense. Reworded this section.

@wenduwan
Copy link
Contributor Author

@shefty Could you confirm if continuous-integration/jenkins/pr-head failure is caused by this change?

@wenduwan wenduwan requested a review from a team April 12, 2023 17:24
@aingerson
Copy link
Contributor

@wenduwan It doesn't look related. We had our cluster go down last week and this looks like lingering issues from that (lack of resources causing timeouts). The builds passed and a few tests passed and the others just couldn't get nodes so I think it's fine.

@wenduwan wenduwan requested a review from shefty April 12, 2023 23:39

#define FI_HMEM_HOST_ALLOC (1ULL << 45)
#define FI_HMEM_DEVICE_ONLY (1ULL << 46)
#define FI_HMEM_DATA_GDRCOPY_HANDLE (1ULL << 59)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please move these to line 298 (where there's a comment regarding mr flags)

GDRCOPY_HANDLE should probably be something like GDR_MAP_HANDLE based on the comments in gdrapi.h:

https://github.com/NVIDIA/gdrcopy/blob/master/include/gdrapi.h

man/fi_mr.3.md Outdated
@@ -736,6 +743,9 @@ The follow flag may be specified to any memory registration call.
API AllocHost call (as opposed to using malloc-like host allocation,
unified/shared memory allocation, or AllocDevice).

*FI_HMEM_DATA_GDRCOPY_HANDLE*
: This flag indicates that hmem_data points to a gdrcopy_handle struct.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

struct gdrcopy_handle is something internal to libfabric. A user of the API won't have this. For the external API, this should be *gdr_mh_t. If additional information beyond gdr_mh_t is needed, then we will need to define a structure in the exported API files for callers to use. Such a structure cannot create a dependency between the libfabric build and cuda (or any other GPU library).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We could actually expose gdrcopy_handle for the application to do the memory pinning etc. and pass it to libfabric.

// src/hmem_cuda_gdrcopy.c
struct gdrcopy_handle {
	gdr_mh_t mh; /* memory handler */
	void *cuda_ptr; /* page aligned gpu pointer */
	void *user_ptr; /* user space ptr mapped to GPU memory */
	size_t length; /* page aligned length */
};

Because of gdrcopy's page alignment requirements, gdr_mh_t alone is not sufficient for address translation, so we do need to define a different struct for callers to use.

libfabric build should not have a hard dependency on gdrcopy, and we should skip gdrcopy if it is not available.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We cannot have gdr_mh_t in any of the libfabric external header files (include/rdma/*.h) without forcing a build dependency on the gdrcopy library, which is a no-go. The register mr APIs already have parameters to pass in an array of addresses and lengths. To register this memory for use with gdr copy, which of the ptrs do we actually need?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To register this memory for use with gdr copy, which of the ptrs do we actually need?

All of these fields are actually required to use gdrcopy. cuda_ptr and user_ptr are both needed to calculate offset induced by page alignment.

We cannot have gdr_mh_t in any of the libfabric external header files (include/rdma/*.h) without forcing a build dependency on the gdrcopy library, which is a no-go.

This makes sense. Practically gdr_mh_t is simply

typedef struct gdr_mh_s {
  unsigned long h;
} gdr_mh_t;

But for libfabric it is an opaque value so probably not a good idea to replace it with unsigned long.

In this case, we probably shouldn't advertise struct gdrcopy_handle and keep it for internal use only as you suggested.

I can remove the mentioning of struct gdrcopy_handle from man page.

include/ofi_mr.h Show resolved Hide resolved
man/fi_mr.3.md Outdated
@@ -736,6 +743,10 @@ The follow flag may be specified to any memory registration call.
API AllocHost call (as opposed to using malloc-like host allocation,
unified/shared memory allocation, or AllocDevice).

*FI_HMEM_DATA_GDRCOPY_HANDLE*
: This flag indicates that hmem_data points to a gdrcopy handle. This flag is
ignored unless iface is set to FI_HMEM_CUDA.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If this is defined and exposed to users, it needs to be well defined, so that it's usable. If we're only going to use this for some internal calls, then we can use one of the reserved internal flags, with some internal structure.

These are the relevant gdrcopy calls:

gdr_pin_buffer(gdr_t g, unsigned long addr, size_t size, uint64_t p2p_token, uint32_t va_space, gdr_mh_t *handle)
gdr_map(gdr_t g, gdr_mh_t handle, void **va, size_t size);
gdr_copy_to_mapping(gdr_mh_t handle, void *map_d_ptr, const void *h_ptr, size_t size)

This is the flow I anticipated:

gdr_mh_t gdr_mem_handle;
void *gdr_mapped_addr;
struct fi_mr_attr mr_attr;
struct iovec iov;

gdr_pin_buffer(g, addr, len, ?, ?, &gdr_mem_handle);
gdr_map(g, gdr_mem_handle, &gdr_mapped_addr, len);

/* Register the gdr mapped buffer */
iov.iov_base = gdr_mapped_addr;
iov.iov_len = len;
mr_attr.iov = &iov;
mr_attr. ...
mr_attr.hmem_data = gdr_mem_handle;
fi_mr_regattr(..., &mr_attr, FI_HMEM_GDR_MAPPED);

The grpcopy routine needs the gdr_mem_handle and gdr_mapped_addr.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This makes sense if gdr pinning and address translation are delegated to the application, e.g. application ensures page alignment, and always passes page-aligned addresses to libfabric.

It's different in the current gdrcopy implementation. libfabric encapsulates this process from the application, and caches the alignment information in struct gdrcopy_handle, used for address translation during gdr_copy_from/to_mapping.

If we're only going to use this for some internal calls, then we can use one of the reserved internal flags, with some internal structure.

This is a good suggestion. In this case an internal flag & struct should be clear and sufficient. I will propose a change accordingly.

#define FI_HMEM_DEVICE_ONLY (1ULL << 46)

/* libfabric internal flags starting from 60 to 63 */
#define FI_HMEM_DATA_GDRCOPY_HANDLE (1ULL << 60)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @shefty I set this flag to 60 based on your prior comment of internal flags. Please let me know if you prefer a different value/name.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You are still changing the public header files. These changes need to be kept internal

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ack. Addressed in updated.

include/ofi_mr.h Outdated
@@ -264,13 +264,17 @@ int ofi_mr_map_verify(struct ofi_mr_map *map, uintptr_t *io_addr,
* registration.
*/

/* libfabric internal flags starting from 60 to 63 */
#define FI_HMEM_DATA_GDRCOPY_HANDLE (1ULL << 60)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To my understanding ofi_mr.h is more suited for this declaration than ofi_hmem.h, since this flag should be used next to struct ofi_mr. Please let me know otherwise.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please rename with OFI prefix to indicate that this is an internal flag.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ack. Will address.

Before that I would like to clarify the change to struct fi_mr_attr. Please see the other response.

@wenduwan
Copy link
Contributor Author

I don't imagine the CI failure is related. Please let me know otherwise @shefty / @aingerson

Copy link
Member

@shefty shefty left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

forgot to submit review

include/ofi_mr.h Outdated
@@ -264,13 +264,17 @@ int ofi_mr_map_verify(struct ofi_mr_map *map, uintptr_t *io_addr,
* registration.
*/

/* libfabric internal flags starting from 60 to 63 */
#define FI_HMEM_DATA_GDRCOPY_HANDLE (1ULL << 60)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please rename with OFI prefix to indicate that this is an internal flag.

man/fi_mr.3.md Outdated
: For libfabric internal use only. This flag indicates that hmem_data points to
a gdrcopy_handle data structure. This flag is ignored unless iface is set to
FI_HMEM_CUDA.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

All updates to this man page should be removed.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks. Addressed comments. Stripped away unrelated changes as well since we are not exposing the new OFI_HMEM_* flag. This makes the change minimal.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The CI appears to be down recently. Assuming the failure is unrelated.

include/rdma/fi_domain.h Show resolved Hide resolved
Introduce a wildcard field hmem_data to provide an extension point for
HMEM interfaces that require additional information to device id.

The immediate use case is to allow different providers to communicate
the presence of a GDR handle. This is achieved with a companion flag,
i.e. OFI_HMEM_DATA_GDRCOPY_HANDLE.

Furthermore, this change makes it possible to offload similar operations
to the user application, e.g. GDR pinning.

Signed-off-by: Wenduo Wang <wenduwan@amazon.com>
Copy link
Member

@shefty shefty left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These changes work, and if not, we should be safe to change the implementation later. Thanks!

@shefty
Copy link
Member

shefty commented Apr 18, 2023

Recent appveyor changes seem to be widening race windows, resulting in more frequent failures. Ignore for now, since the CI only tests the socket provider. I plan on changing that in the near future.

@wenduwan
Copy link
Contributor Author

@wzamazon Do you have more comments on this PR?

@wzamazon
Copy link
Contributor

"continuous-integration/appveyor/pr" failure is not related.

@wzamazon wzamazon merged commit 49d1a76 into ofiwg:main Apr 18, 2023
@wenduwan wenduwan deleted the extend_mr branch April 18, 2023 23:32
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

6 participants