From 9a39e432ac0b1ba346671ed9c45eca42ca194d1e Mon Sep 17 00:00:00 2001 From: Sean Hefty Date: Wed, 15 Feb 2017 15:24:06 -0800 Subject: [PATCH 1/3] fabric/cq-eq: Have application provide error buffer The current CQ/EQ document that the err_data referenced by error events points to a provider owned data buffer. This results in serialization issues to the application. Allow the application to provide a data buffer for error data, including the size of the input buffer. Add a domain attribute to report the maximum size of error data that a provider may need. Fixes #2720 Signed-off-by: Sean Hefty --- include/rdma/fabric.h | 1 + include/rdma/fi_eq.h | 1 + man/fi_cq.3.md | 18 +++++++++++++++--- man/fi_domain.3.md | 8 +++++++- man/fi_eq.3.md | 15 +++++++++++---- 5 files changed, 35 insertions(+), 8 deletions(-) diff --git a/include/rdma/fabric.h b/include/rdma/fabric.h index 3aab9560a8c..b518e09e83b 100644 --- a/include/rdma/fabric.h +++ b/include/rdma/fabric.h @@ -344,6 +344,7 @@ struct fi_domain_attr { uint64_t mode; uint8_t *auth_key; size_t auth_keylen; + size_t max_err_data; }; struct fi_fabric_attr { diff --git a/include/rdma/fi_eq.h b/include/rdma/fi_eq.h index cbb2f6837c9..1cf7a6f4787 100644 --- a/include/rdma/fi_eq.h +++ b/include/rdma/fi_eq.h @@ -229,6 +229,7 @@ struct fi_cq_err_entry { int prov_errno; /* err_data is available until the next time the CQ is read */ void *err_data; + size_t err_data_size; }; enum fi_cq_wait_cond { diff --git a/man/fi_cq.3.md b/man/fi_cq.3.md index 0d605d5aefd..eea9283bc25 100644 --- a/man/fi_cq.3.md +++ b/man/fi_cq.3.md @@ -395,6 +395,7 @@ struct fi_cq_err_entry { int err; /* positive error code */ int prov_errno; /* provider error code */ void *err_data; /* error data */ + size_t err_data_size; /* size of err_data */ }; ``` @@ -490,9 +491,20 @@ of these fields are the same for all CQ entry structure formats. associated with an error. The use of this field and its meaning is provider specific. It is intended to be used as a debugging aid. See fi_cq_strerror for additional details on converting this error data into - a human readable string. Providers are allowed to reuse a single internal - buffer to store additional error information. As a result, error data - is only guaranteed to be available until the next time the CQ is read. + a human readable string. + +*err_data_size* +: On input, err_data_size indicates the size of the err_data buffer in bytes. + On output, err_data_size will be set to the number of bytes copied to the + err_data buffer. The err_data information is typically used with + fi_cq_strerror to provide details about the type of error that occurred. + + For compatibility purposes, if err_data_size is 0 on input, or the fabric + was opened with release < 1.5, err_data will be set to a data buffer + owned by the provider. The contents of the buffer will remain valid until a + subsequent read call against the CQ. Applications must serialize access + to the CQ when processing errors to ensure that the buffer referenced by + err_data does no change. # COMPLETION FLAGS diff --git a/man/fi_domain.3.md b/man/fi_domain.3.md index 031526a3c30..1c8b820d969 100644 --- a/man/fi_domain.3.md +++ b/man/fi_domain.3.md @@ -132,6 +132,7 @@ struct fi_domain_attr { uint64_t mode; uint8_t *auth_key; size_t auth_keylen; + size_t max_err_data; }; ``` @@ -588,7 +589,6 @@ The operational mode bit related to using the domain. to only be used with endpoints, transmit contexts, and receive contexts that have the same set of capability flags. - ## Default authorization key (auth_key) The default authorization key to associate with endpoint and memory @@ -603,6 +603,12 @@ registrations created within the domain unless specified in the endpoint or memory registration attributes. This field is ignored unless the fabric is opened with API version 1.5 or greater. +## Max Error Data Size (max_err_data) + +: The maximum amount of error data, in bytes, that may be returned as part of + a completion or event queue error. This value corresponds to the + err_data_size field in struct fi_cq_err_entry and struct fi_eq_err_entry. + # RETURN VALUE Returns 0 on success. On error, a negative value corresponding to fabric diff --git a/man/fi_eq.3.md b/man/fi_eq.3.md index 02320b4dada..5acfb56eb6b 100644 --- a/man/fi_eq.3.md +++ b/man/fi_eq.3.md @@ -406,10 +406,17 @@ through the prov_errno and err_data fields. Users may call fi_eq_strerror to convert provider specific error information into a printable string for debugging purposes. -If err_data_size is > 0, then the buffer referenced by err_data is directly -user-accessible. The contents of the buffer will remain valid until a -subsequent read call against the EQ. Applications which read the err_data -buffer must ensure that they do not read past the end of the referenced buffer. +On input, err_data_size indicates the size of the err_data buffer in bytes. +On output, err_data_size will be set to the number of bytes copied to the +err_data buffer. The err_data information is typically used with +fi_eq_strerror to provide details about the type of error that occurred. + +For compatibility purposes, if err_data_size is 0 on input, or the fabric +was opened with release < 1.5, err_data will be set to a data buffer +owned by the provider. The contents of the buffer will remain valid until a +subsequent read call against the EQ. Applications must serialize access +to the EQ when processing errors to ensure that the buffer referenced by +err_data does not change. # NOTES From 4c04bb6573b10a5139c685e38e54e602acf9bdbd Mon Sep 17 00:00:00 2001 From: Sean Hefty Date: Tue, 21 Feb 2017 15:47:30 -0800 Subject: [PATCH 2/3] fabric: Define app behavior when FI_MMU_NOTIFY is set Add a new call, fi_mr_refresh, that must be invoked by the application whenever there is a change to the pages backing a registered memory region. This completes the definition and behavior of what it means when the provider sets FI_MMU_NOTIFY. Signed-off-by: Sean Hefty --- include/rdma/fabric.h | 1 + include/rdma/fi_domain.h | 16 ++++++++++++++++ man/fi_mr.3.md | 29 ++++++++++++++++++++++++++++- 3 files changed, 45 insertions(+), 1 deletion(-) diff --git a/include/rdma/fabric.h b/include/rdma/fabric.h index b518e09e83b..979c16d8483 100644 --- a/include/rdma/fabric.h +++ b/include/rdma/fabric.h @@ -493,6 +493,7 @@ enum { FI_QUEUE_WORK, /* struct fi_deferred_work */ FI_CANCEL_WORK, /* struct fi_deferred_work */ FI_FLUSH_WORK, /* NULL */ + FI_REFRESH, /* mr: fi_mr_modify */ }; static inline int fi_control(struct fid *fid, int command, void *arg) diff --git a/include/rdma/fi_domain.h b/include/rdma/fi_domain.h index 89b058e3ad4..418263ce404 100644 --- a/include/rdma/fi_domain.h +++ b/include/rdma/fi_domain.h @@ -107,6 +107,11 @@ struct fi_mr_attr { uint8_t *auth_key; }; +struct fi_mr_modify { + uint64_t flags; + struct fi_mr_attr attr; +}; + #ifdef FABRIC_DIRECT #include @@ -332,6 +337,17 @@ static inline int fi_mr_bind(struct fid_mr *mr, struct fid *bfid, uint64_t flags return mr->fid.ops->bind(&mr->fid, bfid, flags); } +static inline int +fi_mr_refresh(struct fid_mr *mr, const struct iovec *iov, size_t count, + uint64_t flags) +{ + struct fi_mr_modify modify = {0}; + modify.flags = flags; + modify.attr.mr_iov = iov; + modify.attr.iov_count = count; + return mr->fid.ops->control(&mr->fid, FI_REFRESH, &modify); +} + static inline int fi_av_open(struct fid_domain *domain, struct fi_av_attr *attr, struct fid_av **av, void *context) diff --git a/man/fi_mr.3.md b/man/fi_mr.3.md index 27d9c68ccf8..0eaf2d406cb 100644 --- a/man/fi_mr.3.md +++ b/man/fi_mr.3.md @@ -34,6 +34,9 @@ fi_mr_unmap_key fi_mr_bind : Associate a registered memory region with a completion counter. +fi_mr_refresh +: Updates the memory pages associated with a memory region. + # SYNOPSIS ```c @@ -65,6 +68,9 @@ int fi_mr_map_raw(struct fid_domain *domain, uint64_t base_addr, int fi_mr_unmap_key(struct fid_domain *domain, uint64_t key); int fi_mr_bind(struct fid_mr *mr, struct fid *bfid, uint64_t flags); + +int fi_mr_refresh(struct fid_mr *mr, const struct iovec *iov, size, count, + uint64_t flags) ``` # ARGUMENTS @@ -209,7 +215,7 @@ The following apply to memory registration. informs the provider that all necessary physical pages now back the region. The notification is necessary for providers that cannot hook directly into the operating system page tables or memory management - unit. TODO: Define notification mechanism and data. + unit. See fi_mr_refresh() for notification details. *Basic Memory Registration* : Basic memory registration is indicated by the FI_MR_BASIC mr_mode bit @@ -351,6 +357,27 @@ memory region is based on the bitwise OR of the following flags. through which the MR is accessed be created with the FI_RMA_EVENT capability. +## fi_mr_refresh + +The use of this call is required to notify the provider of any change +to the physical pages backing a registered memory region if the +FI_MR_MMU_NOTIFY mode bit has been set. This call informs the provider +that the page table entries associated with the region may have been +modified, and the provider should verify and update the registered +region accordingly. The iov parameter is optional and may be used +to specify which portions of the registered region requires updating. +If provider, providers are only guaranteed to update the specified +address ranges. + +The refresh operation has the effect of disabling and re-enabling +access to the registered region. Any operations from peers that attempt +to access the region will fail while the refresh is occurring. +Additionally, attempts to access the region by the local process +through libfabric APIs may result in a page fault or other fatal operation. + +The fi_mr_refresh call is only needed if the physical pages might have +been updated after the memory region was created. + # MEMORY REGION ATTRIBUTES Memory regions are created using the following attributes. The struct From 0a78d20acbb25def091b3ea5bb4630ff5f211b60 Mon Sep 17 00:00:00 2001 From: Sean Hefty Date: Tue, 21 Feb 2017 16:21:09 -0800 Subject: [PATCH 3/3] fabric: Introduce FI_MR_RMA_EVENT mode bit MRs associated with RMA events may require that the app fully configure the MR prior to its use. This handles the case where the provider cannot dynamically assign a counter to a MR while it is actively being used. Such MRs are considered disabled on creation. Once configured they must explicitly be enabled (similar to enabling an EP). Signed-off-by: Sean Hefty --- include/rdma/fabric.h | 1 + include/rdma/fi_domain.h | 7 +++++++ man/fi_domain.3.md | 4 ++++ man/fi_mr.3.md | 35 ++++++++++++++++++++++++++++++++++- src/fi_tostr.c | 1 + 5 files changed, 47 insertions(+), 1 deletion(-) diff --git a/include/rdma/fabric.h b/include/rdma/fabric.h index 979c16d8483..257f85a543c 100644 --- a/include/rdma/fabric.h +++ b/include/rdma/fabric.h @@ -203,6 +203,7 @@ enum fi_mr_mode { #define FI_MR_ALLOCATED (1 << 5) #define FI_MR_PROV_KEY (1 << 6) #define FI_MR_MMU_NOTIFY (1 << 7) +#define FI_MR_RMA_EVENT (1 << 8) enum fi_progress { FI_PROGRESS_UNSPEC, diff --git a/include/rdma/fi_domain.h b/include/rdma/fi_domain.h index 418263ce404..2fe1129e407 100644 --- a/include/rdma/fi_domain.h +++ b/include/rdma/fi_domain.h @@ -192,6 +192,8 @@ struct fi_ops_domain { struct fi_atomic_attr *attr, uint64_t flags); }; +/* Memory registration flags */ +/* #define FI_RMA_EVENT (1ULL << 56) */ struct fi_ops_mr { size_t size; @@ -348,6 +350,11 @@ fi_mr_refresh(struct fid_mr *mr, const struct iovec *iov, size_t count, return mr->fid.ops->control(&mr->fid, FI_REFRESH, &modify); } +static inline int fi_mr_enable(struct fid_mr *mr) +{ + return mr->fid.ops->control(&mr->fid, FI_ENABLE, NULL); +} + static inline int fi_av_open(struct fid_domain *domain, struct fi_av_attr *attr, struct fid_av **av, void *context) diff --git a/man/fi_domain.3.md b/man/fi_domain.3.md index 1c8b820d969..cfcbb98ae7c 100644 --- a/man/fi_domain.3.md +++ b/man/fi_domain.3.md @@ -439,6 +439,10 @@ The following values may be specified. when the page tables referencing a registered memory region may have been updated. +*FI_MR_RMA_EVENT* +: Indicates that the memory regions associated with completion counters + must be explicitly enabled after being bound to any counter. + *FI_MR_UNSPEC* : Defined for compatibility -- library versions 1.4 and earlier. Setting mr_mode to 0 indicates that FI_MR_BASIC or FI_MR_SCALABLE are requested diff --git a/man/fi_mr.3.md b/man/fi_mr.3.md index 0eaf2d406cb..64e40f327dc 100644 --- a/man/fi_mr.3.md +++ b/man/fi_mr.3.md @@ -37,6 +37,9 @@ fi_mr_bind fi_mr_refresh : Updates the memory pages associated with a memory region. +fi_mr_enable +: Enables a memory region for use. + # SYNOPSIS ```c @@ -71,6 +74,8 @@ int fi_mr_bind(struct fid_mr *mr, struct fid *bfid, uint64_t flags); int fi_mr_refresh(struct fid_mr *mr, const struct iovec *iov, size, count, uint64_t flags) + +int fi_mr_enable(struct fid_mr *mr); ``` # ARGUMENTS @@ -217,6 +222,21 @@ The following apply to memory registration. hook directly into the operating system page tables or memory management unit. See fi_mr_refresh() for notification details. +*FI_MR_RMA_EVENT* +: This mode bit indicates that the provider must configure memory + regions that are associated with RMA events prior to their use. This + includes all memory regions that are associated with completion counters. + When set, applications must indicate if a memory region will be + associated with a completion counter as part of the region's creation. + This is done by passing in the FI_RMA_EVENT flag to the memory + registration call. + + Such memory regions will be created in a disabled state and must be + associated with all completion counters prior to being enabled. To + enable a memory region, the application must call fi_mr_enable(). + After calling fi_mr_enable(), no further resource bindings may be + made to the memory region. + *Basic Memory Registration* : Basic memory registration is indicated by the FI_MR_BASIC mr_mode bit in library versions 1.4 and earlier. Basic registration is equivalent @@ -378,6 +398,14 @@ through libfabric APIs may result in a page fault or other fatal operation. The fi_mr_refresh call is only needed if the physical pages might have been updated after the memory region was created. +## fi_mr_enable + +The enable call is used with memory registration associated with the +FI_MR_RMA_EVENT mode bit. Memory regions created in the disabled state +must be explicitly enabled after being fully configured by the +application. Any resource bindings to the MR must be done prior +to enabling the MR. + # MEMORY REGION ATTRIBUTES Memory regions are created using the following attributes. The struct @@ -495,7 +523,12 @@ desirable for highly scalable apps. # FLAGS -Flags are reserved for future use and must be 0. +The follow flag may be specified to any memory registration call. + +*FI_RMA_EVENT* +: This flag indicates that the specified memory region will be + associated with a completion counter used to count RMA operations + that access the MR. # RETURN VALUES diff --git a/src/fi_tostr.c b/src/fi_tostr.c index 1f61aeafff0..f6e1d02b2da 100644 --- a/src/fi_tostr.c +++ b/src/fi_tostr.c @@ -425,6 +425,7 @@ static void fi_tostr_mr_mode(char *buf, int mr_mode) IFFLAGSTR(mr_mode, FI_MR_ALLOCATED); IFFLAGSTR(mr_mode, FI_MR_PROV_KEY); IFFLAGSTR(mr_mode, FI_MR_MMU_NOTIFY); + IFFLAGSTR(mr_mode, FI_MR_RMA_EVENT); fi_remove_comma(buf); }