-
Notifications
You must be signed in to change notification settings - Fork 920
ofi - add MCA parameters to not use FI_HMEM #11929
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
This commit adds two MCA parameters: mtl_ofi_disable_hmem btl_ofi_disable_hmem to allow for disabling use of FI_HMEM in cases where the provider may advertise support for HMEM but in fact may not, and does not observe the OFI libfabric FI_HMEM_DISABLE_P2P environment variable. This is actually the situation as of the writing of this commit on certain systems owing to limitations in kernel support for registration of accelerator memory. The OFI provider on such systems unfortunately stil advertises support for FI_HMEM with ZE but fails when trying to register memory. These mca parameters allow for turning off use of FI_HMEM in such cases. Related to ofiwg/libfabric#9315 Signed-off-by: Howard Pritchard <howardp@lanl.gov>
@@ -59,6 +59,7 @@ typedef struct mca_mtl_ofi_module_t { | |||
int enable_sep; /* MCA to enable/disable SEP feature */ | |||
int thread_grouping; /* MCA for thread grouping feature */ | |||
int num_ofi_contexts; /* MCA for number of contexts to use */ | |||
bool disable_hmem; /* MCA to enable/disable request for FI_HMEM support from provider */ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Curious about a difference between mtl/ofi and btl/ofi.
For mtl/ofi I see the mca parameters are registered under struct mca_mtl_ofi_module_t
, but btl/ofi uses mca_btl_ofi_component_t
. Is there a reason for this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not that i'm aware of. good point though i'll double check about which of the struct types to use for these booleans.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it looks a bit arbitrary looking in different frameworks/components when to put fields in the mca_XXX_component_t vs mca_XXX_module_t (seems like sometimes the module string is left out). I'd prefer to keep with the pattern in each of the ofi components with this PR.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for checking. I don't have opinion on that.
@edgargabriel could you review when you have a chance? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
This commit adds two MCA parameters:
mtl_ofi_disable_hmem
btl_ofi_disable_hmem
to allow for disabling use of FI_HMEM in cases where the provider may advertise support for HMEM but in fact may not, and does not observe the OFI libfabric FI_HMEM_DISABLE_P2P environment variable.
This is actually the situation as of the writing of this commit on certain systems owing to limitations in kernel support for registration of accelerator memory. The OFI provider on such systems unfortunately stil advertises support for FI_HMEM with ZE but fails when trying to register memory. These mca parameters allow for turning off use of FI_HMEM in such cases.
Related to ofiwg/libfabric#9315