Skip to content

Conversation

hppritcha
Copy link
Member

This commit adds two MCA parameters:

mtl_ofi_disable_hmem
btl_ofi_disable_hmem

to allow for disabling use of FI_HMEM in cases where the provider may advertise support for HMEM but in fact may not, and does not observe the OFI libfabric FI_HMEM_DISABLE_P2P environment variable.

This is actually the situation as of the writing of this commit on certain systems owing to limitations in kernel support for registration of accelerator memory. The OFI provider on such systems unfortunately stil advertises support for FI_HMEM with ZE but fails when trying to register memory. These mca parameters allow for turning off use of FI_HMEM in such cases.

Related to ofiwg/libfabric#9315

This commit adds two MCA parameters:

mtl_ofi_disable_hmem
btl_ofi_disable_hmem

to allow for disabling use of FI_HMEM in cases where the provider may advertise support for HMEM but in fact may not, and does not
observe the OFI libfabric FI_HMEM_DISABLE_P2P environment variable.

This is actually the situation as of the writing of this commit on certain systems owing to limitations in kernel support for registration of accelerator memory.  The OFI provider on such systems unfortunately stil advertises support for FI_HMEM with ZE but fails when trying to register memory.  These mca parameters allow for turning off use of FI_HMEM in such cases.

Related to ofiwg/libfabric#9315

Signed-off-by: Howard Pritchard <howardp@lanl.gov>
@@ -59,6 +59,7 @@ typedef struct mca_mtl_ofi_module_t {
int enable_sep; /* MCA to enable/disable SEP feature */
int thread_grouping; /* MCA for thread grouping feature */
int num_ofi_contexts; /* MCA for number of contexts to use */
bool disable_hmem; /* MCA to enable/disable request for FI_HMEM support from provider */
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Curious about a difference between mtl/ofi and btl/ofi.

For mtl/ofi I see the mca parameters are registered under struct mca_mtl_ofi_module_t, but btl/ofi uses mca_btl_ofi_component_t. Is there a reason for this?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not that i'm aware of. good point though i'll double check about which of the struct types to use for these booleans.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it looks a bit arbitrary looking in different frameworks/components when to put fields in the mca_XXX_component_t vs mca_XXX_module_t (seems like sometimes the module string is left out). I'd prefer to keep with the pattern in each of the ofi components with this PR.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for checking. I don't have opinion on that.

@hppritcha
Copy link
Member Author

@edgargabriel could you review when you have a chance?

Copy link
Member

@edgargabriel edgargabriel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@hppritcha hppritcha merged commit e3c55a7 into open-mpi:main Sep 15, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants