-
Notifications
You must be signed in to change notification settings - Fork 369
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
GNI: FI_MR_SCALABLE support needs significant refactoring #6194
Labels
Comments
hppritcha
added a commit
to hppritcha/ompi
that referenced
this issue
Aug 13, 2020
Uncovered a problem using the GNI provider with the OFI MTL. See ofiwg/libfabric#6194. Related to open-mpi#8001 Signed-off-by: Howard Pritchard <hppritcha@gmail.com>
hppritcha
added a commit
to hppritcha/ompi
that referenced
this issue
Aug 26, 2020
Uncovered a problem using the GNI provider with the OFI MTL. See ofiwg/libfabric#6194. Related to open-mpi#8001
hppritcha
added a commit
to hppritcha/ompi
that referenced
this issue
Aug 26, 2020
Uncovered a problem using the GNI provider with the OFI MTL. See ofiwg/libfabric#6194. Related to open-mpi#8001 Signed-off-by: Howard Pritchard <hppritcha@gmail.com>
hppritcha
added a commit
to hppritcha/ompi
that referenced
this issue
Aug 28, 2020
Uncovered a problem using the GNI provider with the OFI MTL. See ofiwg/libfabric#6194. Related to open-mpi#8001 Signed-off-by: Howard Pritchard <hppritcha@gmail.com> (cherry picked from commit d6ac41c)
hppritcha
added a commit
to hppritcha/ompi
that referenced
this issue
Aug 31, 2020
Uncovered a problem using the GNI provider with the OFI MTL. See ofiwg/libfabric#6194. Related to open-mpi#8001 Signed-off-by: Howard Pritchard <hppritcha@gmail.com> (cherry picked from commit d6ac41c)
mdosanjh
pushed a commit
to mdosanjh/ompi
that referenced
this issue
Mar 16, 2021
Uncovered a problem using the GNI provider with the OFI MTL. See ofiwg/libfabric#6194. Related to open-mpi#8001 Signed-off-by: Howard Pritchard <hppritcha@gmail.com>
mdosanjh
pushed a commit
to mdosanjh/ompi
that referenced
this issue
Mar 16, 2021
Uncovered a problem using the GNI provider with the OFI MTL. See ofiwg/libfabric#6194. Related to open-mpi#8001 Signed-off-by: Howard Pritchard <hppritcha@gmail.com>
closing as we have good enough workaround in Open MPI OFI MTL. |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Open MPI's master and v4.1.x no longer work with even very simple tests using the GNI provider.
For example, here's what happens with the OSU osu_bw test:
The problem has to do with the fact that we had to switch to using the FI_VERSION 1.5 on these branches to work with other providers. Since Open MPI's OFI interface wasn't explicitly requesting an mr_mode on the domain attribute when opening a domain, the GNI provider tries to do its version of FI_MR_SCALABLE, namely FI_MR_LOCAL | FI_MR_MMU_NOTIFY. The FI_MR_LOCAL bit being set is pretty irritating because what worked with older FI_VERSIONs of the GNI provider no longer works - the consumer of libfabric has to start doing its own memory registration management when it didn't before. Actually it could be that it was an oversight to include FI_MR_LOCAL as the GNI provider is doing memory registrations internally for buffers supplied for send/recv ops. the osu_latency test, for instance, appears to run without issue - except latency has gone up compared to use of FI_MR_BASIC.
The way the support for FI_MR_SCALABLE was implemented in the GNI provider makes it pretty much useless for real MPI applications. The provider is making use of the Aries VMDH feature, and is using it in such a way that MDD sharing is not allowed. There are only 4096 MDDS/aries nic and by default, this resource is split between two rdma credentials, so on the order of 2048 MDDS/aries nic are available to Open MPI. This resource has to be shared by all the MPI ranks on the node. Note if MDD sharing is enabled, this is no issue as each MDD can handles 1000s of memory registrations when used in this mode. However, since the provider has disabled, effectively, use of MDD sharing, this ends up seriously restricting the number of memory registrations the provider (or the application can do). Compounding the problem is the fact that the provider's MR cache is disabled when using FI_MR_SCALABLE, so, in the case of osu_bw, its doing a memory registration for every
MPI_Isend
andMPI_Irecv
when the message transfer involves use of the Aries DMA engine. The switchover to using the DMA engine occurs at 16384 byte message size, hence the failure at this transfer size in osu_bw.The GNI provider's vmdh management needs to be significantly refactored to allow a subset of the MDDs to be used with MDD sharing enabled, and the MR cache needs to be reenabled for use with provider internal memory registrations. This will also allow not setting the FI_MR_LOCAL bit.
An additional issue was also uncovered while investigating this problem. Turns out that if Open MPI's mpirun is being used to launch the job, the way the GNI provider is figuring out how many local MPI ranks are present, and hence how many MDDs each rank gets, doesn't work. This will prevent usage of the GNI provider in scalable memory mode on the ANL theta system without a fix. Part of resolving this issue will require adding PMIx hooks to libfabric. PMIx provides the locality information required by the GNI provider.
A workaround is being added to Open MPI to fall back to FI_MR_BASIC when using the GNI provider even when FI_VERSION 1.5 is being requested.
The text was updated successfully, but these errors were encountered: