Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

GNI: FI_MR_SCALABLE support needs significant refactoring #6194

Closed
hppritcha opened this issue Aug 13, 2020 · 1 comment
Closed

GNI: FI_MR_SCALABLE support needs significant refactoring #6194

hppritcha opened this issue Aug 13, 2020 · 1 comment
Assignees
Labels

Comments

@hppritcha
Copy link
Contributor

Open MPI's master and v4.1.x no longer work with even very simple tests using the GNI provider.
For example, here's what happens with the OSU osu_bw test:

hpp@nid00012:/XXXX/osu-micro-benchmarks-5.6.1/mpi/pt2pt>mpirun -np 2 -N 1 ./osu_bw
# OSU MPI Bandwidth Test v5.6.1
# Size      Bandwidth (MB/s)
1                       0.70
2                       1.42
4                       2.86
8                       5.77
16                     11.64
32                     23.49
64                     47.31
128                    94.82
256                   187.32
512                   376.39
1024                  654.13
2048                 1124.04
4096                 1766.85
8192                 2436.37
[nid00012:00000] *** An error occurred in MPI_Isend
[nid00012:00000] *** reported by process [3194224641,0]
[nid00012:00000] *** on communicator MPI_COMM_WORLD
[nid00012:00000] *** MPI_ERR_OTHER: known error not in list
[nid00012:00000] *** MPI_ERRORS_ARE_FATAL (processes in this communicator will now abort,
[nid00012:00000] ***    and MPI will try to terminate your MPI job as well)

The problem has to do with the fact that we had to switch to using the FI_VERSION 1.5 on these branches to work with other providers. Since Open MPI's OFI interface wasn't explicitly requesting an mr_mode on the domain attribute when opening a domain, the GNI provider tries to do its version of FI_MR_SCALABLE, namely FI_MR_LOCAL | FI_MR_MMU_NOTIFY. The FI_MR_LOCAL bit being set is pretty irritating because what worked with older FI_VERSIONs of the GNI provider no longer works - the consumer of libfabric has to start doing its own memory registration management when it didn't before. Actually it could be that it was an oversight to include FI_MR_LOCAL as the GNI provider is doing memory registrations internally for buffers supplied for send/recv ops. the osu_latency test, for instance, appears to run without issue - except latency has gone up compared to use of FI_MR_BASIC.

The way the support for FI_MR_SCALABLE was implemented in the GNI provider makes it pretty much useless for real MPI applications. The provider is making use of the Aries VMDH feature, and is using it in such a way that MDD sharing is not allowed. There are only 4096 MDDS/aries nic and by default, this resource is split between two rdma credentials, so on the order of 2048 MDDS/aries nic are available to Open MPI. This resource has to be shared by all the MPI ranks on the node. Note if MDD sharing is enabled, this is no issue as each MDD can handles 1000s of memory registrations when used in this mode. However, since the provider has disabled, effectively, use of MDD sharing, this ends up seriously restricting the number of memory registrations the provider (or the application can do). Compounding the problem is the fact that the provider's MR cache is disabled when using FI_MR_SCALABLE, so, in the case of osu_bw, its doing a memory registration for every MPI_Isend and MPI_Irecv when the message transfer involves use of the Aries DMA engine. The switchover to using the DMA engine occurs at 16384 byte message size, hence the failure at this transfer size in osu_bw.

The GNI provider's vmdh management needs to be significantly refactored to allow a subset of the MDDs to be used with MDD sharing enabled, and the MR cache needs to be reenabled for use with provider internal memory registrations. This will also allow not setting the FI_MR_LOCAL bit.

An additional issue was also uncovered while investigating this problem. Turns out that if Open MPI's mpirun is being used to launch the job, the way the GNI provider is figuring out how many local MPI ranks are present, and hence how many MDDs each rank gets, doesn't work. This will prevent usage of the GNI provider in scalable memory mode on the ANL theta system without a fix. Part of resolving this issue will require adding PMIx hooks to libfabric. PMIx provides the locality information required by the GNI provider.

A workaround is being added to Open MPI to fall back to FI_MR_BASIC when using the GNI provider even when FI_VERSION 1.5 is being requested.

@hppritcha hppritcha self-assigned this Aug 13, 2020
hppritcha added a commit to hppritcha/ompi that referenced this issue Aug 13, 2020
Uncovered a problem using the GNI provider with the OFI MTL.
See ofiwg/libfabric#6194.

Related to open-mpi#8001

Signed-off-by: Howard Pritchard <hppritcha@gmail.com>
hppritcha added a commit to hppritcha/ompi that referenced this issue Aug 26, 2020
Uncovered a problem using the GNI provider with the OFI MTL.
See ofiwg/libfabric#6194.

Related to open-mpi#8001
hppritcha added a commit to hppritcha/ompi that referenced this issue Aug 26, 2020
Uncovered a problem using the GNI provider with the OFI MTL.
See ofiwg/libfabric#6194.

Related to open-mpi#8001

Signed-off-by: Howard Pritchard <hppritcha@gmail.com>
hppritcha added a commit to hppritcha/ompi that referenced this issue Aug 28, 2020
Uncovered a problem using the GNI provider with the OFI MTL.
See ofiwg/libfabric#6194.

Related to open-mpi#8001

Signed-off-by: Howard Pritchard <hppritcha@gmail.com>
(cherry picked from commit d6ac41c)
hppritcha added a commit to hppritcha/ompi that referenced this issue Aug 31, 2020
Uncovered a problem using the GNI provider with the OFI MTL.
See ofiwg/libfabric#6194.

Related to open-mpi#8001

Signed-off-by: Howard Pritchard <hppritcha@gmail.com>
(cherry picked from commit d6ac41c)
mdosanjh pushed a commit to mdosanjh/ompi that referenced this issue Mar 16, 2021
Uncovered a problem using the GNI provider with the OFI MTL.
See ofiwg/libfabric#6194.

Related to open-mpi#8001

Signed-off-by: Howard Pritchard <hppritcha@gmail.com>
mdosanjh pushed a commit to mdosanjh/ompi that referenced this issue Mar 16, 2021
Uncovered a problem using the GNI provider with the OFI MTL.
See ofiwg/libfabric#6194.

Related to open-mpi#8001

Signed-off-by: Howard Pritchard <hppritcha@gmail.com>
@hppritcha
Copy link
Contributor Author

closing as we have good enough workaround in Open MPI OFI MTL.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

1 participant