Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix error check when initializing Horovod with existing mpi4py communicator #1391

Merged
merged 1 commit into from Sep 10, 2019

Conversation

@romerojosh
Copy link
Collaborator

commented Sep 9, 2019

This error check in init() when attempting to initialize Horovod with an existing mpi4py communicator breaks this functionality completely. The issue is that the flag indicating that MPI is enabled is set in within horovod_init_comm/InitializeHorovodOnce, the function being blocked by the error check.

@romerojosh romerojosh requested review from alsrgv and tgaddair Sep 9, 2019

@tgaddair

This comment has been minimized.

Copy link
Collaborator

commented Sep 9, 2019

Good catch @romerojosh! Instead of removing this guard completely, what about changing it to horovod_mpi_built(), which only checks that Horovod was installed with MPI support? Then we could change the error message to something like:

Horovod has not been built with MPI support.  Ensure MPI is installed and reinstall Horovod with HOROVOD_WITH_MPI=1 to debug the build error.

I think this would be useful diagnostic info since we can now fallback to using Gloo if MPI failed to install correctly.

@romerojosh

This comment has been minimized.

Copy link
Collaborator Author

commented Sep 10, 2019

Makes sense. Do you think a similar change should also be made for the error check in mpi_threads_supported?

@tgaddair

This comment has been minimized.

Copy link
Collaborator

commented Sep 10, 2019

That's a good question. The problem with changing it to check mpi_built() is that mpi_threads_supported() will blindly perform a dynamic cast on the controller:

auto mpiController =
      std::dynamic_pointer_cast<MPIController>(horovod_global.controller);
  return mpiController->IsMpiThreadsSupported() ? 1 : 0;

... which can fail if MPI is installed, but Gloo is being used as the controller.

Checking mpi_context.Enabled() takes care of most of the problem, but doesn't account for one fringe edge case: where you're using MPI for CPU operations (thus mpi_context is enabled), but Gloo for the controller. I don't think anyone is likely to do this in practice (other than benchmarking purposes), but since it's possible, I'll make a fix for that.

But I think in general, checking mpi_enabled() is correct for mpi_threads_supported() for that reason.

@romerojosh romerojosh force-pushed the romerojosh:mpi_comm_fix branch from 921ba40 to 3eb7bdc Sep 10, 2019

@romerojosh romerojosh changed the title Remove buggy error check when initializing Horovod with existing mpi4py communicator Fix error check when initializing Horovod with existing mpi4py communicator Sep 10, 2019

@romerojosh

This comment has been minimized.

Copy link
Collaborator Author

commented Sep 10, 2019

Sounds good to me @tgaddair! Pushed a change modifying rather than removing the existing error check.

@tgaddair
Copy link
Collaborator

left a comment

LGTM! Thanks!

@romerojosh

This comment has been minimized.

Copy link
Collaborator Author

commented Sep 10, 2019

Looks like there are some failing tests unrelated to this change.

@tgaddair

This comment has been minimized.

Copy link
Collaborator

commented Sep 10, 2019

Seems to be related to a change in Keras nightly (#1392).

@tgaddair

This comment has been minimized.

Copy link
Collaborator

commented Sep 10, 2019

@romerojosh Landed #1392, can you rebase and push again?

Fix error check when initializing Horovod with existing mpi4py commun…
…icator.

Signed-off-by: Josh Romero <joshr@nvidia.com>

@romerojosh romerojosh force-pushed the romerojosh:mpi_comm_fix branch from 3eb7bdc to e1d6ab2 Sep 10, 2019

@tgaddair tgaddair merged commit 317c1d9 into horovod:master Sep 10, 2019

2 checks passed

DCO DCO
Details
buildkite/horovod/pr Build #1155 passed (1 hour, 12 minutes, 28 seconds)
Details
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
2 participants
You can’t perform that action at this time.