Skip to content

Commit

Permalink
Added support for Gloo on macOS (#2254)
Browse files Browse the repository at this point in the history
Signed-off-by: Travis Addair <taddair@uber.com>
  • Loading branch information
tgaddair committed Sep 9, 2020
1 parent 3dc1ade commit 5370ebc
Show file tree
Hide file tree
Showing 9 changed files with 59 additions and 58 deletions.
8 changes: 4 additions & 4 deletions CMakeLists.txt
Original file line number Diff line number Diff line change
Expand Up @@ -213,12 +213,15 @@ message(FATAL_ERROR "You should not mix NCCL and MPI GPU due to a possible deadl
endif()

# Gloo
if (NOT "$ENV{HOROVOD_WITHOUT_GLOO}" STREQUAL "1" AND NOT ${CMAKE_SYSTEM_NAME} MATCHES "Darwin")
if (NOT "$ENV{HOROVOD_WITHOUT_GLOO}" STREQUAL "1")
if(HAVE_MPI)
set(USE_MPI TRUE)
else()
set(USE_MPI FALSE)
endif()
if(${CMAKE_SYSTEM_NAME} MATCHES "Darwin")
set(USE_LIBUV_DEFAULT ON)
endif()
set(CMAKE_POLICY_DEFAULT_CMP0074 NEW)
add_subdirectory(third_party/gloo)
include_directories(third_party/gloo)
Expand All @@ -231,9 +234,6 @@ if (NOT "$ENV{HOROVOD_WITHOUT_GLOO}" STREQUAL "1" AND NOT ${CMAKE_SYSTEM_NAME} M
add_definitions(-DHAVE_GLOO=1)
set(HAVE_GLOO TRUE)
endif()
if (NOT HAVE_MPI AND ${CMAKE_SYSTEM_NAME} MATCHES "Darwin")
message(FATAL_ERROR "Gloo cannot be compiled on MacOS, install MPI.")
endif()

# NCCL + MPI
if (HAVE_NCCL AND HAVE_MPI)
Expand Down
1 change: 0 additions & 1 deletion Dockerfile.gpu
Original file line number Diff line number Diff line change
Expand Up @@ -48,7 +48,6 @@ RUN pip install tensorflow==${TENSORFLOW_VERSION} \
keras \
h5py

# https://download.pytorch.org/whl/cu101/torch-1.6.0%2Bcu101-cp37-cp37m-linux_x86_64.whl
RUN PYTAGS=$(python -c "from packaging import tags; tag = list(tags.sys_tags())[0]; print(f'{tag.interpreter}-{tag.abi}')") && \
pip install https://download.pytorch.org/whl/cu101/torch-${PYTORCH_VERSION}%2Bcu101-${PYTAGS}-linux_x86_64.whl \
https://download.pytorch.org/whl/cu101/torchvision-${TORCHVISION_VERSION}%2Bcu101-${PYTAGS}-linux_x86_64.whl
Expand Down
23 changes: 5 additions & 18 deletions README.rst
Original file line number Diff line number Diff line change
Expand Up @@ -107,21 +107,11 @@ To install Horovod:

1. Install `CMake <https://cmake.org/install/>`__

2. *Optional*: Install `Open MPI <https://www.open-mpi.org/>`_ or another MPI implementation.

Learn how to install Open MPI `on this page <https://www.open-mpi.org/faq/?category=building#easy-build>`_.

**Note**: Open MPI 3.1.3 has an issue that may cause hangs. The recommended fix is to downgrade to Open MPI 3.1.2 or upgrade to Open MPI 4.0.0.

**Note (Linux)**: Linux users can use `Gloo <https://github.com/facebookincubator/gloo>`__ as an alternative to MPI, which requires no extra dependencies.

**Note (macOS)**: MPI is required for Horovod on macOS, as Gloo is currently unavailable.

.. raw:: html

<p/>

3. If you've installed TensorFlow from `PyPI <https://pypi.org/project/tensorflow>`__, make sure that the ``g++-4.8.5`` or ``g++-4.9`` is installed.
2. If you've installed TensorFlow from `PyPI <https://pypi.org/project/tensorflow>`__, make sure that the ``g++-4.8.5`` or ``g++-4.9`` is installed.

If you've installed PyTorch from `PyPI <https://pypi.org/project/torch>`__, make sure that the ``g++-4.9`` or above is installed.

Expand All @@ -131,7 +121,7 @@ To install Horovod:

<p/>

4. Install the ``horovod`` pip package.
3. Install the ``horovod`` pip package.

To run on CPUs:

Expand All @@ -145,12 +135,12 @@ To install Horovod:
$ HOROVOD_GPU_OPERATIONS=NCCL pip install horovod
This basic installation is good for laptops and for getting to know Horovod.

For more details on installing Horovod with GPU support, read `Horovod on GPU <docs/gpus.rst>`_.

For the full list of Horovod installation options, read the `Installation Guide <docs/install.rst>`_.

If you want to use MPI, read `Horovod with MPI <docs/mpi.rst>`_.

If you want to use Conda, read `Building a Conda environment with GPU support for Horovod <docs/conda.rst>`_.

If you want to use Docker, read `Horovod in Docker <docs/docker.rst>`_.
Expand Down Expand Up @@ -306,17 +296,14 @@ Gloo
----
`Gloo <https://github.com/facebookincubator/gloo>`_ is an open source collective communications library developed by Facebook.

Gloo comes included with Horovod, and allows users to run Horovod without requiring MPI to be installed. Gloo support only requires
that you have `CMake <https://cmake.org/>`_ installed, and is only supported on Linux at this time.
Gloo comes included with Horovod, and allows users to run Horovod without requiring MPI to be installed.

For environments that have support both MPI and Gloo, you can choose to use Gloo at runtime by passing the ``--gloo`` argument to ``horovodrun``:

.. code-block:: bash
$ horovodrun --gloo -np 2 python train.py
Gloo support is still early in its development, and more features are coming soon.

mpi4py
------
Horovod supports mixing and matching Horovod collectives with other MPI libraries, such as `mpi4py <https://mpi4py.scipy.org>`_,
Expand Down
2 changes: 1 addition & 1 deletion build-docker-images.sh
Original file line number Diff line number Diff line change
Expand Up @@ -22,7 +22,7 @@ function build_one()
docker build -f Dockerfile.${device} -t ${tag} --build-arg python=${py} --no-cache .
horovod_version=$(docker run --rm ${tag} pip show horovod | grep Version | awk '{print $2}')
tensorflow_version=$(docker run --rm ${tag} pip show ${tensorflow_pkg} | grep Version | awk '{print $2}')
pytorch_version=$(docker run --rm ${tag} pip show torch | grep Version | awk '{print $2}')
pytorch_version=$(docker run --rm ${tag} pip show torch | grep Version | sed 's/+/ /g' | awk '{print $2}')
mxnet_version=$(docker run --rm ${tag} pip show ${mxnet_pkg} | grep Version | awk '{print $2}')
final_tag=horovod/horovod:${horovod_version}-tf${tensorflow_version}-torch${pytorch_version}-mxnet${mxnet_version}-py${py}-${device}
docker tag ${tag} ${final_tag}
Expand Down
2 changes: 2 additions & 0 deletions docs/index.rst
Original file line number Diff line number Diff line change
Expand Up @@ -117,6 +117,8 @@ Guides

gpus_include

mpi_include

conda_include

docker_include
Expand Down
27 changes: 20 additions & 7 deletions docs/mpirun.rst → docs/mpi.rst
Original file line number Diff line number Diff line change
@@ -1,7 +1,18 @@
:orphan:
.. inclusion-marker-start-do-not-remove
Horovod with MPI
================

MPI can be used as an alternative to Gloo for coordinating work between processes in Horovod. When using NCCL, performance
will be similar between the two, but if you are doing CPU training, there are noticeable performance benefits to using MPI.

First install `Open MPI <https://www.open-mpi.org/>`_ or another MPI implementation. Learn how to install Open MPI `on this page <https://www.open-mpi.org/faq/?category=building#easy-build>`_.

**Note**: Open MPI 3.1.3 has an issue that may cause hangs. The recommended fix is to downgrade to Open MPI 3.1.2 or upgrade to Open MPI 4.0.0.

mpirun
------

Run Horovod with Open MPI
=========================
``horovodrun`` introduces a convenient, Open MPI-based wrapper for running Horovod scripts.

In some cases it is desirable to have fine-grained control over options passed to Open MPI. This page describes
Expand Down Expand Up @@ -56,7 +67,7 @@ With the ``-x`` option you can specify (``-x NCCL_DEBUG=INFO``) or copy (``-x LD
all the workers.

Custom SSH ports
----------------
~~~~~~~~~~~~~~~~

Specify custom SSH ports with ``-mca plm_rsh_args "-p <port>"`` as follows:

Expand All @@ -73,7 +84,7 @@ Specify custom SSH ports with ``-mca plm_rsh_args "-p <port>"`` as follows:
This is frequently useful in the case of `running Horovod in Docker environment <docker.rst>`_.

Open MPI with RDMA
------------------
~~~~~~~~~~~~~~~~~~

As noted above, using TCP for MPI communication does not have any significant effects on performance in the majority of
cases. Models that make heavy use of ``hvd.broadcast()`` and ``hvd.allgather()`` operations are exceptions to that rule.
Expand All @@ -95,7 +106,7 @@ Other MPI RDMA implementations may or may not benefit from disabling multithread
documentation.

Horovod Parameter Knobs
-----------------------
~~~~~~~~~~~~~~~~~~~~~~~

Many of the configurable parameters available as command line arguments to ``horovodrun`` can be used with ``mpirun``
through the use of environment variables.
Expand All @@ -121,7 +132,7 @@ Autotuning:
Note that when using ``horovodrun``, any command line arguments will override values set in the environment.

Hangs due to non-routed network interfaces
------------------------------------------
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

Having network interfaces that are not routed can cause Open MPI to hang. An example of such interface is ``docker0``.

Expand Down Expand Up @@ -177,3 +188,5 @@ Example ``mpirun`` command with ``lo`` and ``docker0`` interfaces excluded:
-mca pml ob1 -mca btl ^openib \
-mca btl_tcp_if_exclude lo,docker0 \
python train.py
.. inclusion-marker-end-do-not-remove
3 changes: 3 additions & 0 deletions docs/mpi_include.rst
Original file line number Diff line number Diff line change
@@ -0,0 +1,3 @@
.. include:: ./mpi.rst
:start-after: inclusion-marker-start-do-not-remove
:end-before: inclusion-marker-end-do-not-remove
23 changes: 5 additions & 18 deletions docs/summary.rst
Original file line number Diff line number Diff line change
Expand Up @@ -99,21 +99,11 @@ To install Horovod:

1. Install `CMake <https://cmake.org/install/>`__

2. *Optional*: Install `Open MPI <https://www.open-mpi.org/>`_ or another MPI implementation.

Learn how to install Open MPI `on this page <https://www.open-mpi.org/faq/?category=building#easy-build>`_.

**Note**: Open MPI 3.1.3 has an issue that may cause hangs. The recommended fix is to downgrade to Open MPI 3.1.2 or upgrade to Open MPI 4.0.0.

**Note (Linux)**: Linux users can use `Gloo <https://github.com/facebookincubator/gloo>`__ as an alternative to MPI, which requires no extra dependencies.

**Note (macOS)**: MPI is required for Horovod on macOS, as Gloo is currently unavailable.

.. raw:: html

<p/>

3. If you've installed TensorFlow from `PyPI <https://pypi.org/project/tensorflow>`__, make sure that the ``g++-4.8.5`` or ``g++-4.9`` is installed.
2. If you've installed TensorFlow from `PyPI <https://pypi.org/project/tensorflow>`__, make sure that the ``g++-4.8.5`` or ``g++-4.9`` is installed.

If you've installed PyTorch from `PyPI <https://pypi.org/project/torch>`__, make sure that the ``g++-4.9`` or above is installed.

Expand All @@ -123,7 +113,7 @@ To install Horovod:

<p/>

4. Install the ``horovod`` pip package.
3. Install the ``horovod`` pip package.

To run on CPUs:

Expand All @@ -137,12 +127,12 @@ To install Horovod:
$ HOROVOD_GPU_OPERATIONS=NCCL pip install horovod
This basic installation is good for laptops and for getting to know Horovod.

For more details on installing Horovod with GPU support, read `Horovod on GPU <gpus.rst>`_.

For the full list of Horovod installation options, read the `Installation Guide <install.rst>`_.

If you want to use MPI, read `Horovod with MPI <mpi.rst>`_.

If you want to use Conda, read `Building a Conda environment with GPU support for Horovod <conda.rst>`_.

If you want to use Docker, read `Horovod in Docker <docker.rst>`_.
Expand Down Expand Up @@ -298,17 +288,14 @@ Gloo
----
`Gloo <https://github.com/facebookincubator/gloo>`_ is an open source collective communications library developed by Facebook.

Gloo comes included with Horovod, and allows users to run Horovod without requiring MPI to be installed. Gloo support only requires
that you have `CMake <https://cmake.org/>`_ installed, and is only supported on Linux at this time.
Gloo comes included with Horovod, and allows users to run Horovod without requiring MPI to be installed.

For environments that have support both MPI and Gloo, you can choose to use Gloo at runtime by passing the ``--gloo`` argument to ``horovodrun``:

.. code-block:: bash
$ horovodrun --gloo -np 2 python train.py
Gloo support is still early in its development, and more features are coming soon.

mpi4py
------
Horovod supports mixing and matching Horovod collectives with other MPI libraries, such as `mpi4py <https://mpi4py.scipy.org>`_,
Expand Down
28 changes: 19 additions & 9 deletions horovod/common/gloo/gloo_context.cc
Original file line number Diff line number Diff line change
Expand Up @@ -23,7 +23,17 @@
#include "gloo/rendezvous/context.h"
#include "gloo/rendezvous/file_store.h"
#include "gloo/rendezvous/prefix_store.h"

#ifdef __linux__
#include "gloo/transport/tcp/device.h"
using attr = gloo::transport::tcp::attr;
constexpr auto CreateDevice = gloo::transport::tcp::CreateDevice;
#else
// Use uv on macOS as TCP requires epoll (Linux-only)
#include "gloo/transport/uv/device.h"
using attr = gloo::transport::uv::attr;
constexpr auto CreateDevice = gloo::transport::uv::CreateDevice;
#endif

#if HAVE_MPI
#include "gloo/mpi/context.h"
Expand Down Expand Up @@ -98,10 +108,10 @@ void GlooContext::InitializeFromMPI(MPIContext& mpi_ctx,

// TODO(sihan): Add support for multiple interfaces:
// https://github.com/facebookincubator/gloo/issues/190
gloo::transport::tcp::attr attr;
attr.iface = gloo_iface;
attr.ai_family = AF_UNSPEC;
auto dev = gloo::transport::tcp::CreateDevice(attr);
attr device_attr;
device_attr.iface = gloo_iface;
device_attr.ai_family = AF_UNSPEC;
auto dev = CreateDevice(device_attr);
auto timeout = GetTimeoutFromEnv();

auto context =
Expand Down Expand Up @@ -129,14 +139,14 @@ void GlooContext::Initialize(const std::string& gloo_iface) {
return;
}

// Create a tcp device for communication
// Create a device for communication
// TODO(sihan): Add support for multiple interfaces:
// https://github.com/facebookincubator/gloo/issues/190
gloo::transport::tcp::attr attr;
attr.iface = gloo_iface;
attr device_attr;
device_attr.iface = gloo_iface;

attr.ai_family = AF_UNSPEC;
auto dev = gloo::transport::tcp::CreateDevice(attr);
device_attr.ai_family = AF_UNSPEC;
auto dev = CreateDevice(device_attr);
auto timeout = GetTimeoutFromEnv();

auto host_env = std::getenv(HOROVOD_HOSTNAME);
Expand Down

0 comments on commit 5370ebc

Please sign in to comment.