Skip to content
Permalink
Browse files

Edit Horovod docs (#1119)

Signed-off-by: Stephanie Blotner <sblotner@uber.com>
  • Loading branch information...
sblotner authored and alsrgv committed Jun 5, 2019
1 parent 30a2148 commit 89121661a2eaa36e4fab8566bd4f84e2361f3469
Showing with 94 additions and 95 deletions.
  1. +15 −20 docs/gpus.rst
  2. +3 −2 docs/index.rst
  3. +22 −22 docs/mpirun.rst
  4. +3 −3 docs/running.rst
  5. +37 −33 docs/summary.rst
  6. +1 −1 docs/summary_include.rst
  7. +8 −8 docs/timeline.rst
  8. +5 −6 docs/troubleshooting.rst
@@ -11,43 +11,38 @@ Have GPUs?
In most situations, using NCCL 2 will significantly improve performance over the CPU version. NCCL 2 provides the **allreduce**
operation optimized for NVIDIA GPUs and a variety of networking devices, such as RoCE or InfiniBand.

1. Install `NCCL 2 <https://developer.nvidia.com/nccl>`__.
1. Install `NCCL 2 <https://developer.nvidia.com/nccl>`__ following `these steps <http://docs.nvidia.com/deeplearning/sdk/nccl-install-guide/index.html>`__.

Steps to install NCCL 2 are listed `here <http://docs.nvidia.com/deeplearning/sdk/nccl-install-guide/index.html>`__.
If you have installed NCCL 2 using the ``nccl-<version>.txz`` package, you should add the library path to ``LD_LIBRARY_PATH``
environment variable or register it in ``/etc/ld.so.conf``.

If you have installed NCCL 2 using the ``nccl-<version>.txz`` package, you should add the library path to ``LD_LIBRARY_PATH``
environment variable or register it in ``/etc/ld.so.conf``.
.. code-block:: bash
.. code-block:: bash
$ export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:/usr/local/nccl-<version>/lib
$ export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:/usr/local/nccl-<version>/lib
2. (Optional) If you're using an NVIDIA Tesla GPU and NIC with GPUDirect RDMA support, you can further speed up NCCL 2
by installing an `nv_peer_memory <https://github.com/Mellanox/nv_peer_memory>`__ driver.

`GPUDirect <https://developer.nvidia.com/gpudirect>`__ allows GPUs to transfer memory among each other without CPU
involvement, which significantly reduces latency and load on CPU. NCCL 2 is able to use GPUDirect automatically for
**allreduce** operation if it detects it.
`GPUDirect <https://developer.nvidia.com/gpudirect>`__ allows GPUs to transfer memory among each other without CPU
involvement, which significantly reduces latency and load on CPU. NCCL 2 is able to use GPUDirect automatically for
**allreduce** operation if it detects it.

3. Install `Open MPI <https://www.open-mpi.org/>`__ or another MPI implementation.
3. Install `Open MPI <https://www.open-mpi.org/>`__ or another MPI implementation following `these steps <https://www.open-mpi.org/faq/?category=building#easy-build>`__.

Steps to install Open MPI are listed `here <https://www.open-mpi.org/faq/?category=building#easy-build>`__.

**Note**: Open MPI 3.1.3 has an issue that may cause hangs. It is recommended
to downgrade to Open MPI 3.1.2 or upgrade to Open MPI 4.0.0.
**Note**: Open MPI 3.1.3 has an issue that may cause hangs. The recommended fix is to downgrade to Open MPI 3.1.2 or upgrade to Open MPI 4.0.0.

4. Install the ``horovod`` pip package.

If you have installed NCCL 2 using the ``nccl-<version>.txz`` package, you should specify the path to NCCL 2 using the ``HOROVOD_NCCL_HOME``
environment variable.
If you have installed NCCL 2 using the ``nccl-<version>.txz`` package, you should specify the path to NCCL 2 using the ``HOROVOD_NCCL_HOME``
environment variable.

.. code-block:: bash
.. code-block:: bash
$ HOROVOD_NCCL_HOME=/usr/local/nccl-<version> HOROVOD_GPU_ALLREDUCE=NCCL pip install --no-cache-dir horovod
$ HOROVOD_NCCL_HOME=/usr/local/nccl-<version> HOROVOD_GPU_ALLREDUCE=NCCL pip install --no-cache-dir horovod
If you have installed NCCL 2 using the Ubuntu package, you can simply run:
If you have installed NCCL 2 using the Ubuntu package, you can run:

.. code-block:: bash
@@ -1,5 +1,6 @@
Welcome to Horovod's documentation!
===================================
Horovod documentation
=====================
Horovod improves the speed, scale, and resource utilization of deep learning training.

.. toctree::
:maxdepth: 2
@@ -8,44 +8,44 @@ running Horovod training directly using Open MPI.

1. Run on a machine with 4 GPUs:

.. code-block:: bash
.. code-block:: bash
horovodrun -np 16 -H server1:4,server2:4,server3:4,server4:4 python train.py
horovodrun -np 16 -H server1:4,server2:4,server3:4,server4:4 python train.py
Equivalent Open MPI command:
Equivalent Open MPI command:

.. code-block:: bash
.. code-block:: bash
mpirun -np 4 \
-H localhost:4 \
-bind-to none -map-by slot \
-x NCCL_DEBUG=INFO -x LD_LIBRARY_PATH -x PATH \
-mca pml ob1 -mca btl ^openib \
python train.py
mpirun -np 4 \
-H localhost:4 \
-bind-to none -map-by slot \
-x NCCL_DEBUG=INFO -x LD_LIBRARY_PATH -x PATH \
-mca pml ob1 -mca btl ^openib \
python train.py
2. Run on 4 machines with 4 GPUs each:

.. code-block:: bash
.. code-block:: bash
horovodrun -np 16 -H server1:4,server2:4,server3:4,server4:4 python train.py
horovodrun -np 16 -H server1:4,server2:4,server3:4,server4:4 python train.py
Equivalent Open MPI command:
Equivalent Open MPI command:

.. code-block:: bash
.. code-block:: bash
mpirun -np 16 \
-H server1:4,server2:4,server3:4,server4:4 \
-bind-to none -map-by slot \
-x NCCL_DEBUG=INFO -x LD_LIBRARY_PATH -x PATH \
-mca pml ob1 -mca btl ^openib \
python train.py
mpirun -np 16 \
-H server1:4,server2:4,server3:4,server4:4 \
-bind-to none -map-by slot \
-x NCCL_DEBUG=INFO -x LD_LIBRARY_PATH -x PATH \
-mca pml ob1 -mca btl ^openib \
python train.py
Starting with the Open MPI 3, it's important to add the ``-bind-to none`` and ``-map-by slot`` arguments.
``-bind-to none`` specifies Open MPI to not bind a training process to a single CPU core (which would hurt performance).
``-map-by slot`` allows you to have a mixture of different NUMA configurations because the default behavior is to bind
to the socket.

``-mca pml ob1`` and ``-mca btl ^openib`` flags force the use of TCP for MPI communication. This avoids many
The ``-mca pml ob1`` and ``-mca btl ^openib`` flags force the use of TCP for MPI communication. This avoids many
multiprocessing issues that Open MPI has with RDMA which typically results in segmentation faults. Using TCP for MPI
does not have noticeable performance impact since most of the heavy communication is done by NCCL, which will use RDMA
via RoCE or InfiniBand if they're available (see `Horovod on GPU <gpus.md>`_). Notable exceptions from this rule are
@@ -58,7 +58,7 @@ all the workers.
Custom SSH ports
----------------

Custom SSH ports can be specified by ``-mca plm_rsh_args "-p <port>"`` as follows:
Specify custom SSH ports with ``-mca plm_rsh_args "-p <port>"`` as follows:

.. code-block:: bash
@@ -4,19 +4,19 @@
Running Horovod
===============

The examples below are for Open MPI and use ``horovodrun``. Check your MPI documentation for arguments to the ``mpirun``
This page includes examples for Open MPI that use ``horovodrun``. Check your MPI documentation for arguments to the ``mpirun``
command on your system.

Typically one GPU will be allocated per process, so if a server has 4 GPUs, you would run 4 processes. In ``horovodrun``,
the number of processes is specified with the ``-np`` flag.

1. To run on a machine with 4 GPUs:
To run on a machine with 4 GPUs:

.. code-block:: bash
$ horovodrun -np 4 -H localhost:4 python train.py
2. To run on 4 machines with 4 GPUs each:
To run on 4 machines with 4 GPUs each:

.. code-block:: bash
@@ -30,8 +30,8 @@ Horovod
|

Horovod is a distributed training framework for TensorFlow, Keras, PyTorch, and MXNet. The goal of Horovod is to make
distributed Deep Learning fast and easy to use.
Horovod is a distributed deep learning training framework for TensorFlow, Keras, PyTorch, and Apache MXNet.
The goal of Horovod is to make distributed deep learning fast and easy to use.


.. raw:: html
@@ -40,7 +40,7 @@ distributed Deep Learning fast and easy to use.


Horovod is hosted by the `LF AI Foundation <https://lfdl.io>`_ (LF AI). If you are a company that is deeply
committed to using open source technologies in artificial intelligence, machine and deep learning, and wanting to support
committed to using open source technologies in artificial intelligence, machine, and deep learning, and want to support
the communities of open source projects in these domains, consider joining the LF AI Foundation. For details
about who's involved and how Horovod plays a role, read the LF AI `announcement <https://lfdl.io/press/2018/12/13/lf-deep-learning-welcomes-horovod-distributed-training-framework-as-newest-project/>`_.

@@ -50,7 +50,7 @@ about who's involved and how Horovod plays a role, read the LF AI `announcement

|

Why not traditional Distributed TensorFlow?
Why not traditional distributed TensorFlow?
-------------------------------------------

The primary motivation for this project is to make it easy to take a single-GPU TensorFlow program and successfully train
@@ -69,7 +69,7 @@ servers with 4 Pascal GPUs each connected by RoCE-capable 25 Gbit/s network:
:alt: 512-GPU Benchmark

Horovod achieves 90% scaling efficiency for both Inception V3 and ResNet-101, and 68% scaling efficiency for VGG-16.
See the `Benchmarks <benchmarks.rst>`_ page to find out how to reproduce these numbers.
See `Benchmarks <benchmarks.rst>`_ to find out how to reproduce these numbers.

While installing MPI and NCCL itself may seem like an extra hassle, it only needs to be done once by the team dealing
with infrastructure, while everyone else in the company who builds the models can enjoy the simplicity of training them at
@@ -83,8 +83,7 @@ To install Horovod:

1. Install `Open MPI <https://www.open-mpi.org/>`_ or another MPI implementation. Learn how to install Open MPI `on this page <https://www.open-mpi.org/faq/?category=building#easy-build>`_.

**Note**: Open MPI 3.1.3 has an issue that may cause hangs. The recommended fix is to
downgrade to Open MPI 3.1.2 or upgrade to Open MPI 4.0.0.
**Note**: Open MPI 3.1.3 has an issue that may cause hangs. The recommended fix is to downgrade to Open MPI 3.1.2 or upgrade to Open MPI 4.0.0.

2. Install the ``horovod`` pip package.

@@ -93,8 +92,8 @@ downgrade to Open MPI 3.1.2 or upgrade to Open MPI 4.0.0.
$ pip install horovod
This basic installation is good for laptops and for getting to know Horovod.
If you're installing Horovod on a server with GPUs, read the `Horovod on GPU <gpus.rst>`_ page.
If you want to use Docker, read the `Horovod in Docker <docker.rst>`_ page.
If you're installing Horovod on a server with GPUs, read `Horovod on GPU <gpus.rst>`_.
If you want to use Docker, read `Horovod in Docker <docker.rst>`_.


Concepts
@@ -112,24 +111,27 @@ To use Horovod, make the following additions to your program:
1. Run ``hvd.init()``.

2. Pin a server GPU to be used by this process using ``config.gpu_options.visible_device_list``.
With the typical setup of one GPU per process, this can be set to *local rank*. In that case, the first process on
the server will be allocated the first GPU, second process will be allocated the second GPU and so forth.

3. Scale the learning rate by number of workers. Effective batch size in synchronous distributed training is scaled by
the number of workers. An increase in learning rate compensates for the increased batch size.
With the typical setup of one GPU per process, you can set this to *local rank*. In that case, the first process on
the server will be allocated the first GPU, the second process will be allocated the second GPU, and so forth.

4. Wrap optimizer in ``hvd.DistributedOptimizer``. The distributed optimizer delegates gradient computation
to the original optimizer, averages gradients using *allreduce* or *allgather*, and then applies those averaged
gradients.
3. Scale the learning rate by the number of workers.

Effective batch size in synchronous distributed training is scaled by the number of workers.
An increase in learning rate compensates for the increased batch size.

4. Wrap the optimizer in ``hvd.DistributedOptimizer``.

The distributed optimizer delegates gradient computation to the original optimizer, averages gradients using *allreduce* or *allgather*, and then applies those averaged gradients.

5. Add ``hvd.BroadcastGlobalVariablesHook(0)`` to broadcast initial variable states from rank 0 to all other processes.
This is necessary to ensure consistent initialization of all workers when training is started with random weights or
restored from a checkpoint. Alternatively, if you're not using ``MonitoredTrainingSession``, you can simply execute
the ``hvd.broadcast_global_variables`` op after global variables have been initialized.

This is necessary to ensure consistent initialization of all workers when training is started with random weights or restored from a checkpoint.
Alternatively, if you're not using ``MonitoredTrainingSession``, you can execute the ``hvd.broadcast_global_variables`` op after global variables have been initialized.

6. Modify your code to save checkpoints only on worker 0 to prevent other workers from corrupting them.
This can be accomplished by passing ``checkpoint_dir=None`` to ``tf.train.MonitoredTrainingSession`` if
``hvd.rank() != 0``.

Accomplish this by passing ``checkpoint_dir=None`` to ``tf.train.MonitoredTrainingSession`` if ``hvd.rank() != 0``.

Example (see the `examples <examples/>`_ directory for full training examples):

@@ -177,28 +179,28 @@ Example (see the `examples <examples/>`_ directory for full training examples):
Running Horovod
---------------

The example commands below show how to run distributed training. See the `Running Horovod <running.rst>`_
page for more instructions, including RoCE/InfiniBand tweaks and tips for dealing with hangs.
The example commands below show how to run distributed training. See `Running Horovod <running.rst>`_
for more instructions, including RoCE/InfiniBand tweaks and tips for dealing with hangs.

1. To run on a machine with 4 GPUs:

.. code-block:: bash
.. code-block:: bash
$ horovodrun -np 4 -H localhost:4 python train.py
$ horovodrun -np 4 -H localhost:4 python train.py
2. To run on 4 machines with 4 GPUs each:

.. code-block:: bash
.. code-block:: bash
$ horovodrun -np 16 -H server1:4,server2:4,server3:4,server4:4 python train.py
$ horovodrun -np 16 -H server1:4,server2:4,server3:4,server4:4 python train.py
3. To run using Open MPI without the ``horovodrun`` wrapper, see the `Running Horovod with Open MPI <mpirun.rst>`_ page.
3. To run using Open MPI without the ``horovodrun`` wrapper, see `Running Horovod with Open MPI <mpirun.rst>`_.

4. To run in Docker, see the `Horovod in Docker <docker.rst>`_ page.
4. To run in Docker, see `Horovod in Docker <docker.rst>`_.

5. To run in Kubernetes, see `Kubeflow <https://github.com/kubeflow/kubeflow/tree/master/kubeflow/mpi-job>`_, `MPI Operator <https://github.com/kubeflow/mpi-operator/>`_, `Helm Chart <https://github.com/kubernetes/charts/tree/master/stable/horovod/>`_, and `FfDL <https://github.com/IBM/FfDL/tree/master/etc/examples/horovod/>`_.

6. To run in Spark, see the `Spark <spark.rst>`_ page.
6. To run in Spark, see `Spark <spark.rst>`_.

Keras
-----
@@ -360,7 +362,7 @@ to batch small *allreduce* operations, which results in improved performance. We
See `here <tensor-fusion.rst>`__ for full details and tweaking instructions.


Analyzing Horovod Performance
Analyzing Horovod performance
-----------------------------
Horovod has the ability to record the timeline of its activity, called Horovod Timeline.

@@ -372,11 +374,13 @@ See `here <timeline.rst>`__ for full details and usage instructions.

Guides
------
1. Run distributed training in Microsoft Azure using `Batch AI and Horovod <https://github.com/Azure/BatchAI/tree/master/recipes/Horovod>`_. Send us links to any user guides you want to publish on this site
1. Run distributed training in Microsoft Azure using `Batch AI and Horovod <https://github.com/Azure/BatchAI/tree/master/recipes/Horovod>`_.

Send us links to any user guides you want to publish on this site

Troubleshooting
---------------
See the `Troubleshooting <troubleshooting.rst>`_ page and please submit a `ticket <https://github.com/uber/horovod/issues/new>`_
See `Troubleshooting <troubleshooting.rst>`_ and submit a `ticket <https://github.com/uber/horovod/issues/new>`_
if you can't find an answer.


@@ -1,4 +1,4 @@
User guide
User Guide
==========
.. include:: ./summary.rst
:start-after: inclusion-marker-start-do-not-remove
@@ -23,23 +23,23 @@ In the example above, you can see few tensors being reduced. There are two major

1. **Negotiation** - a phase when all workers send to rank 0 signal that they're ready to reduce the given tensor.

* Each worker reporting readiness is represented by a tick under the ``NEGOTIATE_ALLREDUCE`` bar, so you can see which workers were early and which were late.
* Each worker reporting readiness is represented by a tick under the ``NEGOTIATE_ALLREDUCE`` bar, so you can see which workers were early and which were late.

* Immediately after negotiation, rank 0 sends all other workers signal to start reducing the tensor.
* Immediately after negotiation, rank 0 sends all other workers signal to start reducing the tensor.

2. **Processing** - a phase when the operation actually happens. It is further subdivided into multiple sub-phases:

* ``WAIT_FOR_DATA`` indicates time taken to wait for GPU to finish computing input to the **allreduce**, *allgather*, or **broadcast** operations. This happens because TensorFlow tries to smartly interleave scheduling and GPU computation. This is only applicable to situations where the Horovod operation is placed on GPU.
* ``WAIT_FOR_DATA`` indicates time taken to wait for GPU to finish computing input to the **allreduce**, *allgather*, or **broadcast** operations. This happens because TensorFlow tries to smartly interleave scheduling and GPU computation. This is only applicable to situations where the Horovod operation is placed on GPU.

* ``WAIT_FOR_OTHER_TENSOR_DATA`` indicates time taken to wait for GPU to finish computing other inputs for other operations that are part of the same fusion batch.
* ``WAIT_FOR_OTHER_TENSOR_DATA`` indicates time taken to wait for GPU to finish computing other inputs for other operations that are part of the same fusion batch.

* ``QUEUE`` happens when reduction is done with NCCL, and the previous NCCL operation did not finish yet.
* ``QUEUE`` happens when reduction is done with NCCL, and the previous NCCL operation did not finish yet.

* ``MEMCPY_IN_FUSION_BUFFER`` and ``MEMCPY_OUT_FUSION_BUFFER`` indicate time taken to copy data into and out of the fusion buffer.
* ``MEMCPY_IN_FUSION_BUFFER`` and ``MEMCPY_OUT_FUSION_BUFFER`` indicate time taken to copy data into and out of the fusion buffer.

* ``NCCL_ALLREDUCE``, ``MPI_ALLREDUCE``, ``MPI_ALLGATHER``, or ``MPI_BCAST`` indicate time taken to do the actual operation on GPU (or CPU) and highlights whether the operation was performed using NCCL or pure MPI.
* ``NCCL_ALLREDUCE``, ``MPI_ALLREDUCE``, ``MPI_ALLGATHER``, or ``MPI_BCAST`` indicate time taken to do the actual operation on GPU (or CPU) and highlights whether the operation was performed using NCCL or pure MPI.

* In case of ``HOROVOD_HIERARCHICAL_ALLREDUCE=1``, ``NCCL_ALLREDUCE`` will become a sequence or a subsequence of ``NCCL_REDUCESCATTER``, ``NCCL_REDUCE``, ``MEMCPY_IN_HOST_BUFFER``, ``MPI_ALLREDUCE``, ``MEMCPY_OUT_HOST_BUFFER``, ``NCCL_ALLGATHER``, ``NCCL_BCAST``.
* In case of ``HOROVOD_HIERARCHICAL_ALLREDUCE=1``, ``NCCL_ALLREDUCE`` will become a sequence or a subsequence of ``NCCL_REDUCESCATTER``, ``NCCL_REDUCE``, ``MEMCPY_IN_HOST_BUFFER``, ``MPI_ALLREDUCE``, ``MEMCPY_OUT_HOST_BUFFER``, ``NCCL_ALLGATHER``, ``NCCL_BCAST``.

Adding cycle markers
~~~~~~~~~~~~~~~~~~~~

0 comments on commit 8912166

Please sign in to comment.
You can’t perform that action at this time.