Skip to content
Permalink
Browse files

Docs migration to Sphinx (#1088)

* rst migration: concepts

Signed-off-by: Igor Wilbert <igor.wilbert@gmail.com>

* rst migration: running horovod

Signed-off-by: Igor Wilbert <igor.wilbert@gmail.com>

* rst migration: benchmarks

Signed-off-by: Igor Wilbert <igor.wilbert@gmail.com>

* rst migration: docker

Signed-off-by: Igor Wilbert <igor.wilbert@gmail.com>

* rst migration: gpus

Signed-off-by: Igor Wilbert <igor.wilbert@gmail.com>

* rst migration: inference

Signed-off-by: Igor Wilbert <igor.wilbert@gmail.com>

* rst migration: spark

Signed-off-by: Igor Wilbert <igor.wilbert@gmail.com>

* rst migration: tensor-fusion

Signed-off-by: Igor Wilbert <igor.wilbert@gmail.com>

* rst migration: timeline

Signed-off-by: Igor Wilbert <igor.wilbert@gmail.com>

* rst migration: troubleshooting

Signed-off-by: Igor Wilbert <igor.wilbert@gmail.com>

* Changes requested in review 1

Signed-off-by: Igor Wilbert <igor.wilbert@gmail.com>

* Changes requested in review 2

Signed-off-by: Igor Wilbert <igor.wilbert@gmail.com>
  • Loading branch information...
IgorWilbert authored and alsrgv committed May 29, 2019
1 parent 45755f6 commit 6f400014b8cb45aa013077aad0060032a4dda713
@@ -69,7 +69,7 @@ servers with 4 Pascal GPUs each connected by RoCE-capable 25 Gbit/s network:
:alt: 512-GPU Benchmark

Horovod achieves 90% scaling efficiency for both Inception V3 and ResNet-101, and 68% scaling efficiency for VGG-16.
See the `Benchmarks <docs/benchmarks.md>`_ page to find out how to reproduce these numbers.
See the `Benchmarks <docs/benchmarks.rst>`_ page to find out how to reproduce these numbers.

While installing MPI and NCCL itself may seem like an extra hassle, it only needs to be done once by the team dealing
with infrastructure, while everyone else in the company who builds the models can enjoy the simplicity of training them at
@@ -88,20 +88,20 @@ downgrade to Open MPI 3.1.2 or upgrade to Open MPI 4.0.0.

2. Install the ``horovod`` pip package.

.. code-block:: python
.. code-block:: bash
pip install horovod
$ pip install horovod
This basic installation is good for laptops and for getting to know Horovod.
If you're installing Horovod on a server with GPUs, read the `Horovod on GPU <docs/gpus.md>`_ page.
If you want to use Docker, read the `Horovod in Docker <docs/docker.md>`_ page.
If you're installing Horovod on a server with GPUs, read the `Horovod on GPU <docs/gpus.rst>`_ page.
If you want to use Docker, read the `Horovod in Docker <docs/docker.rst>`_ page.


Concepts
--------

Horovod core principles are based on `MPI <http://mpi-forum.org/>`_ concepts such as *size*, *rank*,
*local rank*, *allreduce*, *allgather* and, *broadcast*. See `this page <docs/concepts.md>`_ for more details.
*local rank*, **allreduce**, **allgather** and, *broadcast*. See `this page <docs/concepts.rst>`_ for more details.


Usage
@@ -119,7 +119,7 @@ To use Horovod, make the following additions to your program:
the number of workers. An increase in learning rate compensates for the increased batch size.

4. Wrap optimizer in ``hvd.DistributedOptimizer``. The distributed optimizer delegates gradient computation
to the original optimizer, averages gradients using *allreduce* or *allgather*, and then applies those averaged
to the original optimizer, averages gradients using **allreduce** or **allgather**, and then applies those averaged
gradients.

5. Add ``hvd.BroadcastGlobalVariablesHook(0)`` to broadcast initial variable states from rank 0 to all other processes.
@@ -131,7 +131,7 @@ To use Horovod, make the following additions to your program:
This can be accomplished by passing ``checkpoint_dir=None`` to ``tf.train.MonitoredTrainingSession`` if
``hvd.rank() != 0``.

Example (see the `examples <examples/>`_ directory for full training examples):
Example (see the `examples <https://github.com/horovod/horovod/blob/master/examples/>`_ directory for full training examples):

.. code-block:: python
@@ -177,34 +177,34 @@ Example (see the `examples <examples/>`_ directory for full training examples):
Running Horovod
---------------

The example commands below show how to run distributed training. See the `Running Horovod <docs/running.md>`_
The example commands below show how to run distributed training. See the `Running Horovod <docs/running.rst>`_
page for more instructions, including RoCE/InfiniBand tweaks and tips for dealing with hangs.

1. To run on a machine with 4 GPUs:

.. code-block:: bash
horovodrun -np 4 -H localhost:4 python train.py
$ horovodrun -np 4 -H localhost:4 python train.py
2. To run on 4 machines with 4 GPUs each:

.. code-block:: bash
horovodrun -np 16 -H server1:4,server2:4,server3:4,server4:4 python train.py
$ horovodrun -np 16 -H server1:4,server2:4,server3:4,server4:4 python train.py
3. To run using Open MPI without the ``horovodrun`` wrapper, see the `Running Horovod with Open MPI <docs/mpirun.rst>`_ page.

4. To run in Docker, see the `Horovod in Docker <docs/docker.md>`_ page.
4. To run in Docker, see the `Horovod in Docker <docs/docker.rst>`_ page.

5. To run in Kubernetes, see `Kubeflow <https://github.com/kubeflow/kubeflow/tree/master/kubeflow/mpi-job>`_, `MPI Operator <https://github.com/kubeflow/mpi-operator/>`_, `Helm Chart <https://github.com/kubernetes/charts/tree/master/stable/horovod/>`_, and `FfDL <https://github.com/IBM/FfDL/tree/master/etc/examples/horovod/>`_.

6. To run in Spark, see the `Spark <docs/spark.md>`_ page.
6. To run in Spark, see the `Spark <docs/spark.rst>`_ page.

Keras
-----
Horovod supports Keras and regular TensorFlow in similar ways.

See full training `simple <examples/keras_mnist.py>`_ and `advanced <examples/keras_mnist_advanced.py>`_ examples.
See full training `simple <https://github.com/horovod/horovod/blob/master/examples/keras_mnist.py>`_ and `advanced <https://github.com/horovod/horovod/blob/master/examples/keras_mnist_advanced.py>`_ examples.

**Note**: Keras 2.0.9 has a `known issue <https://github.com/fchollet/keras/issues/8353>`_ that makes each worker allocate
all GPUs on the server, instead of the GPU assigned by the *local rank*. If you have multiple GPUs per server, upgrade
@@ -221,7 +221,7 @@ MXNet
-----
Horovod supports MXNet and regular TensorFlow in similar ways.

See full training `MNIST <examples/mxnet_mnist.py>`_ and `ImageNet <examples/mxnet_imagenet_resnet50.py>`_ examples. The script below provides a simple skeleton of code block based on MXNet Gluon API.
See full training `MNIST <https://github.com/horovod/horovod/blob/master/examples/mxnet_mnist.py>`_ and `ImageNet <https://github.com/horovod/horovod/blob/master/examples/mxnet_imagenet_resnet50.py>`_ examples. The script below provides a simple skeleton of code block based on MXNet Gluon API.

.. code-block:: python
@@ -349,15 +349,15 @@ You can check for MPI multi-threading support by querying the ``hvd.mpi_threads_
Inference
---------
Learn how to optimize your model for inference and remove Horovod operations from the graph `here <docs/inference.md>`_.
Learn how to optimize your model for inference and remove Horovod operations from the graph `here <docs/inference.rst>`_.


Tensor Fusion
-------------
One of the unique things about Horovod is its ability to interleave communication and computation coupled with the ability
to batch small *allreduce* operations, which results in improved performance. We call this batching feature Tensor Fusion.
to batch small **allreduce** operations, which results in improved performance. We call this batching feature Tensor Fusion.

See `here <docs/tensor-fusion.md>`__ for full details and tweaking instructions.
See `here <docs/tensor-fusion.rst>`__ for full details and tweaking instructions.


Analyzing Horovod Performance
@@ -367,7 +367,7 @@ Horovod has the ability to record the timeline of its activity, called Horovod T
.. image:: https://user-images.githubusercontent.com/16640218/29735271-9e148da0-89ac-11e7-9ae0-11d7a099ac89.png
:alt: Horovod Timeline

See `here <docs/timeline.md>`__ for full details and usage instructions.
See `here <docs/timeline.rst>`__ for full details and usage instructions.


Guides
@@ -376,7 +376,7 @@ Guides

Troubleshooting
---------------
See the `Troubleshooting <docs/troubleshooting.md>`_ page and please submit a `ticket <https://github.com/uber/horovod/issues/new>`_
See the `Troubleshooting <docs/troubleshooting.rst>`_ page and please submit a `ticket <https://github.com/uber/horovod/issues/new>`_
if you can't find an answer.


This file was deleted.

@@ -0,0 +1,68 @@

.. inclusion-marker-start-do-not-remove
Benchmarks
==========


.. image:: https://user-images.githubusercontent.com/16640218/38965607-bf5c46ca-4332-11e8-895a-b9c137e86013.png
:alt: 512-GPU Benchmark


The above benchmark was done on 128 servers with 4 Pascal GPUs each connected by a RoCE-capable 25 Gbit/s network. Horovod
achieves 90% scaling efficiency for both Inception V3 and ResNet-101, and 68% scaling efficiency for VGG-16.

To reproduce the benchmarks:

1. Install Horovod using the instructions provided on the `Horovod on GPU <https://github.com/horovod/horovod/blob/master/docs/gpus.rst>`__ page.

2. Clone `https://github.com/tensorflow/benchmarks <https://github.com/tensorflow/benchmarks>`__

.. code-block:: bash
$ git clone https://github.com/tensorflow/benchmarks
$ cd benchmarks
3. Run the benchmark. Examples below are for Open MPI.

.. code-block:: bash
$ horovodrun -np 16 -H server1:4,server2:4,server3:4,server4:4 \
python scripts/tf_cnn_benchmarks/tf_cnn_benchmarks.py \
--model resnet101 \
--batch_size 64 \
--variable_update horovod
4. At the end of the run, you will see the number of images processed per second:

.. code-block:: bash
total images/sec: 1656.82
**Real data benchmarks**

The benchmark instructions above are for the synthetic data benchmark.

To run the benchmark on a real data, you need to download the `ImageNet dataset <http://image-net.org/download-images>`__
and convert it using the TFRecord `preprocessing script <https://github.com/tensorflow/models/blob/master/research/inception/inception/data/download_and_preprocess_imagenet.sh>`__.

Now, simply add ``--data_dir /path/to/imagenet/tfrecords --data_name imagenet --num_batches=2000`` to your training command:

.. code-block:: bash
$ horovodrun -np 16 -H server1:4,server2:4,server3:4,server4:4 \
python scripts/tf_cnn_benchmarks/tf_cnn_benchmarks.py \
--model resnet101 \
--batch_size 64 \
--variable_update horovod \
--data_dir /path/to/imagenet/tfrecords \
--data_name imagenet \
--num_batches=2000
.. inclusion-marker-end-do-not-remove
@@ -0,0 +1,3 @@
.. include:: ./benchmarks.rst
:start-after: inclusion-marker-start-do-not-remove
:end-before: inclusion-marker-end-do-not-remove

This file was deleted.

@@ -0,0 +1,35 @@

.. inclusion-marker-start-do-not-remove
Concepts
========

Horovod core principles are based on the `MPI <http://mpi-forum.org/>`_ concepts *size*, *rank*,
*local rank*, *allreduce*, *allgather*, and *broadcast*. These are best explained by example. Say we launched
a training script on 4 servers, each having 4 GPUs. If we launched one copy of the script per GPU:

* *Size* would be the number of processes, in this case, 16.

* *Rank* would be the unique process ID from 0 to 15 (*size* - 1).

* *Local rank* would be the unique process ID within the server from 0 to 3.

* *Allreduce* is an operation that aggregates data among multiple processes and distributes results back to them. *Allreduce* is used to average dense tensors. Here's an illustration from the `MPI Tutorial <http://mpitutorial.com/tutorials/mpi-reduce-and-allreduce/>`__:

.. image:: http://mpitutorial.com/tutorials/mpi-reduce-and-allreduce/mpi_allreduce_1.png
:alt: Allreduce Illustration

* *Allgather* is an operation that gathers data from all processes on every process. *Allgather* is used to collect values of sparse tensors. Here's an illustration from the `MPI Tutorial <http://mpitutorial.com/tutorials/mpi-scatter-gather-and-allgather/>`__:

.. image:: http://mpitutorial.com/tutorials/mpi-scatter-gather-and-allgather/allgather.png
:alt: Allgather Illustration


* *Broadcast* is an operation that broadcasts data from one process, identified by root rank, onto every other process. Here's an illustration from the `MPI Tutorial <http://mpitutorial.com/tutorials/mpi-broadcast-and-collective-communication/>`__:

.. image:: http://mpitutorial.com/tutorials/mpi-broadcast-and-collective-communication/broadcast_pattern.png
:alt: Broadcast Illustration


.. inclusion-marker-end-do-not-remove
@@ -0,0 +1,3 @@
.. include:: ./concepts.rst
:start-after: inclusion-marker-start-do-not-remove
:end-before: inclusion-marker-end-do-not-remove
@@ -41,6 +41,7 @@
'sphinx.ext.autodoc',
'sphinx.ext.viewcode',
'sphinxcontrib.napoleon',
'nbsphinx',
]

# Add any paths that contain templates here, relative to this directory.

0 comments on commit 6f40001

Please sign in to comment.
You can’t perform that action at this time.