Skip to content
Permalink
Browse files

Changes requested in review 1

  • Loading branch information...
IgorWilbert committed May 26, 2019
1 parent d8ca675 commit d2e1893f82e1e30c8c336d102e223cf09f482c1a
Showing with 112 additions and 120 deletions.
  1. +14 −14 README.rst
  2. +3 −3 docs/benchmarks.rst
  3. +11 −12 docs/docker.rst
  4. +9 −9 docs/gpus.rst
  5. +9 −11 docs/running.rst
  6. +16 −21 docs/spark.rst
  7. +7 −7 docs/tensor-fusion.rst
  8. +11 −11 docs/timeline.rst
  9. +32 −32 docs/troubleshooting.rst
@@ -69,7 +69,7 @@ servers with 4 Pascal GPUs each connected by RoCE-capable 25 Gbit/s network:
:alt: 512-GPU Benchmark

Horovod achieves 90% scaling efficiency for both Inception V3 and ResNet-101, and 68% scaling efficiency for VGG-16.
See the `Benchmarks <benchmarks_include.html>`_ page to find out how to reproduce these numbers.
See the `Benchmarks <https://github.com/horovod/horovod/blob/master/docs/benchmarks.rst>`_ page to find out how to reproduce these numbers.

While installing MPI and NCCL itself may seem like an extra hassle, it only needs to be done once by the team dealing
with infrastructure, while everyone else in the company who builds the models can enjoy the simplicity of training them at
@@ -88,20 +88,20 @@ downgrade to Open MPI 3.1.2 or upgrade to Open MPI 4.0.0.

2. Install the ``horovod`` pip package.

.. code-block:: python
.. code-block:: bash
pip install horovod
$ pip install horovod
This basic installation is good for laptops and for getting to know Horovod.
If you're installing Horovod on a server with GPUs, read the `Horovod on GPU <gpus_include.html>`_ page.
If you want to use Docker, read the `Horovod in Docker <docker_include.html>`_ page.
If you're installing Horovod on a server with GPUs, read the `Horovod on GPU <https://github.com/horovod/horovod/blob/master/docs/gpus.rst>`_ page.
If you want to use Docker, read the `Horovod in Docker <https://github.com/horovod/horovod/blob/master/docs/docker.rst>`_ page.


Concepts
--------

Horovod core principles are based on `MPI <http://mpi-forum.org/>`_ concepts such as *size*, *rank*,
*local rank*, *allreduce*, *allgather* and, *broadcast*. See `this page <concepts_include.html>`_ for more details.
*local rank*, *allreduce*, *allgather* and, *broadcast*. See `this page <https://github.com/horovod/horovod/blob/master/docs/concepts.rst>`_ for more details.


Usage
@@ -177,28 +177,28 @@ Example (see the `examples <examples/>`_ directory for full training examples):
Running Horovod
---------------

The example commands below show how to run distributed training. See the `Running Horovod <running_include.html>`_
The example commands below show how to run distributed training. See the `Running Horovod <https://github.com/horovod/horovod/blob/master/docs/running.rst>`_
page for more instructions, including RoCE/InfiniBand tweaks and tips for dealing with hangs.

1. To run on a machine with 4 GPUs:

.. code-block:: bash
horovodrun -np 4 -H localhost:4 python train.py
$ horovodrun -np 4 -H localhost:4 python train.py
2. To run on 4 machines with 4 GPUs each:

.. code-block:: bash
horovodrun -np 16 -H server1:4,server2:4,server3:4,server4:4 python train.py
$ horovodrun -np 16 -H server1:4,server2:4,server3:4,server4:4 python train.py
3. To run using Open MPI without the ``horovodrun`` wrapper, see the `Running Horovod with Open MPI <docs/mpirun.rst>`_ page.

4. To run in Docker, see the `Horovod in Docker <docker_include.html>`_ page.
4. To run in Docker, see the `Horovod in Docker <https://github.com/horovod/horovod/blob/master/docs/docker.rst>`_ page.

5. To run in Kubernetes, see `Kubeflow <https://github.com/kubeflow/kubeflow/tree/master/kubeflow/mpi-job>`_, `MPI Operator <https://github.com/kubeflow/mpi-operator/>`_, `Helm Chart <https://github.com/kubernetes/charts/tree/master/stable/horovod/>`_, and `FfDL <https://github.com/IBM/FfDL/tree/master/etc/examples/horovod/>`_.

6. To run in Spark, see the `Spark <spark_include.html>`_ page.
6. To run in Spark, see the `Spark <https://github.com/horovod/horovod/blob/master/docs/spark.rst>`_ page.

Keras
-----
@@ -349,15 +349,15 @@ You can check for MPI multi-threading support by querying the ``hvd.mpi_threads_
Inference
---------
Learn how to optimize your model for inference and remove Horovod operations from the graph `here <inference_include.html>`_.
Learn how to optimize your model for inference and remove Horovod operations from the graph `here <https://github.com/horovod/horovod/blob/master/docs/inference.rst>`_.


Tensor Fusion
-------------
One of the unique things about Horovod is its ability to interleave communication and computation coupled with the ability
to batch small *allreduce* operations, which results in improved performance. We call this batching feature Tensor Fusion.

See `here <tensor-fusion_include.html>`__ for full details and tweaking instructions.
See `here <https://github.com/horovod/horovod/blob/master/docs/tensor-fusion.rst>`__ for full details and tweaking instructions.


Analyzing Horovod Performance
@@ -376,7 +376,7 @@ Guides

Troubleshooting
---------------
See the `Troubleshooting <troubleshooting_include.html>`_ page and please submit a `ticket <https://github.com/uber/horovod/issues/new>`_
See the `Troubleshooting <https://github.com/horovod/horovod/blob/master/docs/troubleshooting.rst>`_ page and please submit a `ticket <https://github.com/uber/horovod/issues/new>`_
if you can't find an answer.


@@ -10,12 +10,12 @@ Benchmarks
:alt: 512-GPU Benchmark


The above benchmark was done on 128 servers with 4 Pascal GPUs each connected by RoCE-capable 25 Gbit/s network. Horovod
The above benchmark was done on 128 servers with 4 Pascal GPUs each connected by a RoCE-capable 25 Gbit/s network. Horovod
achieves 90% scaling efficiency for both Inception V3 and ResNet-101, and 68% scaling efficiency for VGG-16.

To reproduce the benchmarks:

1. Install Horovod using the instructions provided on the `Horovod on GPU <gpus.md>`__ page.
1. Install Horovod using the instructions provided on the `Horovod on GPU <https://github.com/horovod/horovod/blob/master/docs/gpus.rst>`__ page.

2. Clone `https://github.com/tensorflow/benchmarks <https://github.com/tensorflow/benchmarks>`__

@@ -50,7 +50,7 @@ The benchmark instructions above are for the synthetic data benchmark.
To run the benchmark on a real data, you need to download the `ImageNet dataset <http://image-net.org/download-images>`__
and convert it using the TFRecord `preprocessing script <https://github.com/tensorflow/models/blob/master/research/inception/inception/data/download_and_preprocess_imagenet.sh>`__.

Now, simply add `--data_dir /path/to/imagenet/tfrecords --data_name imagenet --num_batches=2000` to your training command:
Now, simply add ``--data_dir /path/to/imagenet/tfrecords --data_name imagenet --num_batches=2000`` to your training command:

.. code-block:: bash
@@ -3,16 +3,15 @@
Horovod in Docker
=================

To streamline the installation process on GPU machines, we have published the reference `Dockerfile <../Dockerfile>`__ so

you can get started with Horovod in minutes. The container includes `Examples <../examples>`__ in the `/examples`
To streamline the installation process on GPU machines, we have published the reference `Dockerfile <https://github.com/horovod/horovod/blob/master/Dockerfile>`__ so
you can get started with Horovod in minutes. The container includes `Examples <https://github.com/horovod/horovod/tree/master/examples>`__ in the ``/examples``
directory.

Pre-built docker containers with Horovod are available on `DockerHub <https://hub.docker.com/r/horovod/horovod>`__.
Pre-built Docker containers with Horovod are available on `DockerHub <https://hub.docker.com/r/horovod/horovod>`__.

**Building**

Before building, you can modify `Dockerfile` to your liking, e.g. select a different CUDA, TensorFlow or Python version.
Before building, you can modify ``Dockerfile`` to your liking, e.g. select a different CUDA, TensorFlow or Python version.

.. code-block:: bash
@@ -25,8 +24,8 @@ Before building, you can modify `Dockerfile` to your liking, e.g. select a diffe

After the container is built, run it using `nvidia-docker <https://github.com/NVIDIA/nvidia-docker>`__.

**Note**: you can replace `horovod:latest` with the `specific <https://hub.docker.com/r/horovod/horovod/tags>`__ pre-build
Docker container with Horovod instead of building it by yourself
**Note**: You can replace ``horovod:latest`` with the `specific <https://hub.docker.com/r/horovod/horovod/tags>`__ pre-build
Docker container with Horovod instead of building it by yourself.

.. code-block:: bash
@@ -45,13 +44,13 @@ You can ignore this message.

**Running on multiple machines**

Here we describe a simple example involving a shared filesystem `/mnt/share` using a common port number `12345` for the SSH
daemon that will be run on all the containers. `/mnt/share/ssh` would contain a typical `id_rsa` and `authorized_keys`
Here we describe a simple example involving a shared filesystem ``/mnt/share`` using a common port number ``12345`` for the SSH
daemon that will be run on all the containers. ``/mnt/share/ssh`` would contain a typical ``id_rsa`` and ``authorized_keys``
pair that allows `passwordless authentication <http://www.linuxproblem.org/art_9.html>`__.

**Note**: These are not hard requirements but they make the example more concise. A shared filesystem can be replaced by
`rsync`ing SSH configuration and code across machines, and a common SSH port can be replaced by machine-specific ports
defined in `/root/.ssh/ssh_config` file.
**Note**: These are not hard requirements but they make the example more concise. A shared filesystem can be replaced by ``rsyncing``
SSH configuration and code across machines, and a common SSH port can be replaced by machine-specific ports
defined in ``/root/.ssh/ssh_config`` file.

Primary worker:

@@ -1,4 +1,4 @@
.. inclusion-marker-start-do-not-remove
**allreduce**.. inclusion-marker-start-do-not-remove

Horovod on GPU
==============
@@ -8,15 +8,15 @@ To use Horovod on GPU, read the options below and see which one applies to you b

**Have GPUs?**

In most situations, using NCCL 2 will significantly improve performance over the CPU version. NCCL 2 provides the *allreduce*
In most situations, using NCCL 2 will significantly improve performance over the CPU version. NCCL 2 provides the **allreduce**
operation optimized for NVIDIA GPUs and a variety of networking devices, such as RoCE or InfiniBand.

1. Install `NCCL 2 <https://developer.nvidia.com/nccl>`__.

Steps to install NCCL 2 are listed `here <http://docs.nvidia.com/deeplearning/sdk/nccl-install-guide/index.html>`__.

If you have installed NCCL 2 using the `nccl-<version>.txz` package, you should add the library path to `LD_LIBRARY_PATH`
environment variable or register it in `/etc/ld.so.conf`.
If you have installed NCCL 2 using the ``nccl-<version>.txz`` package, you should add the library path to ``LD_LIBRARY_PATH``
environment variable or register it in ``/etc/ld.so.conf``.

.. code-block:: bash
@@ -28,7 +28,7 @@ by installing an `nv_peer_memory <https://github.com/Mellanox/nv_peer_memory>`__

`GPUDirect <https://developer.nvidia.com/gpudirect>`__ allows GPUs to transfer memory among each other without CPU
involvement, which significantly reduces latency and load on CPU. NCCL 2 is able to use GPUDirect automatically for
*allreduce* operation if it detects it.
**allreduce** operation if it detects it.

3. Install `Open MPI <https://www.open-mpi.org/>`__ or another MPI implementation.

@@ -37,9 +37,9 @@ Steps to install Open MPI are listed `here <https://www.open-mpi.org/faq/?catego
**Note**: Open MPI 3.1.3 has an issue that may cause hangs. It is recommended
to downgrade to Open MPI 3.1.2 or upgrade to Open MPI 4.0.0.

4. Install the `horovod` pip package.
4. Install the ``horovod`` pip package.

If you have installed NCCL 2 using the `nccl-<version>.txz` package, you should specify the path to NCCL 2 using the `HOROVOD_NCCL_HOME`
If you have installed NCCL 2 using the ``nccl-<version>.txz`` package, you should specify the path to NCCL 2 using the ``HOROVOD_NCCL_HOME``
environment variable.

.. code-block:: bash
@@ -55,7 +55,7 @@ If you have installed NCCL 2 using the Ubuntu package, you can simply run:
**Note**: Some models with a high computation to communication ratio benefit from doing allreduce on CPU, even if a
GPU version is available. To force allreduce to happen on CPU, pass `device_dense='/cpu:0'` to `hvd.DistributedOptimizer`:
GPU version is available. To force allreduce to happen on CPU, pass ``device_dense='/cpu:0'`` to ``hvd.DistributedOptimizer``:

.. code-block:: python
@@ -85,7 +85,7 @@ configure Horovod to use them as well:
**Note**: Allgather allocates an output tensor which is proportionate to the number of processes participating in the
training. If you find yourself running out of GPU memory, you can force allgather to happen on CPU by passing
`device_sparse='/cpu:0'` to `hvd.DistributedOptimizer`:
``device_sparse='/cpu:0'`` to ``hvd.DistributedOptimizer``:

.. code-block:: python
@@ -4,11 +4,11 @@
Running Horovod
===============

The examples below are for Open MPI and use `horovodrun`. Check your MPI documentation for arguments to the `mpirun`
The examples below are for Open MPI and use ``horovodrun``. Check your MPI documentation for arguments to the ``mpirun``
command on your system.

Typically one GPU will be allocated per process, so if a server has 4 GPUs, you would run 4 processes. In `horovodrun`,
the number of processes is specified with the `-np` flag.
Typically one GPU will be allocated per process, so if a server has 4 GPUs, you would run 4 processes. In ``horovodrun``,
the number of processes is specified with the ``-np`` flag.

1. To run on a machine with 4 GPUs:

@@ -25,23 +25,21 @@ the number of processes is specified with the `-np` flag.
**Failures due to SSH issues**

The host where `horovodrun` is executed must be able to SSH to all other hosts without any prompts.
The host where ``horovodrun`` is executed must be able to SSH to all other hosts without any prompts.

If `horovodrun` fails with permission error, verify that you can ssh to every other server without entering a password or
If ``horovodrun`` fails with permission error, verify that you can ssh to every other server without entering a password or
answering questions like this:


```
The authenticity of host '<hostname> (<ip address>)' can't be established.
``The authenticity of host '<hostname> (<ip address>)' can't be established.
RSA key fingerprint is xx:xx:xx:xx:xx:xx:xx:xx:xx:xx:xx:xx:xx:xx:xx:xx.
Are you sure you want to continue connecting (yes/no)?
```
Are you sure you want to continue connecting (yes/no)?``


To learn more about setting up passwordless authentication, see `this page <http://www.linuxproblem.org/art_9.html>`__.

To avoid `The authenticity of host '<hostname> (<ip address>)' can't be established` prompts, add all the hosts to
the `~/.ssh/known_hosts` file using `ssh-keyscan`:
To avoid ``The authenticity of host '<hostname> (<ip address>)' can't be established`` prompts, add all the hosts to
the ``~/.ssh/known_hosts`` file using ``ssh-keyscan``:

.. code-block:: bash
@@ -3,7 +3,7 @@
Horovod in Spark
================

The `horovod.spark` package provides a convenient wrapper around Open
The ``horovod.spark`` package provides a convenient wrapper around Open
MPI that makes running Horovod jobs in Spark clusters easy.

In situations where training data originates from Spark, this enables
@@ -51,25 +51,20 @@ A toy example of running a Horovod job in Spark is provided below:
`keras_spark_rossmann.py script <../examples/keras_spark_rossmann.py>`__ provides
an example of end-to-end data preparation and training of a model for the
`Rossmann Store Sales <https://www.kaggle.com/c/rossmann-store-sales>`__ Kaggle
competition.
competition. It is inspired by an article `An Introduction to Deep Learning for Tabular Data <https://www.fast.ai/2018/04/29/categorical-embeddings/>`__
and leverages the code of the notebook referenced in the article. The example is split into three parts:
It is inspired by an article `An Introduction to Deep Learning for Tabular Data <https://www.fast.ai/2018/04/29/categorical-embeddings/>`__
and leverages the code of the notebook referenced in the article.
The example is split into three parts:
1. The first part performs complicated data preprocessing over an initial set
of CSV files provided by the competition and gathered by the community.
2. The second part defines a Keras model and performs a distributed training
of the model using Horovod in Spark.
3. The third part performs prediction using the best model and creates
a submission file.
#. The first part performs complicated data preprocessing over an initial set of CSV files provided by the competition and gathered by the community.
#. The second part defines a Keras model and performs a distributed training of the model using Horovod in Spark.
#. The third part performs prediction using the best model and creates a submission file.
To run the example, please install the following dependencies:
* `pyspark`
* `petastorm >= 0.7.0`
* `h5py >= 2.9.0`
* `tensorflow-gpu >= 1.12.0` (or `tensorflow >= 1.12.0`)
* `horovod >= 0.15.3`
* ``pyspark``
* ``petastorm >= 0.7.0``
* ``h5py >= 2.9.0``
* ``tensorflow-gpu >= 1.12.0`` (or ``tensorflow >= 1.12.0``)
* ``horovod >= 0.15.3``
Run the example:
@@ -90,7 +85,7 @@ for DL Spark cluster setup.
**GPU training**
For GPU training, one approach is to set up a separate GPU Spark cluster
and configure each executor with `# of CPU cores` = `# of GPUs`. This can
and configure each executor with ``# of CPU cores`` = ``# of GPUs``. This can
be accomplished in standalone mode as follows:
.. code-block:: bash
@@ -99,15 +94,15 @@ be accomplished in standalone mode as follows:
$ /path/to/spark/sbin/start-all.sh
This approach turns the `spark.task.cpus` setting to control # of GPUs
This approach turns the ``spark.task.cpus`` setting to control # of GPUs
requested per process (defaults to 1).
The ongoing `SPARK-24615 <https://issues.apache.org/jira/browse/SPARK-24615>`__ effort aims to
introduce GPU-aware resource scheduling in future versions of Spark.
**CPU training**
For CPU training, one approach is to specify the `spark.task.cpus` setting
For CPU training, one approach is to specify the ``spark.task.cpus`` setting
during the training session creation:
.. code-block:: python
@@ -132,7 +127,7 @@ security to isolate Horovod jobs from potential attackers**.
**Environment knobs**
* `HOROVOD_SPARK_START_TIMEOUT` - sets the default timeout for Spark tasks to spawn, register, and start running the code. If executors for Spark tasks are scheduled on-demand and can take a long time to start, it may be useful to increase this timeout on a system level.
* ``HOROVOD_SPARK_START_TIMEOUT`` - sets the default timeout for Spark tasks to spawn, register, and start running the code. If executors for Spark tasks are scheduled on-demand and can take a long time to start, it may be useful to increase this timeout on a system level.
.. inclusion-marker-end-do-not-remove

0 comments on commit d2e1893

Please sign in to comment.
You can’t perform that action at this time.