Skip to content

Commit

Permalink
Browse files Browse the repository at this point in the history
Adopt horovodrun (#924)
* Adopt horovodrun

Signed-off-by: Alex Sergeev <alsrgv@users.noreply.github.com>

* Update docs

Signed-off-by: Alex Sergeev <alsrgv@users.noreply.github.com>

* Add newline to split paragraphs

Signed-off-by: Alex Sergeev <alsrgv@users.noreply.github.com>
  • Loading branch information
alsrgv committed Mar 18, 2019
1 parent f3b1186 commit 420104b
Show file tree
Hide file tree
Showing 9 changed files with 24 additions and 186 deletions.
23 changes: 4 additions & 19 deletions Dockerfile
Expand Up @@ -4,7 +4,7 @@ FROM nvidia/cuda:9.0-devel-ubuntu16.04
ENV TENSORFLOW_VERSION=1.12.0
ENV PYTORCH_VERSION=1.0.0
ENV CUDNN_VERSION=7.4.1.5-1+cuda9.0
ENV NCCL_VERSION=2.4.2-1+cuda9.0
ENV NCCL_VERSION=2.3.7-1+cuda9.0
ENV MXNET_URL=https://s3-us-west-2.amazonaws.com/mxnet-python-packages-gcc5/mxnet_cu90_gcc5-1.4.0-py2.py3-none-manylinux1_x86_64.whl

# Python 2.7 or 3.5 is supported by Ubuntu Xenial out of the box
Expand Down Expand Up @@ -39,9 +39,9 @@ RUN pip install 'numpy<1.15.0' tensorflow-gpu==${TENSORFLOW_VERSION} keras h5py
# Install Open MPI
RUN mkdir /tmp/openmpi && \
cd /tmp/openmpi && \
wget https://www.open-mpi.org/software/ompi/v3.1/downloads/openmpi-3.1.2.tar.gz && \
tar zxf openmpi-3.1.2.tar.gz && \
cd openmpi-3.1.2 && \
wget https://www.open-mpi.org/software/ompi/v3.1/downloads/openmpi-4.0.0.tar.gz && \
tar zxf openmpi-4.0.0.tar.gz && \
cd openmpi-4.0.0 && \
./configure --enable-orterun-prefix-by-default && \
make -j $(nproc) all && \
make install && \
Expand All @@ -53,21 +53,6 @@ RUN ldconfig /usr/local/cuda-9.0/targets/x86_64-linux/lib/stubs && \
HOROVOD_GPU_ALLREDUCE=NCCL HOROVOD_WITH_TENSORFLOW=1 HOROVOD_WITH_PYTORCH=1 HOROVOD_WITH_MXNET=1 pip install --no-cache-dir horovod && \
ldconfig

# Create a wrapper for OpenMPI to allow running as root by default
RUN mv /usr/local/bin/mpirun /usr/local/bin/mpirun.real && \
echo '#!/bin/bash' > /usr/local/bin/mpirun && \
echo 'mpirun.real --allow-run-as-root "$@"' >> /usr/local/bin/mpirun && \
chmod a+x /usr/local/bin/mpirun

# Configure OpenMPI to run good defaults:
# --bind-to none --map-by slot --mca btl_tcp_if_exclude lo,docker0
RUN echo "hwloc_base_binding_policy = none" >> /usr/local/etc/openmpi-mca-params.conf && \
echo "rmaps_base_mapping_policy = slot" >> /usr/local/etc/openmpi-mca-params.conf && \
echo "btl_tcp_if_exclude = lo,docker0" >> /usr/local/etc/openmpi-mca-params.conf

# Set default NCCL parameters
RUN echo NCCL_DEBUG=INFO >> /etc/nccl.conf

# Install OpenSSH for MPI to communicate between containers
RUN apt-get install -y --no-install-recommends openssh-client openssh-server && \
mkdir -p /var/run/sshd
Expand Down
14 changes: 2 additions & 12 deletions README.md
Expand Up @@ -164,23 +164,13 @@ page for more instructions, including RoCE/InfiniBand tweaks and tips for dealin
1. To run on a machine with 4 GPUs:

```bash
$ mpirun -np 4 \
-H localhost:4 \
-bind-to none -map-by slot \
-x NCCL_DEBUG=INFO -x LD_LIBRARY_PATH -x PATH \
-mca pml ob1 -mca btl ^openib \
python train.py
$ horovodrun -np 4 -H localhost:4 python train.py
```

2. To run on 4 machines with 4 GPUs each:

```bash
$ mpirun -np 16 \
-H server1:4,server2:4,server3:4,server4:4 \
-bind-to none -map-by slot \
-x NCCL_DEBUG=INFO -x LD_LIBRARY_PATH -x PATH \
-mca pml ob1 -mca btl ^openib \
python train.py
$ horovodrun -np 16 -H server1:4,server2:4,server3:4,server4:4 python train.py
```

3. To run in Docker, see the [Horovod in Docker](docs/docker.md) page.
Expand Down
14 changes: 2 additions & 12 deletions docs/benchmarks.md
Expand Up @@ -19,12 +19,7 @@ $ cd benchmarks
3. Run the benchmark. Examples below are for Open MPI.

```bash
$ mpirun -np 16 \
-H server1:4,server2:4,server3:4,server4:4 \
-bind-to none -map-by slot \
-x NCCL_DEBUG=INFO -x LD_LIBRARY_PATH -x PATH \
-mca pml ob1 -mca btl ^openib \
\
$ horovodrun -np 16 -H server1:4,server2:4,server3:4,server4:4 \
python scripts/tf_cnn_benchmarks/tf_cnn_benchmarks.py \
--model resnet101 \
--batch_size 64 \
Expand All @@ -47,12 +42,7 @@ and convert it using the TFRecord [preprocessing script](https://github.com/tens
Now, simply add `--data_dir /path/to/imagenet/tfrecords --data_name imagenet --num_batches=2000` to your training command:

```bash
$ mpirun -np 16 \
-H server1:4,server2:4,server3:4,server4:4 \
-bind-to none -map-by slot \
-x NCCL_DEBUG=INFO -x LD_LIBRARY_PATH -x PATH \
-mca pml ob1 -mca btl ^openib \
\
$ horovodrun -np 16 -H server1:4,server2:4,server3:4,server4:4 \
python scripts/tf_cnn_benchmarks/tf_cnn_benchmarks.py \
--model resnet101 \
--batch_size 64 \
Expand Down
10 changes: 2 additions & 8 deletions docs/docker.md
Expand Up @@ -25,14 +25,9 @@ Docker container with Horovod instead of building it by yourself

```bash
$ nvidia-docker run -it horovod:latest
root@c278c88dd552:/examples# mpirun -np 4 -H localhost:4 python keras_mnist_advanced.py
root@c278c88dd552:/examples# horovodrun -np 4 -H localhost:4 python keras_mnist_advanced.py
```

This command does not have options recommended in other parts of the documentation.
`-bind-to none -map-by slot -x NCCL_DEBUG=INFO` options are already set by default in the Docker container so
you don't need to repeat them in the command. Options `-x LD_LIBRARY_PATH -x PATH` are not necessary because we assume
that all the software is installed in the default system location in this Docker image.

If you don't run your container in privileged mode, you may see the following message:

```
Expand All @@ -55,8 +50,7 @@ Primary worker:

```bash
host1$ nvidia-docker run -it --network=host -v /mnt/share/ssh:/root/.ssh horovod:latest
root@c278c88dd552:/examples# mpirun -np 16 -H host1:4,host2:4,host3:4,host4:4 \
-mca plm_rsh_args "-p 12345" python keras_mnist_advanced.py
root@c278c88dd552:/examples# horovodrun -np 16 -H host1:4,host2:4,host3:4,host4:4 -p 12345 python keras_mnist_advanced.py
```

Secondary workers:
Expand Down
115 changes: 8 additions & 107 deletions docs/running.md
@@ -1,70 +1,28 @@
## Running Horovod

The examples below are for Open MPI. Check your MPI documentation for arguments to the `mpirun` command on your system.
The examples below are for Open MPI and use `horovodrun`. Check your MPI documentation for arguments to the `mpirun`
command on your system.

Typically one GPU will be allocated per process, so if a server has 4 GPUs, you would run 4 processes. In Open MPI,
Typically one GPU will be allocated per process, so if a server has 4 GPUs, you would run 4 processes. In `horovodrun`,
the number of processes is specified with the `-np` flag.

Starting with the Open MPI 3, it's important to add the `-bind-to none` and `-map-by slot` arguments. `-bind-to none`
specifies Open MPI to not bind a training process to a single CPU core (which would hurt performance). `-map-by slot`
allows you to have a mixture of different NUMA configurations because the default behavior is to bind to the socket.

`-mca pml ob1` and `-mca btl ^openib` flags force the use of TCP for MPI communication. This avoids many multiprocessing
issues that Open MPI has with RDMA which typically result in segmentation faults. Using TCP for MPI does not have
noticeable performance impact since most of the heavy communication is done by NCCL, which will use RDMA via RoCE or
InfiniBand if they're available (see [Horovod on GPU](gpus.md)). Notable exceptions from this rule are models that heavily
use `hvd.broadcast()` and `hvd.allgather()` operations. To make those operations use RDMA, read the [Open MPI with RDMA](#open-mpi-with-rdma)
section below.

With the `-x` option you can specify (`-x NCCL_DEBUG=INFO`) or copy (`-x LD_LIBRARY_PATH`) an environment variable to all
the workers.

1. To run on a machine with 4 GPUs:

```bash
$ mpirun -np 4 \
-H localhost:4 \
-bind-to none -map-by slot \
-x NCCL_DEBUG=INFO -x LD_LIBRARY_PATH -x PATH \
-mca pml ob1 -mca btl ^openib \
python train.py
$ horovodrun -np 4 -H localhost:4 python train.py
```

2. To run on 4 machines with 4 GPUs each:

```bash
$ mpirun -np 16 \
-H server1:4,server2:4,server3:4,server4:4 \
-bind-to none -map-by slot \
-x NCCL_DEBUG=INFO -x LD_LIBRARY_PATH -x PATH \
-mca pml ob1 -mca btl ^openib \
python train.py
```

### Open MPI with RDMA

As noted above, using TCP for MPI communication does not have any significant effects on performance in the majority of cases.
Models that make heavy use of `hvd.broadcast()` and `hvd.allgather()` operations are exceptions to that rule.

Default Open MPI `openib` BTL that provides RDMA functionality does not work well with MPI multithreading. In order to use
RDMA with `openib`, multithreading must be disabled via `-x HOROVOD_MPI_THREADS_DISABLE=1` option. See the example below:

```bash
$ mpirun -np 16 \
-H server1:4,server2:4,server3:4,server4:4 \
-bind-to none -map-by slot \
-x NCCL_DEBUG=INFO -x LD_LIBRARY_PATH -x HOROVOD_MPI_THREADS_DISABLE=1 -x PATH \
-mca pml ob1 \
python train.py
$ horovodrun -np 16 -H server1:4,server2:4,server3:4,server4:4 python train.py
```

Other MPI RDMA implementations may or may not benefit from disabling multithreading, so please consult vendor documentation.
### Failures due to SSH issues

### Hangs due to SSH issues
The host where `horovodrun` is executed must be able to SSH to all other hosts without any prompts.

The host where `mpirun` is executed must be able to SSH to all other hosts without any prompts.

If `mpirun` hangs without any output, verify that you can ssh to every other server without entering a password or
If `horovodrun` fails with permission error, verify that you can ssh to every other server without entering a password or
answering questions like this:

```
Expand All @@ -81,60 +39,3 @@ the `~/.ssh/known_hosts` file using `ssh-keyscan`:
```bash
$ ssh-keyscan -t rsa,dsa server1 server2 > ~/.ssh/known_hosts
```

### Hangs due to non-routed network interfaces

Having network interfaces that are not routed can cause Open MPI to hang. An example of such interface is `docker0`.

If you see non-routed interfaces (like `docker0`) in the output of `ifconfig`, you should tell Open MPI and NCCL to not
use them via the `-mca btl_tcp_if_exclude <interface>[,<interface>]` and `NCCL_SOCKET_IFNAME=^<interface>[,<interface>]`
parameters.

```bash
$ ifconfig
docker0 Link encap:Ethernet HWaddr 02:42:2d:17:ea:66
inet addr:172.17.0.1 Bcast:0.0.0.0 Mask:255.255.0.0
UP BROADCAST MULTICAST MTU:1500 Metric:1
RX packets:0 errors:0 dropped:0 overruns:0 frame:0
TX packets:0 errors:0 dropped:0 overruns:0 carrier:0
collisions:0 txqueuelen:0
RX bytes:0 (0.0 B) TX bytes:0 (0.0 B)

eth0 Link encap:Ethernet HWaddr 24:8a:07:b3:7d:8b
inet addr:10.0.0.1 Bcast:10.0.0.255 Mask:255.255.255.0
UP BROADCAST RUNNING MULTICAST MTU:1500 Metric:1
RX packets:900002410 errors:0 dropped:405 overruns:0 frame:0
TX packets:1521598641 errors:0 dropped:0 overruns:0 carrier:0
collisions:0 txqueuelen:1000
RX bytes:376184431726 (350.3 GiB) TX bytes:954933846124 (889.3 GiB)

eth1 Link encap:Ethernet HWaddr 24:8a:07:b3:7d:8a
inet addr:192.168.0.1 Bcast:192.168.0.255 Mask:255.255.255.0
UP BROADCAST RUNNING MULTICAST MTU:1500 Metric:1
RX packets:2410141 errors:0 dropped:0 overruns:0 frame:0
TX packets:2312177 errors:0 dropped:0 overruns:0 carrier:0
collisions:0 txqueuelen:1000
RX bytes:698398061 (666.0 MiB) TX bytes:458504418 (437.2 MiB)

lo Link encap:Local Loopback
inet addr:127.0.0.1 Mask:255.0.0.0
inet6 addr: ::1/128 Scope:Host
UP LOOPBACK RUNNING MTU:65536 Metric:1
RX packets:497075633 errors:0 dropped:0 overruns:0 frame:0
TX packets:497075633 errors:0 dropped:0 overruns:0 carrier:0
collisions:0 txqueuelen:1
RX bytes:72680421398 (67.6 GiB) TX bytes:72680421398 (67.6 GiB)
```

For example:

```bash
$ mpirun -np 16 \
-H server1:4,server2:4,server3:4,server4:4 \
-bind-to none -map-by slot \
-x NCCL_DEBUG=INFO -x LD_LIBRARY_PATH -x PATH \
-x NCCL_SOCKET_IFNAME=^lo,docker0 \
-mca pml ob1 -mca btl ^openib \
-mca btl_tcp_if_exclude lo,docker0 \
python train.py
```
6 changes: 3 additions & 3 deletions docs/tensor-fusion.md
Expand Up @@ -18,17 +18,17 @@ one reduction operation. The algorithm of Tensor Fusion is as follows:
The fusion buffer size can be tweaked using the `HOROVOD_FUSION_THRESHOLD` environment variable:

```bash
$ HOROVOD_FUSION_THRESHOLD=33554432 mpirun -np 4 -x HOROVOD_FUSION_THRESHOLD python train.py
$ HOROVOD_FUSION_THRESHOLD=33554432 horovodrun -np 4 python train.py
```

Setting the `HOROVOD_FUSION_THRESHOLD` environment variable to zero disables Tensor Fusion:

```bash
$ HOROVOD_FUSION_THRESHOLD=0 mpirun -np 4 -x HOROVOD_FUSION_THRESHOLD python train.py
$ HOROVOD_FUSION_THRESHOLD=0 horovodrun -np 4 python train.py
```

You can tweak time between cycles (defined in milliseconds) using the `HOROVOD_CYCLE_TIME` environment variable:

```bash
$ HOROVOD_CYCLE_TIME=3.5 mpirun -np 4 -x HOROVOD_FUSION_THRESHOLD python train.py
$ HOROVOD_CYCLE_TIME=3.5 horovodrun -np 4 python train.py
```
4 changes: 2 additions & 2 deletions docs/timeline.md
Expand Up @@ -8,7 +8,7 @@ To record a Horovod Timeline, set the `HOROVOD_TIMELINE` environment variable to
file to be created. This file is only recorded on rank 0, but it contains information about activity of all workers.

```bash
$ HOROVOD_TIMELINE=/path/to/timeline.json mpirun -np 4 -x HOROVOD_TIMELINE python train.py
$ HOROVOD_TIMELINE=/path/to/timeline.json horovodrun -np 4 python train.py
```

You can then open the timeline file using the `chrome://tracing` facility of the [Chrome](https://www.google.com/chrome/browser/) browser.
Expand Down Expand Up @@ -56,5 +56,5 @@ To add cycle markers to the timeline, set the `HOROVOD_TIMELINE_MARK_CYCLES` env

```bash
$ HOROVOD_TIMELINE=/path/to/timeline.json HOROVOD_TIMELINE_MARK_CYCLES=1 \
mpirun -np 4 -x HOROVOD_TIMELINE python train.py
horovodrun -np 4 python train.py
```
22 changes: 0 additions & 22 deletions docs/troubleshooting.md
Expand Up @@ -220,28 +220,6 @@ For example:
$ HOROVOD_GPU_ALLREDUCE=NCCL HOROVOD_NCCL_HOME=/path/to/nccl pip install --no-cache-dir horovod
```

### ncclCommInitRank failed: unhandled cuda error

If you see the error message below during the training, it means that NCCL is not able to initialize correctly. You can
set the `NCCL_DEBUG` environment variable to `INFO` to have NCCL print debugging information which may reveal the reason.

```
UnknownError (see above for traceback): ncclCommInitRank failed: unhandled cuda error
[[Node: training/TFOptimizer/DistributedAdadeltaOptimizer_Allreduce/HorovodAllreduce_training_TFOptimizer_gradients_dense_2_BiasAdd_grad_tuple_control_dependency_1_0 = HorovodAllreduce[T=DT_FLOAT, _device="/job:localhost/replica:0/task:0/gpu:0"](training/TFOptimizer/gradients/dense_2/BiasAdd_grad/tuple/control_dependency_1)]]
[[Node: training/TFOptimizer/DistributedAdadeltaOptimizer/update/_94 = _Recv[client_terminated=false, recv_device="/job:localhost/replica:0/task:0/cpu:0", send_device="/job:localhost/replica:0/task:0/gpu:0", send_device_incarnation=1, tensor_name="edge_583_training/TFOptimizer/DistributedAdadeltaOptimizer/update", tensor_type=DT_FLOAT, _device="/job:localhost/replica:0/task:0/cpu:0"]()]]
```

For example:

```bash
$ export NCCL_DEBUG=INFO
$ mpirun -np 4 \
-H localhost:4 \
-bind-to none -map-by slot \
-x NCCL_DEBUG=INFO -x LD_LIBRARY_PATH -x PATH \
python train.py
```

### ncclAllReduce failed: invalid data type

If you see the error message below during the training, it means that Horovod was linked to the wrong version of NCCL
Expand Down
2 changes: 1 addition & 1 deletion horovod/run/run.py
Expand Up @@ -426,7 +426,7 @@ def run():
if not _is_open_mpi_installed():
raise Exception(
'horovodrun convenience script currently only supports '
'Open MPI.\n'
'Open MPI.\n\n'
'Choose one of:\n'
'1. Install Open MPI 4.0.0+ and re-install Horovod '
'(use --no-cache-dir pip option).\n'
Expand Down

7 comments on commit 420104b

@kit1980
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why delete "Hangs due to non-routed network interfaces" section?
Is it no longer relevant?

@abditag2
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@kit1980 when you use horovodrun wrapper, it checks all the hosts and finds common interfaces to use for MPI. I believe that is why @alsrgv removed these instructions.

@kit1980
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@abditag2 but using mpirun directly is still a supported way to run horovod, right?
This seems to be valuable information...

@abditag2
Copy link
Collaborator

@abditag2 abditag2 commented on 420104b Apr 12, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Correct, mprirun will be supported forever. @alsrgv should we add it back to our docs?

@alsrgv
Copy link
Member Author

@alsrgv alsrgv commented on 420104b Apr 12, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess we could create another doc mpirun.md which would have details about running Horovod using Open MPI, and we could link it from the main README/running.md. Does that sound reasonable @kit1980 ?

@kit1980
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@alsrgv sounds good to me

@alsrgv
Copy link
Member Author

@alsrgv alsrgv commented on 420104b May 17, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is getting revived in #1084.

Please sign in to comment.