Skip to content
Permalink
Browse files

Horovod auto-selection of g++ version (#1199)

* Horovod auto-selection of g++ version

Signed-off-by: Alex Sergeev <alsrgv@users.noreply.github.com>

* Get rid of g++-4.9, use the highest compatible compiler version

Signed-off-by: Alex Sergeev <alsrgv@users.noreply.github.com>

* Bugfixes

Signed-off-by: Alex Sergeev <alsrgv@users.noreply.github.com>

* Patch compiler per plugin

Signed-off-by: Alex Sergeev <alsrgv@users.noreply.github.com>

* Properly extract version from g++-4.8

Signed-off-by: Alex Sergeev <alsrgv@users.noreply.github.com>

* Add a hack to remove offensive compiler options

Signed-off-by: Alex Sergeev <alsrgv@users.noreply.github.com>

* Bugfixes

Signed-off-by: Alex Sergeev <alsrgv@users.noreply.github.com>

* Python 2.7 support

Signed-off-by: Alex Sergeev <alsrgv@users.noreply.github.com>

* Silly bugfix

Signed-off-by: Alex Sergeev <alsrgv@users.noreply.github.com>

* Fix indentation in gpus.rst

Signed-off-by: Alex Sergeev <alsrgv@users.noreply.github.com>

* Mention starting version for tf.version

Signed-off-by: Alex Sergeev <alsrgv@users.noreply.github.com>

* More comments

Signed-off-by: Alex Sergeev <alsrgv@users.noreply.github.com>

* Review comments

Signed-off-by: Alex Sergeev <alsrgv@users.noreply.github.com>
  • Loading branch information...
alsrgv committed Jul 9, 2019
1 parent 734cdb6 commit d9021529063d815bad199311b8f869ad574cb610
Showing with 263 additions and 134 deletions.
  1. +3 −25 Dockerfile
  2. +1 −25 Dockerfile.test.cpu
  3. +1 −24 Dockerfile.test.gpu
  4. +19 −3 README.rst
  5. +7 −1 docs/gpus.rst
  6. +12 −0 docs/index.rst
  7. +17 −1 docs/summary.rst
  8. +28 −0 horovod/common/util.py
  9. +159 −17 setup.py
  10. +0 −21 test/common.py
  11. +15 −15 test/test_stall.py
  12. +1 −2 test/test_timeline.py
@@ -15,16 +15,10 @@ ENV PYTHON_VERSION=${python}
# Set default shell to /bin/bash
SHELL ["/bin/bash", "-cu"]

# We need gcc-4.9 to build plugins for TensorFlow & PyTorch, which is only available in Ubuntu Xenial
RUN echo deb http://archive.ubuntu.com/ubuntu xenial main universe | tee -a /etc/apt/sources.list

RUN apt-get update && apt-get install -y --allow-downgrades --allow-change-held-packages --no-install-recommends \
build-essential \
cmake \
gcc-4.9 \
g++-4.9 \
gcc-4.9-base \
software-properties-common \
g++-4.8 \
git \
curl \
vim \
@@ -72,28 +66,12 @@ RUN mkdir /tmp/openmpi && \
ldconfig && \
rm -rf /tmp/openmpi

# Pin GCC to 4.9 (priority 200) to compile correctly against TensorFlow, PyTorch, and MXNet.
# Backup existing GCC installation as priority 100, so that it can be recovered later.
RUN update-alternatives --install /usr/bin/gcc gcc $(readlink -f $(which gcc)) 100 && \
update-alternatives --install /usr/bin/x86_64-linux-gnu-gcc x86_64-linux-gnu-gcc $(readlink -f $(which gcc)) 100 && \
update-alternatives --install /usr/bin/g++ g++ $(readlink -f $(which g++)) 100 && \
update-alternatives --install /usr/bin/x86_64-linux-gnu-g++ x86_64-linux-gnu-g++ $(readlink -f $(which g++)) 100
RUN update-alternatives --install /usr/bin/gcc gcc /usr/bin/gcc-4.9 200 && \
update-alternatives --install /usr/bin/x86_64-linux-gnu-gcc x86_64-linux-gnu-gcc /usr/bin/gcc-4.9 200 && \
update-alternatives --install /usr/bin/g++ g++ /usr/bin/g++-4.9 200 && \
update-alternatives --install /usr/bin/x86_64-linux-gnu-g++ x86_64-linux-gnu-g++ /usr/bin/g++-4.9 200

# Install Horovod, temporarily using CUDA stubs
RUN ldconfig /usr/local/cuda/targets/x86_64-linux/lib/stubs && \
HOROVOD_GPU_ALLREDUCE=NCCL HOROVOD_WITH_TENSORFLOW=1 HOROVOD_WITH_PYTORCH=1 HOROVOD_WITH_MXNET=1 pip install --no-cache-dir horovod && \
HOROVOD_GPU_ALLREDUCE=NCCL HOROVOD_WITH_TENSORFLOW=1 HOROVOD_WITH_PYTORCH=1 HOROVOD_WITH_MXNET=1 \
pip install --no-cache-dir horovod && \
ldconfig

# Remove GCC pinning
RUN update-alternatives --remove gcc /usr/bin/gcc-4.9 && \
update-alternatives --remove x86_64-linux-gnu-gcc /usr/bin/gcc-4.9 && \
update-alternatives --remove g++ /usr/bin/g++-4.9 && \
update-alternatives --remove x86_64-linux-gnu-g++ /usr/bin/g++-4.9

# Install OpenSSH for MPI to communicate between containers
RUN apt-get install -y --no-install-recommends openssh-client openssh-server && \
mkdir -p /var/run/sshd
@@ -16,9 +16,6 @@ ARG PYSPARK_PACKAGE=pyspark==2.4.0
# Set default shell to /bin/bash
SHELL ["/bin/bash", "-cu"]

# We need gcc-4.9 to build plugins for TensorFlow & PyTorch, which is only available in Ubuntu Xenial
RUN echo deb http://archive.ubuntu.com/ubuntu xenial main universe | tee -a /etc/apt/sources.list

# Install essential packages.
RUN apt-get update -qq
RUN apt-get install -y --no-install-recommends \
@@ -27,10 +24,7 @@ RUN apt-get install -y --no-install-recommends \
openssh-client \
git \
build-essential \
gcc-4.9 \
g++-4.9 \
gcc-4.9-base \
software-properties-common
g++-4.8

# Install Python.
RUN apt-get install -y python${PYTHON_VERSION} python${PYTHON_VERSION}-dev
@@ -112,17 +106,6 @@ RUN pip install ${TORCHVISION_PACKAGE} Pillow --no-deps
# Install MXNet.
RUN pip install ${MXNET_PACKAGE}

# Pin GCC to 4.9 (priority 200) to compile correctly against TensorFlow, PyTorch, and MXNet.
# Backup existing GCC installation as priority 100, so that it can be recovered later.
RUN update-alternatives --install /usr/bin/gcc gcc $(readlink -f $(which gcc)) 100 && \
update-alternatives --install /usr/bin/x86_64-linux-gnu-gcc x86_64-linux-gnu-gcc $(readlink -f $(which gcc)) 100 && \
update-alternatives --install /usr/bin/g++ g++ $(readlink -f $(which g++)) 100 && \
update-alternatives --install /usr/bin/x86_64-linux-gnu-g++ x86_64-linux-gnu-g++ $(readlink -f $(which g++)) 100
RUN update-alternatives --install /usr/bin/gcc gcc /usr/bin/gcc-4.9 200 && \
update-alternatives --install /usr/bin/x86_64-linux-gnu-gcc x86_64-linux-gnu-gcc /usr/bin/gcc-4.9 200 && \
update-alternatives --install /usr/bin/g++ g++ /usr/bin/g++-4.9 200 && \
update-alternatives --install /usr/bin/x86_64-linux-gnu-g++ x86_64-linux-gnu-g++ /usr/bin/g++-4.9 200

# Install Horovod.
RUN if [[ ${MPI_KIND} == "MLSL" ]]; then \
if [ -z "${LD_LIBRARY_PATH:-}" ]; then \
@@ -154,13 +137,6 @@ RUN if [[ ${MPI_KIND} == "MLSL" ]]; then \
pip install -v /horovod/dist/horovod-*.tar.gz; \
fi


# Remove GCC pinning
RUN update-alternatives --remove gcc /usr/bin/gcc-4.9 && \
update-alternatives --remove x86_64-linux-gnu-gcc /usr/bin/gcc-4.9 && \
update-alternatives --remove g++ /usr/bin/g++-4.9 && \
update-alternatives --remove x86_64-linux-gnu-g++ /usr/bin/g++-4.9

# Hack for compatibility of MNIST example with TensorFlow 1.1.0.
RUN if [[ ${TENSORFLOW_PACKAGE} == "tensorflow==1.1.0" ]]; then \
sed -i "s/from tensorflow import keras/from tensorflow.contrib import keras/" /horovod/examples/tensorflow_mnist.py; \
@@ -20,9 +20,6 @@ ARG HOROVOD_MIXED_INSTALL=0
# Set default shell to /bin/bash
SHELL ["/bin/bash", "-cu"]

# We need gcc-4.9 to build plugins for TensorFlow & PyTorch, which is only available in Ubuntu Xenial
RUN echo deb http://archive.ubuntu.com/ubuntu xenial main universe | tee -a /etc/apt/sources.list

# Install essential packages.
RUN apt-get update -qq
RUN apt-get install -y --allow-downgrades --allow-change-held-packages --no-install-recommends \
@@ -31,10 +28,7 @@ RUN apt-get install -y --allow-downgrades --allow-change-held-packages --no-inst
openssh-client \
git \
build-essential \
gcc-4.9 \
g++-4.9 \
gcc-4.9-base \
software-properties-common \
g++-4.8 \
libcudnn7=${CUDNN_VERSION} \
libnccl2=${NCCL_VERSION_OVERRIDE} \
libnccl-dev=${NCCL_VERSION_OVERRIDE}
@@ -94,29 +88,12 @@ RUN pip install ${TORCHVISION_PACKAGE} Pillow --no-deps
# Install MXNet.
RUN pip install ${MXNET_PACKAGE}

# Pin GCC to 4.9 (priority 200) to compile correctly against TensorFlow, PyTorch, and MXNet.
# Backup existing GCC installation as priority 100, so that it can be recovered later.
RUN update-alternatives --install /usr/bin/gcc gcc $(readlink -f $(which gcc)) 100 && \
update-alternatives --install /usr/bin/x86_64-linux-gnu-gcc x86_64-linux-gnu-gcc $(readlink -f $(which gcc)) 100 && \
update-alternatives --install /usr/bin/g++ g++ $(readlink -f $(which g++)) 100 && \
update-alternatives --install /usr/bin/x86_64-linux-gnu-g++ x86_64-linux-gnu-g++ $(readlink -f $(which g++)) 100
RUN update-alternatives --install /usr/bin/gcc gcc /usr/bin/gcc-4.9 200 && \
update-alternatives --install /usr/bin/x86_64-linux-gnu-gcc x86_64-linux-gnu-gcc /usr/bin/gcc-4.9 200 && \
update-alternatives --install /usr/bin/g++ g++ /usr/bin/g++-4.9 200 && \
update-alternatives --install /usr/bin/x86_64-linux-gnu-g++ x86_64-linux-gnu-g++ /usr/bin/g++-4.9 200

# Install Horovod.
RUN cd /horovod && python setup.py sdist
RUN ldconfig /usr/local/cuda/targets/x86_64-linux/lib/stubs && \
bash -c "${HOROVOD_BUILD_FLAGS} pip install -v /horovod/dist/horovod-*.tar.gz" && \
ldconfig

# Remove GCC pinning
RUN update-alternatives --remove gcc /usr/bin/gcc-4.9 && \
update-alternatives --remove x86_64-linux-gnu-gcc /usr/bin/gcc-4.9 && \
update-alternatives --remove g++ /usr/bin/g++-4.9 && \
update-alternatives --remove x86_64-linux-gnu-g++ /usr/bin/g++-4.9

# Hack for compatibility of MNIST example with TensorFlow 1.1.0.
RUN if [[ ${TENSORFLOW_PACKAGE} == "tensorflow-gpu==1.1.0" ]]; then \
sed -i "s/from tensorflow import keras/from tensorflow.contrib import keras/" /horovod/examples/tensorflow_mnist.py; \
@@ -83,17 +83,33 @@ To install Horovod:

1. Install `Open MPI <https://www.open-mpi.org/>`_ or another MPI implementation. Learn how to install Open MPI `on this page <https://www.open-mpi.org/faq/?category=building#easy-build>`_.

**Note**: Open MPI 3.1.3 has an issue that may cause hangs. The recommended fix is to
downgrade to Open MPI 3.1.2 or upgrade to Open MPI 4.0.0.
**Note**: Open MPI 3.1.3 has an issue that may cause hangs. The recommended fix is to
downgrade to Open MPI 3.1.2 or upgrade to Open MPI 4.0.0.

2. Install the ``horovod`` pip package.
.. raw:: html

<p/>

2. If you've installed TensorFlow from `PyPI <https://pypi.org/project/tensorflow>`__, make sure that the ``g++-4.8.5`` or ``g++-4.9`` is installed.

If you've installed PyTorch from `PyPI <https://pypi.org/project/torch>`__, make sure that the ``g++-4.9`` or above is installed.

If you've installed either package from `Conda <https://conda.io>`_, make sure that the ``gxx_linux-64`` Conda package is installed.

.. raw:: html

<p/>

3. Install the ``horovod`` pip package.

.. code-block:: bash
$ pip install horovod
This basic installation is good for laptops and for getting to know Horovod.

If you're installing Horovod on a server with GPUs, read the `Horovod on GPU <docs/gpus.rst>`_ page.

If you want to use Docker, read the `Horovod in Docker <docs/docker.rst>`_ page.


@@ -32,7 +32,13 @@ by installing an `nv_peer_memory <https://github.com/Mellanox/nv_peer_memory>`__

**Note**: Open MPI 3.1.3 has an issue that may cause hangs. The recommended fix is to downgrade to Open MPI 3.1.2 or upgrade to Open MPI 4.0.0.

4. Install the ``horovod`` pip package.
4. If you've installed TensorFlow from `PyPI <https://pypi.org/project/tensorflow>`__, make sure that the ``g++-4.8.5`` or ``g++-4.9`` is installed.

If you've installed PyTorch from `PyPI <https://pypi.org/project/torch>`__, make sure that the ``g++-4.9`` or above is installed.

If you've installed either package from `Conda <https://conda.io>`_, make sure that the ``gxx_linux-64`` Conda package is installed.

5. Install the ``horovod`` pip package.

If you have installed NCCL 2 using the ``nccl-<version>.txz`` package, you should specify the path to NCCL 2 using the ``HOROVOD_NCCL_HOME``
environment variable.
@@ -13,6 +13,10 @@ Choose your deep learning framework to learn how to get started with Horovod.
<p>To use Horovod with TensorFlow on your laptop:
<ol>
<li><a href="https://www.open-mpi.org/faq/?category=building#easy-build">Install Open MPI 3.1.2 or 4.0.0</a>, or another MPI implementation. </li>
<li>
If you've installed TensorFlow from <a href="https://pypi.org/project/tensorflow">PyPI</a>, make sure that the <code>g++-4.8.5</code> or <code>g++-4.9</code> is installed.<br/>
If you've installed TensorFlow from <a href="https://conda.io">Conda</a>, make sure that the <code>gxx_linux-64</code> Conda package is installed.
</li>
<li>Install the Horovod pip package: <code>pip install horovod</code></li>
<li>Read <a href="https://horovod.readthedocs.io/en/latest/tensorflow.html">Horovod with TensorFlow</a> for best practices and examples. </li>
</ol>
@@ -25,6 +29,10 @@ Choose your deep learning framework to learn how to get started with Horovod.
<p>To use Horovod with Keras on your laptop:
<ol>
<li><a href="https://www.open-mpi.org/faq/?category=building#easy-build">Install Open MPI 3.1.2 or 4.0.0</a>, or another MPI implementation. </li>
<li>
If you've installed TensorFlow from <a href="https://pypi.org/project/tensorflow">PyPI</a>, make sure that the <code>g++-4.8.5</code> or <code>g++-4.9</code> is installed.<br/>
If you've installed TensorFlow from <a href="https://conda.io">Conda</a>, make sure that the <code>gxx_linux-64</code> Conda package is installed.
</li>
<li>Install the Horovod pip package: <code>pip install horovod</code></li>
<li>Read <a href="https://horovod.readthedocs.io/en/latest/keras.html">Horovod with Keras</a> for best practices and examples. </li>
</ol>
@@ -37,6 +45,10 @@ Choose your deep learning framework to learn how to get started with Horovod.
<p>To use Horovod with PyTorch on your laptop:
<ol>
<li><a href="https://www.open-mpi.org/faq/?category=building#easy-build">Install Open MPI 3.1.2 or 4.0.0</a>, or another MPI implementation. </li>
<li>
If you've installed PyTorch from <a href="https://pypi.org/project/torch">PyPI</a>, make sure that the <code>g++-4.9</code> or above is installed.<br/>
If you've installed PyTorch from <a href="https://conda.io">Conda</a>, make sure that the <code>gxx_linux-64</code> Conda package is installed.
</li>
<li>Install the Horovod pip package: <code>pip install horovod</code></li>
<li>Read <a href="https://horovod.readthedocs.io/en/latest/pytorch.html">Horovod with PyTorch</a> for best practices and examples. </li>
</ol>
@@ -85,14 +85,30 @@ To install Horovod:

**Note**: Open MPI 3.1.3 has an issue that may cause hangs. The recommended fix is to downgrade to Open MPI 3.1.2 or upgrade to Open MPI 4.0.0.

2. Install the ``horovod`` pip package.
.. raw:: html

<p/>

2. If you've installed TensorFlow from `PyPI <https://pypi.org/project/tensorflow>`__, make sure that the ``g++-4.8.5`` or ``g++-4.9`` is installed.

If you've installed PyTorch from `PyPI <https://pypi.org/project/torch>`__, make sure that the ``g++-4.9`` or above is installed.

If you've installed either package from `Conda <https://conda.io>`_, make sure that the ``gxx_linux-64`` Conda package is installed.

.. raw:: html

<p/>

3. Install the ``horovod`` pip package.

.. code-block:: bash
$ pip install horovod
This basic installation is good for laptops and for getting to know Horovod.

If you're installing Horovod on a server with GPUs, read `Horovod on GPU <gpus.rst>`_.

If you want to use Docker, read `Horovod in Docker <docker.rst>`_.


@@ -14,6 +14,7 @@
# limitations under the License.
# =============================================================================

from contextlib import contextmanager
import os
import sysconfig

@@ -44,3 +45,30 @@ def check_extension(ext_name, ext_env_var, pkg_path, *args):
raise ImportError(
'Extension %s has not been built. If this is not expected, reinstall '
'Horovod with %s=1 to debug the build error.' % (ext_name, ext_env_var))


@contextmanager
def env(**kwargs):
# ignore args with None values
for k in list(kwargs.keys()):
if kwargs[k] is None:
del kwargs[k]

# backup environment
backup = {}
for k in kwargs.keys():
backup[k] = os.environ.get(k)

# set new values & yield
for k, v in kwargs.items():
os.environ[k] = v

try:
yield
finally:
# restore environment
for k in kwargs.keys():
if backup[k] is not None:
os.environ[k] = backup[k]
else:
del os.environ[k]

0 comments on commit d902152

Please sign in to comment.
You can’t perform that action at this time.