Skip to content
Permalink
Browse files

Add -hostfile option to horovodrun (#1243)

* add hostfile option to horovodrun

Signed-off-by: Lin Yuan <apeforest@gmail.com>

* check remote host names

Signed-off-by: Lin Yuan <apeforest@gmail.com>

* add test and docs for -hostfile option

Signed-off-by: Lin Yuan <apeforest@gmail.com>

* address reviewer commnet

Signed-off-by: Lin Yuan <apeforest@gmail.com>

* fix lint

Signed-off-by: Lin Yuan <apeforest@gmail.com>

* address reviewer comment

Signed-off-by: Lin Yuan <apeforest@gmail.com>

* address reviewer comment

Signed-off-by: Lin Yuan <apeforest@gmail.com>

* address comment

Signed-off-by: Lin Yuan <apeforest@gmail.com>
  • Loading branch information...
apeforest authored and alsrgv committed Aug 3, 2019
1 parent a30cbee commit a72bc96a0f87a8a7c666fff005f1afa21e95b972
Showing with 67 additions and 20 deletions.
  1. +4 −0 .buildkite/gen-pipeline.sh
  2. +33 −6 docs/running.rst
  3. +30 −14 horovod/run/run.py
@@ -146,6 +146,10 @@ run_all() {
run_test "${test}" "${queue}" \
":muscle: Test Horovodrun (${test})" \
"horovodrun -np 2 -H localhost:2 python /horovod/examples/tensorflow_mnist.py"
run_test "${test}" "${queue}" \
":muscle: Test Horovodrun (${test})" \
"echo 'localhost slots=2' > hostfile" \
"horovodrun -np 2 -hostfile hostfile python /horovod/examples/mxnet_mnist.py"
fi
fi

@@ -4,10 +4,12 @@
Run Horovod
===========

This page includes examples for Open MPI that use ``horovodrun``. Check your MPI documentation for arguments to the ``mpirun``
This page includes examples for Open MPI that use ``horovodrun``. Check your
MPI documentation for arguments to the ``mpirun``
command on your system.

Typically one GPU will be allocated per process, so if a server has 4 GPUs, you would run 4 processes. In ``horovodrun``,
Typically one GPU will be allocated per process, so if a server has 4 GPUs,
you will run 4 processes. In ``horovodrun``,
the number of processes is specified with the ``-np`` flag.

To run on a machine with 4 GPUs:
@@ -22,12 +24,35 @@ To run on 4 machines with 4 GPUs each:
$ horovodrun -np 16 -H server1:4,server2:4,server3:4,server4:4 python train.py
You can also specify host nodes in a host file. For example:

.. code-block:: bash
$ cat myhostfile
aa slots=2
bb slots=2
cc slots=2
This example lists the host names (aa, bb, and cc) and how many "slots" there
are for each.
Slots indicate how many processes can potentially execute on a node.
This format is the same as in
`mpirun command <https://www.open-mpi.org/doc/v4.0/man1/mpirun.1.php#toc6>`__.

To run on hosts specified in a hostfile:

.. code-block:: bash
$ horovodrun -np 6 -hostfile myhostfile python train.py
Failures due to SSH issues
~~~~~~~~~~~~~~~~~~~~~~~~~~
The host where ``horovodrun`` is executed must be able to SSH to all other hosts without any prompts.
The host where ``horovodrun`` is executed must be able to SSH to all other
hosts without any prompts.

If ``horovodrun`` fails with permission error, verify that you can ssh to every other server without entering a password or
If ``horovodrun`` fails with a permission error, verify that you can ssh to
every other server without entering a password or
answering questions like this:


@@ -38,7 +63,8 @@ Are you sure you want to continue connecting (yes/no)?``

To learn more about setting up passwordless authentication, see `this page <http://www.linuxproblem.org/art_9.html>`__.

To avoid ``The authenticity of host '<hostname> (<ip address>)' can't be established`` prompts, add all the hosts to
To avoid ``The authenticity of host '<hostname> (<ip address>)' can't be
established`` prompts, add all the hosts to
the ``~/.ssh/known_hosts`` file using ``ssh-keyscan``:

.. code-block:: bash
@@ -49,6 +75,7 @@ the ``~/.ssh/known_hosts`` file using ``ssh-keyscan``:
Advanced: Run Horovod with Open MPI
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
In some advanced cases you might want fine-grained control over options passed to Open MPI.
To learn how run Horovod training directly using Open MPI, read `Run Horovod with Open MPI <mpirun.rst>`_.
To learn how to run Horovod training directly using Open MPI,
read `Run Horovod with Open MPI <mpirun.rst>`_.

.. inclusion-marker-end-do-not-remove
@@ -294,16 +294,25 @@ def parse_args():
parser.add_argument('-p', '--ssh-port', action="store", dest="ssh_port",
type=int, help="SSH port on all the hosts.")

parser.add_argument('-H', '--host', action="store", dest="host",
help="To specify the list of host names as well as the "
"number of available slots on each host for "
"training processes using the following format: "
"<hostname>:<number of slots>,... . "
"E.g., host1:2,host2:4,host3:1 "
"indicates that 2 processes can run on "
"host1, 4 processes on host2, and 1 process "
"on host3.")

host_group = parser.add_argument_group("Use one of the following options "
"to specify which hosts (nodes) of "
"the cluster to run on")
host_group.add_argument('-H', '--host', action="store", dest="host",
help="To specify the list of host names as well "
"as the number of available slots on each "
"host for training processes using the "
"following format: <hostname>:<number of "
"slots>,... . E.g., host1:2,host2:4,host3:1 "
"indicates that 2 processes can run on "
"host1, 4 processes on host2, and 1 process "
"on host3.")
host_group.add_argument('-hostfile', '--hostfile', action="store",
dest="hostfile",
help="To specify a host file with the list of "
"host names as well as the number of "
"available slots on each host. "
"Each line of the host file is formatted "
"as <hostname> slots=<number of slots")
parser.add_argument('--disable-cache', action="store_true",
dest="disable_cache",
help="If the flag is not set, horovodrun will perform "
@@ -353,6 +362,9 @@ def run():
if args.host:
all_host_names = [x for x in
[y.split(':')[0] for y in args.host.split(',')]]
elif args.hostfile:
all_host_names = [x for x in
[line.split()[0] for line in open(args.hostfile)]]
else:
all_host_names = []

@@ -392,7 +404,7 @@ def run():
parameters_hash)

remote_host_names = []
if args.host:
if args.host or args.hostfile:
if settings.verbose >= 2:
print("Filtering local host names.")
remote_host_names = network.filter_local_addresses(all_host_names)
@@ -406,13 +418,17 @@ def run():
if settings.verbose >= 2:
print("SSH was successful into all the remote hosts.")

hosts_arg = "-H {hosts}".format(hosts=args.host)
if args.host:
hosts_arg = "-H {hosts}".format(hosts=args.host)
else:
hosts_arg = "-hostfile {hostfile}".format(hostfile=args.hostfile)
else:
# if user does not specify any hosts, mpirun by default uses local host.
# if none of --host of --hostfile is specified, localhost will be
# used by default
# There is no need to specify localhost.
hosts_arg = ""

if args.host and len(remote_host_names) > 0:
if len(remote_host_names) > 0:
if settings.verbose >= 2:
print("Testing interfaces on all the hosts.")

0 comments on commit a72bc96

Please sign in to comment.
You can’t perform that action at this time.