Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add -hostfile option to horovodrun #1243

Merged
merged 8 commits into from
Aug 3, 2019
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
4 changes: 4 additions & 0 deletions .buildkite/gen-pipeline.sh
Original file line number Diff line number Diff line change
Expand Up @@ -146,6 +146,10 @@ run_all() {
run_test "${test}" "${queue}" \
":muscle: Test Horovodrun (${test})" \
"horovodrun -np 2 -H localhost:2 python /horovod/examples/tensorflow_mnist.py"
run_test "${test}" "${queue}" \
":muscle: Test Horovodrun (${test})" \
"echo 'localhost slots=2' > hostfile" \
"horovodrun -np 2 -hostfile hostfile python /horovod/examples/mxnet_mnist.py"
fi
fi

Expand Down
39 changes: 33 additions & 6 deletions docs/running.rst
Original file line number Diff line number Diff line change
Expand Up @@ -4,10 +4,12 @@
Run Horovod
===========

This page includes examples for Open MPI that use ``horovodrun``. Check your MPI documentation for arguments to the ``mpirun``
This page includes examples for Open MPI that use ``horovodrun``. Check your
MPI documentation for arguments to the ``mpirun``
command on your system.

Typically one GPU will be allocated per process, so if a server has 4 GPUs, you would run 4 processes. In ``horovodrun``,
Typically one GPU will be allocated per process, so if a server has 4 GPUs,
you will run 4 processes. In ``horovodrun``,
the number of processes is specified with the ``-np`` flag.

To run on a machine with 4 GPUs:
Expand All @@ -22,12 +24,35 @@ To run on 4 machines with 4 GPUs each:

$ horovodrun -np 16 -H server1:4,server2:4,server3:4,server4:4 python train.py

You can also specify host nodes in a host file. For example:

.. code-block:: bash

$ cat myhostfile

aa slots=2
bb slots=2
cc slots=2

This example lists the host names (aa, bb, and cc) and how many "slots" there
are for each.
Slots indicate how many processes can potentially execute on a node.
This format is the same as in
`mpirun command <https://www.open-mpi.org/doc/v4.0/man1/mpirun.1.php#toc6>`__.

To run on hosts specified in a hostfile:
sblotner marked this conversation as resolved.
Show resolved Hide resolved

.. code-block:: bash

$ horovodrun -np 6 -hostfile myhostfile python train.py
apeforest marked this conversation as resolved.
Show resolved Hide resolved

Failures due to SSH issues
~~~~~~~~~~~~~~~~~~~~~~~~~~
The host where ``horovodrun`` is executed must be able to SSH to all other hosts without any prompts.
The host where ``horovodrun`` is executed must be able to SSH to all other
hosts without any prompts.

If ``horovodrun`` fails with permission error, verify that you can ssh to every other server without entering a password or
If ``horovodrun`` fails with a permission error, verify that you can ssh to
every other server without entering a password or
answering questions like this:


Expand All @@ -38,7 +63,8 @@ Are you sure you want to continue connecting (yes/no)?``

To learn more about setting up passwordless authentication, see `this page <http://www.linuxproblem.org/art_9.html>`__.

To avoid ``The authenticity of host '<hostname> (<ip address>)' can't be established`` prompts, add all the hosts to
To avoid ``The authenticity of host '<hostname> (<ip address>)' can't be
established`` prompts, add all the hosts to
the ``~/.ssh/known_hosts`` file using ``ssh-keyscan``:

.. code-block:: bash
Expand All @@ -49,6 +75,7 @@ the ``~/.ssh/known_hosts`` file using ``ssh-keyscan``:
Advanced: Run Horovod with Open MPI
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
In some advanced cases you might want fine-grained control over options passed to Open MPI.
To learn how run Horovod training directly using Open MPI, read `Run Horovod with Open MPI <mpirun.rst>`_.
To learn how to run Horovod training directly using Open MPI,
read `Run Horovod with Open MPI <mpirun.rst>`_.

.. inclusion-marker-end-do-not-remove
44 changes: 30 additions & 14 deletions horovod/run/run.py
Original file line number Diff line number Diff line change
Expand Up @@ -294,16 +294,25 @@ def parse_args():
parser.add_argument('-p', '--ssh-port', action="store", dest="ssh_port",
type=int, help="SSH port on all the hosts.")

parser.add_argument('-H', '--host', action="store", dest="host",
help="To specify the list of host names as well as the "
"number of available slots on each host for "
"training processes using the following format: "
"<hostname>:<number of slots>,... . "
"E.g., host1:2,host2:4,host3:1 "
"indicates that 2 processes can run on "
"host1, 4 processes on host2, and 1 process "
"on host3.")

host_group = parser.add_argument_group("Use one of the following options "
"to specify which hosts (nodes) of "
"the cluster to run on")
host_group.add_argument('-H', '--host', action="store", dest="host",
help="To specify the list of host names as well "
"as the number of available slots on each "
"host for training processes using the "
"following format: <hostname>:<number of "
"slots>,... . E.g., host1:2,host2:4,host3:1 "
"indicates that 2 processes can run on "
"host1, 4 processes on host2, and 1 process "
"on host3.")
host_group.add_argument('-hostfile', '--hostfile', action="store",
dest="hostfile",
help="To specify a host file with the list of "
"host names as well as the number of "
"available slots on each host. "
"Each line of the host file is formatted "
"as <hostname> slots=<number of slots")
parser.add_argument('--disable-cache', action="store_true",
dest="disable_cache",
help="If the flag is not set, horovodrun will perform "
Expand Down Expand Up @@ -353,6 +362,9 @@ def run():
if args.host:
all_host_names = [x for x in
[y.split(':')[0] for y in args.host.split(',')]]
elif args.hostfile:
all_host_names = [x for x in
[line.split()[0] for line in open(args.hostfile)]]
else:
all_host_names = []

Expand Down Expand Up @@ -392,7 +404,7 @@ def run():
parameters_hash)

remote_host_names = []
if args.host:
if args.host or args.hostfile:
if settings.verbose >= 2:
print("Filtering local host names.")
remote_host_names = network.filter_local_addresses(all_host_names)
Expand All @@ -406,13 +418,17 @@ def run():
if settings.verbose >= 2:
print("SSH was successful into all the remote hosts.")

hosts_arg = "-H {hosts}".format(hosts=args.host)
if args.host:
hosts_arg = "-H {hosts}".format(hosts=args.host)
else:
hosts_arg = "-hostfile {hostfile}".format(hostfile=args.hostfile)
else:
# if user does not specify any hosts, mpirun by default uses local host.
# if none of --host of --hostfile is specified, localhost will be
# used by default
# There is no need to specify localhost.
hosts_arg = ""

if args.host and len(remote_host_names) > 0:
if len(remote_host_names) > 0:
if settings.verbose >= 2:
print("Testing interfaces on all the hosts.")

Expand Down