Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Added horovodrun. #869

Merged
merged 20 commits into from Mar 18, 2019

Conversation

3 participants
@abditag2
Copy link
Collaborator

abditag2 commented Feb 28, 2019

now, we can run horovod scripts using:

horovodrun -np 2 -H worker-0:1,worker-1:1 model.py

@abditag2 abditag2 requested review from tgaddair and alsrgv Feb 28, 2019

@alsrgv alsrgv referenced this pull request Feb 28, 2019

Closed

Created Horovod.run #858

@alsrgv
Copy link
Collaborator

alsrgv left a comment

Few comments

Show resolved Hide resolved bin/horovodrun
Show resolved Hide resolved bin/horovodrun
Show resolved Hide resolved horovod/spark/task/task_service.py Outdated
Show resolved Hide resolved horovod/run/horovod_task_fn.py Outdated
Show resolved Hide resolved horovod/run/run.py
Show resolved Hide resolved horovod/run/horovod_task_fn.py Outdated
Show resolved Hide resolved horovod/run/horovod_task_fn.py Outdated
Show resolved Hide resolved horovod/run/run.py Outdated
Show resolved Hide resolved horovod/run/run.py Outdated
Show resolved Hide resolved horovod/run/run.py Outdated
Show resolved Hide resolved horovod/run/run.py Outdated
Show resolved Hide resolved horovod/run/util/cache.py Outdated
Show resolved Hide resolved horovod/run/util/cache.py Outdated
Show resolved Hide resolved horovod/run/util/cache.py Outdated
Show resolved Hide resolved horovod/spark/driver/driver_service.py Outdated
Show resolved Hide resolved horovod/spark/util/network.py Outdated
Show resolved Hide resolved horovod/run/run.py Outdated
Show resolved Hide resolved horovod/run/run.py Outdated
@alsrgv
Copy link
Collaborator

alsrgv left a comment

root@test-X11DPG-OT-1:/examples# horovodrun -np 8 echo test
Traceback (most recent call last):
  File "/usr/local/bin/horovodrun", line 21, in <module>
    run.run()
  File "/usr/local/lib/python3.5/dist-packages/horovod/run/run.py", line 331, in run
    [y.split(':')[0] for y in args.host.split(',')]]
AttributeError: 'NoneType' object has no attribute 'split'

If no host specified, we should just run on localhost. We should omit ssh detection in that case, and DO NOT use -H localhost:X since it requires sshd.

Show resolved Hide resolved horovod/run/run.py Outdated
@alsrgv

This comment has been minimized.

Copy link
Collaborator

alsrgv commented Mar 15, 2019

Can we add a test to .travis.yaml? This way we'll run it on every platform that we support.

@alsrgv

This comment has been minimized.

Copy link
Collaborator

alsrgv commented Mar 15, 2019

echo test does not work:

root@test-X11DPG-OT-1:/examples# horovodrun -np 8 -H localhost:8 --ssh-port 12345 echo test
cache is enabled
[test-X11DPG-OT-1:02235] Warning: could not find environment variable "OLDPWD"
[1,0]<stderr>:/usr/bin/python: can't open file 'echo': [Errno 2] No such file or directory
-------------------------------------------------------
Primary job  terminated normally, but 1 process returned
a non-zero exit code. Per user-direction, the job has been aborted.
-------------------------------------------------------
[1,1]<stderr>:/usr/bin/python: can't open file 'echo': [Errno 2] No such file or directory
--------------------------------------------------------------------------
mpirun.real detected that one or more processes exited with non-zero status, thus causing
the job to be terminated. The first process to do so was:

  Process name: [[63207,1],0]
  Exit code:    2
--------------------------------------------------------------------------
Traceback (most recent call last):
  File "/usr/local/bin/horovodrun", line 21, in <module>
    run.run()
  File "/usr/local/lib/python3.5/dist-packages/horovod/run/run.py", line 402, in run
    'mpirun exited with code %d, see the error above.' % exit_code)
Exception: mpirun exited with code 2, see the error above.
@alsrgv

This comment has been minimized.

Copy link
Collaborator

alsrgv commented Mar 15, 2019

Escaping seems not work:

root@test-X11DPG-OT-1:/examples# horovodrun -np 8 -H localhost:8 --ssh-port 12345 python -c 'print("lala")'
cache is enabled
/bin/sh: 1: Syntax error: "(" unexpected
Traceback (most recent call last):
  File "/usr/local/bin/horovodrun", line 21, in <module>
    run.run()
  File "/usr/local/lib/python3.5/dist-packages/horovod/run/run.py", line 402, in run
    'mpirun exited with code %d, see the error above.' % exit_code)
Exception: mpirun exited with code 2, see the error above.
Show resolved Hide resolved horovod/run/run.py Outdated
Show resolved Hide resolved horovod/run/run.py Outdated
Show resolved Hide resolved horovod/run/util/threads.py
Show resolved Hide resolved horovod/run/run.py Outdated
Show resolved Hide resolved horovod/run/run.py Outdated
@alsrgv
Copy link
Collaborator

alsrgv left a comment

Note - we'll need to remove mpirun alias script & mca-params.conf in https://github.com/horovod/horovod/blob/master/Dockerfile#L57, as it conflicts with horovodrun.

Show resolved Hide resolved horovod/run/run.py Outdated
Show resolved Hide resolved horovod/run/run.py Outdated
@abditag2

This comment has been minimized.

Copy link
Collaborator Author

abditag2 commented Mar 15, 2019

echo test does not work:

root@test-X11DPG-OT-1:/examples# horovodrun -np 8 -H localhost:8 --ssh-port 12345 echo test
cache is enabled
[test-X11DPG-OT-1:02235] Warning: could not find environment variable "OLDPWD"
[1,0]<stderr>:/usr/bin/python: can't open file 'echo': [Errno 2] No such file or directory
-------------------------------------------------------
Primary job  terminated normally, but 1 process returned
a non-zero exit code. Per user-direction, the job has been aborted.
-------------------------------------------------------
[1,1]<stderr>:/usr/bin/python: can't open file 'echo': [Errno 2] No such file or directory
--------------------------------------------------------------------------
mpirun.real detected that one or more processes exited with non-zero status, thus causing
the job to be terminated. The first process to do so was:

  Process name: [[63207,1],0]
  Exit code:    2
--------------------------------------------------------------------------
Traceback (most recent call last):
  File "/usr/local/bin/horovodrun", line 21, in <module>
    run.run()
  File "/usr/local/lib/python3.5/dist-packages/horovod/run/run.py", line 402, in run
    'mpirun exited with code %d, see the error above.' % exit_code)
Exception: mpirun exited with code 2, see the error above.

python executable was included in the command. Changed.

Show resolved Hide resolved horovod/run/horovod_task_fn.py Outdated
Show resolved Hide resolved horovod/run/horovod_task_fn.py Outdated
Show resolved Hide resolved horovod/run/run.py Outdated
Show resolved Hide resolved horovod/run/run.py
Show resolved Hide resolved horovod/run/run.py Outdated

abditag2 added some commits Feb 20, 2019

Added horovod run
Signed-off-by: fardin <fardin@uber.com>
Added travis CI test for horovodrun
Signed-off-by: fardin <fardin@uber.com>
renamed files and added quote
Signed-off-by: fardin <fardin@uber.com>
Added quotes around command
Signed-off-by: fardin <fardin@uber.com>
realigned code
Signed-off-by: fardin <fardin@uber.com>
fixed shlex import
Signed-off-by: fardin <fardin@uber.com>
fixed travis CI test
Signed-off-by: fardin <fardin@uber.com>
merged stdout and stderr
Signed-off-by: fardin <fardin@uber.com>
merged stdout and sterr
Signed-off-by: fardin <fardin@uber.com>
print("host output:\n {host_output}".format(
host_output=host_output.getvalue()))
"Launching horovodrun task function was not "
"successful:{host_output}"

This comment has been minimized.

Copy link
@alsrgv

alsrgv Mar 16, 2019

Collaborator

Add \n after :

This comment has been minimized.

Copy link
@abditag2

abditag2 Mar 16, 2019

Author Collaborator

added

host=host_addresses[index]))
print("stderr:\n{stderr}\nstdout:\n{stdout}".format(stdout=stdout,
stderr=stderr))
print("ssh not successful for host {host}:{msg_output}".format(

This comment has been minimized.

Copy link
@alsrgv

alsrgv Mar 16, 2019

Collaborator

Add \n after :

This comment has been minimized.

Copy link
@abditag2

abditag2 Mar 16, 2019

Author Collaborator

added

# --disable-cache flag.
fn_cache = None
if not args.disable_cache:
parameters_hash = str(hash(str(args.host) + str(args.np)))

This comment has been minimized.

Copy link
@alsrgv

alsrgv Mar 16, 2019

Collaborator

Using hash() in Python3 for this purpose won't work because it's randomized: https://stackoverflow.com/a/17192658. We should switch to use consistent hashing mechanism, maybe similar to what we use in host_hash().

We should add a separator between parameters components for cleanliness. Should we add .ssh_port to the hash as well?

This comment has been minimized.

Copy link
@abditag2

abditag2 Mar 16, 2019

Author Collaborator

interesting. Fixed it.


def use_cache():
"""
If decorates a function, and if cache_disabled is set, it will store the

This comment has been minimized.

Copy link
@alsrgv

alsrgv Mar 16, 2019

Collaborator

Should it say "if fn_cache is set"?

This comment has been minimized.

Copy link
@alsrgv

alsrgv Mar 16, 2019

Collaborator

Also, should we move this to cache.py?

This comment has been minimized.

Copy link
@abditag2

abditag2 Mar 16, 2019

Author Collaborator

done

fixed comments
Signed-off-by: fardin <fardin@uber.com>
Show resolved Hide resolved horovod/run/run.py

abditag2 added some commits Mar 16, 2019

fixed horovodrun test
Signed-off-by: fardin <fardin@uber.com>
filtered remote hosts
Signed-off-by: fardin <fardin@uber.com>
fixed style
Signed-off-by: fardin <fardin@uber.com>
Show resolved Hide resolved horovod/run/run.py Outdated
Show resolved Hide resolved horovod/run/run.py Outdated
Show resolved Hide resolved horovod/run/util/cache.py
Show resolved Hide resolved horovod/run/util/network.py Outdated
Show resolved Hide resolved horovod/run/util/network.py Outdated
Show resolved Hide resolved horovod/run/run.py Outdated

abditag2 added some commits Mar 17, 2019

included local host in the ring interface check
Signed-off-by: fardin <fardin@uber.com>
fixed travis
Signed-off-by: fardin <fardin@uber.com>
fixed failing unittest
Signed-off-by: fardin <fardin@uber.com>
encoded params before hashing
Signed-off-by: fardin <fardin@uber.com>
@alsrgv
Copy link
Collaborator

alsrgv left a comment

Can we ignore horovodrun test for MPICH? Currently, it fails with:

[mpiexec@d3ef27906d44] match_arg (utils/args/args.c:159): unrecognized argument allow-run-as-root
[mpiexec@d3ef27906d44] HYDU_parse_array (utils/args/args.c:174): argument matching returned error
[mpiexec@d3ef27906d44] parse_args (ui/mpich/utils.c:1596): error parsing input array
[mpiexec@d3ef27906d44] HYD_uii_mpx_get_parameters (ui/mpich/utils.c:1648): unable to parse user arguments
[mpiexec@d3ef27906d44] main (ui/mpich/mpiexec.c:153): error parsing parameters

Also, could you check MPI version during the run via mpirun --version and if the output does not contain Open MPI, show an error:

horovodrun convenience script currently only supports Open MPI.

Choose one of:
1. Install Open MPI 4.0.0+ and re-install Horovod (use --no-cache-dir pip option).
2. Run distributed training script using the standard way provided by your MPI distribution (usually mpirun, srun, or jsrun).
Show resolved Hide resolved horovod/run/run.py Outdated
Show resolved Hide resolved horovod/run/run.py Outdated
Show resolved Hide resolved horovod/run/run.py Outdated
excluded mpich
Signed-off-by: fardin <fardin@uber.com>
if not _is_open_mpi_installed():
raise Exception(
'horovodrun convenience script currently only supports '
'Open MPI.\n'

This comment has been minimized.

Copy link
@alsrgv

alsrgv Mar 17, 2019

Collaborator

Add one extra newline for visual clarify of paragraphs?

abditag2 added some commits Mar 17, 2019

printed error with mpirun --version
Signed-off-by: fardin <fardin@uber.com>
better exception print
Signed-off-by: fardin <fardin@uber.com>
@alsrgv

alsrgv approved these changes Mar 17, 2019

Copy link
Collaborator

alsrgv left a comment

LGTM

@alsrgv

This comment has been minimized.

Copy link
Collaborator

alsrgv commented Mar 17, 2019

Failing tests are due to #919, good to merge.

@abditag2

This comment has been minimized.

Copy link
Collaborator Author

abditag2 commented Mar 18, 2019

🎆

@tgaddair
Copy link
Collaborator

tgaddair left a comment

Looks good! Just a few suggestions for the help function.

type=int,
help="SSH port on all the workers.")

parser.add_argument('-H', '--host', action="store", dest="host",

This comment has been minimized.

Copy link
@tgaddair

tgaddair Mar 18, 2019

Collaborator

Does this imply that the user doesn't specify the number of processes to run per host as they would in mpirun? In other words, users cannot launch non-homogenous jobs (jobs with differing numbers of GPUs per host)?

This comment has been minimized.

Copy link
@abditag2

abditag2 Mar 18, 2019

Author Collaborator

the host parameter is just similar to mpirun -H arg where they can specify the number of processes per each host e.g.: host-1:2,host-2:2,host-3:4

This comment has been minimized.

Copy link
@alsrgv

alsrgv Mar 18, 2019

Collaborator

We should assume user never heard of mpirun, so we need to convey that he needs to specify number of processes per host as well.

"initial checks and execute them every time.")
parser.add_argument('--horovod-start-timeout', action="store",
dest="start_timeout",
help="Horovodrun has to perform all the checks and and "

This comment has been minimized.

Copy link
@tgaddair

tgaddair Mar 18, 2019

Collaborator

Typo: "and and".

What is the default?

This comment has been minimized.

Copy link
@abditag2

abditag2 Mar 18, 2019

Author Collaborator

it is 600 seconds. It can also be specified using env variable HOROVOD_START_TIMEOUT

This comment has been minimized.

Copy link
@alsrgv

alsrgv Mar 18, 2019

Collaborator

Let's add to help string

type=int,
help="SSH port on all the workers.")

parser.add_argument('-H', '--host', action="store", dest="host",

This comment has been minimized.

Copy link
@tgaddair

tgaddair Mar 18, 2019

Collaborator

--hosts since it takes more than one? Is there any default if not specified (like localhost)?

This comment has been minimized.

Copy link
@abditag2

abditag2 Mar 18, 2019

Author Collaborator

If no hostname is specified, mpirun command uses localhost by default and runs as many processes as given in -np argument on localhost.


def use_cache():
"""
If decorates a function and if fn_cache is set, it will store the

This comment has been minimized.

Copy link
@tgaddair

tgaddair Mar 18, 2019

Collaborator

Typo: If decorates a function -> If used to decorate a function

@abditag2 abditag2 merged commit f3b1186 into horovod:master Mar 18, 2019

1 check failed

continuous-integration/travis-ci/pr The Travis CI build failed
Details
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
You can’t perform that action at this time.