Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add -hostfile option to horovodrun #1243

Merged
merged 8 commits into from Aug 3, 2019

Conversation

@apeforest
Copy link
Collaborator

commented Jul 22, 2019

Enable the same --hostfile option in horovodrun as in mpirun.
Users can specify a list of hosts in the hostfile as:

slots=<num_procs>
slots=<num_procs>
...

Tested on a GPU cluster with 8 nodes.

This PR fix #1231

@apeforest apeforest force-pushed the apeforest:feature/horovodrun-hostfile branch 2 times, most recently from 2cc6356 to 495d9fa Jul 22, 2019

@apeforest apeforest force-pushed the apeforest:feature/horovodrun-hostfile branch from 495d9fa to 7069e9c Jul 22, 2019

@alsrgv alsrgv requested review from sblotner and alsrgv Jul 22, 2019

@alsrgv
Copy link
Member

left a comment

LGTM with a couple of comments.
@sblotner, could you take a look at doc wording as well?

Show resolved Hide resolved .buildkite/gen-pipeline.sh Outdated
Show resolved Hide resolved docs/running.rst Outdated

@alsrgv alsrgv requested review from abditag2 and tgaddair Jul 22, 2019

@alsrgv

This comment has been minimized.

Copy link
Member

commented Jul 22, 2019

cc @abditag2 who wrote horovodrun

@sblotner
Copy link
Collaborator

left a comment

Added some edits for the docs

Show resolved Hide resolved docs/running.rst Outdated
Show resolved Hide resolved docs/running.rst
Show resolved Hide resolved docs/running.rst
Show resolved Hide resolved docs/running.rst Outdated
Show resolved Hide resolved docs/running.rst Outdated
Show resolved Hide resolved docs/running.rst Outdated
@abditag2

This comment has been minimized.

Copy link
Collaborator

commented Jul 22, 2019

Please fix or update the comments and docs in the following lines with the new behavior:

# if user does not specify any hosts, mpirun by default uses local host.

^^ Mention that if none of --host or --host-file is given, localhost will be used.

$ cat myhostfile
aa slots=2
bb slots=2
cc slots=2

This comment has been minimized.

Copy link
@abditag2

abditag2 Jul 22, 2019

Collaborator

[optional] I would suggest using the same format of the --host option for the host file as well i.e., each line could be

$cat myhostfile
host1:np1
host2:np2
...

This comment has been minimized.

Copy link
@apeforest

apeforest Jul 30, 2019

Author Collaborator
Show resolved Hide resolved horovod/run/run.py Outdated
@abditag2

This comment has been minimized.

Copy link
Collaborator

commented Jul 22, 2019

@apeforest apeforest force-pushed the apeforest:feature/horovodrun-hostfile branch 3 times, most recently from 427e343 to 06fb269 Jul 30, 2019

@apeforest

This comment has been minimized.

Copy link
Collaborator Author

commented Jul 30, 2019

@abditag2 @sblotner Addressed your comments. Please review again. thanks.

@sblotner
Copy link
Collaborator

left a comment

A few more editorial comments

Show resolved Hide resolved docs/running.rst Outdated
Show resolved Hide resolved docs/running.rst Outdated
Show resolved Hide resolved docs/running.rst Outdated
Show resolved Hide resolved docs/running.rst Outdated

@apeforest apeforest force-pushed the apeforest:feature/horovodrun-hostfile branch from 2016dab to 9026493 Jul 31, 2019

Show resolved Hide resolved docs/running.rst Outdated
@abditag2
Copy link
Collaborator

left a comment

LGTM. Please address @sblotner 's comments.

@apeforest apeforest force-pushed the apeforest:feature/horovodrun-hostfile branch 2 times, most recently from a8057c7 to dfcde12 Jul 31, 2019

@alsrgv

alsrgv approved these changes Aug 2, 2019

Copy link
Member

left a comment

LGTM, one minor comment. Can you rebase on the latest master for the integration tests to pass?

Show resolved Hide resolved docs/running.rst Outdated

apeforest added some commits Jul 22, 2019

add hostfile option to horovodrun
Signed-off-by: Lin Yuan <apeforest@gmail.com>
check remote host names
Signed-off-by: Lin Yuan <apeforest@gmail.com>
add test and docs for -hostfile option
Signed-off-by: Lin Yuan <apeforest@gmail.com>
address reviewer commnet
Signed-off-by: Lin Yuan <apeforest@gmail.com>
fix lint
Signed-off-by: Lin Yuan <apeforest@gmail.com>

apeforest added some commits Jul 31, 2019

address reviewer comment
Signed-off-by: Lin Yuan <apeforest@gmail.com>
address reviewer comment
Signed-off-by: Lin Yuan <apeforest@gmail.com>

@apeforest apeforest force-pushed the apeforest:feature/horovodrun-hostfile branch from 7db46d0 to 77fba3d Aug 2, 2019

address comment
Signed-off-by: Lin Yuan <apeforest@gmail.com>

@apeforest apeforest force-pushed the apeforest:feature/horovodrun-hostfile branch from 77fba3d to 20fd1a3 Aug 2, 2019

@alsrgv

This comment has been minimized.

Copy link
Member

commented Aug 3, 2019

cc @zsh-thu note to merge this fix in your branch

@alsrgv alsrgv merged commit a72bc96 into horovod:master Aug 3, 2019

2 checks passed

DCO DCO
Details
buildkite/horovod/pr Build #876 passed (1 hour, 5 minutes, 19 seconds)
Details
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
4 participants
You can’t perform that action at this time.