Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add LSF and jsrun support to horovodrun #1805

Merged
merged 3 commits into from Mar 24, 2020
Merged

Conversation

@nvcastet
Copy link
Collaborator

nvcastet commented Mar 19, 2020

Example to run on a LSF cluster (e.g. Summit):
horovodrun python train.py

Also, perform cpu/mem process binding to get the best performance.

Contributors:
@bethune-bryant
@nvcastet

Signed-off-by: Nicolas V Castet nvcastet@us.ibm.com

@nvcastet nvcastet requested review from tgaddair and romerojosh Mar 19, 2020
@nvcastet nvcastet force-pushed the nvcastet:support_lsf branch 3 times, most recently from 0b26b39 to 2359d2f Mar 19, 2020
Copy link
Collaborator

tgaddair left a comment

LGTM! Just a minor nit.

print('Testing interfaces on all the hosts.')
common_intfs = None
# Skipping interface discovery for LSF cluster as it slows down considerably the job start
if not lsf.LSFUtils.using_lsf():

This comment has been minimized.

Copy link
@tgaddair

tgaddair Mar 20, 2020

Collaborator

This block is getting pretty heavily nested. Maybe we can pull this out into a utility function _get_common_interfaces, where the first couple lines are a precondition check:

if lsf.LSFUtils.using_lsf():
    return None

What do you think?

This comment has been minimized.

Copy link
@nvcastet

nvcastet Mar 20, 2020

Author Collaborator

Yes I agree, code will be cleaner. I will do that.

@nvcastet nvcastet force-pushed the nvcastet:support_lsf branch 3 times, most recently from 68b6927 to cd9a522 Mar 23, 2020
nvcastet added 3 commits Mar 11, 2020
Example to run on a LSF cluster (e.g. Summit):
horovodrun python train.py

Perform cpu/mem process binding to get the best performance.

Contributors:
@bethune-bryant
@nvcastet

Signed-off-by: Nicolas V Castet <nvcastet@us.ibm.com>
Signed-off-by: Nicolas V Castet <nvcastet@us.ibm.com>
Signed-off-by: Nicolas V Castet <nvcastet@us.ibm.com>
@nvcastet nvcastet force-pushed the nvcastet:support_lsf branch from cd9a522 to f168999 Mar 23, 2020
Copy link
Collaborator

tgaddair left a comment

LGTM! Thanks for the refactor, as well.

@tgaddair tgaddair merged commit 58e7de0 into horovod:master Mar 24, 2020
5 checks passed
5 checks passed
build
Details
build
Details
DCO DCO
Details
buildkite/horovod/pr Build #2299 passed (1 hour, 13 minutes, 42 seconds)
Details
ppc64le-checks ppc64le Build/Tests Passed
Details
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Linked issues

Successfully merging this pull request may close these issues.

None yet

2 participants
You can’t perform that action at this time.