New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add support for horovodrun on kubeflow/mpi-job #2199
Conversation
c1095e1
to
8b03038
Compare
Hey @zw0610, can you rebase off of master? It looks like this branch was forked off the v0.19.3 branch. If you apply your change to the HEAD of master, it should resolve the merge conflicts. |
horovod/run/common/util/env.py
Outdated
@@ -45,3 +45,11 @@ def get_env_rank_and_size(): | |||
# Default to rank zero and size one if there are no environment variables | |||
return 0, 1 | |||
|
|||
|
|||
def is_mpi_job(): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would suggest renaming to is_kubeflow_mpi
for clarity.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
sure
horovod/run/common/util/env.py
Outdated
|
||
def is_mpi_job(): | ||
rsh_agent = os.environ.get('OMPI_MCA_plm_rsh_agent') | ||
if rsh_agent is not None and rsh_agent == '/etc/mpi/kubexec.sh': |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can simplify to just:
if rsh_agent == '/etc/mpi/kubexec.sh':
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
horovod/run/driver/driver_service.py
Outdated
index=codec.dumps_base64(index), | ||
driver_addresses=codec.dumps_base64(driver_addresses), | ||
settings=codec.dumps_base64(settings)) | ||
if env_util.is_mpi_job(): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It looks like there's an opportunity to make this more generic and reduce code duplication.
First, I would suggest adding a helper function to env.py
:
def get_remote_command(local_command, host, ssh_port_arg=None):
if is_kubeflow_mpi():
return f'/etc/mpi/kubexec.sh {host} {local_command}'
else:
ssh_port_arg = ssh_port_arg or ''
return f'ssh -o StrictHostKeyChecking=no {host} {ssh_port_arg} {local_command}'
Then you can refactor this and the other sections of code:
local_command = '\'{python} -m horovod.run.task_fn {index} {driver_addresses}' \
' {settings}\'' \
.format(
python=sys.executable,
index=codec.dumps_base64(index),
driver_addresses=codec.dumps_base64(driver_addresses),
settings=codec.dumps_base64(settings))
command = get_remote_command(command, host_name, ssh_port_arg)
What do you think?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I noticed that on the master branch, get_ssh_command
is defined under horovod/runner/util/remote.py
. We can further reduce the duplication by adding a get_remote_command
there. It also reduce the modification to the rest of scripts.
Please take a look at the implementation I just pushed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks! Yes, I just implemented get_ssh_command
yesterday for a different issue, and thought it would help to go ahead and design it in a way the fit into this PR. I'll take a look.
8b03038
to
7da5c8e
Compare
right, this pr is forked off v0.19.3. Instead of rebasing, I added modifications to the master branch, with your suggestions adopted. |
Signed-off-by: Wang Zhang <zw199006@gmail.com>
7da5c8e
to
75f5253
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM. Thanks!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM! Very clean set of changes. Thanks for the PR!
Checklist before submitting
Description
is_mpi_job
is_kubeflow_mpi
underhorovod/run/common/util/env.py
horovod/runner/common/util/env.py
for determining whether the environment was setup by kubeflow/mpi-operator2. for all shell command withssh
, add revision for mpi-job with/etc/mpi/kubexec.sh
get_ssh_command
withget_remote_command
to support other shell tools.Fixes #2198.
Review process to land