Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add support for horovodrun on kubeflow/mpi-job #2199

Merged
merged 1 commit into from Aug 25, 2020

Conversation

zw0610
Copy link
Contributor

@zw0610 zw0610 commented Aug 24, 2020

Checklist before submitting

  • Did you read the contributor guide?
  • Did you update the docs?
  • Did you write any tests to validate this change?
  • Did you update the CHANGELOG, if this change affects users?

Description

  1. add is_mpi_job is_kubeflow_mpi under horovod/run/common/util/env.py horovod/runner/common/util/env.py for determining whether the environment was setup by kubeflow/mpi-operator

2. for all shell command with ssh, add revision for mpi-job with /etc/mpi/kubexec.sh

  1. wrap get_ssh_command with get_remote_command to support other shell tools.

Fixes #2198.

Review process to land

  1. All tests and other checks must succeed.
  2. At least one member of the technical steering committee must review and approve.
  3. If any member of the technical steering committee requests changes, they must be addressed.

@zw0610
Copy link
Contributor Author

zw0610 commented Aug 24, 2020

/cc @gaocegege @terrytangyuan @carmark

@tgaddair
Copy link
Collaborator

Hey @zw0610, can you rebase off of master? It looks like this branch was forked off the v0.19.3 branch. If you apply your change to the HEAD of master, it should resolve the merge conflicts.

@@ -45,3 +45,11 @@ def get_env_rank_and_size():
# Default to rank zero and size one if there are no environment variables
return 0, 1


def is_mpi_job():
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would suggest renaming to is_kubeflow_mpi for clarity.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sure


def is_mpi_job():
rsh_agent = os.environ.get('OMPI_MCA_plm_rsh_agent')
if rsh_agent is not None and rsh_agent == '/etc/mpi/kubexec.sh':
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can simplify to just:

if rsh_agent == '/etc/mpi/kubexec.sh':

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

index=codec.dumps_base64(index),
driver_addresses=codec.dumps_base64(driver_addresses),
settings=codec.dumps_base64(settings))
if env_util.is_mpi_job():
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It looks like there's an opportunity to make this more generic and reduce code duplication.

First, I would suggest adding a helper function to env.py:

def get_remote_command(local_command, host, ssh_port_arg=None):
    if is_kubeflow_mpi():
        return f'/etc/mpi/kubexec.sh {host} {local_command}'
    else:
        ssh_port_arg = ssh_port_arg or ''
        return f'ssh -o StrictHostKeyChecking=no {host} {ssh_port_arg} {local_command}'

Then you can refactor this and the other sections of code:

local_command = '\'{python} -m horovod.run.task_fn {index} {driver_addresses}' \
                    ' {settings}\'' \
                        .format(
                                python=sys.executable,
                                index=codec.dumps_base64(index),
                                driver_addresses=codec.dumps_base64(driver_addresses),
                                settings=codec.dumps_base64(settings))
command = get_remote_command(command, host_name, ssh_port_arg)

What do you think?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I noticed that on the master branch, get_ssh_command is defined under horovod/runner/util/remote.py. We can further reduce the duplication by adding a get_remote_command there. It also reduce the modification to the rest of scripts.

Please take a look at the implementation I just pushed.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks! Yes, I just implemented get_ssh_command yesterday for a different issue, and thought it would help to go ahead and design it in a way the fit into this PR. I'll take a look.

@zw0610
Copy link
Contributor Author

zw0610 commented Aug 25, 2020

Hey @zw0610, can you rebase off of master? It looks like this branch was forked off the v0.19.3 branch. If you apply your change to the HEAD of master, it should resolve the merge conflicts.

right, this pr is forked off v0.19.3. Instead of rebasing, I added modifications to the master branch, with your suggestions adopted.

Signed-off-by: Wang Zhang <zw199006@gmail.com>
@zw0610 zw0610 changed the title [WIP] Add support for horovodrun on kubeflow/mpi-job Add support for horovodrun on kubeflow/mpi-job Aug 25, 2020
Copy link

@gaocegege gaocegege left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/LGTM

@zw0610 Thanks for the PR!

/cc @kuikuikuizzZ

Copy link
Contributor

@terrytangyuan terrytangyuan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. Thanks!

Copy link
Collaborator

@tgaddair tgaddair left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM! Very clean set of changes. Thanks for the PR!

@tgaddair tgaddair merged commit 70ff27d into horovod:master Aug 25, 2020
2 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Development

Successfully merging this pull request may close these issues.

support other shell tools in gloo run
4 participants