Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add horovod.run.run to make horovod notebook friendly (new impl) #1307

Merged
merged 35 commits into from
Oct 24, 2019

Conversation

WeichenXu123
Copy link
Contributor

@WeichenXu123 WeichenXu123 commented Aug 15, 2019

Proposed API:

def run(
        func,
        args=(),
        kwargs={},
        np=1,
        hosts=None,
        hostfile=None,
        start_timeout=None,
        ssh_port=None,
        disable_cache=None,
        output_filename=None,
        verbose=None,
        use_gloo=None,
        use_mpi=None)

I launch a http server in driver side, so that:

  • remote processes can read the exec function from the server
  • remote process can put return value into the server.

I integrate the code into current run code.

horovod/run/run.py Outdated Show resolved Hide resolved
horovod/run/run.py Outdated Show resolved Hide resolved
horovod/run/run.py Outdated Show resolved Hide resolved
horovod/run/run.py Outdated Show resolved Hide resolved
@alsrgv alsrgv requested a review from tgaddair August 28, 2019 21:29
@WeichenXu123 WeichenXu123 changed the title [WIP] Add horovod.run.run_func to make horovod notebook friendly (new impl) Add horovod.run.run_func to make horovod notebook friendly (new impl) Aug 29, 2019
@WeichenXu123 WeichenXu123 force-pushed the issue1176_3 branch 2 times, most recently from 67947d0 to 5618f8e Compare August 29, 2019 16:45
@WeichenXu123
Copy link
Contributor Author

I update PR for addressing some comments.
I will add test later, but it is ready for first pass review.

@tgaddair
Copy link
Collaborator

Since the rendezvous server has been made more generic, I would move it out of the horovod.run.rendezvous package into either horovod.run or horovod.run.http.

@WeichenXu123 WeichenXu123 changed the title Add horovod.run.run_func to make horovod notebook friendly (new impl) [WIP] Add horovod.run.run_func to make horovod notebook friendly (new impl) Aug 30, 2019
@WeichenXu123
Copy link
Contributor Author

@tgaddair I add a structure HorovodArgs to encapsulate all the configs.

@WeichenXu123 WeichenXu123 changed the title [WIP] Add horovod.run.run_func to make horovod notebook friendly (new impl) Add horovod.run.run_func to make horovod notebook friendly (new impl) Aug 30, 2019
@WeichenXu123
Copy link
Contributor Author

@tgaddair Seemingly the buildkite system has some issue. Sometimes building failed.

@tgaddair
Copy link
Collaborator

@tgaddair Seemingly the buildkite system has some issue. Sometimes building failed.

Hey @WeichenXu123, can you rebase onto master? There was a change made to pytorch-nightly.

Signed-off-by: WeichenXu <weichen.xu@databricks.com>
Signed-off-by: WeichenXu <weichen.xu@databricks.com>
Signed-off-by: WeichenXu <weichen.xu@databricks.com>
@WeichenXu123 WeichenXu123 force-pushed the issue1176_3 branch 4 times, most recently from 2e0c976 to 28f6fe0 Compare September 2, 2019 03:03
Signed-off-by: WeichenXu <weichen.xu@databricks.com>
@WeichenXu123
Copy link
Contributor Author

@tgaddair I address most comments from you, only leave the one "ret value", we need to address the issue first which I comment above #1307 (comment)

Signed-off-by: WeichenXu <weichen.xu@databricks.com>
Signed-off-by: WeichenXu <weichen.xu@databricks.com>
Signed-off-by: WeichenXu <weichen.xu@databricks.com>
@WeichenXu123
Copy link
Contributor Author

@tgaddair I add back the mpich test, then when running test_interactiverun.py, it raise error like:

test_interactiverun.py:68:
  | _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
  | /usr/local/lib/python3.6/dist-packages/horovod/run/run.py:922: in run
  | return _run(hargs)
  | /usr/local/lib/python3.6/dist-packages/horovod/run/run.py:805: in _run
  | _launch_job(args, remote_host_names, settings, common_intfs, command)
  | /usr/local/lib/python3.6/dist-packages/horovod/run/run.py:832: in _launch_job
  | mpi_run(settings, common_intfs, env, command)
  | _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
  |  
  | settings = <horovod.run.common.util.settings.Settings object at 0x7f7b47467860>
  | common_intfs = {'lo'}
  | env = {'HOME': '/root', 'HOROVOD_CACHE_CAPACITY': '(None,)', 'HOROVOD_CYCLE_TIME': '(None,)', 'HOSTNAME': 'be96015d6083', ...}
  | command = ['/usr/bin/python', '-m', 'horovod.run.run_task', '127.0.0.1', '50171']
  |  
  | def mpi_run(settings, common_intfs, env, command):
  | mpi_impl_flags = _get_mpi_implementation_flags()
  | if mpi_impl_flags is None:
  | raise Exception(
  | >               'horovodrun convenience script does not find an installed OpenMPI.\n\n'
  | 'Choose one of:\n'
  | '1. Install Open MPI 4.0.0+ or IBM Spectrum MPI and re-install Horovod '
  | '(use --no-cache-dir pip option).\n'
  | '2. Run distributed '
  | 'training script using the standard way provided by your'
  | ' MPI distribution (usually mpirun, srun, or jsrun).\n'
  | '3. Use built-in gloo option (horovodrun --gloo ...).')
  | E           Exception: horovodrun convenience script does not find an installed OpenMPI.
  | E
  | E           Choose one of:
  | E           1. Install Open MPI 4.0.0+ or IBM Spectrum MPI and re-install Horovod (use --no-cache-dir pip option).
  | E           2. Run distributed training script using the standard way provided by your MPI distribution (usually mpirun, srun, or jsrun).
  | E           3. Use built-in gloo option (horovodrun --gloo ...).
  |  
  | /usr/local/lib/python3.6/dist-packages/horovod/run/mpi_run.py:65: Exception
  | ______________________ InteractiveRunTests.test_happy_run ______________________

I think the code here need to update

mpi_impl_flags = _get_mpi_implementation_flags()

Current master only support OpenMPI and IBM spectrum MPI, and current master test do not cover the code path here.

Signed-off-by: WeichenXu <weichen.xu@databricks.com>
Signed-off-by: WeichenXu <weichen.xu@databricks.com>
@WeichenXu123
Copy link
Contributor Author

@tgaddair See my another PR test for MPICH (print log version), we can see MPICH do not support --allow-run-as-root
#1455

E               RuntimeError: mpirun failed with exit code 255, stdout
  | E
  | E               , stderr
  | E               [mpiexec@e099377ab812] match_arg (utils/args/args.c:159): unrecognized argument allow-run-as-root
  | E               [mpiexec@e099377ab812] HYDU_parse_array (utils/args/args.c:174): argument matching returned error
  | E               [mpiexec@e099377ab812] parse_args (ui/mpich/utils.c:1597): error parsing input array
  | E               [mpiexec@e099377ab812] HYD_uii_mpx_get_parameters (ui/mpich/utils.c:1649): unable to parse user arguments
  | E               [mpiexec@e099377ab812] main (ui/mpich/mpiexec.c:149): error parsing parameters
  |  
  | /usr/local/lib/python3.6/dist-packages/horovod/run/mpi_run.py:134: RuntimeError
 ```

Signed-off-by: WeichenXu <weichen.xu@databricks.com>
@tgaddair
Copy link
Collaborator

Good catch @WeichenXu123. Until we've done more testing with MPICH, how about we skip the interactive run test unless we're using OpenMPI similar to how we do for Spark?

# Seems that spark tests depend on MPI, do not test those when mpi is not available
  local exclude_spark_if_needed=""
  if [[ ${test} != *"mpi"* ]]; then
    exclude_spark_if_needed="| sed 's/[a-z_]*spark[a-z_.]*//g'"
  fi

Signed-off-by: WeichenXu <weichen.xu@databricks.com>
@WeichenXu123
Copy link
Contributor Author

Ready.
@tgaddair Before merging, let's wait @mengxr to take a final look.

Copy link
Collaborator

@tgaddair tgaddair left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM! We'll land once @mengxr takes a look.

Copy link
Contributor

@mengxr mengxr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I made one pass over the API and implementation. I don't know why some changes belong to this PR but the run implementation looks good in general.

horovod/run/run.py Outdated Show resolved Hide resolved
horovod/run/run.py Show resolved Hide resolved
horovod/run/run.py Show resolved Hide resolved
horovod/run/run.py Show resolved Hide resolved

hargs = HorovodArgs()

hargs.np = np
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

simpler if we use namedtuple to define HorovodArgs.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here there're many args I defined default value None in class HorovodArgs constructor. And in run we only set few arguments. If use namedtuple, we need to copy the argments list twice completely, like:

HorovodArgs = namedtuple('arg1', 'arg2', ..., 'arg100')
hargs = HorovodArgs(arg1=XXX, arg2=XXX, ..., arg100=XXX)

but most of the args we only need to keep default value None.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unfortunately, in the current design we cannot use namedtuple because the downstream _run function modifies some of the object's values (therefore, we cannot assume immutability). We cannot change the behavior to use namedtuple._replace as it is currently designed because there command line entrypoint uses the same code path, but passes in parsed arguments object (here @WeichenXu123 is relying on Python's duck typing to get around this interface overloading).

Longterm, I think we should consolidate these two code paths (functional vs command line) to use the same object to represent materialized argument values. But I don't think it's necessary to do that in this PR since it's an implementation detail not exposed to the end user.

horovod/run/common/util/env.py Show resolved Hide resolved
horovod/run/http/http_client.py Show resolved Hide resolved
horovod/run/run.py Show resolved Hide resolved
horovod/run/run.py Outdated Show resolved Hide resolved
horovod/run/run.py Show resolved Hide resolved
Signed-off-by: WeichenXu <weichen.xu@databricks.com>
Signed-off-by: WeichenXu <weichen.xu@databricks.com>
Signed-off-by: WeichenXu <weichen.xu@databricks.com>
Signed-off-by: WeichenXu <weichen.xu@databricks.com>
Copy link
Collaborator

@tgaddair tgaddair left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good! Let's merge it.

@tgaddair tgaddair merged commit 9fc256d into horovod:master Oct 24, 2019
@WeichenXu123 WeichenXu123 deleted the issue1176_3 branch October 25, 2019 01:48
jeffdaily pushed a commit to ROCm/horovod that referenced this pull request Nov 27, 2019
Signed-off-by: WeichenXu <weichen.xu@databricks.com>
DelphianCalamity pushed a commit to DelphianCalamity/horovod that referenced this pull request Apr 18, 2020
Signed-off-by: WeichenXu <weichen.xu@databricks.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Development

Successfully merging this pull request may close these issues.

None yet

6 participants