Add `horovod.run.run` to make horovod notebook friendly (new impl) #1307

WeichenXu123 · 2019-08-15T16:03:26Z

Proposed API:

def run(
        func,
        args=(),
        kwargs={},
        np=1,
        hosts=None,
        hostfile=None,
        start_timeout=None,
        ssh_port=None,
        disable_cache=None,
        output_filename=None,
        verbose=None,
        use_gloo=None,
        use_mpi=None)

I launch a http server in driver side, so that:

remote processes can read the exec function from the server
remote process can put return value into the server.

I integrate the code into current run code.

horovod/run/run.py

WeichenXu123 · 2019-08-29T16:47:28Z

I update PR for addressing some comments.
I will add test later, but it is ready for first pass review.

tgaddair · 2019-08-29T21:00:28Z

Since the rendezvous server has been made more generic, I would move it out of the horovod.run.rendezvous package into either horovod.run or horovod.run.http.

WeichenXu123 · 2019-08-30T13:34:30Z

@tgaddair I add a structure HorovodArgs to encapsulate all the configs.

WeichenXu123 · 2019-08-30T16:15:00Z

@tgaddair Seemingly the buildkite system has some issue. Sometimes building failed.

tgaddair · 2019-08-31T00:00:38Z

@tgaddair Seemingly the buildkite system has some issue. Sometimes building failed.

Hey @WeichenXu123, can you rebase onto master? There was a change made to pytorch-nightly.

Signed-off-by: WeichenXu <weichen.xu@databricks.com>

WeichenXu123 · 2019-10-17T13:23:28Z

@tgaddair I address most comments from you, only leave the one "ret value", we need to address the issue first which I comment above #1307 (comment)

Signed-off-by: WeichenXu <weichen.xu@databricks.com>

.buildkite/gen-pipeline.sh

Signed-off-by: WeichenXu <weichen.xu@databricks.com>

WeichenXu123 · 2019-10-19T05:17:03Z

@tgaddair I add back the mpich test, then when running test_interactiverun.py, it raise error like:

test_interactiverun.py:68:
  | _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
  | /usr/local/lib/python3.6/dist-packages/horovod/run/run.py:922: in run
  | return _run(hargs)
  | /usr/local/lib/python3.6/dist-packages/horovod/run/run.py:805: in _run
  | _launch_job(args, remote_host_names, settings, common_intfs, command)
  | /usr/local/lib/python3.6/dist-packages/horovod/run/run.py:832: in _launch_job
  | mpi_run(settings, common_intfs, env, command)
  | _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
  |  
  | settings = <horovod.run.common.util.settings.Settings object at 0x7f7b47467860>
  | common_intfs = {'lo'}
  | env = {'HOME': '/root', 'HOROVOD_CACHE_CAPACITY': '(None,)', 'HOROVOD_CYCLE_TIME': '(None,)', 'HOSTNAME': 'be96015d6083', ...}
  | command = ['/usr/bin/python', '-m', 'horovod.run.run_task', '127.0.0.1', '50171']
  |  
  | def mpi_run(settings, common_intfs, env, command):
  | mpi_impl_flags = _get_mpi_implementation_flags()
  | if mpi_impl_flags is None:
  | raise Exception(
  | >               'horovodrun convenience script does not find an installed OpenMPI.\n\n'
  | 'Choose one of:\n'
  | '1. Install Open MPI 4.0.0+ or IBM Spectrum MPI and re-install Horovod '
  | '(use --no-cache-dir pip option).\n'
  | '2. Run distributed '
  | 'training script using the standard way provided by your'
  | ' MPI distribution (usually mpirun, srun, or jsrun).\n'
  | '3. Use built-in gloo option (horovodrun --gloo ...).')
  | E           Exception: horovodrun convenience script does not find an installed OpenMPI.
  | E
  | E           Choose one of:
  | E           1. Install Open MPI 4.0.0+ or IBM Spectrum MPI and re-install Horovod (use --no-cache-dir pip option).
  | E           2. Run distributed training script using the standard way provided by your MPI distribution (usually mpirun, srun, or jsrun).
  | E           3. Use built-in gloo option (horovodrun --gloo ...).
  |  
  | /usr/local/lib/python3.6/dist-packages/horovod/run/mpi_run.py:65: Exception
  | ______________________ InteractiveRunTests.test_happy_run ______________________

I think the code here need to update

horovod/horovod/run/mpi_run.py

Line 62 in 36a98fe

mpi_impl_flags = _get_mpi_implementation_flags()

Current master only support OpenMPI and IBM spectrum MPI, and current master test do not cover the code path here.

Signed-off-by: WeichenXu <weichen.xu@databricks.com>

WeichenXu123 · 2019-10-20T03:44:21Z

@tgaddair See my another PR test for MPICH (print log version), we can see MPICH do not support --allow-run-as-root
#1455

E               RuntimeError: mpirun failed with exit code 255, stdout
  | E
  | E               , stderr
  | E               [mpiexec@e099377ab812] match_arg (utils/args/args.c:159): unrecognized argument allow-run-as-root
  | E               [mpiexec@e099377ab812] HYDU_parse_array (utils/args/args.c:174): argument matching returned error
  | E               [mpiexec@e099377ab812] parse_args (ui/mpich/utils.c:1597): error parsing input array
  | E               [mpiexec@e099377ab812] HYD_uii_mpx_get_parameters (ui/mpich/utils.c:1649): unable to parse user arguments
  | E               [mpiexec@e099377ab812] main (ui/mpich/mpiexec.c:149): error parsing parameters
  |  
  | /usr/local/lib/python3.6/dist-packages/horovod/run/mpi_run.py:134: RuntimeError
 ```

Signed-off-by: WeichenXu <weichen.xu@databricks.com>

tgaddair · 2019-10-21T17:59:10Z

Good catch @WeichenXu123. Until we've done more testing with MPICH, how about we skip the interactive run test unless we're using OpenMPI similar to how we do for Spark?

# Seems that spark tests depend on MPI, do not test those when mpi is not available
  local exclude_spark_if_needed=""
  if [[ ${test} != *"mpi"* ]]; then
    exclude_spark_if_needed="| sed 's/[a-z_]*spark[a-z_.]*//g'"
  fi

Signed-off-by: WeichenXu <weichen.xu@databricks.com>

WeichenXu123 · 2019-10-22T04:16:30Z

Ready.
@tgaddair Before merging, let's wait @mengxr to take a final look.

tgaddair

LGTM! We'll land once @mengxr takes a look.

mengxr

I made one pass over the API and implementation. I don't know why some changes belong to this PR but the run implementation looks good in general.

horovod/run/run.py

mengxr · 2019-10-22T19:46:14Z

horovod/run/run.py

+
+    hargs = HorovodArgs()
+
+    hargs.np = np


simpler if we use namedtuple to define HorovodArgs.

Here there're many args I defined default value None in class HorovodArgs constructor. And in run we only set few arguments. If use namedtuple, we need to copy the argments list twice completely, like:

HorovodArgs = namedtuple('arg1', 'arg2', ..., 'arg100') hargs = HorovodArgs(arg1=XXX, arg2=XXX, ..., arg100=XXX)

but most of the args we only need to keep default value None.

See https://docs.python.org/3/library/collections.html#collections.somenamedtuple._replace.

Unfortunately, in the current design we cannot use namedtuple because the downstream _run function modifies some of the object's values (therefore, we cannot assume immutability). We cannot change the behavior to use namedtuple._replace as it is currently designed because there command line entrypoint uses the same code path, but passes in parsed arguments object (here @WeichenXu123 is relying on Python's duck typing to get around this interface overloading).

Longterm, I think we should consolidate these two code paths (functional vs command line) to use the same object to represent materialized argument values. But I don't think it's necessary to do that in this PR since it's an implementation detail not exposed to the end user.

horovod/run/common/util/env.py

horovod/run/http/http_client.py

horovod/run/run.py

Signed-off-by: WeichenXu <weichen.xu@databricks.com>

tgaddair

Looks good! Let's merge it.

Signed-off-by: WeichenXu <weichen.xu@databricks.com>

WeichenXu123 mentioned this pull request Aug 15, 2019

Add horovod.run.run_func to make horovod notebook friendly #1192

Closed

WeichenXu123 force-pushed the issue1176_3 branch 6 times, most recently from 2e30c1f to 8fd518f Compare August 22, 2019 11:12

tgaddair mentioned this pull request Aug 27, 2019

FailedPreconditionError (see above for traceback): Error while reading resource variable training_4/Adadelta/Variable_9 from Container: localhost. #1343

Closed

alsrgv reviewed Aug 28, 2019

View reviewed changes

horovod/run/run.py Outdated Show resolved Hide resolved

horovod/run/run.py Outdated Show resolved Hide resolved

horovod/run/run.py Outdated Show resolved Hide resolved

horovod/run/run.py Outdated Show resolved Hide resolved

alsrgv requested a review from tgaddair August 28, 2019 21:29

WeichenXu123 force-pushed the issue1176_3 branch from 67947d0 to af417a9 Compare August 29, 2019 16:38

WeichenXu123 changed the title ~~[WIP] Add horovod.run.run_func to make horovod notebook friendly (new impl)~~ Add horovod.run.run_func to make horovod notebook friendly (new impl) Aug 29, 2019

WeichenXu123 force-pushed the issue1176_3 branch 2 times, most recently from 67947d0 to 5618f8e Compare August 29, 2019 16:45

WeichenXu123 changed the title ~~Add horovod.run.run_func to make horovod notebook friendly (new impl)~~ [WIP] Add horovod.run.run_func to make horovod notebook friendly (new impl) Aug 30, 2019

WeichenXu123 force-pushed the issue1176_3 branch from e608f7e to 7f95212 Compare August 30, 2019 13:35

WeichenXu123 changed the title ~~[WIP] Add horovod.run.run_func to make horovod notebook friendly (new impl)~~ Add horovod.run.run_func to make horovod notebook friendly (new impl) Aug 30, 2019

WeichenXu123 added 2 commits August 31, 2019 09:41

init pr

75812cd

Signed-off-by: WeichenXu <weichen.xu@databricks.com>

address comments

e5a26fd

Signed-off-by: WeichenXu <weichen.xu@databricks.com>

WeichenXu123 force-pushed the issue1176_3 branch from 4144c78 to 84703f5 Compare August 31, 2019 01:42

update

6875f30

Signed-off-by: WeichenXu <weichen.xu@databricks.com>

WeichenXu123 force-pushed the issue1176_3 branch 4 times, most recently from 2e0c976 to 28f6fe0 Compare September 2, 2019 03:03

address comments

33024ec

Signed-off-by: WeichenXu <weichen.xu@databricks.com>

WeichenXu123 force-pushed the issue1176_3 branch from cf3766b to 33024ec Compare October 17, 2019 13:20

fix gen-pipeline

90e58f2

Signed-off-by: WeichenXu <weichen.xu@databricks.com>

tgaddair reviewed Oct 17, 2019

View reviewed changes

.buildkite/gen-pipeline.sh Outdated Show resolved Hide resolved

WeichenXu123 added 3 commits October 19, 2019 10:53

merge master

b86c6f3

restore mpich test

5b48b13

Signed-off-by: WeichenXu <weichen.xu@databricks.com>

add mpi_args

7318d61

Signed-off-by: WeichenXu <weichen.xu@databricks.com>

WeichenXu123 added 2 commits October 19, 2019 15:38

collect all process ret val

6739508

Signed-off-by: WeichenXu <weichen.xu@databricks.com>

add mpich support

78ce593

Signed-off-by: WeichenXu <weichen.xu@databricks.com>

improve test

6e0bd83

Signed-off-by: WeichenXu <weichen.xu@databricks.com>

exclude mpich for run_interactiverun

cb0a152

Signed-off-by: WeichenXu <weichen.xu@databricks.com>

tgaddair approved these changes Oct 22, 2019

View reviewed changes

mengxr suggested changes Oct 22, 2019

View reviewed changes

address comments

c8a8643

Signed-off-by: WeichenXu <weichen.xu@databricks.com>

WeichenXu123 force-pushed the issue1176_3 branch from e3c7cb6 to c8a8643 Compare October 23, 2019 12:47

WeichenXu123 added 3 commits October 23, 2019 21:44

fix test

1b23e7d

Signed-off-by: WeichenXu <weichen.xu@databricks.com>

update get_env_rank_and_size

3535ba2

Signed-off-by: WeichenXu <weichen.xu@databricks.com>

fix

7147aab

Signed-off-by: WeichenXu <weichen.xu@databricks.com>

mengxr approved these changes Oct 24, 2019

View reviewed changes

tgaddair approved these changes Oct 24, 2019

View reviewed changes

tgaddair merged commit 9fc256d into horovod:master Oct 24, 2019

WeichenXu123 deleted the issue1176_3 branch October 25, 2019 01:48

jeffdaily pushed a commit to ROCm/horovod that referenced this pull request Nov 27, 2019

Add horovod.run.run to make horovod notebook friendly (horovod#1307)

da8f678

Signed-off-by: WeichenXu <weichen.xu@databricks.com>

tgaddair mentioned this pull request Dec 4, 2019

Multi-Worker + Multi-GPU from script instead of command line #1555

Closed

DelphianCalamity pushed a commit to DelphianCalamity/horovod that referenced this pull request Apr 18, 2020

Add horovod.run.run to make horovod notebook friendly (horovod#1307)

d51e9bf

Signed-off-by: WeichenXu <weichen.xu@databricks.com>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add `horovod.run.run` to make horovod notebook friendly (new impl) #1307

Add `horovod.run.run` to make horovod notebook friendly (new impl) #1307

WeichenXu123 commented Aug 15, 2019 •

edited

Loading

WeichenXu123 commented Aug 29, 2019

tgaddair commented Aug 29, 2019

WeichenXu123 commented Aug 30, 2019

WeichenXu123 commented Aug 30, 2019

tgaddair commented Aug 31, 2019

WeichenXu123 commented Oct 17, 2019

WeichenXu123 commented Oct 19, 2019

WeichenXu123 commented Oct 20, 2019

tgaddair commented Oct 21, 2019

WeichenXu123 commented Oct 22, 2019

tgaddair left a comment

mengxr left a comment

mengxr Oct 22, 2019

WeichenXu123 Oct 23, 2019

mengxr Oct 23, 2019

tgaddair Oct 23, 2019

tgaddair left a comment

Add horovod.run.run to make horovod notebook friendly (new impl) #1307

Add horovod.run.run to make horovod notebook friendly (new impl) #1307

Conversation

WeichenXu123 commented Aug 15, 2019 • edited Loading

Proposed API:

WeichenXu123 commented Aug 29, 2019

tgaddair commented Aug 29, 2019

WeichenXu123 commented Aug 30, 2019

WeichenXu123 commented Aug 30, 2019

tgaddair commented Aug 31, 2019

WeichenXu123 commented Oct 17, 2019

WeichenXu123 commented Oct 19, 2019

WeichenXu123 commented Oct 20, 2019

tgaddair commented Oct 21, 2019

WeichenXu123 commented Oct 22, 2019

tgaddair left a comment

Choose a reason for hiding this comment

mengxr left a comment

Choose a reason for hiding this comment

mengxr Oct 22, 2019

Choose a reason for hiding this comment

WeichenXu123 Oct 23, 2019

Choose a reason for hiding this comment

mengxr Oct 23, 2019

Choose a reason for hiding this comment

tgaddair Oct 23, 2019

Choose a reason for hiding this comment

tgaddair left a comment

Choose a reason for hiding this comment

Add `horovod.run.run` to make horovod notebook friendly (new impl) #1307

Add `horovod.run.run` to make horovod notebook friendly (new impl) #1307

WeichenXu123 commented Aug 15, 2019 •

edited

Loading