Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add `horovod.run.run` to make horovod notebook friendly (new impl) #1307

Merged
merged 35 commits into from Oct 24, 2019

Conversation

@WeichenXu123
Copy link
Contributor

WeichenXu123 commented Aug 15, 2019

Proposed API:

def run(
        func,
        args=(),
        kwargs={},
        np=1,
        hosts=None,
        hostfile=None,
        start_timeout=None,
        ssh_port=None,
        disable_cache=None,
        output_filename=None,
        verbose=None,
        use_gloo=None,
        use_mpi=None)

I launch a http server in driver side, so that:

  • remote processes can read the exec function from the server
  • remote process can put return value into the server.

I integrate the code into current run code.

horovod/run/run.py Outdated Show resolved Hide resolved
horovod/run/run.py Outdated Show resolved Hide resolved
horovod/run/run.py Outdated Show resolved Hide resolved
horovod/run/run.py Outdated Show resolved Hide resolved
@alsrgv alsrgv requested a review from tgaddair Aug 28, 2019
@WeichenXu123 WeichenXu123 force-pushed the WeichenXu123:issue1176_3 branch from 67947d0 to af417a9 Aug 29, 2019
@WeichenXu123 WeichenXu123 changed the title [WIP] Add `horovod.run.run_func` to make horovod notebook friendly (new impl) Add `horovod.run.run_func` to make horovod notebook friendly (new impl) Aug 29, 2019
@WeichenXu123 WeichenXu123 force-pushed the WeichenXu123:issue1176_3 branch 2 times, most recently from 67947d0 to 5618f8e Aug 29, 2019
@WeichenXu123

This comment has been minimized.

Copy link
Contributor Author

WeichenXu123 commented Aug 29, 2019

I update PR for addressing some comments.
I will add test later, but it is ready for first pass review.

@tgaddair

This comment has been minimized.

Copy link
Collaborator

tgaddair commented Aug 29, 2019

Since the rendezvous server has been made more generic, I would move it out of the horovod.run.rendezvous package into either horovod.run or horovod.run.http.

@WeichenXu123 WeichenXu123 changed the title Add `horovod.run.run_func` to make horovod notebook friendly (new impl) [WIP] Add `horovod.run.run_func` to make horovod notebook friendly (new impl) Aug 30, 2019
@WeichenXu123

This comment has been minimized.

Copy link
Contributor Author

WeichenXu123 commented Aug 30, 2019

@tgaddair I add a structure HorovodArgs to encapsulate all the configs.

@WeichenXu123 WeichenXu123 force-pushed the WeichenXu123:issue1176_3 branch from e608f7e to 7f95212 Aug 30, 2019
@WeichenXu123 WeichenXu123 changed the title [WIP] Add `horovod.run.run_func` to make horovod notebook friendly (new impl) Add `horovod.run.run_func` to make horovod notebook friendly (new impl) Aug 30, 2019
@WeichenXu123

This comment has been minimized.

Copy link
Contributor Author

WeichenXu123 commented Aug 30, 2019

@tgaddair Seemingly the buildkite system has some issue. Sometimes building failed.

@tgaddair

This comment has been minimized.

Copy link
Collaborator

tgaddair commented Aug 31, 2019

@tgaddair Seemingly the buildkite system has some issue. Sometimes building failed.

Hey @WeichenXu123, can you rebase onto master? There was a change made to pytorch-nightly.

Signed-off-by: WeichenXu <weichen.xu@databricks.com>
Signed-off-by: WeichenXu <weichen.xu@databricks.com>
@WeichenXu123 WeichenXu123 force-pushed the WeichenXu123:issue1176_3 branch from 4144c78 to 84703f5 Aug 31, 2019
update
Signed-off-by: WeichenXu <weichen.xu@databricks.com>
@WeichenXu123 WeichenXu123 force-pushed the WeichenXu123:issue1176_3 branch 4 times, most recently from 2e0c976 to 28f6fe0 Aug 31, 2019
Signed-off-by: WeichenXu <weichen.xu@databricks.com>
@WeichenXu123 WeichenXu123 force-pushed the WeichenXu123:issue1176_3 branch from cf3766b to 33024ec Oct 17, 2019
@WeichenXu123

This comment has been minimized.

Copy link
Contributor Author

WeichenXu123 commented Oct 17, 2019

@tgaddair I address most comments from you, only leave the one "ret value", we need to address the issue first which I comment above #1307 (comment)

Signed-off-by: WeichenXu <weichen.xu@databricks.com>
.buildkite/gen-pipeline.sh Outdated Show resolved Hide resolved
Signed-off-by: WeichenXu <weichen.xu@databricks.com>
Signed-off-by: WeichenXu <weichen.xu@databricks.com>
@WeichenXu123

This comment has been minimized.

Copy link
Contributor Author

WeichenXu123 commented Oct 19, 2019

@tgaddair I add back the mpich test, then when running test_interactiverun.py, it raise error like:

test_interactiverun.py:68:
  | _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
  | /usr/local/lib/python3.6/dist-packages/horovod/run/run.py:922: in run
  | return _run(hargs)
  | /usr/local/lib/python3.6/dist-packages/horovod/run/run.py:805: in _run
  | _launch_job(args, remote_host_names, settings, common_intfs, command)
  | /usr/local/lib/python3.6/dist-packages/horovod/run/run.py:832: in _launch_job
  | mpi_run(settings, common_intfs, env, command)
  | _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
  |  
  | settings = <horovod.run.common.util.settings.Settings object at 0x7f7b47467860>
  | common_intfs = {'lo'}
  | env = {'HOME': '/root', 'HOROVOD_CACHE_CAPACITY': '(None,)', 'HOROVOD_CYCLE_TIME': '(None,)', 'HOSTNAME': 'be96015d6083', ...}
  | command = ['/usr/bin/python', '-m', 'horovod.run.run_task', '127.0.0.1', '50171']
  |  
  | def mpi_run(settings, common_intfs, env, command):
  | mpi_impl_flags = _get_mpi_implementation_flags()
  | if mpi_impl_flags is None:
  | raise Exception(
  | >               'horovodrun convenience script does not find an installed OpenMPI.\n\n'
  | 'Choose one of:\n'
  | '1. Install Open MPI 4.0.0+ or IBM Spectrum MPI and re-install Horovod '
  | '(use --no-cache-dir pip option).\n'
  | '2. Run distributed '
  | 'training script using the standard way provided by your'
  | ' MPI distribution (usually mpirun, srun, or jsrun).\n'
  | '3. Use built-in gloo option (horovodrun --gloo ...).')
  | E           Exception: horovodrun convenience script does not find an installed OpenMPI.
  | E
  | E           Choose one of:
  | E           1. Install Open MPI 4.0.0+ or IBM Spectrum MPI and re-install Horovod (use --no-cache-dir pip option).
  | E           2. Run distributed training script using the standard way provided by your MPI distribution (usually mpirun, srun, or jsrun).
  | E           3. Use built-in gloo option (horovodrun --gloo ...).
  |  
  | /usr/local/lib/python3.6/dist-packages/horovod/run/mpi_run.py:65: Exception
  | ______________________ InteractiveRunTests.test_happy_run ______________________

I think the code here need to update

mpi_impl_flags = _get_mpi_implementation_flags()

Current master only support OpenMPI and IBM spectrum MPI, and current master test do not cover the code path here.

Signed-off-by: WeichenXu <weichen.xu@databricks.com>
Signed-off-by: WeichenXu <weichen.xu@databricks.com>
@WeichenXu123

This comment has been minimized.

Copy link
Contributor Author

WeichenXu123 commented Oct 20, 2019

@tgaddair See my another PR test for MPICH (print log version), we can see MPICH do not support --allow-run-as-root
#1455

E               RuntimeError: mpirun failed with exit code 255, stdout
  | E
  | E               , stderr
  | E               [mpiexec@e099377ab812] match_arg (utils/args/args.c:159): unrecognized argument allow-run-as-root
  | E               [mpiexec@e099377ab812] HYDU_parse_array (utils/args/args.c:174): argument matching returned error
  | E               [mpiexec@e099377ab812] parse_args (ui/mpich/utils.c:1597): error parsing input array
  | E               [mpiexec@e099377ab812] HYD_uii_mpx_get_parameters (ui/mpich/utils.c:1649): unable to parse user arguments
  | E               [mpiexec@e099377ab812] main (ui/mpich/mpiexec.c:149): error parsing parameters
  |  
  | /usr/local/lib/python3.6/dist-packages/horovod/run/mpi_run.py:134: RuntimeError
 ```
Signed-off-by: WeichenXu <weichen.xu@databricks.com>
@tgaddair

This comment has been minimized.

Copy link
Collaborator

tgaddair commented Oct 21, 2019

Good catch @WeichenXu123. Until we've done more testing with MPICH, how about we skip the interactive run test unless we're using OpenMPI similar to how we do for Spark?

# Seems that spark tests depend on MPI, do not test those when mpi is not available
  local exclude_spark_if_needed=""
  if [[ ${test} != *"mpi"* ]]; then
    exclude_spark_if_needed="| sed 's/[a-z_]*spark[a-z_.]*//g'"
  fi
Signed-off-by: WeichenXu <weichen.xu@databricks.com>
@WeichenXu123

This comment has been minimized.

Copy link
Contributor Author

WeichenXu123 commented Oct 22, 2019

Ready.
@tgaddair Before merging, let's wait @mengxr to take a final look.

Copy link
Collaborator

tgaddair left a comment

LGTM! We'll land once @mengxr takes a look.

Copy link

mengxr left a comment

I made one pass over the API and implementation. I don't know why some changes belong to this PR but the run implementation looks good in general.

horovod/run/run.py Outdated Show resolved Hide resolved
horovod/run/run.py Show resolved Hide resolved
horovod/run/run.py Show resolved Hide resolved
horovod/run/run.py Show resolved Hide resolved

hargs = HorovodArgs()

hargs.np = np

This comment has been minimized.

Copy link
@mengxr

mengxr Oct 22, 2019

simpler if we use namedtuple to define HorovodArgs.

This comment has been minimized.

Copy link
@WeichenXu123

WeichenXu123 Oct 23, 2019

Author Contributor

Here there're many args I defined default value None in class HorovodArgs constructor. And in run we only set few arguments. If use namedtuple, we need to copy the argments list twice completely, like:

HorovodArgs = namedtuple('arg1', 'arg2', ..., 'arg100')
hargs = HorovodArgs(arg1=XXX, arg2=XXX, ..., arg100=XXX)

but most of the args we only need to keep default value None.

This comment has been minimized.

Copy link
@tgaddair

tgaddair Oct 23, 2019

Collaborator

Unfortunately, in the current design we cannot use namedtuple because the downstream _run function modifies some of the object's values (therefore, we cannot assume immutability). We cannot change the behavior to use namedtuple._replace as it is currently designed because there command line entrypoint uses the same code path, but passes in parsed arguments object (here @WeichenXu123 is relying on Python's duck typing to get around this interface overloading).

Longterm, I think we should consolidate these two code paths (functional vs command line) to use the same object to represent materialized argument values. But I don't think it's necessary to do that in this PR since it's an implementation detail not exposed to the end user.

horovod/run/common/util/env.py Show resolved Hide resolved
horovod/run/http/http_client.py Show resolved Hide resolved
horovod/run/run.py Show resolved Hide resolved
horovod/run/run.py Outdated Show resolved Hide resolved
horovod/run/run.py Show resolved Hide resolved
Signed-off-by: WeichenXu <weichen.xu@databricks.com>
@WeichenXu123 WeichenXu123 force-pushed the WeichenXu123:issue1176_3 branch from e3c7cb6 to c8a8643 Oct 23, 2019
fix test
Signed-off-by: WeichenXu <weichen.xu@databricks.com>
Signed-off-by: WeichenXu <weichen.xu@databricks.com>
fix
Signed-off-by: WeichenXu <weichen.xu@databricks.com>
@mengxr
mengxr approved these changes Oct 24, 2019
Copy link
Collaborator

tgaddair left a comment

Looks good! Let's merge it.

@tgaddair tgaddair merged commit 9fc256d into horovod:master Oct 24, 2019
2 checks passed
2 checks passed
DCO DCO
Details
buildkite/horovod/pr Build #1269 passed (52 minutes, 1 second)
Details
@WeichenXu123 WeichenXu123 deleted the WeichenXu123:issue1176_3 branch Oct 25, 2019
jeffdaily added a commit to ROCmSoftwarePlatform/horovod that referenced this pull request Nov 27, 2019
Signed-off-by: WeichenXu <weichen.xu@databricks.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
6 participants
You can’t perform that action at this time.