-
Notifications
You must be signed in to change notification settings - Fork 2.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add horovod.run.run
to make horovod notebook friendly (new impl)
#1307
Conversation
2e30c1f
to
8fd518f
Compare
67947d0
to
af417a9
Compare
horovod.run.run_func
to make horovod notebook friendly (new impl)horovod.run.run_func
to make horovod notebook friendly (new impl)
67947d0
to
5618f8e
Compare
I update PR for addressing some comments. |
Since the rendezvous server has been made more generic, I would move it out of the |
horovod.run.run_func
to make horovod notebook friendly (new impl)horovod.run.run_func
to make horovod notebook friendly (new impl)
@tgaddair I add a structure |
e608f7e
to
7f95212
Compare
horovod.run.run_func
to make horovod notebook friendly (new impl)horovod.run.run_func
to make horovod notebook friendly (new impl)
@tgaddair Seemingly the buildkite system has some issue. Sometimes building failed. |
Hey @WeichenXu123, can you rebase onto master? There was a change made to |
Signed-off-by: WeichenXu <weichen.xu@databricks.com>
4144c78
to
84703f5
Compare
2e0c976
to
28f6fe0
Compare
Signed-off-by: WeichenXu <weichen.xu@databricks.com>
cf3766b
to
33024ec
Compare
@tgaddair I address most comments from you, only leave the one "ret value", we need to address the issue first which I comment above #1307 (comment) |
Signed-off-by: WeichenXu <weichen.xu@databricks.com>
Signed-off-by: WeichenXu <weichen.xu@databricks.com>
Signed-off-by: WeichenXu <weichen.xu@databricks.com>
@tgaddair I add back the mpich test, then when running
I think the code here need to update horovod/horovod/run/mpi_run.py Line 62 in 36a98fe
Current master only support OpenMPI and IBM spectrum MPI, and current master test do not cover the code path here. |
Signed-off-by: WeichenXu <weichen.xu@databricks.com>
Signed-off-by: WeichenXu <weichen.xu@databricks.com>
@tgaddair See my another PR test for MPICH (print log version), we can see MPICH do not support
|
Signed-off-by: WeichenXu <weichen.xu@databricks.com>
Good catch @WeichenXu123. Until we've done more testing with MPICH, how about we skip the interactive run test unless we're using OpenMPI similar to how we do for Spark?
|
Signed-off-by: WeichenXu <weichen.xu@databricks.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM! We'll land once @mengxr takes a look.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I made one pass over the API and implementation. I don't know why some changes belong to this PR but the run
implementation looks good in general.
|
||
hargs = HorovodArgs() | ||
|
||
hargs.np = np |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
simpler if we use namedtuple
to define HorovodArgs
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Here there're many args I defined default value None in class HorovodArgs
constructor. And in run
we only set few arguments. If use namedtuple
, we need to copy the argments list twice completely, like:
HorovodArgs = namedtuple('arg1', 'arg2', ..., 'arg100')
hargs = HorovodArgs(arg1=XXX, arg2=XXX, ..., arg100=XXX)
but most of the args we only need to keep default value None
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Unfortunately, in the current design we cannot use namedtuple
because the downstream _run
function modifies some of the object's values (therefore, we cannot assume immutability). We cannot change the behavior to use namedtuple._replace
as it is currently designed because there command line entrypoint uses the same code path, but passes in parsed arguments object (here @WeichenXu123 is relying on Python's duck typing to get around this interface overloading).
Longterm, I think we should consolidate these two code paths (functional vs command line) to use the same object to represent materialized argument values. But I don't think it's necessary to do that in this PR since it's an implementation detail not exposed to the end user.
Signed-off-by: WeichenXu <weichen.xu@databricks.com>
e3c7cb6
to
c8a8643
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good! Let's merge it.
Signed-off-by: WeichenXu <weichen.xu@databricks.com>
Signed-off-by: WeichenXu <weichen.xu@databricks.com>
Proposed API:
I launch a http server in driver side, so that:
I integrate the code into current
run
code.