Add Join op (with only support for AllReduce and PyTorch for now) #1058

kit1980 · 2019-05-07T21:42:09Z

This is very earlier work in progress PR to add hvd.join() as described in #832

This is mostly boilerplate code, it compiles and I can call hvd.join() from the pytorch test (the Join op goes to the queue but does nothing).

The purpose of this PR for me is to understand if I'm going in the right direction.
Some other questions I have (in no particular order):

Should Join op have name?
Join op doesn't have an input, and should have single int output (last joined rank), right?
Does the Join op need to be implemented separately for cuda, nccl, in cuda_operations.cc, nccl_operations.cc, etc. ?
In operation manager, should it be a vector of available Join ops or just one Join op?
The logic for join op should track already joined ranks, this should be similar to the logic in AllReduce and AllGather ops? Any code pointers/suggestions?
AllReduce and AllGather ops need to know already joined ranks. Is it going to be stored in global state? Any code pointers/suggestions?
Any suggestions how to make adding Join op more granular (several smaller PRs)?

kit1980 · 2019-05-07T21:45:51Z

For PyTorch, are both mpi_lib_v2 and mpi_lib implementations will be needed eventually or only v2?

alsrgv

Thanks for starting this! I left some stylistic feedback. To answer your questions:

I don't think Join op needs a name, but we should make sure only one Join op is enqueued on any given rank. Duplicate name checking may be a way around it.
Yes, it should return last joined rank.
It feels like we need to expand the API of operations to allow for no-ops. All kinds of operations would then have to implement this additional API.
If I understand correctly, you're thinking about bit vectors in accelerated orchestration logic? I think two bits would be sufficient to indicate that some/all of ranks did Join().

5-6. Rank-tracking logic is all done in the orchestration layer. We should implement this both for bitvector and standard approach. The logic for both is here, I recommend careful study as it got a bit complicated over time.

I'd suggest sticking to a single PR for now until it becomes unmanageable (usually it doesn't).
We'd need to eventually support PyTorch, TensorFlow and Apache MXNet. Legacy PyTorch could be made unsupported if it's too much work.

.cache/v/cache/lastfailed

horovod/common/message.cc

horovod/common/message.h

horovod/common/operations.cc

horovod/common/ops/collective_operations.h

horovod/common/ops/operation_manager.cc

horovod/common/ops/collective_operations.cc

kit1980 · 2019-05-31T19:31:45Z

Rebased on master, addressed style comments, fixed "Segmentation fault" because of trying to cache Join op.

horovod/common/message.cc

horovod/common/message.h

horovod/common/wire/message.fbs

horovod/common/wire/message_generated.h

horovod/torch/mpi_ops.py

test/test_torch.py

kit1980 · 2019-06-13T04:27:31Z

Hi @alsrgv,

I've coded some basic logic for the JoinOp (the code is not in this PR currently yet): global state stores number of ranks already joined on the coordinator, and each rank knows about itself if it already joined. Then IncrementTensorCount function takes into account number of ranks already joined, and sets ready_to_reduce if the number of ranks reducing the tensor equals to mpi_size - joined_count. And each rank checks its own joined bit to decide if it needs to actually perform the reduce operation after getting reduce response.

The problems is that the ranks that actually perform the reduce (not yet joined) use MPI or NCCL reduce, that expects all ranks to participate. So one way to solve it is to make every rank know the list of already joined ranks, and create new MPI communicator that only uses subset of all nodes; this seems not really feasible... Another idea is that all joined ranks still participate in MPI reduce (Join op Execute calls MPI reduce), but with zero tensors; but in this case the joined nodes need to somehow know the shape of the reduced tensor...

Any suggestions?

kit1980 · 2019-06-18T18:42:37Z

An update after my last comment.
Now I'm trying to implement this design:

The coordinator tracks how many ranks sent Join request, and each rank knows if it Joined.
Then, when sending Reduce response, if at least one rank Joined, the coordinator sends Join response with the information about the reduced tensors - type and size (there will be many Join responses after one Join request). The Join response will be processed by the joined ranks only, and these ranks will use the information from the response to fill their tensor_table (the actual content of the tensors will be 0).
Then all the ranks will execute Reduce normally.

@alsrgv, could you review if it makes sense?

kit1980 · 2019-06-18T18:44:57Z

Another problem I have is how to keep Python thread alive after one rank Joins earlier. The wait must be in the Python thread, not in the background thread, and I'm not sure how to communicate back that the wait is over (all ranks sent Join requests).

kit1980 · 2019-06-25T03:07:50Z

So I've implemented this design (similar to my previous comment, but with some modifications):

The coordinator tracks how many ranks already sent Join requests, and each rank knows if it Joined.
Running join() from Python blocks Python thread for that rank until Join response, which is sent only when all ranks Joined.
When sending Reduce response, if at least one rank Joined, the coordinator includes tensor type and size information in the Reduce response. This information is used by the Joined ranks to construct temporary 0-filled data for MPI AllReduce, while non-Joined ranks execute normal AllReduce.

This seems to work OK with 1 small test I have. If you think this approach makes sense, I'll fix all TODOs and test/fix more scenarios (Fused responses, all data types, cache...)

alsrgv · 2019-07-02T01:22:50Z

@kit1980, sorry for going dark on you. Your proposed approach sounds good!

alsrgv · 2019-07-02T01:24:54Z

Note that it should work with AllGather, too (send empty slices from joined nodes).

horovod/common/operations.cc

kit1980 · 2019-07-29T20:14:03Z

Join with AllReduce now passes tests for all data types for MPI and NCCL.
Please review.

DEKHTIARJonathan · 2019-08-05T14:18:25Z

I have a simple question how different it is from this:

from mpi4py import MPI
MPI.COMM_WORLD.Barrier()  # Waiting for all MPI processes to sync

Looks quite identical objective to me

kit1980 · 2019-08-05T16:23:50Z

@DEKHTIARJonathan, this is for the case when data is unevenly distributed between workers, so different workers do different number of steps. MPI barrier can't help in this case - the workers that still have data will forever wait in optimizer.step() for the workers that already processed all their data.
Please see this issue for more details and discussion: #832

DEKHTIARJonathan · 2019-08-05T16:35:49Z

Oh I see. Thanks for the pointer. Never had this usecase

kit1980 · 2019-08-06T22:44:30Z

@alsrgv Could you take a look at this?

alsrgv

Left a few comments to help me understand the PR better. Could you rebase on the latest master to resolve conflicts?

horovod/common/operations.cc

horovod/common/global_state.h

kit1980 · 2019-08-23T07:58:16Z

Rebased on master (which was painful because of recent large code moves with simultaneous logic and formatting changes).

kit1980 · 2019-09-12T18:53:00Z

Rebased on master again. Please review.

tgaddair

Looks great, just a few minor changes left.

horovod/common/global_state.h

horovod/common/operations.cc

horovod/common/tensor_queue.h

horovod/torch/adapter_v2.cc

horovod/torch/mpi_ops.cc

test/test_torch.py

tgaddair · 2019-10-26T21:00:26Z

horovod/common/common.h

@@ -211,6 +214,8 @@ class OpContext {
                     std::shared_ptr<PersistentBuffer>* tensor) = 0;
  virtual Status AllocateOutput(TensorShape shape,
                                std::shared_ptr<Tensor>* tensor) = 0;
+  virtual Status AllocateZeros(int64_t num_elements, DataType dtype,


Can you add stub implementations of this method to the other OpContext children, including TFOpContext in tensorflow/mpi_ops.cc and MXOpContext in mxnet/adapter.{h,cc}? These implementations should raise an error similar to the one for PyTorch <1.0.

Done for TF and MXNet.

tgaddair · 2019-10-31T23:29:05Z

@kit1980 looks like TensorFlow just switched nightly over to 2.0. I'll make a quick update to our test script and then your tests should pass.

kit1980 · 2019-10-31T23:33:09Z

@kit1980 looks like TensorFlow just switched nightly over to 2.0. I'll make a quick update to our test script and then your tests should pass.

Thanks, I was trying to understand what's going on - the previous nightly build passed.

kit1980 · 2019-11-01T00:50:17Z

All the tests passed.

Signed-off-by: Sergii Dymchenko <sedymche@microsoft.com>

tgaddair

LGTM! Nice job!

kit1980 · 2019-11-09T01:34:22Z

Hi @kit1980 I have a question: Does horovod's cache support the join function of pytorch? Because once there is a rank join, according to the cache, the tensor of other rank hits will be put back into tensor_queue_, so that the current horovod is always in an infinite loop state.

Can you provide a test that exhibits this behavior? I don't understand why this would happen "the tensor of other rank hits will be put back into tensor_queue_". I have test_horovod_join_allreduce for PyTorch that passes OK.

Hi， @kit1980 I don't have a pytorch test here, but if horovod's cache supports pytorch's join function, then the following situation will occur:
For ranks without join:
Train to step 1000: Since tensor1 tensor2 is cached the previous step, then it will hit
For the rank of join: there is no tensor1 tensor2
Then after the cheap mpi of the cache , because of the rank join, so for tensor1 and tensor2, not all ranks are ready. In this case, for the rank without join will put tensor1 tensor2 back Tensor_queue
That caused the program to loop
Of course, the assumption that the above situation is that pytorch will take the logic of the cache.
So my question is:
1.pytorch whether to execute the logic of the cache？

Hi @huoliquankai7,
I think I see what you mean now in my tests.
I'm working on fixing this.

tgaddair · 2019-11-12T00:55:19Z

Hey @huoliquankai7, any update on the TensorFlow PR you mentioned? Now that PyTorch has landed, would be great to get it for TF as well!

tgaddair · 2019-11-12T19:25:32Z

Hey @kit1980, what is your plan for adding Allgather, Broadcast, TensorFlow, and MXNet support for this operation? No worries if there are no immediate plans, we can add some of this work to the backlog, but just wanted to make sure we don't duplicate our efforts.

kit1980 · 2019-11-12T19:53:09Z

Hey @kit1980, what is your plan for adding Allgather, Broadcast, TensorFlow, and MXNet support for this operation? No worries if there are no immediate plans, we can add some of this work to the backlog, but just wanted to make sure we don't duplicate our efforts.

Hi,
I've found couple of issues with Join and fusion and caching while testing with MNIST, should be fixed very soon.
After that I'll add support for Allgather and Broadcast, should be easy.
For TensorFlow and MXNet I'd like someone's help, because I don't really use these.

Richie-yan · 2019-11-13T04:18:48Z

Hey @huoliquankai7, any update on the TensorFlow PR you mentioned? Now that PyTorch has landed, would be great to get it for TF as well!

Hi， @tgaddair ,Because @kit1980's join and cache have a problem, so wait for him to solve it, and then I sort it out, it should be put forward tensorflow pr soon.

tgaddair · 2019-11-27T18:28:16Z

Hey @kit1980, and update on the cache issue?

…rovod#1058) Signed-off-by: Sergii Dymchenko <sedymche@microsoft.com>

kit1980 · 2019-12-04T05:03:11Z

Hey @kit1980, and update on the cache issue?

Hi @tgaddair
I added 1 more special bit to the cache bit vector to decide that communication is needed if one of the ranks did join this tick.
It mostly works, still need to fix couple of issues I found after adding more extensive tests.

tgaddair · 2019-12-04T05:24:17Z

Thanks for the update, @kit1980. When the PR is ready, we should also get @romerojosh to take a look, as he's the creator of the bit cache.

…rovod#1058) Signed-off-by: Sergii Dymchenko <sedymche@microsoft.com>

kit1980 mentioned this pull request May 7, 2019

Is it possible run horovod with unequal steps on each worker? #832

Open

alsrgv reviewed May 13, 2019

View reviewed changes

kit1980 force-pushed the master branch from c43b50d to e82d3d0 Compare May 31, 2019 19:28

alsrgv reviewed Jun 3, 2019

View reviewed changes

alsrgv mentioned this pull request Jun 28, 2019

Horovod doesn't seem to work with tf.estimator.train_and_evaluate API #182

Open

kit1980 force-pushed the master branch 2 times, most recently from b131034 to 7374086 Compare July 24, 2019 00:17

thiagocrepaldi reviewed Jul 25, 2019

View reviewed changes

horovod/common/operations.cc Outdated Show resolved Hide resolved

kit1980 force-pushed the master branch 2 times, most recently from 084f1c1 to 530e8a9 Compare July 29, 2019 18:21

alsrgv reviewed Aug 13, 2019

View reviewed changes

kit1980 force-pushed the master branch 3 times, most recently from 0b4c542 to d762c2a Compare August 23, 2019 06:37

kit1980 force-pushed the master branch from d762c2a to 47fb08f Compare September 12, 2019 18:52

tgaddair reviewed Oct 26, 2019

View reviewed changes

kit1980 force-pushed the master branch from bddb6d9 to 6945574 Compare October 31, 2019 21:45

kit1980 force-pushed the master branch from 6945574 to 13d1010 Compare October 31, 2019 23:31

Sergii Dymchenko added 6 commits October 31, 2019 17:50

Add Join op with AllReduce and PyTorch support.

7a896de

Signed-off-by: Sergii Dymchenko <sedymche@microsoft.com>

Address review comments.

92a0cda

Signed-off-by: Sergii Dymchenko <sedymche@microsoft.com>

Address more review comments.

0ea23df

Signed-off-by: Sergii Dymchenko <sedymche@microsoft.com>

Create zero tensor for Join via PyTorch.

d747ddf

Signed-off-by: Sergii Dymchenko <sedymche@microsoft.com>

Don't support Join Op for PyTorch before 1.0.

a25f568

Signed-off-by: Sergii Dymchenko <sedymche@microsoft.com>

Address review comments.

7f6d2c2

Signed-off-by: Sergii Dymchenko <sedymche@microsoft.com>

kit1980 force-pushed the master branch from 13d1010 to 7f6d2c2 Compare November 1, 2019 00:51

tgaddair approved these changes Nov 1, 2019

View reviewed changes

tgaddair merged commit ef5804b into horovod:master Nov 1, 2019

tgaddair mentioned this pull request Nov 12, 2019

WARNING: One or more tensors were submitted to be reduced, gathered #403

Closed

jeffdaily pushed a commit to ROCm/horovod that referenced this pull request Nov 27, 2019

Add Join op (with only support for AllReduce and PyTorch for now) (ho…

784ca8a

…rovod#1058) Signed-off-by: Sergii Dymchenko <sedymche@microsoft.com>

tgaddair mentioned this pull request Dec 4, 2019

How to make Horovod wait for processes? #1540

Open

kit1980 mentioned this pull request Dec 6, 2019

Fix Join op bugs with cache and fusion. #1571

Merged

tgaddair mentioned this pull request Feb 28, 2020

Whether horovod implements cross iteration scheduling #1558

Open

DelphianCalamity pushed a commit to DelphianCalamity/horovod that referenced this pull request Apr 18, 2020

Add Join op (with only support for AllReduce and PyTorch for now) (ho…

df4243e

…rovod#1058) Signed-off-by: Sergii Dymchenko <sedymche@microsoft.com>

maxhgerlach mentioned this pull request Aug 11, 2021

Have hvd.join() return the last rank that joined #3097

Merged

4 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add Join op (with only support for AllReduce and PyTorch for now) #1058

Add Join op (with only support for AllReduce and PyTorch for now) #1058

kit1980 commented May 7, 2019

kit1980 commented May 7, 2019

alsrgv left a comment

kit1980 commented May 31, 2019

kit1980 commented Jun 13, 2019 •

edited

Loading

kit1980 commented Jun 18, 2019

kit1980 commented Jun 18, 2019

kit1980 commented Jun 25, 2019

alsrgv commented Jul 2, 2019

alsrgv commented Jul 2, 2019

kit1980 commented Jul 29, 2019

DEKHTIARJonathan commented Aug 5, 2019 •

edited

Loading

kit1980 commented Aug 5, 2019

DEKHTIARJonathan commented Aug 5, 2019

kit1980 commented Aug 6, 2019

alsrgv left a comment

kit1980 commented Aug 23, 2019

kit1980 commented Sep 12, 2019

tgaddair left a comment

tgaddair Oct 26, 2019

kit1980 Oct 31, 2019

tgaddair commented Oct 31, 2019

kit1980 commented Oct 31, 2019

kit1980 commented Nov 1, 2019

tgaddair left a comment

kit1980 commented Nov 9, 2019

tgaddair commented Nov 12, 2019

tgaddair commented Nov 12, 2019

kit1980 commented Nov 12, 2019

Richie-yan commented Nov 13, 2019

tgaddair commented Nov 27, 2019

kit1980 commented Dec 4, 2019

tgaddair commented Dec 4, 2019

Add Join op (with only support for AllReduce and PyTorch for now) #1058

Add Join op (with only support for AllReduce and PyTorch for now) #1058

Conversation

kit1980 commented May 7, 2019

kit1980 commented May 7, 2019

alsrgv left a comment

Choose a reason for hiding this comment

kit1980 commented May 31, 2019

kit1980 commented Jun 13, 2019 • edited Loading

kit1980 commented Jun 18, 2019

kit1980 commented Jun 18, 2019

kit1980 commented Jun 25, 2019

alsrgv commented Jul 2, 2019

alsrgv commented Jul 2, 2019

kit1980 commented Jul 29, 2019

DEKHTIARJonathan commented Aug 5, 2019 • edited Loading

kit1980 commented Aug 5, 2019

DEKHTIARJonathan commented Aug 5, 2019

kit1980 commented Aug 6, 2019

alsrgv left a comment

Choose a reason for hiding this comment

kit1980 commented Aug 23, 2019

kit1980 commented Sep 12, 2019

tgaddair left a comment

Choose a reason for hiding this comment

tgaddair Oct 26, 2019

Choose a reason for hiding this comment

kit1980 Oct 31, 2019

Choose a reason for hiding this comment

tgaddair commented Oct 31, 2019

kit1980 commented Oct 31, 2019

kit1980 commented Nov 1, 2019

tgaddair left a comment

Choose a reason for hiding this comment

kit1980 commented Nov 9, 2019

tgaddair commented Nov 12, 2019

tgaddair commented Nov 12, 2019

kit1980 commented Nov 12, 2019

Richie-yan commented Nov 13, 2019

tgaddair commented Nov 27, 2019

kit1980 commented Dec 4, 2019

tgaddair commented Dec 4, 2019

kit1980 commented Jun 13, 2019 •

edited

Loading

DEKHTIARJonathan commented Aug 5, 2019 •

edited

Loading