New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

Sign up for GitHub

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Jump to bottom

Add barrier call to torch module to support easy synchronization for process sets #3139

Merged

Tixxx merged 7 commits into master from tix/add_barrier_torch

Oct 6, 2021

Collaborator

Tixxx commented Aug 29, 2021 •

edited

Checklist before submitting

Did you read the contributor guide?
Did you update the docs?
Did you write any tests to validate this change?
Did you update the CHANGELOG, if this change affects users?

Description

Add barrier call to torch module to support easy synchronization for process sets. Existing methods that use other apis are either not well known or not designed to be a synchronization point. This also achieves parity with other distributed training frameworks.

This also fixes a minor issue with is_initialized(). The horovod_is_initialized returns an atomic bool which will be interpreted as a random INT by ctypes on Mac. Changed it to return INT instead.

Fixes #3121 (issue).
#3121

Review process to land

All tests and other checks must succeed.
At least one member of the technical steering committee must review and approve.
If any member of the technical steering committee requests changes, they must be addressed.

Tixxx requested review from tgaddair, irasit and chongxiaoc

August 29, 2021 01:40

github-actions bot commented Aug 29, 2021 •

edited

Unit Test Results

    692 files ±0     692 suites ±0 6h 37m 17s ⏱️ ±0s
    701 tests ±0     652 ✔️ ±0     49 💤 ±0 0 ❌ ±0
14 817 runs ±0 10 350 ✔️ ±0 4 467 💤 ±0 0 ❌ ±0

Results for commit d9afdae. ± Comparison against base commit d9afdae.

♻️ This comment has been updated with latest results.

github-actions bot commented Aug 29, 2021 •

edited

Unit Test Results (with flaky tests)

    782 files ±0     782 suites ±0 7h 1m 52s ⏱️ ±0s
    701 tests ±0     652 ✔️ ±0     48 💤 ±0 1 ❌ ±0
16 921 runs ±0 11 675 ✔️ ±0 5 245 💤 ±0 1 ❌ ±0

For more details on these failures, see this check.

Results for commit d9afdae. ± Comparison against base commit d9afdae.

♻️ This comment has been updated with latest results.

chongxiaoc approved these changes

View reviewed changes

Collaborator

chongxiaoc left a comment

LGTM

So Lightning has to use this new op to serve as barrier instead of using join()?

Collaborator Author

Tixxx commented Aug 30, 2021

LGTM

So Lightning has to use this new op to serve as barrier instead of using join()?

@chongxiaoc Yes, otherwise we will keep bumping into the error reported in the issue. I can take a look on the lightning side as well. As a side note, we need to implement join for other collective operations(allgather, bcast, etc) too to fully support join during training.

Collaborator

chongxiaoc commented Aug 30, 2021

@tgaddair @romerojosh Can you take a look?

chongxiaoc requested a review from romerojosh

August 30, 2021 20:54

maxhgerlach self-requested a review

September 15, 2021 17:52

maxhgerlach reviewed

View reviewed changes

Collaborator

maxhgerlach left a comment

this looks like a very useful contribution and would indeed provide a missing feature which is regularly requested.

I left a comment wondering about possible multithreading issues with the current implementation of this PR.

horovod/common/operations.cc Outdated Show resolved Hide resolved

Tixxx force-pushed the tix/add_barrier_torch branch from 48111ca to b0f0edb Compare

October 4, 2021 07:23

Collaborator Author

Tixxx commented Oct 4, 2021

Hi @Tixxx,

this looks like a very useful contribution and would indeed provide a missing feature which is regularly requested.

I left a comment wondering about possible multithreading issues with the current implementation of this PR.

@maxhgerlach @tgaddair I added a new barrier request to process the op in the background thread. please take a look. Thanks.

tgaddair approved these changes

View reviewed changes

Collaborator

tgaddair left a comment

LGTM! @romerojosh can you also take a quick look?

romerojosh reviewed

View reviewed changes

Collaborator

romerojosh left a comment

Thanks for making updates @Tixxx. While this looks like it gets the job done, the implementation here looks a bit involved just to implement a barrier operation in the background thread. In torch.distributed, a barrier is implemented by scheduling a 1 element allreduce. If we do something similar, this would greatly simplify the implementation since you can just implement the barrier op directly in Python using existing allreduce operations.

What do you think?

Collaborator Author

Tixxx commented Oct 4, 2021

Thanks for making updates @Tixxx. While this looks like it gets the job done, the implementation here looks a bit involved just to implement a barrier operation in the background thread. In torch.distributed, a barrier is implemented by scheduling a 1 element allreduce. If we do something similar, this would greatly simplify the implementation since you can just implement the barrier op directly in Python using existing allreduce operations.

What do you think?

@romerojosh You raised a good point. I had thought about it at the beginning, but after going through c10d's code base, it looks like they only go with allreduce approach for the nccl backend which I think it's due to the lack of barrier api in NCCL. For other backends that expose a barrier api, they call the api directly.
This is where the separation of control plane and computation plane in Horovod comes in handy, we don't need to reply on backend computation primitives for simple control ops like barrier. Also, the complexity of barrier is simpler compared to allreduce. For example, in open mpi, the complexity of mpi_barrier is almost constant when called on mpi_world communicator. Launching an allreduce call for barrier seems a bit of an overkill to me. This might look involved in the implementation's perspective, but IMO it's actually simpler during runtime.
let me know if this makes sense. Thanks!

Collaborator

romerojosh commented Oct 4, 2021

@Tixxx Thanks for the comment. Really, the bulk of the wall time overhead from the barrier would be from the control plane processing, not the actual barrier call. The benefits of a faster barrier op will mostly be hidden by that overhead.
However, as you've already implemented the barrier code in the backend, I'm fine with it. Please do take a look at the generated flatbuffer files though to ensure things are consistent.

Collaborator Author

Tixxx commented Oct 4, 2021

@Tixxx Thanks for the comment. Really, the bulk of the wall time overhead from the barrier would be from the control plane processing, not the actual barrier call. The benefits of a faster barrier op will mostly be hidden by that overhead. However, as you've already implemented the barrier code in the backend, I'm fine with it. Please do take a look at the generated flatbuffer files though to ensure things are consistent.

@romerojosh thanks a lot for the reminder. I did miss one part in the fbs file. updated to reflect the correct schema.

Tixxx force-pushed the tix/add_barrier_torch branch from 88719df to ca5a696 Compare

October 4, 2021 23:17

TJ added 7 commits

October 5, 2021 14:30


          Added barrier call to torch module

1017b8e

Signed-off-by: TJ <tix@uber.com>


          Remove unnecessary file

e2710b6

Signed-off-by: TJ <tix@uber.com>


          Added pull request link to CHANGELOG

9cd92f4

Signed-off-by: TJ <tix@uber.com>


          fix tests

6e01755

Signed-off-by: TJ <tix@uber.com>


          added barrier request to process it in background thread

84ceafc

Signed-off-by: TJ <tix@uber.com>


          changed flatbuffers to match request and response

Signed-off-by: TJ <tix@uber.com>


          fixing test failures

acf8aac

Signed-off-by: TJ <tix@uber.com>

Tixxx force-pushed the tix/add_barrier_torch branch from 81a3b86 to acf8aac Compare

October 5, 2021 21:55

Tixxx merged commit d9afdae into master

EnricoMi mentioned this pull request

Action runs when PR is merged and compares commit to itself EnricoMi/publish-unit-test-result-action#148

Closed

weihanmines pushed a commit to weihanmines/horovod that referenced this pull request


          Fix linking _pywrap_tensorflow_internal.so and re-enable XLA on macOS (…

00f2586

…horovod#3173)

Spark/Lightning: fix the usage of checkpoint callback (horovod#3186)

Signed-off-by: Chongxiao Cao <chongxiaoc@uber.com>

Fix Cometlogger experiment key lost issue (horovod#3184)

* test

Signed-off-by: Peng Zhang <pengz@uber.com>

* test

Signed-off-by: Peng Zhang <pengz@uber.com>

* fix_logger

Signed-off-by: Peng Zhang <pengz@uber.com>

* fix_logger

Signed-off-by: Peng Zhang <pengz@uber.com>

* recreate_loger

Signed-off-by: Peng Zhang <pengz@uber.com>

* fix_var

Signed-off-by: Peng Zhang <pengz@uber.com>

* test

Signed-off-by: Peng Zhang <pengz@uber.com>

* test

Signed-off-by: Peng Zhang <pengz@uber.com>

Updated torch c++ to use new aten api (horovod#3175)

Spark/Keras: remove bare Keras support (horovod#3191)

Make fork PRs publish test change stats (horovod#3185)

Signed-off-by: Enrico Minack <github@enrico.minack.dev>

Support for nccl on cuda 11.4 (horovod#3182)

Signed-off-by: Evan Brossard <evanb@maka-ars.com>

Fix MPICH support (horovod#3148)

* fix MPICH implementation
* enable tests for MPICH and Intel MPI

Signed-off-by: Jinzhe Zeng <jinzhe.zeng@rutgers.edu>

Increase build timeout to 40m on Buildkite (horovod#3192)

Signed-off-by: Enrico Minack <github@enrico.minack.dev>

Change CMake syntax to be compatible with old versions of CMake (horovod#3196)

Signed-off-by: Max H. Gerlach <git@maxgerlach.de>

Reinit every torch test (horovod#3194)

Add barrier call to torch module to support easy synchronization for process sets (horovod#3139)

* Added barrier call to torch module

Signed-off-by: TJ <tix@uber.com>

Bump version to 0.23.0 (horovod#3200)

Signed-off-by: Travis Addair <tgaddair@gmail.com>

Co-authored-by: Max H. Gerlach <git@maxgerlach.de>

Increase Parallel PyTest timeout to 10m (horovod#3198)

* Increase MPI and Gloo Parallel PyTest timeout to 10m

Signed-off-by: Enrico Minack <github@enrico.minack.dev>

Spark/Lightning: don't overwrite model with checkpoint by default (horovod#3201)

Lightning estimator saves model by default if there is no specified
checkpoint callback. However, model is not overwritten with checkpoint
file in that case.

Signed-off-by: Chongxiao Cao <chongxiaoc@uber.com>

Spark/Lightning: fix checkpoint callback dirpath typo (horovod#3204)

Signed-off-by: Chongxiao Cao <chongxiaoc@uber.com>

Rework events in CI workflows (horovod#3202)

Signed-off-by: Enrico Minack <github@enrico.minack.dev>

Allow for concurrent schedule and master build, document concurrency (horovod#3206)

Signed-off-by: Enrico Minack <github@enrico.minack.dev>

Ray: fix RayExecutor to fail when num_workers=0 and num_hosts=None (horovod#3210)

Signed-off-by: Travis Addair <tgaddair@gmail.com>

add_history_in_lightning_estimator (horovod#3214)

Signed-off-by: Peng Zhang <pengz@uber.com>

Allow buildkite building merge commits on forks (horovod#3215)

Signed-off-by: Enrico Minack <github@enrico.minack.dev>

Fix json output in ci-results.yaml (horovod#3217)

Spark/Lightning: fix history metrics for estimator serialization (horovod#3216)

Save metrics inside the checkpoint dict , which will be load with map_location=torch.device('cpu')

Signed-off-by: Peng Zhang <pengz@uber.com>

patch python source files on macCI (horovod#3220)

* patch python source files on macCI

* Trigger build and test CI

Signed-off-by: TJ <tix@uber.com>

Co-authored-by: Enrico Minack <github@enrico.minack.dev>

Updated examples of torch and tf to include mixed precision training (horovod#3222)

* Added mixed precision example for pytorch

* added mixed precision for keras

Signed-off-by: TJ <tix@uber.com>

Job buildkite-heads accesses ci-workflow outputs, add it to the needs (horovod#3225)

Signed-off-by: Enrico Minack <github@enrico.minack.dev>

Fixes race condition for ray scale up down tests (horovod#3205)

Ensure that at least one host from the previous set of hosts have
been registered.
Without this, the discovery script will "discover" the new
set of hosts before the current set can register.
This would result in a race condition.
Consider a discovery schedule:
```
discovery_schedule = [
    (10, ['host-1:2']),
    (30, ['host-1:2', 'host-2:1', 'host-3:1']),
    (None, ['host-2:1']),
]
```
The initial set is: ['host-1:2']. Before this is registered in the driver, the discovery script
discovers the set: ['host-1:2', 'host-2:1', 'host-3:1'], and adds ['host-2:1', 'host-3:1'].
However, since ['host-1:2'] has not registered, there is no coordinator to notify the workers.
When host-1 and host-3 are removed, driver.resume will call _activate_workers, which will update the host assignments.
It has a check to see if the intersection between the previous and current set of hosts. It finds that the previous
set is ['host-1:2'], and the current set is ['host-2:1'], since there was no notification for the added and removed
hosts.
This ensures that the previous set of hosts can register before the current set is discovered.

Signed-off-by: Abin Shahab <ashahab@linkedin.com>

Removed a case of the default mutable argument pitfall (horovod#3227)

Signed-off-by: Naelson Douglas <naelson17@gmail.com>

Updates to TSC members (horovod#3234)

Signed-off-by: Travis Addair <tgaddair@gmail.com>

Add in-place broadcast for TensorFlow (horovod#3128)

* Update comment in FindTensorflow.cmake

Signed-off-by: Max H. Gerlach <git@maxgerlach.de>

* Add in-place broadcast_() and broadcast_variables() for TF

Signed-off-by: Max H. Gerlach <git@maxgerlach.de>

* Include source files from TF in build to avoid missing symbol errors

Signed-off-by: Max H. Gerlach <git@maxgerlach.de>

* Limit build and test to TF 2.6+

Signed-off-by: Max H. Gerlach <git@maxgerlach.de>

* Remove source files copied from TensorFlow

The missing symbols are resolved by linking against _pywrap_tensorflow_internal.so,
which was introduced to Horovod with PR horovod#3053.

Signed-off-by: Max H. Gerlach <git@maxgerlach.de>

* Fix possible type attribute values for HorovodBroadcastInplace

Signed-off-by: Max H. Gerlach <git@maxgerlach.de>

* Add reference variables to test

Signed-off-by: Max H. Gerlach <git@maxgerlach.de>

* Update comments, doc strings, changelog

Signed-off-by: Max H. Gerlach <git@maxgerlach.de>

[Elastic Horovod] Fix the bug for ElasticSampler and hvd.elastic.state (horovod#3144)

Co-authored-by: gethinhu <gethinhu@tencent.com>

a better way to handle nccl error under elastic scenario (horovod#3112)

Signed-off-by: guoze.lin <guozelin@tencent.com>

check torch version for mixed precision example (horovod#3238)

Lightning: set limit_train_batches and limit_val_batches (horovod#3237)

Tell Lightning trainer that how many batches a single epoch needs.

Signed-off-by: Chongxiao Cao <chongxiaoc@uber.com>

Spark/Lightning: reduce memory footprint of async dataloader (horovod#3239)

Limit async data loader queue size.

Signed-off-by: Peng Zhang <pengz@uber.com>

Change default fusion threshold from 64MB to 128MB in docs (horovod#3241)

fix the example of pytorch_lightning_mnist.py (horovod#3245)

- remove unused arg parameters
- fix model test issue on GPU

Signed-off-by: Chongxiao Cao <chongxiaoc@uber.com>

CI: use latest pytorch_lightning with torchhead (horovod#3243)

test_gradient_aggregation with real gradient instead of a constant (horovod#3176)

This fixes issue horovod#2664 by performing gradient aggregation with a real gradient instead of a constant.
PR: horovod#2647 shifts the gradient allreduce when the gradient is computed (both through the DistributedOptimizer or through the DistributedGradientTape). Which means that this unittest, by design in TF2.4, doesn't call allreduce in _aggregate_gradients().

Since this unittest provide a gradient as constant (without effectively computing it), the gradient will never be allreduced.
The current change ensure that instead of a constant a real gradient is computed from a loss-function.

Note: The current loss-function intentionally evaluates to zero. A future PR should convert it to a real loss function(e.g. MeanSquaredError) and compute gradients from that to test gradient aggregation.
Signed-off-by: Abin Shahab <ashahab@linkedin.com>

weihanmines pushed a commit to weihanmines/horovod that referenced this pull request


          Add barrier call to torch module to support easy synchronization for …

64d7d68

…process sets (horovod#3139)

* Added barrier call to torch module

Signed-off-by: TJ <tix@uber.com>
Signed-off-by: weihanmines <weihan13@amd.com>

weihanmines pushed a commit to weihanmines/horovod that referenced this pull request


          Fix mistake in keras LR scheduler callback (horovod#3142)

3d5e4ed

- Fixes issue when start_epoch != 0

Signed-off-by: Dinesh Ramasamy <89654805+iitmdinesh@users.noreply.github.com>
Signed-off-by: weihanmines <weihan13@amd.com>

fix torch op handles lazy release which may cause oom in elastic scenario (horovod#3110)

* fix torch op handles lazy release which may cause oom in elastic scenario

Signed-off-by: guoze.lin <guozelin@tencent.com>

* Update mpi_ops.py

Co-authored-by: guoze.lin <guozelin@tencent.com>
Co-authored-by: Travis Addair <tgaddair@gmail.com>
Signed-off-by: weihanmines <weihan13@amd.com>

Added support for extraction of storage options from url. (horovod#3137)

* Added support for extraction of storage options from url.

Signed-off-by: Manjur Ansari <maansar@microsoft.com>

* mock fsspec.utils

Signed-off-by: Manjur Ansari <maansar@microsoft.com>

* Added missing comma

Co-authored-by: Travis Addair <tgaddair@gmail.com>
Signed-off-by: weihanmines <weihan13@amd.com>

Make RayExecutor use the current placement group if one exists (horovod#3134)

Signed-off-by: weihanmines <weihan13@amd.com>

Fix the mapping btw pyspark and numpy (horovod#3146)

Signed-off-by: Haoyang Chen <haoyang@uber.com>
Signed-off-by: weihanmines <weihan13@amd.com>

Add tests for Keras callbacks: MetricAverageCallback, LearningRateScheduleCallback and LearningRateWarmupCallback (horovod#3102)

There were no tests for MetricAverageCallback, LearningRateScheduleCallback and LearningRateWarmupCallback from hvd as noted in horovod#2659. This PR adds testing to verify the callback works.

Signed-off-by: Moses Lee <14leeyuchieh@gmail.com>
Co-authored-by: Moses Lee <molee@molee-ld4.linkedin.biz>
Signed-off-by: weihanmines <weihan13@amd.com>

Split gpu tests in head and non-head versions (horovod#3155)

Signed-off-by: Enrico Minack <github@enrico.minack.dev>
Signed-off-by: weihanmines <weihan13@amd.com>

Allow caller to customize the Tensorboard callback (horovod#3153)

* Keras Estimator: Allow user to pass in TensorBoard callback

Signed-off-by: Rich Porter <rich.porter@uber.com>

* Remove callback from other processes on the same machine

Signed-off-by: Rich Porter <rich.porter@uber.com>

* Allow other ranks to profile as well.  Doesn't seem to conflict

Signed-off-by: Rich Porter <rich.porter@uber.com>
Signed-off-by: weihanmines <weihan13@amd.com>

test_torch.py: add explicit join() for testing duplicated name errors (horovod#3159)

For torch nightly >=10.0, we need to add an explict join() call to avoid
hanging when testing duplicated name errors.

Signed-off-by: Chongxiao Cao <chongxiaoc@uber.com>
Signed-off-by: weihanmines <weihan13@amd.com>

Disable TF2.6.0 XLA support on OSX (horovod#3133)

Related to issue#3132

Signed-off-by: Chongxiao Cao <chongxiaoc@uber.com>
Signed-off-by: weihanmines <weihan13@amd.com>

Fix linking _pywrap_tensorflow_internal.so and re-enable XLA on macOS  (horovod#3173)

Signed-off-by: weihanmines <weihan13@amd.com>

Spark/Lightning: fix the usage of checkpoint callback (horovod#3186)

Signed-off-by: Chongxiao Cao <chongxiaoc@uber.com>
Signed-off-by: weihanmines <weihan13@amd.com>

Fix Cometlogger experiment key lost issue (horovod#3184)

* test

Signed-off-by: Peng Zhang <pengz@uber.com>

* test

Signed-off-by: Peng Zhang <pengz@uber.com>

* fix_logger

Signed-off-by: Peng Zhang <pengz@uber.com>

* fix_logger

Signed-off-by: Peng Zhang <pengz@uber.com>

* recreate_loger

Signed-off-by: Peng Zhang <pengz@uber.com>

* fix_var

Signed-off-by: Peng Zhang <pengz@uber.com>

* test

Signed-off-by: Peng Zhang <pengz@uber.com>

* test

Signed-off-by: Peng Zhang <pengz@uber.com>
Signed-off-by: weihanmines <weihan13@amd.com>

Updated torch c++ to use new aten api (horovod#3175)

Signed-off-by: weihanmines <weihan13@amd.com>

Spark/Keras: remove bare Keras support (horovod#3191)

Signed-off-by: weihanmines <weihan13@amd.com>

Make fork PRs publish test change stats (horovod#3185)

Signed-off-by: Enrico Minack <github@enrico.minack.dev>
Signed-off-by: weihanmines <weihan13@amd.com>

Support for nccl on cuda 11.4 (horovod#3182)

Signed-off-by: Evan Brossard <evanb@maka-ars.com>
Signed-off-by: weihanmines <weihan13@amd.com>

Fix MPICH support (horovod#3148)

* fix MPICH implementation
* enable tests for MPICH and Intel MPI

Signed-off-by: Jinzhe Zeng <jinzhe.zeng@rutgers.edu>
Signed-off-by: weihanmines <weihan13@amd.com>

Increase build timeout to 40m on Buildkite (horovod#3192)

Signed-off-by: Enrico Minack <github@enrico.minack.dev>
Signed-off-by: weihanmines <weihan13@amd.com>

Change CMake syntax to be compatible with old versions of CMake (horovod#3196)

Signed-off-by: Max H. Gerlach <git@maxgerlach.de>
Signed-off-by: weihanmines <weihan13@amd.com>

Reinit every torch test (horovod#3194)

Signed-off-by: weihanmines <weihan13@amd.com>

Add barrier call to torch module to support easy synchronization for process sets (horovod#3139)

* Added barrier call to torch module

Signed-off-by: TJ <tix@uber.com>
Signed-off-by: weihanmines <weihan13@amd.com>

Bump version to 0.23.0 (horovod#3200)

Signed-off-by: Travis Addair <tgaddair@gmail.com>

Co-authored-by: Max H. Gerlach <git@maxgerlach.de>
Signed-off-by: weihanmines <weihan13@amd.com>

Increase Parallel PyTest timeout to 10m (horovod#3198)

* Increase MPI and Gloo Parallel PyTest timeout to 10m

Signed-off-by: Enrico Minack <github@enrico.minack.dev>
Signed-off-by: weihanmines <weihan13@amd.com>

Spark/Lightning: don't overwrite model with checkpoint by default (horovod#3201)

Lightning estimator saves model by default if there is no specified
checkpoint callback. However, model is not overwritten with checkpoint
file in that case.

Signed-off-by: Chongxiao Cao <chongxiaoc@uber.com>
Signed-off-by: weihanmines <weihan13@amd.com>

Spark/Lightning: fix checkpoint callback dirpath typo (horovod#3204)

Signed-off-by: Chongxiao Cao <chongxiaoc@uber.com>
Signed-off-by: weihanmines <weihan13@amd.com>

Rework events in CI workflows (horovod#3202)

Signed-off-by: Enrico Minack <github@enrico.minack.dev>
Signed-off-by: weihanmines <weihan13@amd.com>

Allow for concurrent schedule and master build, document concurrency (horovod#3206)

Signed-off-by: Enrico Minack <github@enrico.minack.dev>
Signed-off-by: weihanmines <weihan13@amd.com>

Ray: fix RayExecutor to fail when num_workers=0 and num_hosts=None (horovod#3210)

Signed-off-by: Travis Addair <tgaddair@gmail.com>
Signed-off-by: weihanmines <weihan13@amd.com>

add_history_in_lightning_estimator (horovod#3214)

Signed-off-by: Peng Zhang <pengz@uber.com>
Signed-off-by: weihanmines <weihan13@amd.com>

Allow buildkite building merge commits on forks (horovod#3215)

Signed-off-by: Enrico Minack <github@enrico.minack.dev>
Signed-off-by: weihanmines <weihan13@amd.com>

Fix json output in ci-results.yaml (horovod#3217)

Signed-off-by: weihanmines <weihan13@amd.com>

Spark/Lightning: fix history metrics for estimator serialization (horovod#3216)

Save metrics inside the checkpoint dict , which will be load with map_location=torch.device('cpu')

Signed-off-by: Peng Zhang <pengz@uber.com>
Signed-off-by: weihanmines <weihan13@amd.com>

patch python source files on macCI (horovod#3220)

* patch python source files on macCI

* Trigger build and test CI

Signed-off-by: TJ <tix@uber.com>

Co-authored-by: Enrico Minack <github@enrico.minack.dev>
Signed-off-by: weihanmines <weihan13@amd.com>

Updated examples of torch and tf to include mixed precision training (horovod#3222)

* Added mixed precision example for pytorch

* added mixed precision for keras

Signed-off-by: TJ <tix@uber.com>
Signed-off-by: weihanmines <weihan13@amd.com>

Job buildkite-heads accesses ci-workflow outputs, add it to the needs (horovod#3225)

Signed-off-by: Enrico Minack <github@enrico.minack.dev>
Signed-off-by: weihanmines <weihan13@amd.com>

Fixes race condition for ray scale up down tests (horovod#3205)

Ensure that at least one host from the previous set of hosts have
been registered.
Without this, the discovery script will "discover" the new
set of hosts before the current set can register.
This would result in a race condition.
Consider a discovery schedule:
```
discovery_schedule = [
    (10, ['host-1:2']),
    (30, ['host-1:2', 'host-2:1', 'host-3:1']),
    (None, ['host-2:1']),
]
```
The initial set is: ['host-1:2']. Before this is registered in the driver, the discovery script
discovers the set: ['host-1:2', 'host-2:1', 'host-3:1'], and adds ['host-2:1', 'host-3:1'].
However, since ['host-1:2'] has not registered, there is no coordinator to notify the workers.
When host-1 and host-3 are removed, driver.resume will call _activate_workers, which will update the host assignments.
It has a check to see if the intersection between the previous and current set of hosts. It finds that the previous
set is ['host-1:2'], and the current set is ['host-2:1'], since there was no notification for the added and removed
hosts.
This ensures that the previous set of hosts can register before the current set is discovered.

Signed-off-by: Abin Shahab <ashahab@linkedin.com>
Signed-off-by: weihanmines <weihan13@amd.com>

Removed a case of the default mutable argument pitfall (horovod#3227)

Signed-off-by: Naelson Douglas <naelson17@gmail.com>
Signed-off-by: weihanmines <weihan13@amd.com>

Updates to TSC members (horovod#3234)

Signed-off-by: Travis Addair <tgaddair@gmail.com>
Signed-off-by: weihanmines <weihan13@amd.com>

Add in-place broadcast for TensorFlow (horovod#3128)

* Update comment in FindTensorflow.cmake

Signed-off-by: Max H. Gerlach <git@maxgerlach.de>

* Add in-place broadcast_() and broadcast_variables() for TF

Signed-off-by: Max H. Gerlach <git@maxgerlach.de>

* Include source files from TF in build to avoid missing symbol errors

Signed-off-by: Max H. Gerlach <git@maxgerlach.de>

* Limit build and test to TF 2.6+

Signed-off-by: Max H. Gerlach <git@maxgerlach.de>

* Remove source files copied from TensorFlow

The missing symbols are resolved by linking against _pywrap_tensorflow_internal.so,
which was introduced to Horovod with PR horovod#3053.

Signed-off-by: Max H. Gerlach <git@maxgerlach.de>

* Fix possible type attribute values for HorovodBroadcastInplace

Signed-off-by: Max H. Gerlach <git@maxgerlach.de>

* Add reference variables to test

Signed-off-by: Max H. Gerlach <git@maxgerlach.de>

* Update comments, doc strings, changelog

Signed-off-by: Max H. Gerlach <git@maxgerlach.de>
Signed-off-by: weihanmines <weihan13@amd.com>

[Elastic Horovod] Fix the bug for ElasticSampler and hvd.elastic.state (horovod#3144)

Co-authored-by: gethinhu <gethinhu@tencent.com>
Signed-off-by: weihanmines <weihan13@amd.com>

a better way to handle nccl error under elastic scenario (horovod#3112)

Signed-off-by: guoze.lin <guozelin@tencent.com>
Signed-off-by: weihanmines <weihan13@amd.com>

check torch version for mixed precision example (horovod#3238)

Signed-off-by: weihanmines <weihan13@amd.com>

Lightning: set limit_train_batches and limit_val_batches (horovod#3237)

Tell Lightning trainer that how many batches a single epoch needs.

Signed-off-by: Chongxiao Cao <chongxiaoc@uber.com>
Signed-off-by: weihanmines <weihan13@amd.com>

Spark/Lightning: reduce memory footprint of async dataloader (horovod#3239)

Limit async data loader queue size.

Signed-off-by: Peng Zhang <pengz@uber.com>
Signed-off-by: weihanmines <weihan13@amd.com>

Change default fusion threshold from 64MB to 128MB in docs (horovod#3241)

Signed-off-by: weihanmines <weihan13@amd.com>

fix the example of pytorch_lightning_mnist.py (horovod#3245)

- remove unused arg parameters
- fix model test issue on GPU

Signed-off-by: Chongxiao Cao <chongxiaoc@uber.com>
Signed-off-by: weihanmines <weihan13@amd.com>

CI: use latest pytorch_lightning with torchhead (horovod#3243)

Signed-off-by: weihanmines <weihan13@amd.com>

test_gradient_aggregation with real gradient instead of a constant (horovod#3176)

This fixes issue horovod#2664 by performing gradient aggregation with a real gradient instead of a constant.
PR: horovod#2647 shifts the gradient allreduce when the gradient is computed (both through the DistributedOptimizer or through the DistributedGradientTape). Which means that this unittest, by design in TF2.4, doesn't call allreduce in _aggregate_gradients().

Since this unittest provide a gradient as constant (without effectively computing it), the gradient will never be allreduced.
The current change ensure that instead of a constant a real gradient is computed from a loss-function.

Note: The current loss-function intentionally evaluates to zero. A future PR should convert it to a real loss function(e.g. MeanSquaredError) and compute gradients from that to test gradient aggregation.
Signed-off-by: Abin Shahab <ashahab@linkedin.com>
Signed-off-by: weihanmines <weihan13@amd.com>

Remove MetricAverageCallback warning on tf >= 2.5 (horovod#3050)

Signed-off-by: Henrique Mendonça <henrique.mendonca@cscs.ch>
Signed-off-by: weihanmines <weihan13@amd.com>

Fix Horovod pyarrow IndexError: list index out of range (horovod#3255)

Signed-off-by: Weichen Xu <weichen.xu@databricks.com>
Signed-off-by: weihanmines <weihan13@amd.com>

Fixing up current CI test failures.  (horovod#3259)

Signed-off-by: Josh Romero <joshr@nvidia.com>
Co-authored-by: Travis Addair <tgaddair@gmail.com>
Co-authored-by: Enrico Minack <github@enrico.minack.dev>
Signed-off-by: weihanmines <weihan13@amd.com>

Revert "Fix Horovod pyarrow IndexError: list index out of range (horovod#3255)" (horovod#3265)

This reverts commit 3efc229.

Signed-off-by: Travis Addair <tgaddair@gmail.com>
Signed-off-by: weihanmines <weihan13@amd.com>

Debugging for lightning data loader and fix for simple profiler. (horovod#3253)

add debugging flag for lightning data loader , make async data loader queue size configurable

Signed-off-by: weihanmines <weihan13@amd.com>

Call process_set._setup in init() to point to the correct native lib path (horovod#3258)

* call setup for common process_set in remote trainers

moved _setup call to init()

Signed-off-by: TJ <tix@uber.com>
Signed-off-by: weihanmines <weihan13@amd.com>

Add support for MXNet async dependency engine. (horovod#3242)

Signed-off-by: Josh Romero <joshr@nvidia.com>
Signed-off-by: weihanmines <weihan13@amd.com>

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Reviewers

maxhgerlach maxhgerlach left review comments

romerojosh romerojosh left review comments

tgaddair tgaddair approved these changes

chongxiaoc chongxiaoc approved these changes

irasit Awaiting requested review from irasit