[TF2.4 - HVD Keras Fix] - Horovod worker desynchronization - Potentially deadlock training TF 2.4 #2647

DEKHTIARJonathan · 2021-02-05T02:09:35Z

CC: @romerojosh @tgaddair @nluehr

This PR aims to fix a bug with TF2.4 already reported here: tensorflow/tensorflow#46863

Basically this PR does the following:

Rollback the allreduce in _aggregate_gradients and make sure to execute the allreduce during gradient computation
Yes it means that in the eventuality the Grad Step is skipped (by mixed precision) we have allreduced for no reason, but this event is supposed to be very rare. And it's still better than having worker to desynchronize or deadlock.
self._aggregated_gradients = True has been moved to gradient computation instead of allreduce itself. This is done to work around that gradient tape is used to compute gradient with a CTL

@tgaddair : would be nice to push a new release" to address this bug, that way we can unblock TF2.4 users ;)

Thanks everyone

tgaddair · 2021-02-08T16:24:45Z

@DEKHTIARJonathan thanks for the PR! Seems this is breaking some tests, can you look into this?

[1]<stdout>:test_tensorflow2_keras.py::Tf2KerasTests::test_gradient_aggregation FAILED
--
  | [1]<stdout>:test_tensorflow2_keras.py::Tf2KerasTests::test_session SKIPPED (Not ...)
  | [0]<stderr>:[2021-02-05 20:33:38. 76607: W /tmp/pip-req-build-a5836dcu/horovod/common/stall_inspector.cc:105] One or more tensors were submitted to be reduced, gathered or broadcasted by subset of ranks and are waiting for remainder of ranks for more than 60 seconds. This may indicate that different ranks are trying to submit different tensors or that only subset of ranks is submitting tensors, which will cause deadlock.
  | [0]<stderr>:Missing ranks:
  | [0]<stderr>:0: [PartitionedCall/DistributedRMSprop_Allreduce/cond/then/_5/DistributedRMSprop_Allreduce/cond/HorovodAllreduce_DistributedRMSprop_Allreduce_UnsortedSegmentSum_0]
  | [0]<stderr>:1: [StatefulPartitionedCall/cond/then/_16/cond/PartitionedCall/DistributedTestingOptimizer_Allreduce/cond/then/_56/DistributedTestingOptimizer_Allreduce/cond/HorovodAllreduce_grads_0]

https://buildkite.com/horovod/horovod/builds/4887#34ba55ce-1e9b-4e4e-a824-efb41fbd7084/216-766

…lly deadlock training TF 2.4 Signed-off-by: DEKHTIARJonathan <jdekhtiar@nvidia.com>

DEKHTIARJonathan · 2021-02-10T20:17:32Z

@romerojosh @tgaddair I have completely reworked the PR, we found duplicated calls to allreduce() that should address the fews bugs.

In addition, I had to skip with TF 2.4+ the test_gradient_aggregation() from test/parallel/test_tensorflow2_keras which naturally doesn't make any sense anymore since we don't allreduce anymore during _aggregate_gradients, so this failing test was totally expected.

Let's see how the CI comes out

tgaddair · 2021-02-11T00:23:47Z

Hey @DEKHTIARJonathan, can you explain the skipping of the gradient aggregation test? We still need to support this functionality, even if we don't call _aggregate_gradients explicitly.

DEKHTIARJonathan · 2021-02-11T00:29:52Z

This PR shifts the gradient allreduce when the gradient is computed (both through the DistributedOptimizer or through the DistributedGradientTape). Which means that this unittest, by design in TF2.4, doesn't call allreduce in _aggregate_gradients().

Since this unittest provide a gradient as constant (without effectively computing it), the gradient will never be allreduced.
The safest bet for this PR (which is to return to a working state) was to skip this unittest that conceptually doesn't make sense in the current direction for TF2.4.

Though if the usecase is still important, I recommend pushing a following PR doing the followings:

fix test_gradient_aggregation() not to use tf.constant() for the gradient value and actually compute it for real.
remove the pytest.skip

horovod/_keras/__init__.py

tgaddair

LGTM!

github-actions · 2021-02-11T02:21:47Z

Unit Test Results

    759 files +  16     759 suites +16 4h 51m 28s ⏱️ + 1m 32s
    559 tests ±    0     529 ✔️ -     1     29 💤 ±    0 1 ❌ +1
15 264 runs +521 11 614 ✔️ +351 3 649 💤 +169 1 ❌ +1

For more details on these failures, see this check.

Results for commit 56cc923. ± Comparison against base commit c64b1d6.

This fixes issue horovod#2664 by performing gradient aggregation with a real gradient instead of a constant. PR: horovod#2647 shifts the gradient allreduce when the gradient is computed (both through the DistributedOptimizer or through the DistributedGradientTape). Which means that this unittest, by design in TF2.4, doesn't call allreduce in _aggregate_gradients(). Since this unittest provide a gradient as constant (without effectively computing it), the gradient will never be allreduced. The current change ensure that instead of a constant a real gradient is computed from a loss-function. Note: The current loss-function intentionally evaluates to zero. A future PR should convert it to a real loss function(e.g. MeanSquaredError) and compute gradients from that to test gradient aggregation.

This fixes issue horovod#2664 by performing gradient aggregation with a real gradient instead of a constant. PR: horovod#2647 shifts the gradient allreduce when the gradient is computed (both through the DistributedOptimizer or through the DistributedGradientTape). Which means that this unittest, by design in TF2.4, doesn't call allreduce in _aggregate_gradients(). Since this unittest provide a gradient as constant (without effectively computing it), the gradient will never be allreduced. The current change ensure that instead of a constant a real gradient is computed from a loss-function. Note: The current loss-function intentionally evaluates to zero. A future PR should convert it to a real loss function(e.g. MeanSquaredError) and compute gradients from that to test gradient aggregation. Signed-off-by: Abin Shahab <ashahab@linkedin.com>

…3176) This fixes issue #2664 by performing gradient aggregation with a real gradient instead of a constant. PR: #2647 shifts the gradient allreduce when the gradient is computed (both through the DistributedOptimizer or through the DistributedGradientTape). Which means that this unittest, by design in TF2.4, doesn't call allreduce in _aggregate_gradients(). Since this unittest provide a gradient as constant (without effectively computing it), the gradient will never be allreduced. The current change ensure that instead of a constant a real gradient is computed from a loss-function. Note: The current loss-function intentionally evaluates to zero. A future PR should convert it to a real loss function(e.g. MeanSquaredError) and compute gradients from that to test gradient aggregation. Signed-off-by: Abin Shahab <ashahab@linkedin.com>

…horovod#3173) Spark/Lightning: fix the usage of checkpoint callback (horovod#3186) Signed-off-by: Chongxiao Cao <chongxiaoc@uber.com> Fix Cometlogger experiment key lost issue (horovod#3184) * test Signed-off-by: Peng Zhang <pengz@uber.com> * test Signed-off-by: Peng Zhang <pengz@uber.com> * fix_logger Signed-off-by: Peng Zhang <pengz@uber.com> * fix_logger Signed-off-by: Peng Zhang <pengz@uber.com> * recreate_loger Signed-off-by: Peng Zhang <pengz@uber.com> * fix_var Signed-off-by: Peng Zhang <pengz@uber.com> * test Signed-off-by: Peng Zhang <pengz@uber.com> * test Signed-off-by: Peng Zhang <pengz@uber.com> Updated torch c++ to use new aten api (horovod#3175) Spark/Keras: remove bare Keras support (horovod#3191) Make fork PRs publish test change stats (horovod#3185) Signed-off-by: Enrico Minack <github@enrico.minack.dev> Support for nccl on cuda 11.4 (horovod#3182) Signed-off-by: Evan Brossard <evanb@maka-ars.com> Fix MPICH support (horovod#3148) * fix MPICH implementation * enable tests for MPICH and Intel MPI Signed-off-by: Jinzhe Zeng <jinzhe.zeng@rutgers.edu> Increase build timeout to 40m on Buildkite (horovod#3192) Signed-off-by: Enrico Minack <github@enrico.minack.dev> Change CMake syntax to be compatible with old versions of CMake (horovod#3196) Signed-off-by: Max H. Gerlach <git@maxgerlach.de> Reinit every torch test (horovod#3194) Add barrier call to torch module to support easy synchronization for process sets (horovod#3139) * Added barrier call to torch module Signed-off-by: TJ <tix@uber.com> Bump version to 0.23.0 (horovod#3200) Signed-off-by: Travis Addair <tgaddair@gmail.com> Co-authored-by: Max H. Gerlach <git@maxgerlach.de> Increase Parallel PyTest timeout to 10m (horovod#3198) * Increase MPI and Gloo Parallel PyTest timeout to 10m Signed-off-by: Enrico Minack <github@enrico.minack.dev> Spark/Lightning: don't overwrite model with checkpoint by default (horovod#3201) Lightning estimator saves model by default if there is no specified checkpoint callback. However, model is not overwritten with checkpoint file in that case. Signed-off-by: Chongxiao Cao <chongxiaoc@uber.com> Spark/Lightning: fix checkpoint callback dirpath typo (horovod#3204) Signed-off-by: Chongxiao Cao <chongxiaoc@uber.com> Rework events in CI workflows (horovod#3202) Signed-off-by: Enrico Minack <github@enrico.minack.dev> Allow for concurrent schedule and master build, document concurrency (horovod#3206) Signed-off-by: Enrico Minack <github@enrico.minack.dev> Ray: fix RayExecutor to fail when num_workers=0 and num_hosts=None (horovod#3210) Signed-off-by: Travis Addair <tgaddair@gmail.com> add_history_in_lightning_estimator (horovod#3214) Signed-off-by: Peng Zhang <pengz@uber.com> Allow buildkite building merge commits on forks (horovod#3215) Signed-off-by: Enrico Minack <github@enrico.minack.dev> Fix json output in ci-results.yaml (horovod#3217) Spark/Lightning: fix history metrics for estimator serialization (horovod#3216) Save metrics inside the checkpoint dict , which will be load with map_location=torch.device('cpu') Signed-off-by: Peng Zhang <pengz@uber.com> patch python source files on macCI (horovod#3220) * patch python source files on macCI * Trigger build and test CI Signed-off-by: TJ <tix@uber.com> Co-authored-by: Enrico Minack <github@enrico.minack.dev> Updated examples of torch and tf to include mixed precision training (horovod#3222) * Added mixed precision example for pytorch * added mixed precision for keras Signed-off-by: TJ <tix@uber.com> Job buildkite-heads accesses ci-workflow outputs, add it to the needs (horovod#3225) Signed-off-by: Enrico Minack <github@enrico.minack.dev> Fixes race condition for ray scale up down tests (horovod#3205) Ensure that at least one host from the previous set of hosts have been registered. Without this, the discovery script will "discover" the new set of hosts before the current set can register. This would result in a race condition. Consider a discovery schedule: ``` discovery_schedule = [ (10, ['host-1:2']), (30, ['host-1:2', 'host-2:1', 'host-3:1']), (None, ['host-2:1']), ] ``` The initial set is: ['host-1:2']. Before this is registered in the driver, the discovery script discovers the set: ['host-1:2', 'host-2:1', 'host-3:1'], and adds ['host-2:1', 'host-3:1']. However, since ['host-1:2'] has not registered, there is no coordinator to notify the workers. When host-1 and host-3 are removed, driver.resume will call _activate_workers, which will update the host assignments. It has a check to see if the intersection between the previous and current set of hosts. It finds that the previous set is ['host-1:2'], and the current set is ['host-2:1'], since there was no notification for the added and removed hosts. This ensures that the previous set of hosts can register before the current set is discovered. Signed-off-by: Abin Shahab <ashahab@linkedin.com> Removed a case of the default mutable argument pitfall (horovod#3227) Signed-off-by: Naelson Douglas <naelson17@gmail.com> Updates to TSC members (horovod#3234) Signed-off-by: Travis Addair <tgaddair@gmail.com> Add in-place broadcast for TensorFlow (horovod#3128) * Update comment in FindTensorflow.cmake Signed-off-by: Max H. Gerlach <git@maxgerlach.de> * Add in-place broadcast_() and broadcast_variables() for TF Signed-off-by: Max H. Gerlach <git@maxgerlach.de> * Include source files from TF in build to avoid missing symbol errors Signed-off-by: Max H. Gerlach <git@maxgerlach.de> * Limit build and test to TF 2.6+ Signed-off-by: Max H. Gerlach <git@maxgerlach.de> * Remove source files copied from TensorFlow The missing symbols are resolved by linking against _pywrap_tensorflow_internal.so, which was introduced to Horovod with PR horovod#3053. Signed-off-by: Max H. Gerlach <git@maxgerlach.de> * Fix possible type attribute values for HorovodBroadcastInplace Signed-off-by: Max H. Gerlach <git@maxgerlach.de> * Add reference variables to test Signed-off-by: Max H. Gerlach <git@maxgerlach.de> * Update comments, doc strings, changelog Signed-off-by: Max H. Gerlach <git@maxgerlach.de> [Elastic Horovod] Fix the bug for ElasticSampler and hvd.elastic.state (horovod#3144) Co-authored-by: gethinhu <gethinhu@tencent.com> a better way to handle nccl error under elastic scenario (horovod#3112) Signed-off-by: guoze.lin <guozelin@tencent.com> check torch version for mixed precision example (horovod#3238) Lightning: set limit_train_batches and limit_val_batches (horovod#3237) Tell Lightning trainer that how many batches a single epoch needs. Signed-off-by: Chongxiao Cao <chongxiaoc@uber.com> Spark/Lightning: reduce memory footprint of async dataloader (horovod#3239) Limit async data loader queue size. Signed-off-by: Peng Zhang <pengz@uber.com> Change default fusion threshold from 64MB to 128MB in docs (horovod#3241) fix the example of pytorch_lightning_mnist.py (horovod#3245) - remove unused arg parameters - fix model test issue on GPU Signed-off-by: Chongxiao Cao <chongxiaoc@uber.com> CI: use latest pytorch_lightning with torchhead (horovod#3243) test_gradient_aggregation with real gradient instead of a constant (horovod#3176) This fixes issue horovod#2664 by performing gradient aggregation with a real gradient instead of a constant. PR: horovod#2647 shifts the gradient allreduce when the gradient is computed (both through the DistributedOptimizer or through the DistributedGradientTape). Which means that this unittest, by design in TF2.4, doesn't call allreduce in _aggregate_gradients(). Since this unittest provide a gradient as constant (without effectively computing it), the gradient will never be allreduced. The current change ensure that instead of a constant a real gradient is computed from a loss-function. Note: The current loss-function intentionally evaluates to zero. A future PR should convert it to a real loss function(e.g. MeanSquaredError) and compute gradients from that to test gradient aggregation. Signed-off-by: Abin Shahab <ashahab@linkedin.com>

…orovod#3176) This fixes issue horovod#2664 by performing gradient aggregation with a real gradient instead of a constant. PR: horovod#2647 shifts the gradient allreduce when the gradient is computed (both through the DistributedOptimizer or through the DistributedGradientTape). Which means that this unittest, by design in TF2.4, doesn't call allreduce in _aggregate_gradients(). Since this unittest provide a gradient as constant (without effectively computing it), the gradient will never be allreduced. The current change ensure that instead of a constant a real gradient is computed from a loss-function. Note: The current loss-function intentionally evaluates to zero. A future PR should convert it to a real loss function(e.g. MeanSquaredError) and compute gradients from that to test gradient aggregation. Signed-off-by: Abin Shahab <ashahab@linkedin.com> Signed-off-by: weihanmines <weihan13@amd.com>

- Fixes issue when start_epoch != 0 Signed-off-by: Dinesh Ramasamy <89654805+iitmdinesh@users.noreply.github.com> Signed-off-by: weihanmines <weihan13@amd.com> fix torch op handles lazy release which may cause oom in elastic scenario (horovod#3110) * fix torch op handles lazy release which may cause oom in elastic scenario Signed-off-by: guoze.lin <guozelin@tencent.com> * Update mpi_ops.py Co-authored-by: guoze.lin <guozelin@tencent.com> Co-authored-by: Travis Addair <tgaddair@gmail.com> Signed-off-by: weihanmines <weihan13@amd.com> Added support for extraction of storage options from url. (horovod#3137) * Added support for extraction of storage options from url. Signed-off-by: Manjur Ansari <maansar@microsoft.com> * mock fsspec.utils Signed-off-by: Manjur Ansari <maansar@microsoft.com> * Added missing comma Co-authored-by: Travis Addair <tgaddair@gmail.com> Signed-off-by: weihanmines <weihan13@amd.com> Make RayExecutor use the current placement group if one exists (horovod#3134) Signed-off-by: weihanmines <weihan13@amd.com> Fix the mapping btw pyspark and numpy (horovod#3146) Signed-off-by: Haoyang Chen <haoyang@uber.com> Signed-off-by: weihanmines <weihan13@amd.com> Add tests for Keras callbacks: MetricAverageCallback, LearningRateScheduleCallback and LearningRateWarmupCallback (horovod#3102) There were no tests for MetricAverageCallback, LearningRateScheduleCallback and LearningRateWarmupCallback from hvd as noted in horovod#2659. This PR adds testing to verify the callback works. Signed-off-by: Moses Lee <14leeyuchieh@gmail.com> Co-authored-by: Moses Lee <molee@molee-ld4.linkedin.biz> Signed-off-by: weihanmines <weihan13@amd.com> Split gpu tests in head and non-head versions (horovod#3155) Signed-off-by: Enrico Minack <github@enrico.minack.dev> Signed-off-by: weihanmines <weihan13@amd.com> Allow caller to customize the Tensorboard callback (horovod#3153) * Keras Estimator: Allow user to pass in TensorBoard callback Signed-off-by: Rich Porter <rich.porter@uber.com> * Remove callback from other processes on the same machine Signed-off-by: Rich Porter <rich.porter@uber.com> * Allow other ranks to profile as well. Doesn't seem to conflict Signed-off-by: Rich Porter <rich.porter@uber.com> Signed-off-by: weihanmines <weihan13@amd.com> test_torch.py: add explicit join() for testing duplicated name errors (horovod#3159) For torch nightly >=10.0, we need to add an explict join() call to avoid hanging when testing duplicated name errors. Signed-off-by: Chongxiao Cao <chongxiaoc@uber.com> Signed-off-by: weihanmines <weihan13@amd.com> Disable TF2.6.0 XLA support on OSX (horovod#3133) Related to issue#3132 Signed-off-by: Chongxiao Cao <chongxiaoc@uber.com> Signed-off-by: weihanmines <weihan13@amd.com> Fix linking _pywrap_tensorflow_internal.so and re-enable XLA on macOS (horovod#3173) Signed-off-by: weihanmines <weihan13@amd.com> Spark/Lightning: fix the usage of checkpoint callback (horovod#3186) Signed-off-by: Chongxiao Cao <chongxiaoc@uber.com> Signed-off-by: weihanmines <weihan13@amd.com> Fix Cometlogger experiment key lost issue (horovod#3184) * test Signed-off-by: Peng Zhang <pengz@uber.com> * test Signed-off-by: Peng Zhang <pengz@uber.com> * fix_logger Signed-off-by: Peng Zhang <pengz@uber.com> * fix_logger Signed-off-by: Peng Zhang <pengz@uber.com> * recreate_loger Signed-off-by: Peng Zhang <pengz@uber.com> * fix_var Signed-off-by: Peng Zhang <pengz@uber.com> * test Signed-off-by: Peng Zhang <pengz@uber.com> * test Signed-off-by: Peng Zhang <pengz@uber.com> Signed-off-by: weihanmines <weihan13@amd.com> Updated torch c++ to use new aten api (horovod#3175) Signed-off-by: weihanmines <weihan13@amd.com> Spark/Keras: remove bare Keras support (horovod#3191) Signed-off-by: weihanmines <weihan13@amd.com> Make fork PRs publish test change stats (horovod#3185) Signed-off-by: Enrico Minack <github@enrico.minack.dev> Signed-off-by: weihanmines <weihan13@amd.com> Support for nccl on cuda 11.4 (horovod#3182) Signed-off-by: Evan Brossard <evanb@maka-ars.com> Signed-off-by: weihanmines <weihan13@amd.com> Fix MPICH support (horovod#3148) * fix MPICH implementation * enable tests for MPICH and Intel MPI Signed-off-by: Jinzhe Zeng <jinzhe.zeng@rutgers.edu> Signed-off-by: weihanmines <weihan13@amd.com> Increase build timeout to 40m on Buildkite (horovod#3192) Signed-off-by: Enrico Minack <github@enrico.minack.dev> Signed-off-by: weihanmines <weihan13@amd.com> Change CMake syntax to be compatible with old versions of CMake (horovod#3196) Signed-off-by: Max H. Gerlach <git@maxgerlach.de> Signed-off-by: weihanmines <weihan13@amd.com> Reinit every torch test (horovod#3194) Signed-off-by: weihanmines <weihan13@amd.com> Add barrier call to torch module to support easy synchronization for process sets (horovod#3139) * Added barrier call to torch module Signed-off-by: TJ <tix@uber.com> Signed-off-by: weihanmines <weihan13@amd.com> Bump version to 0.23.0 (horovod#3200) Signed-off-by: Travis Addair <tgaddair@gmail.com> Co-authored-by: Max H. Gerlach <git@maxgerlach.de> Signed-off-by: weihanmines <weihan13@amd.com> Increase Parallel PyTest timeout to 10m (horovod#3198) * Increase MPI and Gloo Parallel PyTest timeout to 10m Signed-off-by: Enrico Minack <github@enrico.minack.dev> Signed-off-by: weihanmines <weihan13@amd.com> Spark/Lightning: don't overwrite model with checkpoint by default (horovod#3201) Lightning estimator saves model by default if there is no specified checkpoint callback. However, model is not overwritten with checkpoint file in that case. Signed-off-by: Chongxiao Cao <chongxiaoc@uber.com> Signed-off-by: weihanmines <weihan13@amd.com> Spark/Lightning: fix checkpoint callback dirpath typo (horovod#3204) Signed-off-by: Chongxiao Cao <chongxiaoc@uber.com> Signed-off-by: weihanmines <weihan13@amd.com> Rework events in CI workflows (horovod#3202) Signed-off-by: Enrico Minack <github@enrico.minack.dev> Signed-off-by: weihanmines <weihan13@amd.com> Allow for concurrent schedule and master build, document concurrency (horovod#3206) Signed-off-by: Enrico Minack <github@enrico.minack.dev> Signed-off-by: weihanmines <weihan13@amd.com> Ray: fix RayExecutor to fail when num_workers=0 and num_hosts=None (horovod#3210) Signed-off-by: Travis Addair <tgaddair@gmail.com> Signed-off-by: weihanmines <weihan13@amd.com> add_history_in_lightning_estimator (horovod#3214) Signed-off-by: Peng Zhang <pengz@uber.com> Signed-off-by: weihanmines <weihan13@amd.com> Allow buildkite building merge commits on forks (horovod#3215) Signed-off-by: Enrico Minack <github@enrico.minack.dev> Signed-off-by: weihanmines <weihan13@amd.com> Fix json output in ci-results.yaml (horovod#3217) Signed-off-by: weihanmines <weihan13@amd.com> Spark/Lightning: fix history metrics for estimator serialization (horovod#3216) Save metrics inside the checkpoint dict , which will be load with map_location=torch.device('cpu') Signed-off-by: Peng Zhang <pengz@uber.com> Signed-off-by: weihanmines <weihan13@amd.com> patch python source files on macCI (horovod#3220) * patch python source files on macCI * Trigger build and test CI Signed-off-by: TJ <tix@uber.com> Co-authored-by: Enrico Minack <github@enrico.minack.dev> Signed-off-by: weihanmines <weihan13@amd.com> Updated examples of torch and tf to include mixed precision training (horovod#3222) * Added mixed precision example for pytorch * added mixed precision for keras Signed-off-by: TJ <tix@uber.com> Signed-off-by: weihanmines <weihan13@amd.com> Job buildkite-heads accesses ci-workflow outputs, add it to the needs (horovod#3225) Signed-off-by: Enrico Minack <github@enrico.minack.dev> Signed-off-by: weihanmines <weihan13@amd.com> Fixes race condition for ray scale up down tests (horovod#3205) Ensure that at least one host from the previous set of hosts have been registered. Without this, the discovery script will "discover" the new set of hosts before the current set can register. This would result in a race condition. Consider a discovery schedule: ``` discovery_schedule = [ (10, ['host-1:2']), (30, ['host-1:2', 'host-2:1', 'host-3:1']), (None, ['host-2:1']), ] ``` The initial set is: ['host-1:2']. Before this is registered in the driver, the discovery script discovers the set: ['host-1:2', 'host-2:1', 'host-3:1'], and adds ['host-2:1', 'host-3:1']. However, since ['host-1:2'] has not registered, there is no coordinator to notify the workers. When host-1 and host-3 are removed, driver.resume will call _activate_workers, which will update the host assignments. It has a check to see if the intersection between the previous and current set of hosts. It finds that the previous set is ['host-1:2'], and the current set is ['host-2:1'], since there was no notification for the added and removed hosts. This ensures that the previous set of hosts can register before the current set is discovered. Signed-off-by: Abin Shahab <ashahab@linkedin.com> Signed-off-by: weihanmines <weihan13@amd.com> Removed a case of the default mutable argument pitfall (horovod#3227) Signed-off-by: Naelson Douglas <naelson17@gmail.com> Signed-off-by: weihanmines <weihan13@amd.com> Updates to TSC members (horovod#3234) Signed-off-by: Travis Addair <tgaddair@gmail.com> Signed-off-by: weihanmines <weihan13@amd.com> Add in-place broadcast for TensorFlow (horovod#3128) * Update comment in FindTensorflow.cmake Signed-off-by: Max H. Gerlach <git@maxgerlach.de> * Add in-place broadcast_() and broadcast_variables() for TF Signed-off-by: Max H. Gerlach <git@maxgerlach.de> * Include source files from TF in build to avoid missing symbol errors Signed-off-by: Max H. Gerlach <git@maxgerlach.de> * Limit build and test to TF 2.6+ Signed-off-by: Max H. Gerlach <git@maxgerlach.de> * Remove source files copied from TensorFlow The missing symbols are resolved by linking against _pywrap_tensorflow_internal.so, which was introduced to Horovod with PR horovod#3053. Signed-off-by: Max H. Gerlach <git@maxgerlach.de> * Fix possible type attribute values for HorovodBroadcastInplace Signed-off-by: Max H. Gerlach <git@maxgerlach.de> * Add reference variables to test Signed-off-by: Max H. Gerlach <git@maxgerlach.de> * Update comments, doc strings, changelog Signed-off-by: Max H. Gerlach <git@maxgerlach.de> Signed-off-by: weihanmines <weihan13@amd.com> [Elastic Horovod] Fix the bug for ElasticSampler and hvd.elastic.state (horovod#3144) Co-authored-by: gethinhu <gethinhu@tencent.com> Signed-off-by: weihanmines <weihan13@amd.com> a better way to handle nccl error under elastic scenario (horovod#3112) Signed-off-by: guoze.lin <guozelin@tencent.com> Signed-off-by: weihanmines <weihan13@amd.com> check torch version for mixed precision example (horovod#3238) Signed-off-by: weihanmines <weihan13@amd.com> Lightning: set limit_train_batches and limit_val_batches (horovod#3237) Tell Lightning trainer that how many batches a single epoch needs. Signed-off-by: Chongxiao Cao <chongxiaoc@uber.com> Signed-off-by: weihanmines <weihan13@amd.com> Spark/Lightning: reduce memory footprint of async dataloader (horovod#3239) Limit async data loader queue size. Signed-off-by: Peng Zhang <pengz@uber.com> Signed-off-by: weihanmines <weihan13@amd.com> Change default fusion threshold from 64MB to 128MB in docs (horovod#3241) Signed-off-by: weihanmines <weihan13@amd.com> fix the example of pytorch_lightning_mnist.py (horovod#3245) - remove unused arg parameters - fix model test issue on GPU Signed-off-by: Chongxiao Cao <chongxiaoc@uber.com> Signed-off-by: weihanmines <weihan13@amd.com> CI: use latest pytorch_lightning with torchhead (horovod#3243) Signed-off-by: weihanmines <weihan13@amd.com> test_gradient_aggregation with real gradient instead of a constant (horovod#3176) This fixes issue horovod#2664 by performing gradient aggregation with a real gradient instead of a constant. PR: horovod#2647 shifts the gradient allreduce when the gradient is computed (both through the DistributedOptimizer or through the DistributedGradientTape). Which means that this unittest, by design in TF2.4, doesn't call allreduce in _aggregate_gradients(). Since this unittest provide a gradient as constant (without effectively computing it), the gradient will never be allreduced. The current change ensure that instead of a constant a real gradient is computed from a loss-function. Note: The current loss-function intentionally evaluates to zero. A future PR should convert it to a real loss function(e.g. MeanSquaredError) and compute gradients from that to test gradient aggregation. Signed-off-by: Abin Shahab <ashahab@linkedin.com> Signed-off-by: weihanmines <weihan13@amd.com> Remove MetricAverageCallback warning on tf >= 2.5 (horovod#3050) Signed-off-by: Henrique Mendonça <henrique.mendonca@cscs.ch> Signed-off-by: weihanmines <weihan13@amd.com> Fix Horovod pyarrow IndexError: list index out of range (horovod#3255) Signed-off-by: Weichen Xu <weichen.xu@databricks.com> Signed-off-by: weihanmines <weihan13@amd.com> Fixing up current CI test failures. (horovod#3259) Signed-off-by: Josh Romero <joshr@nvidia.com> Co-authored-by: Travis Addair <tgaddair@gmail.com> Co-authored-by: Enrico Minack <github@enrico.minack.dev> Signed-off-by: weihanmines <weihan13@amd.com> Revert "Fix Horovod pyarrow IndexError: list index out of range (horovod#3255)" (horovod#3265) This reverts commit 3efc229. Signed-off-by: Travis Addair <tgaddair@gmail.com> Signed-off-by: weihanmines <weihan13@amd.com> Debugging for lightning data loader and fix for simple profiler. (horovod#3253) add debugging flag for lightning data loader , make async data loader queue size configurable Signed-off-by: weihanmines <weihan13@amd.com> Call process_set._setup in init() to point to the correct native lib path (horovod#3258) * call setup for common process_set in remote trainers moved _setup call to init() Signed-off-by: TJ <tix@uber.com> Signed-off-by: weihanmines <weihan13@amd.com> Add support for MXNet async dependency engine. (horovod#3242) Signed-off-by: Josh Romero <joshr@nvidia.com> Signed-off-by: weihanmines <weihan13@amd.com>

DEKHTIARJonathan changed the title ~~[TF2.4 - HVD Keras Fix]~~ [TF2.4 - HVD Keras Fix] - Horovod worker desynchronization - Potentially deadlock training TF 2.4 Feb 5, 2021

DEKHTIARJonathan force-pushed the jdekhtiar/hvd_keras_tf2.4_fix branch from 9ea857c to 0aff889 Compare February 5, 2021 02:12

This comment has been minimized.

Sign in to view

DEKHTIARJonathan force-pushed the jdekhtiar/hvd_keras_tf2.4_fix branch from 0aff889 to 5956d24 Compare February 5, 2021 18:20

This comment has been minimized.

Sign in to view

DEKHTIARJonathan force-pushed the jdekhtiar/hvd_keras_tf2.4_fix branch from b5a0cb3 to 665a6b5 Compare February 5, 2021 20:05

This comment has been minimized.

Sign in to view

DEKHTIARJonathan force-pushed the jdekhtiar/hvd_keras_tf2.4_fix branch 2 times, most recently from 4051862 to 1e48306 Compare February 10, 2021 20:03

[TF2.4 - HVD Keras Fix] - Horovod worker desynchronization - Potentia…

14d15a8

…lly deadlock training TF 2.4 Signed-off-by: DEKHTIARJonathan <jdekhtiar@nvidia.com>

DEKHTIARJonathan force-pushed the jdekhtiar/hvd_keras_tf2.4_fix branch from 1e48306 to 14d15a8 Compare February 10, 2021 20:10

This comment has been minimized.

Sign in to view

tgaddair reviewed Feb 11, 2021

View reviewed changes

horovod/_keras/__init__.py Outdated Show resolved Hide resolved

Cosmetic Fix

6460d0b

tgaddair mentioned this pull request Feb 11, 2021

Fix test_gradient_aggregation for TF 2.4.0 #2664

Closed

tgaddair approved these changes Feb 11, 2021

View reviewed changes

tgaddair merged commit 56cc923 into horovod:master Feb 11, 2021

DEKHTIARJonathan deleted the jdekhtiar/hvd_keras_tf2.4_fix branch February 11, 2021 00:38

This comment has been minimized.

Sign in to view

ashahab mentioned this pull request Sep 23, 2021

test_gradient_aggregation with real gradient instead of a constant #3176

Merged

plliao mentioned this pull request Mar 30, 2022

Fixes gradient update timing in TF AggregationHelperEager #3496

Merged

4 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[TF2.4 - HVD Keras Fix] - Horovod worker desynchronization - Potentially deadlock training TF 2.4 #2647

[TF2.4 - HVD Keras Fix] - Horovod worker desynchronization - Potentially deadlock training TF 2.4 #2647

DEKHTIARJonathan commented Feb 5, 2021 •

edited

Loading

This comment has been minimized.

This comment has been minimized.

This comment has been minimized.

tgaddair commented Feb 8, 2021

DEKHTIARJonathan commented Feb 10, 2021

This comment has been minimized.

tgaddair commented Feb 11, 2021

DEKHTIARJonathan commented Feb 11, 2021

tgaddair left a comment

This comment has been minimized.

github-actions bot commented Feb 11, 2021

[TF2.4 - HVD Keras Fix] - Horovod worker desynchronization - Potentially deadlock training TF 2.4 #2647

[TF2.4 - HVD Keras Fix] - Horovod worker desynchronization - Potentially deadlock training TF 2.4 #2647

Conversation

DEKHTIARJonathan commented Feb 5, 2021 • edited Loading

This comment has been minimized.

This comment has been minimized.

This comment has been minimized.

tgaddair commented Feb 8, 2021

DEKHTIARJonathan commented Feb 10, 2021

This comment has been minimized.

tgaddair commented Feb 11, 2021

DEKHTIARJonathan commented Feb 11, 2021

tgaddair left a comment

Choose a reason for hiding this comment

This comment has been minimized.

github-actions bot commented Feb 11, 2021

Unit Test Results

DEKHTIARJonathan commented Feb 5, 2021 •

edited

Loading