Skip to content

Elastic mode improvements, MXNet async dependency engine, fixes for latest PyTorch and TensorFlow versions

Compare
Choose a tag to compare
@tgaddair tgaddair released this 02 Mar 15:57
· 185 commits to master since this release
b089df6

Added

  • Ray: Added elastic keyword parameters to RayExecutor API: This API supports both static (non-elastic) and elastic Horovod jobs. (#3190)
  • TensorFlow: Added in-place broadcasting of variables. (#3128)
  • Elastic: Added support for resurrecting blacklisted hosts. (#3319)
  • MXNet: Added support for MXNet async dependency engine. (#3242, #2963)
  • Spark/Lightning: Added history to lightning estimator. (#3214)

Changed

  • Moved to CMake version 3.13 with first-class CUDA language support and re-enabled parallelized builds. Uses a temporary installation of CMake if CMake 3.13 is not found. (#3261, #3371)
  • Moved released Docker image horovod and horovod-cpu to Ubuntu 20.04 and Python 3.8. (#3393)
  • Spark Estimator: Don't shuffle row groups if training data requires non-shuffle (#3369)
  • Spark/Lightning: Reduced memory footprint of async dataloader. (#3239)
  • Elastic: Improved handling NCCL errors under elastic scenario. (#3112)
  • Spark/Lightning: Do not overwrite model with checkpoint by default. (#3201)
  • Make checkpoint name optional so that user can save to h5 format. (#3411)

Deprecated

  • Deprecated ElasticRayExecutor APIs in favor of the new RayExecutor API. (#3190)

Removed

  • Spark: Removed h5py<3 constraint as this is not needed anymore for Tensorflow >2.5.0. (#3301)

Fixed

  • Elastic Spark: Fixed indices in initial task-to-task registration. (#3410)
  • PyTorch: Fixed GIL-related deadlock with PyTorch 1.10.1. (#3352)
  • PyTorch: Fixed finalization of ProcessSetTable. (#3351)
  • Fixed remote trainers to point to the correct shared lib path. (#3258)
  • Fixed imports from tensorflow.python.keras with tensorflow 2.6.0+. (#3403)
  • Fixed Adasum communicator init logic. (#3379)
  • Lightning: Fixed resume logger. (#3375)
  • Fixed the checkpoint directory structure for pytorch and pytorch lightning. (#3362)
  • Fixed possible integer overflow in multiplication. (#3368)
  • Fixed the pytorch_lightning_mnist.py example. (#3245, #3290)
  • Fixed barrier segmentation fault. (#3313)
  • Fixed hvd.barrier() tensor queue management. (#3300)
  • Fixed PyArrow "list index out of range" IndexError. (#3274)
  • Elastic: Fixed all workers sometimes failing on elastic Horovod failure. (#3264)
  • Spark/Lightning: Fixed setting limit_train_batches and limit_val_batches. (#3237)
  • Elastic: Fixed ElasticSampler and hvd.elastic.state losing some indices of processed samples when nodes dropped. (#3143)
  • Spark/Lightning: Fixed history metrics for estimator serialization. (#3216)
  • Ray: Fixed RayExecutor to fail when num_workers=0 and num_hosts=None. (#3210)
  • Spark/Lightning: Fixed checkpoint callback dirpath typo. (#3204)