Skip to content

Better support for model parallel, more reduction operations for allreduce (min, max, product), grouped allgather and reducedscatter, Petastorm reader level parallel shuffling, NVTabular data loader

Compare
Choose a tag to compare
@EnricoMi EnricoMi released this 13 Oct 12:29
· 79 commits to master since this release
c638dce

Added

  • Spark Estimator: Added support for custom data loaders in KerasEstimator. (#3603)
  • Spark Estimator: Added NVTabular data loader for KerasEstimator. (#3603)
  • Spark Estimator: Added gradient accumulation support to Spark torch estimator. (#3681)
  • TensorFlow: Added register_local_var functionality to distributed optimizers and local gradient aggregators. (#3695)
  • TensorFlow: Added support for local variables for BroadcastGlobalVariablesCallback. (#3703)
  • Enabled use of native ncclAvg op for NCCL allreduces. (#3646)
  • Added support for additional reduction operations for allreduce (min, max, product). (#3660)
  • Added 2D torus allreduce using NCCL. (#3608)
  • Added support for Petastorm reader level parallel shuffling. (#3665)
  • Added random seed support for Lightning datamodule to generate reproducible data loading outputs. (#3665)
  • Added support for int8 and uint8 allreduce and grouped_allreduce in TensorFlow. (#3649)
  • Added support for batched memory copies in GPUAllgather. (#3590)
  • Added support for batched memory copies in GPUReducescatter. (#3621)
  • Added hvd.grouped_allgather() and hvd.grouped_reducescatter() operations. (#3594)
  • Added warning messages if output tensor memory allocations fail. (#3594)
  • Added register_local_source and use_generic_names funtionality to DistributedGradientTape. (#3628)
  • Added PartialDistributedGradientTape() API for model parallel use cases. (#3643)
  • Spark/Lightning: Added reader_worker_count and reader_pool_type. (#3612)
  • Spark/Lightning: Added transformation_edit_fields and transformation_removed_fields param for EstimatorParams. (#3651)
  • TensorFlow: Added doc string for hvd.grouped_allreduce(). (#3594)
  • ROCm: Enabled alltoall. (#3654)

Changed

  • Default Petastorm reader pool is changed from process to thread for lower memory usage. (#3665)
  • Keras: Support only legacy optimizers in Keras 2.11+. (#3725)
  • Gloo: When negotiating, use gather rather than allgather. (#3633)
  • Use packaging.version instead of distutils version classes. (#3700)

Deprecated

  • Deprecated field shuffle_buffer_size from EstimatorParams. Use shuffle to enable shuffle or not. (#3665)

Removed

  • Build: Removed std::regex use for better cxxabi11 compatibility. (#3584)

Fixed

  • TensorFlow: Fixed the optimizer iteration increments when backward_passes_per_step > 1. (#3631)
  • Fixed FuseResponses() on BATCHED_D2D_PADDING edge cases for Reducescatter and/or ROCm. (#3621)
  • PyTorch: Fixed Reducescatter functions to raise HorovodInternalError rather than RuntimeError. (#3594)
  • PyTorch on GPUs without GPU operations: Fixed grouped allreduce to set CPU device in tensor table. (#3594)
  • Fixed race condition in PyTorch allocation handling. (#3639)
  • Build: Fixed finding nvcc (if not in $PATH) with older versions of CMake. (#3682)
  • Fixed reducescatter() and grouped_reducescatter() to raise clean exceptions for scalar inputs. (#3699)
  • Updated Eigen submodule to fix build on macOS with aarch64. (#3619)
  • Build: Correctly select files in torch/ directory to be hipified. (#3588)
  • Build: Modify regex match for CUDA|ROCm in FindPytorch.cmake. (#3593)
  • Build: Fixed ROCm-specific build failure. (#3630)