Release Process sets, XLA support, improved GPU backend · horovod/horovod

Added process sets to concurrently run collective operations on subsets of Horovod processes in TensorFlow, PyTorch, and MXNet. (#2839, #3042, #3043, #3054, #3083, #3090)
Added XLA support for Allreduce via tf.function(jit_compile=True). (#3053)
Added fused buffer scaling and unpack/pack kernels on GPU. (#2973)
Added support for NCCL on CUDA 11.4. (#3182)
Added fp16 compression for MXNet. (#2987)
Added terminate_on_nan flag to Spark Lightning estimator. (#3088)
Added barrier() API to torch module to support simple synchronization among ranks and to achieve parity with PyTorch DDP and similar frameworks. #3139
Added params for customizing Tensorboard callback. (#3153)
Added hvd.cross_rank() for keras. (#3008)
Added barrier() API to torch module to support simple synchronization among ranks and to achieve parity with PyTorch DDP and similar frameworks. #3139

Implemented more asynchronous dependency handling on GPU. (#2963)
Ray: RayExecutor will now use the current placement group instead of always creating a new one. (#3134)
Lightning: turned off shuffling for validation dataset. (#2974)
Ray: RayExecutor will use the current placement group if one exists. (#3134)
Extended hvd.join() to return the last rank that joined. (#3097)

Provide feedback