New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

Sign up for GitHub

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Jump to bottom

Introducing Adasum algorithm to do allreduction. #1485

Merged

tgaddair merged 20 commits into horovod:master from Tixxx:alpha_light_official

Nov 25, 2019

Collaborator

Tixxx commented Oct 29, 2019 •

edited

Loading

Adasum operations for both CPU and NCCL build of Horovod
Framework support in Tensorflow and Pytorch to enable Adasum
A new optimizer added for Tensorflow and Pytorch to deliver more accurate estimation when using Adasum

Main contributors:
Olli Saarikivi (olsaarik)
Vadim Eksarevskiy (vaeksare)
Jaliya Ekanayake (jaliyae)
Todd Mytkowicz (klipto)
Saeed Maleki(saeedmaleki)
Sergii Dymchenko(kit1980)
Tianju Xu(Tixxx)

Collaborator Author

Tixxx commented Oct 29, 2019

Adasum

What is Adasum

Scaling DNN training to many GPUs always comes at a convergence degradation. This is because with larger batch sizes, gradients are averaged and the learning rate per example is smaller. To address this, learning rate is usually scaled up but this can lead to divergence of model parameters. Adasum addresses these two issues without introducing any hyper-parameter.

Suppose there are two almost-parallel gradients from two different GPUs, g1 and g2, and they need to be reduced as shown in the figure below. The two common practices for reductions are g1+g2, the gray vector or (g1+g2)/2, the green vector. g1+g2 may cause divergence of the model since it is effectively moving in the direction of g1 or g2 by two times the magnitude of g1 or g2. Therefore, generally (g1+g2)/2 is safer and more desired.

Now consider the two orthogonal gradients g1 and g2 in the figure below. Since g1 and g2 are in two different dimensions and independent of each other, g1+g2 may not cause divergence.

Finally, consider the third scenario where g1 and g2 are neither parallel nor orthogonal as shown in the figure below. In such a case, Adasum projects g2 on the orthogonal space of g1 (the pink vector) and adds that with g1 to produce the reduced vector. In this case, the final vector moves in each dimension only as much as each of g1 or g2 and therefore causes no divergence.

This idea extends to many gradients as well. Suppose there are 2^n gradients coming from 2^n different GPUs. Adasum inductively takes pairs of gradients and reduces them using the method above until all of them are reduced into one gradient.

Highlights of code changes

We provide an algorithmic interface which does not have any dependency on a particular communication library for extensibility. An MPI version of implementation of Adasum has been provided to support new operations we have added to Horovod. Here is the list of changes that we propose:

Adasum class in horovod/common/ops/adasum/adasum.h: Algorithmic interface of Adasum which contains the main logic.
AdasumMPI class in horovod/common/ops/adasum/adasum_mpi.h and adasum_mpi.cc: An MPI implementation of Adasum algorithm.
AdasumMPIAllreduceOp class in horovod/common/ops/adasum_mpi_operations.h and adasum_mpi_operations.cc: A new operation class that inherits from AdasumMPI and Horovod's AllreduceOp. This utilizes the fusion buffer to perform efficient Adasum reductions on CPU when HOROVOD_GPU_ALLREDUCE is set to None.
AdasumCudaAllreduceOp class in horovod/common/ops/adasum_cuda_operations.h and adasum_cuda_operations.cc: A new operation class that inherits from AdasumMPI and Horovod's NCCLAllreduce. This is a hierarchical operation that uses NCCL to perform intra-node sum-averaging and Adasum algorithm for inter-node reductions. This op requires Horovod to be compiled with HOROVOD_GPU_ALLREDUCE=NCCL
A new response and request type has been introduced in addition to existing ones:

enum ResponseType { ALLREDUCE = 0, ALLGATHER = 1, BROADCAST = 2, ADASUM = 3, ERROR = 4};

A new environment variable HOROVOD_ADASUM_MPI_CHUNK_SIZE has been introduced to improve MPI communication efficiency for some platform configurations(i.e. Azure NC series machines + IntelMPI).

In addition to the above changes in Horovod's common library, we also added a list of changes to framework layer for both Tensorflow and Pytorch to enable easy use of Adasum:

An enum that contains a list of allreduce operations has been introduced for users to select among Average, Sum or Adasum. This improves extensibility to add more ops in the future and backward compatibility.
An optional parameter op has been added to DistributedOptimizer and allreduce API for users to specify which operation to perform.
A new distributed optimizer has been added to both frameworks to support Adasum algorithm. Since the nature of Adasum requires it to operate on the full magnitude of the gradient, the newly added distributed optimizer uses the difference in magnitude of weights between before and after the optimizer performs a step to deliver a more accurate estimation. When op=hvd.Adasum is specified, the new optimizer will be used.

DistributedOptimizer example for Tensorflow:

opt = tf.train.AdamOptimizer(0.001)

opt = hvd.DistributedOptimizer(opt, backward_passes_per_step=5, op=hvd.Adasum)

Allreduce example for Tensorflow:

hvd.allreduce(tensor, op=hvd.Adasum)

DistributedOptimizer example for Pytorch:

optimizer = optim.SGD(model.parameters(), lr=args.lr, momentum=args.momentum)

optimizer = hvd.DistributedOptimizer(optimizer, named_parameters=model.named_parameters(), compression=compression, backward_passes_per_step = 5, op=hvd.Adasum)

Allreduce example for Pytorch:

hvd.allreduce(tensor, op=hvd.Adasum)

Additional notes

Adasum ensures correct convergence behavior even with large effective batch sizes.
As the number of ranks scales up, the learning rate does not need to be scaled if using CPU to do Adasum reduction. If HOROVOD_GPU_ALLREDUCE=NCCL flag is used to compile Horovod, Adasum needs the learning rate to be scaled by the number of GPUs locally on a node.
Pytorch training in fp16 format is not yet supported by this pull request. We are in the process of integrating Apex into the new optimizer to enabled full mixed precision training with Adasum in Pytorch.
When HOROVOD_GPU_ALLREDUCE=NCCL flag is used to compile Horovod and training is run on a single node, only averaging through NCCL library is used to perform reductions and no Adasum algorithm will take place in this configuration.

Tixxx force-pushed the alpha_light_official branch 5 times, most recently from c315af2 to a0e2a48 Compare

November 1, 2019 23:13

tgaddair reviewed

View reviewed changes

Collaborator

tgaddair left a comment

Still going through the code in detail, but wanted to give you all some quick feedback as I was getting things to compile on my local Mac laptop.

examples/pytorch_mnist.py Show resolved Hide resolved

examples/pytorch_synthetic_benchmark.py Outdated Show resolved Hide resolved

examples/pytorch_synthetic_benchmark.py Outdated Show resolved Hide resolved

horovod/common/basics.py Outdated Show resolved Hide resolved

horovod/common/operations.cc Outdated Show resolved Hide resolved

horovod/common/ops/adasum/adasum.h Outdated Show resolved Hide resolved

horovod/common/ops/adasum/adasum.h Outdated Show resolved Hide resolved

horovod/common/ops/adasum/adasum.h Outdated Show resolved Hide resolved

horovod/common/ops/adasum/adasum.h Outdated Show resolved Hide resolved

horovod/common/ops/adasum_mpi_operations.cc Outdated Show resolved Hide resolved

tgaddair reviewed

View reviewed changes

horovod/common/ops/adasum/adasum.h Outdated Show resolved Hide resolved

tgaddair reviewed

View reviewed changes

horovod/torch/__init__.py Show resolved Hide resolved

Tixxx force-pushed the alpha_light_official branch from 46e00f0 to 731552e Compare

November 4, 2019 03:39

jaliyae commented Nov 5, 2019

@jaliyae is added to the review. #Closed

tgaddair reviewed

View reviewed changes

horovod/tensorflow/__init__.py Outdated Show resolved Hide resolved

Tixxx closed this

Collaborator Author

Tixxx commented Nov 6, 2019

re-opening to trigger CI

Tixxx reopened this

Tixxx force-pushed the alpha_light_official branch 2 times, most recently from afcadd7 to b2bcf91 Compare

November 7, 2019 18:23

Collaborator

tgaddair commented Nov 15, 2019

Hey @Tixxx, can you rebase off of master? There was a problem with unit tests caused by changes to tf-nightly. The tests were fixed by #1515.

Tix added 12 commits

November 15, 2019 09:21


          Introducing Adasum algorithm to do allreduction.

ef3c333

 1. Adasum operations for both CPU and NCCL build of Horovod
 2. Framework support in Tensorflow and Pytorch to enable Adasum
 3. A new optimizer added for Tensorflow and Pytorch to deliver more accurate estimation when using Adasum

Main contributors:
Olli Saarikivi (olsaarik)
Vadim Eksarevskiy (vaeksare)
Jaliya Ekanayake (jaliyae)
Todd Mytkowicz (klipto)
Saeed Maleki(saeedmaleki)
Sergii Dymchenko(kit1980)

Signed-off-by: Tix <tix@microsoft.com>


          guard mpi files under HAVE_MPI in setup.py

9f9804f

Signed-off-by: Tix <tix@microsoft.com>


          added missing import

bc378ac

Signed-off-by: Tix <tix@microsoft.com>


          Guard adasum cuda operation files with MPI directive

9afe51e

Signed-off-by: Tix <tix@microsoft.com>


          fixed interface for torch 0.4

f93bfdb

made test to be compatible with python27

Signed-off-by: Tix <tix@microsoft.com>


          skip parallel and orthogonal tests when GPUs are not available

2ea04c9

Signed-off-by: Tix <tix@microsoft.com>


          Rebased after joinop PR

8a33e5c

added adasum as an option in pytorch resnet example

Signed-off-by: Tix <tix@microsoft.com>


          added a missing file

3a83df4

Signed-off-by: Tix <tix@microsoft.com>


          changed tf example to use adasum

6376f13

Signed-off-by: Tix <tix@microsoft.com>


          rebased on master again

16e6eaf

Signed-off-by: Tix <tix@microsoft.com>


          PR comments

c57abb3

fixed incorrect lr scaling in examples when using Adasum
improved tf test to support tf 2.0

Signed-off-by: Tix <tix@microsoft.com>


          removed Adasum warning when compiling with MPI for GPU allreduce

Signed-off-by: Tix <tix@microsoft.com>

Tix added 8 commits

November 15, 2019 09:21


          Changed communicator pool initialization of Adasum to be lazily initi…

263eb8e

…alized.

dont run cpu tests if mpi is not available

Signed-off-by: Tix <tix@microsoft.com>


          Assert if gpus are not homogeneous when using NCCl

124d35d

Signed-off-by: Tix <tix@microsoft.com>


          Expose homogeneous flag from common library

02b7bd6

Signed-off-by: Tix <tix@microsoft.com>


          SKip delta optimizer test when mpi is not available

Signed-off-by: Tix <tix@microsoft.com>


          Rebased again.

b85a033

SKip Gloo tests for Adasum

Signed-off-by: Tix <tix@microsoft.com>


          changed test to use correct expected value when GPUs are available bu…

cbde7f4

…t Horovod is compiled without GPU-ALLREDUCE flag.

Signed-off-by: Tix <tix@microsoft.com>


          Added JoinOp to adasum.

2fb1e29

Fixed mxnet import failure.

Signed-off-by: Tix <tix@microsoft.com>


          rebased on master

08cf4e9

Signed-off-by: Tix <tix@microsoft.com>

Tixxx force-pushed the alpha_light_official branch from 88b0a0d to 08cf4e9 Compare

November 15, 2019 17:40

tgaddair approved these changes

View reviewed changes

Collaborator

tgaddair left a comment

Everything looks good from our end. Let's go ahead and merge this with plans for a few followup PRs:

Documentation describing Adasum (could use description from this PR) along with guidance on when to expect it to significantly improve results.
TensorFlow 2.0 support.
Keras support.

tgaddair merged commit 5fa1d7a into horovod:master

Collaborator

nvcastet commented Nov 25, 2019

@Tixxx Are there any comparative experiment results for Adasum showing the speed and accuracy of the training for some benchmarks?

thyeros commented Nov 26, 2019

@Tixxx does AdaSum guarantee two critical convergence properties of SGD?

unbiased gradient estimator
bounded variacne

saeedmaleki commented Nov 27, 2019

@Tixxx does AdaSum guarantee two critical convergence properties of SGD?

unbiased gradient estimator

bounded variacne

The expected value of the computed gradients with Adasum is not necessarily the true gradient of all training example but has a positive inner product with it. Note that the loss value decreases if the model is update with any direction that has a positive inner product with the gradient (provided that the higher order term in Taylor series are negligible).

The variance for the gradients computed with Adasum is bounded as long as the learning is scheduled properly.

saeedmaleki commented Nov 27, 2019

@Tixxx Are there any comparative experiment results for Adasum showing the speed and accuracy of the training for some benchmarks?

Yes, we will provide those benchmarks and numbers in near future.

jeffdaily pushed a commit to ROCm/horovod that referenced this pull request


          Introducing Adasum algorithm to do allreduction. (horovod#1485)

7f0e20f

* Introducing Adasum algorithm to do allreduction.
 1. Adasum operations for both CPU and NCCL build of Horovod
 2. Framework support in Tensorflow and Pytorch to enable Adasum
 3. A new optimizer added for Tensorflow and Pytorch to deliver more accurate estimation when using Adasum

Main contributors:
Olli Saarikivi (olsaarik)
Vadim Eksarevskiy (vaeksare)
Jaliya Ekanayake (jaliyae)
Todd Mytkowicz (klipto)
Saeed Maleki(saeedmaleki)
Sergii Dymchenko(kit1980)

Signed-off-by: Tix <tix@microsoft.com>

tgaddair mentioned this pull request

multi gpus training results in zigzag loss curve #1527

Closed

ConeyLiu reviewed

View reviewed changes

horovod/common/util.py

@@ @@ -96,6 +97,10 @@ def extension_available(ext_base_name, verbose=False): @@
                   return _check_extension_lambda(
                       ext_base_name, available_fn, 'built', verbose) or False
+              def gpu_available(ext_base_name, verbose=False):
+                  available_fn = lambda ext: ext._check_has_gpu()

Contributor

ConeyLiu Dec 3, 2019

Hi, I got the following errors with horovodrun:

Checking whether extension tensorflow was running with GPU.
Traceback (most recent call last):
  File "/home/xianyang/opt/miniconda3/lib/python3.7/site-packages/horovod/common/util.py", line 73, in _target_fn
    ext = importlib.import_module('.' + ext_base_name, 'horovod')
  File "/home/xianyang/opt/miniconda3/lib/python3.7/importlib/__init__.py", line 127, in import_module
    return _bootstrap._gcd_import(name[level:], package, level)
  File "<frozen importlib._bootstrap>", line 1006, in _gcd_import
  File "<frozen importlib._bootstrap>", line 983, in _find_and_load
  File "<frozen importlib._bootstrap>", line 967, in _find_and_load_unlocked
  File "<frozen importlib._bootstrap>", line 677, in _load_unlocked
  File "<frozen importlib._bootstrap_external>", line 728, in exec_module
  File "<frozen importlib._bootstrap>", line 219, in _call_with_frames_removed
  File "/home/xianyang/opt/miniconda3/lib/python3.7/site-packages/horovod/tensorflow/__init__.py", line 43, in <module>
    has_gpu = gpu_available('tensorflow')
  File "/home/xianyang/opt/miniconda3/lib/python3.7/site-packages/horovod/common/util.py", line 103, in gpu_available
    ext_base_name, available_fn, 'running with GPU', verbose) or False
  File "/home/xianyang/opt/miniconda3/lib/python3.7/site-packages/horovod/common/util.py", line 90, in _check_extension_lambda
    p.start()
  File "/home/xianyang/opt/miniconda3/lib/python3.7/multiprocessing/process.py", line 110, in start
    'daemonic processes are not allowed to have children'
AssertionError: daemonic processes are not allowed to have children
Extension tensorflow was NOT running with GPU.

This could be reproduced with:
horovod.common.util.gpu_available('tensorflow', True)

nvcastet added a commit to nvcastet/horovod that referenced this pull request


          Fix non-x86 build failures caused by horovod#1485

0245ae7

Add extra checks for x86 compiler flags and x86 AVX headers/functions

Signed-off-by: Nicolas V Castet <nvcastet@us.ibm.com>

nvcastet added a commit to nvcastet/horovod that referenced this pull request


          Fix non-x86 build failures caused by horovod#1485

914d06d

Add extra checks for x86 compiler flags and x86 AVX headers/functions

Signed-off-by: Nicolas V Castet <nvcastet@us.ibm.com>

tgaddair pushed a commit that referenced this pull request


          Fix non-x86 build failures caused by #1485 (#1580)

970a4c6

Signed-off-by: Nicolas V Castet <nvcastet@us.ibm.com>

Collaborator

eric-haibin-lin commented Jan 30, 2020

Would you mind sharing relevant papers/evaluations on adasum for large batch training? @olsaarik
@vaeksare
@jaliyae
@klipto
@saeedmaleki
@kit1980
@Tixxx

jaliyae commented Jan 30, 2020

@eric-haibin-lin, we will share a perf study soon along with a userguide for AdaSum.

Collaborator

eric-haibin-lin commented Jan 30, 2020

@jaliyae thanks! look forward to it

vaeksare mentioned this pull request

Adasum Full GPU Ring-based Allreduce #1760

Closed

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment