-
Notifications
You must be signed in to change notification settings - Fork 2.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Introducing Adasum algorithm to do allreduction. #1485
Conversation
c315af2
to
a0e2a48
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Still going through the code in detail, but wanted to give you all some quick feedback as I was getting things to compile on my local Mac laptop.
46e00f0
to
731552e
Compare
@jaliyae is added to the review. #Closed |
re-opening to trigger CI |
afcadd7
to
b2bcf91
Compare
1. Adasum operations for both CPU and NCCL build of Horovod 2. Framework support in Tensorflow and Pytorch to enable Adasum 3. A new optimizer added for Tensorflow and Pytorch to deliver more accurate estimation when using Adasum Main contributors: Olli Saarikivi (olsaarik) Vadim Eksarevskiy (vaeksare) Jaliya Ekanayake (jaliyae) Todd Mytkowicz (klipto) Saeed Maleki(saeedmaleki) Sergii Dymchenko(kit1980) Signed-off-by: Tix <tix@microsoft.com>
Signed-off-by: Tix <tix@microsoft.com>
Signed-off-by: Tix <tix@microsoft.com>
Signed-off-by: Tix <tix@microsoft.com>
made test to be compatible with python27 Signed-off-by: Tix <tix@microsoft.com>
Signed-off-by: Tix <tix@microsoft.com>
added adasum as an option in pytorch resnet example Signed-off-by: Tix <tix@microsoft.com>
Signed-off-by: Tix <tix@microsoft.com>
Signed-off-by: Tix <tix@microsoft.com>
Signed-off-by: Tix <tix@microsoft.com>
fixed incorrect lr scaling in examples when using Adasum improved tf test to support tf 2.0 Signed-off-by: Tix <tix@microsoft.com>
Signed-off-by: Tix <tix@microsoft.com>
…alized. dont run cpu tests if mpi is not available Signed-off-by: Tix <tix@microsoft.com>
Signed-off-by: Tix <tix@microsoft.com>
Signed-off-by: Tix <tix@microsoft.com>
Signed-off-by: Tix <tix@microsoft.com>
SKip Gloo tests for Adasum Signed-off-by: Tix <tix@microsoft.com>
…t Horovod is compiled without GPU-ALLREDUCE flag. Signed-off-by: Tix <tix@microsoft.com>
Fixed mxnet import failure. Signed-off-by: Tix <tix@microsoft.com>
Signed-off-by: Tix <tix@microsoft.com>
88b0a0d
to
08cf4e9
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Everything looks good from our end. Let's go ahead and merge this with plans for a few followup PRs:
- Documentation describing Adasum (could use description from this PR) along with guidance on when to expect it to significantly improve results.
- TensorFlow 2.0 support.
- Keras support.
@Tixxx Are there any comparative experiment results for Adasum showing the speed and accuracy of the training for some benchmarks? |
@Tixxx does AdaSum guarantee two critical convergence properties of SGD?
|
Hi @thyeros, The expected value of the computed gradients with Adasum is not necessarily the true gradient of all training example but has a positive inner product with it. Note that the loss value decreases if the model is update with any direction that has a positive inner product with the gradient (provided that the higher order term in Taylor series are negligible). The variance for the gradients computed with Adasum is bounded as long as the learning is scheduled properly. |
* Introducing Adasum algorithm to do allreduction. 1. Adasum operations for both CPU and NCCL build of Horovod 2. Framework support in Tensorflow and Pytorch to enable Adasum 3. A new optimizer added for Tensorflow and Pytorch to deliver more accurate estimation when using Adasum Main contributors: Olli Saarikivi (olsaarik) Vadim Eksarevskiy (vaeksare) Jaliya Ekanayake (jaliyae) Todd Mytkowicz (klipto) Saeed Maleki(saeedmaleki) Sergii Dymchenko(kit1980) Signed-off-by: Tix <tix@microsoft.com>
@@ -96,6 +97,10 @@ def extension_available(ext_base_name, verbose=False): | |||
return _check_extension_lambda( | |||
ext_base_name, available_fn, 'built', verbose) or False | |||
|
|||
def gpu_available(ext_base_name, verbose=False): | |||
available_fn = lambda ext: ext._check_has_gpu() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi, I got the following errors with horovodrun:
Checking whether extension tensorflow was running with GPU.
Traceback (most recent call last):
File "/home/xianyang/opt/miniconda3/lib/python3.7/site-packages/horovod/common/util.py", line 73, in _target_fn
ext = importlib.import_module('.' + ext_base_name, 'horovod')
File "/home/xianyang/opt/miniconda3/lib/python3.7/importlib/__init__.py", line 127, in import_module
return _bootstrap._gcd_import(name[level:], package, level)
File "<frozen importlib._bootstrap>", line 1006, in _gcd_import
File "<frozen importlib._bootstrap>", line 983, in _find_and_load
File "<frozen importlib._bootstrap>", line 967, in _find_and_load_unlocked
File "<frozen importlib._bootstrap>", line 677, in _load_unlocked
File "<frozen importlib._bootstrap_external>", line 728, in exec_module
File "<frozen importlib._bootstrap>", line 219, in _call_with_frames_removed
File "/home/xianyang/opt/miniconda3/lib/python3.7/site-packages/horovod/tensorflow/__init__.py", line 43, in <module>
has_gpu = gpu_available('tensorflow')
File "/home/xianyang/opt/miniconda3/lib/python3.7/site-packages/horovod/common/util.py", line 103, in gpu_available
ext_base_name, available_fn, 'running with GPU', verbose) or False
File "/home/xianyang/opt/miniconda3/lib/python3.7/site-packages/horovod/common/util.py", line 90, in _check_extension_lambda
p.start()
File "/home/xianyang/opt/miniconda3/lib/python3.7/multiprocessing/process.py", line 110, in start
'daemonic processes are not allowed to have children'
AssertionError: daemonic processes are not allowed to have children
Extension tensorflow was NOT running with GPU.
This could be reproduced with:
horovod.common.util.gpu_available('tensorflow', True)
Add extra checks for x86 compiler flags and x86 AVX headers/functions Signed-off-by: Nicolas V Castet <nvcastet@us.ibm.com>
Add extra checks for x86 compiler flags and x86 AVX headers/functions Signed-off-by: Nicolas V Castet <nvcastet@us.ibm.com>
@eric-haibin-lin, we will share a perf study soon along with a userguide for AdaSum. |
@jaliyae thanks! look forward to it |
Main contributors:
Olli Saarikivi (olsaarik)
Vadim Eksarevskiy (vaeksare)
Jaliya Ekanayake (jaliyae)
Todd Mytkowicz (klipto)
Saeed Maleki(saeedmaleki)
Sergii Dymchenko(kit1980)
Tianju Xu(Tixxx)