New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Jump to bottom

Add support for gradient_predivide_factor and averaging in Horovod backend. #1949

Merged

tgaddair merged 30 commits into horovod:master from romerojosh:gradient_predivide_pr

Aug 17, 2020

Commits on Aug 14, 2020

Add support for gradient_predivide_factor and averaging in Horovod ba…
```
…ckend.

Signed-off-by: Josh Romero <joshr@nvidia.com>
```
romerojosh committed Aug 14, 2020
Configuration menu
View commit details

Copy full SHA for 69885d5

Browse repository at this point
Copy the full SHA

69885d5 View commit details

Browse the repository at this point in the history
Add files to MANIFEST.in. Fix gloo only builds.
```
Signed-off-by: Josh Romero <joshr@nvidia.com>
```
romerojosh committed Aug 14, 2020
Configuration menu
View commit details

Copy full SHA for 63f1025

Browse repository at this point
Copy the full SHA

63f1025 View commit details

Browse the repository at this point in the history
Add missing root_rank arg in MXNet code. Change to default -1 instead…
```
… of 0.

Signed-off-by: Josh Romero <joshr@nvidia.com>
```
romerojosh committed Aug 14, 2020
Configuration menu
View commit details

Copy full SHA for b8b6548

Browse repository at this point
Copy the full SHA

b8b6548 View commit details

Browse the repository at this point in the history
Revert use of MPI_IN_PLACE in ccl_operations. Add horovod_cuda_lib to…
```
… extension modules in setup.py.

Signed-off-by: Josh Romero <joshr@nvidia.com>
```
romerojosh committed Aug 14, 2020
Configuration menu
View commit details

Copy full SHA for be8d15f

Browse repository at this point
Copy the full SHA

be8d15f View commit details

Browse the repository at this point in the history
Compile half.cc for gloo builds.
```
Signed-off-by: Josh Romero <joshr@nvidia.com>
```
romerojosh committed Aug 14, 2020
Configuration menu
View commit details

Copy full SHA for 3bf34f0

Browse repository at this point
Copy the full SHA

3bf34f0 View commit details

Browse the repository at this point in the history
Fixes to torch v1 build. Build Horovod CUDA kernels based on framewor…
```
…k support. Add cmake to ppc64le test environment.

Signed-off-by: Josh Romero <joshr@nvidia.com>
```
romerojosh committed Aug 14, 2020
Configuration menu
View commit details

Copy full SHA for 394b167

Browse repository at this point
Copy the full SHA

394b167 View commit details

Browse the repository at this point in the history
Move postscale_factor modification for average into backend.
```
Signed-off-by: Josh Romero <joshr@nvidia.com>
```
romerojosh committed Aug 14, 2020
Configuration menu
View commit details

Copy full SHA for 445cf63

Browse repository at this point
Copy the full SHA

445cf63 View commit details

Browse the repository at this point in the history
Extend backend averaging to Adasum. Fix to ppc64le build.
```
Signed-off-by: Josh Romero <joshr@nvidia.com>
```
romerojosh committed Aug 14, 2020
Configuration menu
View commit details

Copy full SHA for 1cf2b3f

Browse repository at this point
Copy the full SHA

1cf2b3f View commit details

Browse the repository at this point in the history
Extend gradient_predivide_factor support to Keras.
```
Signed-off-by: Josh Romero <joshr@nvidia.com>
```
romerojosh committed Aug 14, 2020
Configuration menu
View commit details

Copy full SHA for 0daea1a

Browse repository at this point
Copy the full SHA

0daea1a View commit details

Browse the repository at this point in the history
Fix ppc64le build.
```
Signed-off-by: Josh Romero <joshr@nvidia.com>
```
romerojosh committed Aug 14, 2020
Configuration menu
View commit details

Copy full SHA for 116e138

Browse repository at this point
Copy the full SHA

116e138 View commit details

Browse the repository at this point in the history
Add gradient_predivide_factor to examples. Fix torch optimizer.py.
```
Signed-off-by: Josh Romero <joshr@nvidia.com>
```
romerojosh committed Aug 14, 2020
Configuration menu
View commit details

Copy full SHA for 788eef8

Browse repository at this point
Copy the full SHA

788eef8 View commit details

Browse the repository at this point in the history
Raise exception if op != Average and gradient_predivide_factor is set…
```
…. Addressing some other minor comments.

Signed-off-by: Josh Romero <joshr@nvidia.com>
```
romerojosh committed Aug 14, 2020
Configuration menu
View commit details

Copy full SHA for 96b9aa7

Browse repository at this point
Copy the full SHA

96b9aa7 View commit details

Browse the repository at this point in the history
Remove gradient_predivide_factor arg from some examples.
```
Signed-off-by: Josh Romero <joshr@nvidia.com>
```
romerojosh committed Aug 14, 2020
Configuration menu
View commit details

Copy full SHA for 5217244

Browse repository at this point
Copy the full SHA

5217244 View commit details

Browse the repository at this point in the history
Document HOROVOD_BUILD_CUDA_CC_LIST env var.
```
Signed-off-by: Josh Romero <joshr@nvidia.com>
```
romerojosh committed Aug 14, 2020
Configuration menu
View commit details

Copy full SHA for d641cd9

Browse repository at this point
Copy the full SHA

d641cd9 View commit details

Browse the repository at this point in the history
Cleanup/fixes after rebase.
```
Signed-off-by: Josh Romero <joshr@nvidia.com>
```
romerojosh committed Aug 14, 2020
Configuration menu
View commit details

Copy full SHA for 346ee50

Browse repository at this point
Copy the full SHA

346ee50 View commit details

Browse the repository at this point in the history
Update TF allreduce gradient to include pre/postscale factors.
```
Signed-off-by: Josh Romero <joshr@nvidia.com>
```
romerojosh committed Aug 14, 2020
Configuration menu
View commit details

Copy full SHA for adbc290

Browse repository at this point
Copy the full SHA

adbc290 View commit details

Browse the repository at this point in the history
Convert prescale and postscale factor args to scalar tensors to maint…
```
…ain double precision accuracy.

Signed-off-by: Josh Romero <joshr@nvidia.com>
```
romerojosh committed Aug 14, 2020
Configuration menu
View commit details

Copy full SHA for c59c324

Browse repository at this point
Copy the full SHA

c59c324 View commit details

Browse the repository at this point in the history
More robust testing of prescale and postscale factor behavior.
```
Signed-off-by: Josh Romero <joshr@nvidia.com>
```
romerojosh committed Aug 14, 2020
Configuration menu
View commit details

Copy full SHA for e7f07c5

Browse repository at this point
Copy the full SHA

e7f07c5 View commit details

Browse the repository at this point in the history
Fixes after rebase.
```
Signed-off-by: Josh Romero <joshr@nvidia.com>
```
romerojosh committed Aug 14, 2020
Configuration menu
View commit details

Copy full SHA for cb1efdf

Browse repository at this point
Copy the full SHA

cb1efdf View commit details

Browse the repository at this point in the history
Use size_op() to compute postscale_factor.
```
Signed-off-by: Josh Romero <joshr@nvidia.com>
```
romerojosh committed Aug 14, 2020
Configuration menu
View commit details

Copy full SHA for 369cf5c

Browse repository at this point
Copy the full SHA

369cf5c View commit details

Browse the repository at this point in the history
Revert "Use size_op() to compute postscale_factor."
```
This reverts commit f938c63.

Signed-off-by: Josh Romero <joshr@nvidia.com>
```
romerojosh committed Aug 14, 2020
Configuration menu
View commit details

Copy full SHA for db105fb

Browse repository at this point
Copy the full SHA

db105fb View commit details

Browse the repository at this point in the history
Revert "Convert prescale and postscale factor args to scalar tensors …
```
…to maintain double precision accuracy."

This reverts commit bf74af2.

Signed-off-by: Josh Romero <joshr@nvidia.com>
```
romerojosh committed Aug 14, 2020
Configuration menu
View commit details

Copy full SHA for 9ce7caa

Browse repository at this point
Copy the full SHA

9ce7caa View commit details

Browse the repository at this point in the history
Skip FP64 prescaling/postscaling tests for TensorFlow.
```
Signed-off-by: Josh Romero <joshr@nvidia.com>
```
romerojosh committed Aug 14, 2020
Configuration menu
View commit details

Copy full SHA for b9437cd

Browse repository at this point
Copy the full SHA

b9437cd View commit details

Browse the repository at this point in the history
Remove size() usage in Python when computing scaling factors for TF f…
```
…or elastic compatibility.

Signed-off-by: Josh Romero <joshr@nvidia.com>
```
romerojosh committed Aug 14, 2020
Configuration menu
View commit details

Copy full SHA for 04c4d57

Browse repository at this point
Copy the full SHA

04c4d57 View commit details

Browse the repository at this point in the history
Fix __CUDA_ARCH__ usage so half2 specialized kernel is invoked on sup…
```
…ported architectures. Invoke nvcc to obtain complete list of supported CCs to use for default compilation.

Signed-off-by: Josh Romero <joshr@nvidia.com>
```
romerojosh committed Aug 14, 2020
Configuration menu
View commit details

Copy full SHA for 4442229

Browse repository at this point
Copy the full SHA

4442229 View commit details

Browse the repository at this point in the history
Fix up pre/postscale torch tests for torch 1.12 multiplication behavior.
```
Signed-off-by: Josh Romero <joshr@nvidia.com>
```
romerojosh committed Aug 14, 2020
Configuration menu
View commit details

Copy full SHA for 4213b43

Browse repository at this point
Copy the full SHA

4213b43 View commit details

Browse the repository at this point in the history
Update supported compute capability detection.
```
Signed-off-by: Josh Romero <joshr@nvidia.com>
```
romerojosh committed Aug 14, 2020
Configuration menu
View commit details

Copy full SHA for 38363ce

Browse repository at this point
Copy the full SHA

38363ce View commit details

Browse the repository at this point in the history
Fix pre/postscaling tests for MXNet 1.4.
```
Signed-off-by: Josh Romero <joshr@nvidia.com>
```
romerojosh committed Aug 14, 2020
Configuration menu
View commit details

Copy full SHA for 1247043

Browse repository at this point
Copy the full SHA

1247043 View commit details

Browse the repository at this point in the history
Update pre/postscale tests. Deal with HOROVOD_MIXED_INSTALL cases.
```
Signed-off-by: Josh Romero <joshr@nvidia.com>
```
romerojosh committed Aug 14, 2020
Configuration menu
View commit details

Copy full SHA for 58af1c1

Browse repository at this point
Copy the full SHA

58af1c1 View commit details

Browse the repository at this point in the history
Fix pre/postscale test for PyTorch HOROVOD_MIXED_INSTALL case.
```
Signed-off-by: Josh Romero <joshr@nvidia.com>
```
romerojosh committed Aug 14, 2020
Configuration menu
View commit details

Copy full SHA for f0bcf58

Browse repository at this point
Copy the full SHA

f0bcf58 View commit details

Browse the repository at this point in the history

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add support for gradient_predivide_factor and averaging in Horovod backend. #1949

Add support for gradient_predivide_factor and averaging in Horovod backend. #1949

Commits on Aug 14, 2020