Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add support for gradient_predivide_factor and averaging in Horovod backend. #1949

Merged
merged 30 commits into from
Aug 17, 2020

Commits on Aug 14, 2020

  1. Add support for gradient_predivide_factor and averaging in Horovod ba…

    …ckend.
    
    Signed-off-by: Josh Romero <joshr@nvidia.com>
    romerojosh committed Aug 14, 2020
    Configuration menu
    Copy the full SHA
    69885d5 View commit details
    Browse the repository at this point in the history
  2. Add files to MANIFEST.in. Fix gloo only builds.

    Signed-off-by: Josh Romero <joshr@nvidia.com>
    romerojosh committed Aug 14, 2020
    Configuration menu
    Copy the full SHA
    63f1025 View commit details
    Browse the repository at this point in the history
  3. Add missing root_rank arg in MXNet code. Change to default -1 instead…

    … of 0.
    
    Signed-off-by: Josh Romero <joshr@nvidia.com>
    romerojosh committed Aug 14, 2020
    Configuration menu
    Copy the full SHA
    b8b6548 View commit details
    Browse the repository at this point in the history
  4. Revert use of MPI_IN_PLACE in ccl_operations. Add horovod_cuda_lib to…

    … extension modules in setup.py.
    
    Signed-off-by: Josh Romero <joshr@nvidia.com>
    romerojosh committed Aug 14, 2020
    Configuration menu
    Copy the full SHA
    be8d15f View commit details
    Browse the repository at this point in the history
  5. Compile half.cc for gloo builds.

    Signed-off-by: Josh Romero <joshr@nvidia.com>
    romerojosh committed Aug 14, 2020
    Configuration menu
    Copy the full SHA
    3bf34f0 View commit details
    Browse the repository at this point in the history
  6. Fixes to torch v1 build. Build Horovod CUDA kernels based on framewor…

    …k support. Add cmake to ppc64le test environment.
    
    Signed-off-by: Josh Romero <joshr@nvidia.com>
    romerojosh committed Aug 14, 2020
    Configuration menu
    Copy the full SHA
    394b167 View commit details
    Browse the repository at this point in the history
  7. Move postscale_factor modification for average into backend.

    Signed-off-by: Josh Romero <joshr@nvidia.com>
    romerojosh committed Aug 14, 2020
    Configuration menu
    Copy the full SHA
    445cf63 View commit details
    Browse the repository at this point in the history
  8. Extend backend averaging to Adasum. Fix to ppc64le build.

    Signed-off-by: Josh Romero <joshr@nvidia.com>
    romerojosh committed Aug 14, 2020
    Configuration menu
    Copy the full SHA
    1cf2b3f View commit details
    Browse the repository at this point in the history
  9. Extend gradient_predivide_factor support to Keras.

    Signed-off-by: Josh Romero <joshr@nvidia.com>
    romerojosh committed Aug 14, 2020
    Configuration menu
    Copy the full SHA
    0daea1a View commit details
    Browse the repository at this point in the history
  10. Fix ppc64le build.

    Signed-off-by: Josh Romero <joshr@nvidia.com>
    romerojosh committed Aug 14, 2020
    Configuration menu
    Copy the full SHA
    116e138 View commit details
    Browse the repository at this point in the history
  11. Add gradient_predivide_factor to examples. Fix torch optimizer.py.

    Signed-off-by: Josh Romero <joshr@nvidia.com>
    romerojosh committed Aug 14, 2020
    Configuration menu
    Copy the full SHA
    788eef8 View commit details
    Browse the repository at this point in the history
  12. Raise exception if op != Average and gradient_predivide_factor is set…

    …. Addressing some other minor comments.
    
    Signed-off-by: Josh Romero <joshr@nvidia.com>
    romerojosh committed Aug 14, 2020
    Configuration menu
    Copy the full SHA
    96b9aa7 View commit details
    Browse the repository at this point in the history
  13. Remove gradient_predivide_factor arg from some examples.

    Signed-off-by: Josh Romero <joshr@nvidia.com>
    romerojosh committed Aug 14, 2020
    Configuration menu
    Copy the full SHA
    5217244 View commit details
    Browse the repository at this point in the history
  14. Document HOROVOD_BUILD_CUDA_CC_LIST env var.

    Signed-off-by: Josh Romero <joshr@nvidia.com>
    romerojosh committed Aug 14, 2020
    Configuration menu
    Copy the full SHA
    d641cd9 View commit details
    Browse the repository at this point in the history
  15. Cleanup/fixes after rebase.

    Signed-off-by: Josh Romero <joshr@nvidia.com>
    romerojosh committed Aug 14, 2020
    Configuration menu
    Copy the full SHA
    346ee50 View commit details
    Browse the repository at this point in the history
  16. Update TF allreduce gradient to include pre/postscale factors.

    Signed-off-by: Josh Romero <joshr@nvidia.com>
    romerojosh committed Aug 14, 2020
    Configuration menu
    Copy the full SHA
    adbc290 View commit details
    Browse the repository at this point in the history
  17. Convert prescale and postscale factor args to scalar tensors to maint…

    …ain double precision accuracy.
    
    Signed-off-by: Josh Romero <joshr@nvidia.com>
    romerojosh committed Aug 14, 2020
    Configuration menu
    Copy the full SHA
    c59c324 View commit details
    Browse the repository at this point in the history
  18. More robust testing of prescale and postscale factor behavior.

    Signed-off-by: Josh Romero <joshr@nvidia.com>
    romerojosh committed Aug 14, 2020
    Configuration menu
    Copy the full SHA
    e7f07c5 View commit details
    Browse the repository at this point in the history
  19. Fixes after rebase.

    Signed-off-by: Josh Romero <joshr@nvidia.com>
    romerojosh committed Aug 14, 2020
    Configuration menu
    Copy the full SHA
    cb1efdf View commit details
    Browse the repository at this point in the history
  20. Use size_op() to compute postscale_factor.

    Signed-off-by: Josh Romero <joshr@nvidia.com>
    romerojosh committed Aug 14, 2020
    Configuration menu
    Copy the full SHA
    369cf5c View commit details
    Browse the repository at this point in the history
  21. Revert "Use size_op() to compute postscale_factor."

    This reverts commit f938c63.
    
    Signed-off-by: Josh Romero <joshr@nvidia.com>
    romerojosh committed Aug 14, 2020
    Configuration menu
    Copy the full SHA
    db105fb View commit details
    Browse the repository at this point in the history
  22. Revert "Convert prescale and postscale factor args to scalar tensors …

    …to maintain double precision accuracy."
    
    This reverts commit bf74af2.
    
    Signed-off-by: Josh Romero <joshr@nvidia.com>
    romerojosh committed Aug 14, 2020
    Configuration menu
    Copy the full SHA
    9ce7caa View commit details
    Browse the repository at this point in the history
  23. Skip FP64 prescaling/postscaling tests for TensorFlow.

    Signed-off-by: Josh Romero <joshr@nvidia.com>
    romerojosh committed Aug 14, 2020
    Configuration menu
    Copy the full SHA
    b9437cd View commit details
    Browse the repository at this point in the history
  24. Remove size() usage in Python when computing scaling factors for TF f…

    …or elastic compatibility.
    
    Signed-off-by: Josh Romero <joshr@nvidia.com>
    romerojosh committed Aug 14, 2020
    Configuration menu
    Copy the full SHA
    04c4d57 View commit details
    Browse the repository at this point in the history
  25. Fix __CUDA_ARCH__ usage so half2 specialized kernel is invoked on sup…

    …ported architectures. Invoke nvcc to obtain complete list of supported CCs to use for default compilation.
    
    Signed-off-by: Josh Romero <joshr@nvidia.com>
    romerojosh committed Aug 14, 2020
    Configuration menu
    Copy the full SHA
    4442229 View commit details
    Browse the repository at this point in the history
  26. Fix up pre/postscale torch tests for torch 1.12 multiplication behavior.

    Signed-off-by: Josh Romero <joshr@nvidia.com>
    romerojosh committed Aug 14, 2020
    Configuration menu
    Copy the full SHA
    4213b43 View commit details
    Browse the repository at this point in the history
  27. Update supported compute capability detection.

    Signed-off-by: Josh Romero <joshr@nvidia.com>
    romerojosh committed Aug 14, 2020
    Configuration menu
    Copy the full SHA
    38363ce View commit details
    Browse the repository at this point in the history
  28. Fix pre/postscaling tests for MXNet 1.4.

    Signed-off-by: Josh Romero <joshr@nvidia.com>
    romerojosh committed Aug 14, 2020
    Configuration menu
    Copy the full SHA
    1247043 View commit details
    Browse the repository at this point in the history
  29. Update pre/postscale tests. Deal with HOROVOD_MIXED_INSTALL cases.

    Signed-off-by: Josh Romero <joshr@nvidia.com>
    romerojosh committed Aug 14, 2020
    Configuration menu
    Copy the full SHA
    58af1c1 View commit details
    Browse the repository at this point in the history
  30. Fix pre/postscale test for PyTorch HOROVOD_MIXED_INSTALL case.

    Signed-off-by: Josh Romero <joshr@nvidia.com>
    romerojosh committed Aug 14, 2020
    Configuration menu
    Copy the full SHA
    f0bcf58 View commit details
    Browse the repository at this point in the history