Skip to content
Permalink
Branch: master
Commits on Sep 10, 2019
  1. Fix error check when initializing Horovod with existing mpi4py commun…

    romerojosh authored and tgaddair committed Sep 10, 2019
    …icator. (#1391)
    
    Signed-off-by: Josh Romero <joshr@nvidia.com>
Commits on Jun 15, 2019
  1. Adding support for multiple CUDA streams for NCCL operations. (#1128)

    romerojosh authored and alsrgv committed Jun 15, 2019
    * Adding support for multiple CUDA streams for NCCL operations.
    
    Signed-off-by: Josh Romero <joshr@nvidia.com>
    
    * Fix compilation without CUDA or NCCL enabled.
    
    Signed-off-by: Josh Romero <joshr@nvidia.com>
    
    * Updating variable names.
    
    Signed-off-by: Josh Romero <joshr@nvidia.com>
Commits on Apr 11, 2019
  1. Enable faster worker coordination via response caching. (#902)

    romerojosh authored and alsrgv committed Apr 11, 2019
    * Enable faster worker coordination via response caching.
    
    Signed-off-by: Josh Romero <joshr@nvidia.com>
    
    * Adding cache invalidation mechanism to deal with name conflicts.
    
    Signed-off-by: Josh Romero <joshr@nvidia.com>
    
    * Making response order consistent between bypass and non-bypass paths.
    
    Signed-off-by: Josh Romero <joshr@nvidia.com>
    
    * Lock around tensor_table access for autotuner in bypass path. Moving stall check before potential quick return.
    
    Signed-off-by: Josh Romero <joshr@nvidia.com>
    
    * Removing status bit logic from ResponseCache class and applying only in GetCommonCacheAndState function for greater clarity.
    
    Signed-off-by: Josh Romero <joshr@nvidia.com>
    
    * Cleanup after rebase.
    
    Signed-off-by: Josh Romero <joshr@nvidia.com>
    
    * Moving stall check before call to GetCommonCacheAndState.
    
    Signed-off-by: Josh Romero <joshr@nvidia.com>
    
    * Addressing some review comments.
    
    Signed-off-by: Josh Romero <joshr@nvidia.com>
    
    * Adding some methods and additional comments to ResponseCache class.
    
    Signed-off-by: Josh Romero <joshr@nvidia.com>
    
    * Adding cache capacity to parameter manager.
    
    Signed-off-by: Josh Romero <joshr@nvidia.com>
    
    * Addressing some additional comments.
    
    Signed-off-by: Josh Romero <joshr@nvidia.com>
    
    * Adding CacheCoordinator class to encapsulate bit vector communication process.
    
    Signed-off-by: Josh Romero <joshr@nvidia.com>
    
    * Removing unused methods from ResponseCache.
    
    Signed-off-by: Josh Romero <joshr@nvidia.com>
    
    * Adding more comments to ResponseCache class.
    
    Signed-off-by: Josh Romero <joshr@nvidia.com>
    
    * Fixes to should_shut_down handling.
    
    Signed-off-by: Josh Romero <joshr@nvidia.com>
    
    * Consolidating some duplicated code. Added some more comments around cache operations in operations.cc.
    
    Signed-off-by: Josh Romero <joshr@nvidia.com>
    
    * Changing cache capacity treatment in autotuner to a boolean categorical parameter.
    
    Signed-off-by: Josh Romero <joshr@nvidia.com>
    
    * Implemented logic to issue stall warnings for cached messages.
    
    Signed-off-by: Josh Romero <joshr@nvidia.com>
    
    * Addressing comment.
    
    Signed-off-by: Josh Romero <joshr@nvidia.com>
You can’t perform that action at this time.