Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add CUDA Allgather implementation to support tensor fusion #1480

Merged
merged 2 commits into from
Nov 1, 2019

Conversation

jessebenson
Copy link
Contributor

@jessebenson jessebenson commented Oct 29, 2019

Allgather with CUDA tensors does not always work with tensor fusion/CUDA.

  1. Build Horovod without HOROVOD_GPU_*
import torch
import horovod.torch as hvd

hvd.init()

// The following allgather will succeed because it will run on the CPU (manually cudaMemcpy to CPU, run CPU allgather) in pytorch/tensorflow/etc.
tensor = torch.cuda.FloatTensor(*([4])).fill_(hvd.rank())
result = hvd.allgather(tensor)

// Tensor fusion works because the CUDA memory is manually copied to/from CPU
// and AllgatherOp uses std::memcpy for tensor fusion
tensor1 = torch.cuda.FloatTensor(*([4])).fill_(hvd.rank())
tensor2 = torch.cuda.FloatTensor(*([4])).fill_(hvd.rank())
handle1 = hvd.allgather_async(tensor1)
handle2 = hvd.allgather_async(tensor2)
result1 = hvd.synchronize(handle1)
result2 = hvd.synchronize(handle2)
  1. Build OpenMPI with CUDA support. Build Horovod with HOROVOD_GPU_ALLREDUCE=MPI and HOROVOD_GPU_ALLGATHER=MPI.
import torch
import horovod.torch as hvd

hvd.init()

// The following allgather will fail because CUDA is not initialized
tensor = torch.cuda.FloatTensor(*([4])).fill_(hvd.rank())
result = hvd.allgather(tensor)

// The following allgather will succeed because MPI_CUDAAllreduce will first initialize CUDA
tensor = torch.cuda.FloatTensor(*([4])).fill_(hvd.rank())
dummy = hvd.allreduce(tensor)
result = hvd.allgather(tensor)

// The following allgather using tensor fusion will fail
// This will invoke AllgatherOp, which unconditionally calls std::memcpy on the pointers, even if they are allocated on a CUDA device
tensor1 = torch.cuda.FloatTensor(*([4])).fill_(hvd.rank())
tensor2 = torch.cuda.FloatTensor(*([4])).fill_(hvd.rank())
handle1 = hvd.allgather_async(tensor1)
handle2 = hvd.allgather_async(tensor2)
result1 = hvd.synchronize(handle1)
result2 = hvd.synchronize(handle2)

@tgaddair
Copy link
Collaborator

Very nice! Adding @abditag2 who worked on the original Allgather tensor fusion implementation.

@jessebenson
Copy link
Contributor Author

I am also preparing a pull request to add "Reducescatter" as a first-class operation type. There's a paper about a memory optimization for distributed training. Instead of doing 'allreduce -> optimizer' they do 'reducescatter -> optimizer -> allgather'. But to do that, I need reducescatter in Horovod :)

https://arxiv.org/abs/1910.02054

Signed-off-by: Jesse Benson (AI) <jesseb@microsoft.com>
Signed-off-by: Jesse Benson (AI) <jesseb@microsoft.com>
@jessebenson
Copy link
Contributor Author

jessebenson commented Nov 1, 2019

I rebased on the latest master commit, and added the CUDAOpContext class for CUDAAllreduce and CUDAAllgather to share CUDA init logic.

Copy link
Collaborator

@tgaddair tgaddair left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM! Nice work!

@tgaddair tgaddair merged commit 2fa7cc7 into horovod:master Nov 1, 2019
@jessebenson jessebenson deleted the cuda-allgather branch November 1, 2019 23:59
jeffdaily pushed a commit to ROCm/horovod that referenced this pull request Nov 27, 2019
)

Signed-off-by: Jesse Benson (AI) <jesseb@microsoft.com>
DelphianCalamity pushed a commit to DelphianCalamity/horovod that referenced this pull request Apr 18, 2020
)

Signed-off-by: Jesse Benson (AI) <jesseb@microsoft.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants