Add CUDA Allgather implementation to support tensor fusion #1480

jessebenson · 2019-10-29T00:13:42Z

Allgather with CUDA tensors does not always work with tensor fusion/CUDA.

Build Horovod without HOROVOD_GPU_*

import torch
import horovod.torch as hvd

hvd.init()

// The following allgather will succeed because it will run on the CPU (manually cudaMemcpy to CPU, run CPU allgather) in pytorch/tensorflow/etc.
tensor = torch.cuda.FloatTensor(*([4])).fill_(hvd.rank())
result = hvd.allgather(tensor)

// Tensor fusion works because the CUDA memory is manually copied to/from CPU
// and AllgatherOp uses std::memcpy for tensor fusion
tensor1 = torch.cuda.FloatTensor(*([4])).fill_(hvd.rank())
tensor2 = torch.cuda.FloatTensor(*([4])).fill_(hvd.rank())
handle1 = hvd.allgather_async(tensor1)
handle2 = hvd.allgather_async(tensor2)
result1 = hvd.synchronize(handle1)
result2 = hvd.synchronize(handle2)

Build OpenMPI with CUDA support. Build Horovod with HOROVOD_GPU_ALLREDUCE=MPI and HOROVOD_GPU_ALLGATHER=MPI.

import torch
import horovod.torch as hvd

hvd.init()

// The following allgather will fail because CUDA is not initialized
tensor = torch.cuda.FloatTensor(*([4])).fill_(hvd.rank())
result = hvd.allgather(tensor)

// The following allgather will succeed because MPI_CUDAAllreduce will first initialize CUDA
tensor = torch.cuda.FloatTensor(*([4])).fill_(hvd.rank())
dummy = hvd.allreduce(tensor)
result = hvd.allgather(tensor)

// The following allgather using tensor fusion will fail
// This will invoke AllgatherOp, which unconditionally calls std::memcpy on the pointers, even if they are allocated on a CUDA device
tensor1 = torch.cuda.FloatTensor(*([4])).fill_(hvd.rank())
tensor2 = torch.cuda.FloatTensor(*([4])).fill_(hvd.rank())
handle1 = hvd.allgather_async(tensor1)
handle2 = hvd.allgather_async(tensor2)
result1 = hvd.synchronize(handle1)
result2 = hvd.synchronize(handle2)

tgaddair · 2019-10-29T00:42:30Z

Very nice! Adding @abditag2 who worked on the original Allgather tensor fusion implementation.

horovod/common/ops/cuda_operations.cc

jessebenson · 2019-10-29T04:02:23Z

I am also preparing a pull request to add "Reducescatter" as a first-class operation type. There's a paper about a memory optimization for distributed training. Instead of doing 'allreduce -> optimizer' they do 'reducescatter -> optimizer -> allgather'. But to do that, I need reducescatter in Horovod :)

https://arxiv.org/abs/1910.02054

Signed-off-by: Jesse Benson (AI) <jesseb@microsoft.com>

jessebenson · 2019-11-01T22:52:38Z

I rebased on the latest master commit, and added the CUDAOpContext class for CUDAAllreduce and CUDAAllgather to share CUDA init logic.

tgaddair

LGTM! Nice work!

) Signed-off-by: Jesse Benson (AI) <jesseb@microsoft.com>

tgaddair requested review from tgaddair and abditag2 October 29, 2019 00:42

jessebenson commented Oct 29, 2019

View reviewed changes

horovod/common/ops/cuda_operations.cc Outdated Show resolved Hide resolved

jessebenson force-pushed the cuda-allgather branch from 1dbb0db to 19557a9 Compare November 1, 2019 17:56

jessebenson added 2 commits November 1, 2019 14:43

Add CUDA Allgather implementation.

3ed8113

Signed-off-by: Jesse Benson (AI) <jesseb@microsoft.com>

Add CUDAOpContext to encapsulate common CUDA Op methods.

6caabb7

Signed-off-by: Jesse Benson (AI) <jesseb@microsoft.com>

jessebenson force-pushed the cuda-allgather branch from 19557a9 to 6caabb7 Compare November 1, 2019 21:43

tgaddair approved these changes Nov 1, 2019

View reviewed changes

tgaddair merged commit 2fa7cc7 into horovod:master Nov 1, 2019

jessebenson deleted the cuda-allgather branch November 1, 2019 23:59

jeffdaily pushed a commit to ROCm/horovod that referenced this pull request Nov 27, 2019

Add CUDA Allgather implementation to support tensor fusion (horovod#1480

5e3889d

) Signed-off-by: Jesse Benson (AI) <jesseb@microsoft.com>

DelphianCalamity pushed a commit to DelphianCalamity/horovod that referenced this pull request Apr 18, 2020

Add CUDA Allgather implementation to support tensor fusion (horovod#1480

38110ea

) Signed-off-by: Jesse Benson (AI) <jesseb@microsoft.com>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add CUDA Allgather implementation to support tensor fusion #1480

Add CUDA Allgather implementation to support tensor fusion #1480

jessebenson commented Oct 29, 2019 •

edited

Loading

tgaddair commented Oct 29, 2019

jessebenson commented Oct 29, 2019

jessebenson commented Nov 1, 2019 •

edited

Loading

tgaddair left a comment

Add CUDA Allgather implementation to support tensor fusion #1480

Add CUDA Allgather implementation to support tensor fusion #1480

Conversation

jessebenson commented Oct 29, 2019 • edited Loading

tgaddair commented Oct 29, 2019

jessebenson commented Oct 29, 2019

jessebenson commented Nov 1, 2019 • edited Loading

tgaddair left a comment

Choose a reason for hiding this comment

jessebenson commented Oct 29, 2019 •

edited

Loading

jessebenson commented Nov 1, 2019 •

edited

Loading