Adasum Full GPU Ring-based Allreduce #1760

vaeksare · 2020-03-02T19:10:39Z

This PR adds a new Adasum Op that is capable of being performed completely on the GPU intranode. It works by mimicking the NCCL ring Allreduce using a custom algorithm that is based on CUDA-aware MPI send/receive primitives. On machines that are able to support it, it allows for much higher throughput than other Adasum operations.

Note that this current version only works on DGX1-like machines that have 8 GPUs. It could be extended to more in the future, but this was the main use case where the other two currently existing Adasum modes (CPU and Hierarchical) both have large drawbacks.

In order to use it, Horovod should be compiled with HOROVOD_GPU_ALLREDUCE=MPI and op=hvd.Adasum should be passed in to the optimizer/allreduce calls. The user needs to have a CUDA-aware MPI implementation installed (we used OpenMPI with UCX).

nvcastet · 2020-03-04T21:37:40Z

@vaeksare Thanks for the PR!
What limits this new op to DGX1-like machines?
Could you share some performance numbers between the Adasum ops and also compared to a regular NCCL allreduce op?

vaeksare · 2020-03-05T23:38:42Z

@nvcastet thanks for taking a look!
The ring building algorithm using for the Allreduce is specialized for a DGX1 in terms of number of GPUs per ring and number of rings used. It will always build 4 rings currently, 2 "fat" rings and 2 "skinny" rings with half the capacity. This is only ideal on an 8-GPU hybrid cube-mesh interconnection that a DGX1 has, in which each GPU is connected via NVLink to 4 other GPUs, and some of the NVLinks have double the effective bandwidth of others. Additionally, the rings are hardcoded to have 8 GPUs in each. In order to support varying configurations, we would need to to have a new algorithm to dynamically build the rings based on the exact configuration. Right now, the rings are built dynamically but there are always 4 of them and it's assumed the NVLink configuration follow the restrains mentioned above.

In terms of rough performance numbers, running pytorch_synthetic_benchmark.py the throughput results are approximately the following (all tested on a single DGX1):

NCCL Allreduce: 310 images/sec/gpu (all averaging)
Ring-based Adasum: 290 images/sec/gpu (all Adasum operation)
Hierarchical "Adasum": 310 images/sec/gpu (note that because this is single node, this will actually just run NCCL Averaging since it only does Adasum cross-node. So this carries none of the Adasum convergence benefits)
CPU Adasum: 100 images/sec/gpu (all Adasum operation. This is so slow because it's not able to utilize GPU P2P communication through NVLinks).

vaeksare · 2020-03-05T23:39:53Z

It also seems like the CI builds are currently failing with a rather strange failure (not being able to find cmake I think?) Do you happen to have any insights into why that could be the case? Builds fine locally for me.

nvcastet · 2020-03-09T18:54:46Z

@vaeksare You may want to rebase your branch. We fixed some CI build issues recently.
@vaeksare All the logic to calculate how many rings to better utilize the hardware topology is already nicely done inside NCCL. From what I understood, you would need ncclallreduce to support a custom kernel for the reduction op instead of just avg/sum/min/max to implement your algorithm?
@sjeaugey @romerojosh Do you know if there is a way to customize the reduce op of a ncclallreduce?

sjeaugey · 2020-03-09T19:21:36Z

The reduction operation is at the very core of the NCCL CUDA kernels, and we generate kernels for each operation/datatype, because we need to properly unroll and merge it with the rest of the algorithm. We looked at ways to use user-defined operations but it was either hard to use or very slow.

As far as I understand the goal here is to scale values down as we sum, to avoid reaching the limits of the format, and end up with the average instead of the sum. Is that right ?

vaeksare · 2020-03-09T19:50:18Z

@sjeaugey @nvcastet Scaling down is not quite accurate, what we do is actually quite a bit more complex. Effectively instead of summing the two vectors, they are projected onto each other (through a dot product and norm calculations). This will effectively interpolate between a sum and an average depending on how parallel or orthogonal the vectors are. For more details, you can see here: #1485.

The big issue with this is that this operation is not associative. We have looked quite closely as NCCL internals before, and had some discussions, and don't believe there is any reasonable way to implement this in existing NCCL operations, as the way the Allreduce is coded in NCCL it cannot support a non-associative custom op without losing all of its performance.

It is definitely true that we could write ring formation based on how NCCL does it, but I believe this would still require largely rewriting that portion of NCCL inside our Adasum ops, which was outside of the scope of this immediate work.

I will rebase shortly for the CI.

romerojosh

Thanks for the PR @vaeksare! I just took a first rough pass on the code and left a few comments to start out.

romerojosh · 2020-03-12T15:34:54Z

horovod/common/ops/cuda/CMakeLists.txt

+
+find_package(CUDA)
+
+list(APPEND CUDA_NVCC_FLAGS "--compiler-options -fPIC -D_FORCE_INLINES -arch=sm_60")


Can we generalize these flags to build for more than just sm_60? Might be good to expose this also as a build option so a user can specify their own build flags.

romerojosh · 2020-03-12T15:36:33Z

horovod/common/ops/cuda/adasum_cuda_kernels.cu

+
+template<typename T, typename TACC>
+__global__
+void CudaDotProductKernel(int count, const T* a, const T* b, TACC* out) {


Would it be possible to use functions from cuBLAS for this dot product operation like cublasDotEx or similarly cublasNrm2Ex? :
https://docs.nvidia.com/cuda/cublas/index.html#cublas-dotEx
https://docs.nvidia.com/cuda/cublas/index.html#cublas-nrm2Ex

The use of atomicAdd in this kernel will introduce non-determinism (which may be a concern) while these APIs are deterministic. There are also Thrust APIs for reductions and inner products but the docs suggest those are also non-deterministic.

I tried using the cuBLAS operations for this previously, but found that the performance was actually a lot lower with those than writing this directly (~20-25% slower). I don't believe that it's worth the trade off.

I'd expect using individual cublas calls to be slower than the fused approach you have, but the benefit is determinism. Is the 20-25% slower end-to-end, or just timing this particular set of operations. @tgaddair do you have any comments on this regarding deterministic operations?

In general, my preference is to prefer performance over determinism by default, but provide a means for users to enforce stricter determinism guarantees if desired. So we could toggle between the fused and unfused ops with an environment variable or horovodrun arg, for example. I feel this would be consistent with the way we handle similar cases (e.g., tensor fusion). Does that sound reasonable to you all?

I think that sounds reasonable. A second deterministic version of this fused kernel approach can be written. Instead of atomic adds, the first kernel should write intermediate block sums to a temporary array. Then a second kernel can be launched to sum the temporary results in the array.

I actually had this implemented exactly as you described originally (I think the part below is the leftover portion from me forgetting to remove all that code). The 20-25% slowdown I observed with it was on the pytorch synthetic benchmark example (resnet50 with batch size 32). So it was a significant slowdown in an end-to-end scenario, albeit a very communication bound one. I am not sure how significant of a concern this non determinism is, but I could add back the deterministic path. I mostly removed it because I think the slowdown is too much for any real use case, and it adds a rather large amount of code.

horovod/common/ops/cuda/adasum_cuda_kernels.cu

romerojosh · 2020-03-12T16:21:40Z

horovod/tensorflow/__init__.py

-                        warnings.warn('Adasum reduction does not currently support GPU reduction using MPI. Tensors '
-                                      'are copied to CPU memory instead. To use Adasum for GPU reduction, please '
-                                      'compile Horovod with HOROVOD_GPU_ALLREDUCE=NCCL.')
+                        if horovod_local_size != 8:


Is there a reason this is limited to exactly 8 GPUs? It seems like most of the intra-GPU communication is carried out via MPI_Isend/MPI_Irecv and thus should generalize to any number of GPUs? I think there would be less performance the GPUs aren't P2P connected, but there isn't a guarantee that all 8 GPU systems would have a DGX1-like topology.

No real good reason. Primary reason for this was that this is only intended to be used by DGX1-like architectures, but checking for number of GPUs is much simpler than checking for the exact architecture. So this seemed better than no check at all, but much cleaner than adding convoluted checks for the exact topology.

I'd suggest having a default (but maybe slow path) that works on any number of GPUs. You can print a warning that specifies that the current design is optimized for DGX1 like topologies with 8 GPUs.

horovod/common/ops/cuda/adasum_cuda_kernels.cu

romerojosh · 2020-03-12T16:30:08Z

horovod/common/ops/cuda/adasum_cuda_kernels.cu

+
+template<typename T, typename TACC>
+__global__
+void CudaScaleAddKernel(int count, T* a, const T* b, TACC a_coeff, TACC b_coeff) {


Similar to above, could this operation be expressed using Thrust using thrust::transform?: https://thrust.github.io/doc/group__transformations_ga68a3ba7d332887f1332ca3bc04453792.html#ga68a3ba7d332887f1332ca3bc04453792

I looked again and I do not think this kernel is used either, so it should be removed.

romerojosh · 2020-03-12T16:30:30Z

horovod/common/ops/cuda/adasum_cuda_kernels.cu

+
+template<typename T, typename TACC>
+__global__
+void CudaSingleAdasumKernel(int count, T* a, const T* b, TACC* out) {


Could this operation be expressed using Thrust using thrust::transform?: https://thrust.github.io/doc/group__transformations_ga68a3ba7d332887f1332ca3bc04453792.html#ga68a3ba7d332887f1332ca3bc04453792

vaeksare · 2020-03-17T18:15:41Z

Hi @romerojosh , thanks for the initial feedback! Regarding using cuBLAS and thrust, I experimented with it but found the performance to be generally worse than writing the operations directly. And given that the operations don't add too much code, I think it's worth it to just have them be implemented manually.
Seems like rebasing did not fix the build error. Let me investigate further.

stale · 2020-11-06T16:44:24Z

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

liuyunfeng2016 · 2021-01-08T04:13:32Z

@ashbhandare @tgaddair @sblotner @vaeksare @romerojosh @sjeaugey
hello,I found that hvd.AdaSum is incorrect in Horovod 0.21.0. It should be hvd.Adasum.
Then I have some questions when I use Adasum,how do I choose Pure CPU、Ring、Hierarchical、three modes
Now,I found HOROVOD_HIERARCHICAL_ALLREDUCE in paper,
my environment is 8 Tesla V100s per node.How can I make the most of his ability?I found The performance deteriorates by 10 times when the Adasum is used.
The following are my building parameters.

ENV HOROVOD_WITHOUT_GLOO=1
ENV HOROVOD_GPU_OPERATIONS='NCCL'
ENV HOROVOD_CPU_OPERATIONS='MPI'
ENV HOROVOD_WITH_PYTORCH=1
ENV HOROVOD_NCCL_INCLUDE=/usr/include
ENV HOROVOD_NCCL_LIB=/usr/lib/x86_64-linux-gnu
ENV HOROVOD_MPICXX_SHOW="/usr/local/openmpi/bin/mpicxx -show"
ENV HOROVOD_WITH_TENSORFLOW=1
ENV HOROVOD_CUDA_HOME="/usr/local/cuda"
ENV TENSORFLOW_VERSION=1.15

tgaddair requested review from nvcastet and romerojosh March 2, 2020 19:45

romerojosh reviewed Mar 12, 2020

View reviewed changes

adasum gpu ring code

7e15e02

vaeksare force-pushed the adasum_gpu branch from 15eeef3 to 7e15e02 Compare March 17, 2020 17:45

removed unused function and fixed threads per block

44850c4

romerojosh mentioned this pull request May 13, 2020

Add support for gradient_predivide_factor and averaging in Horovod backend. #1949

Merged

stale bot added the wontfix label Nov 6, 2020

stale bot closed this Nov 13, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Adasum Full GPU Ring-based Allreduce #1760

Adasum Full GPU Ring-based Allreduce #1760

vaeksare commented Mar 2, 2020

nvcastet commented Mar 4, 2020

vaeksare commented Mar 5, 2020

vaeksare commented Mar 5, 2020

nvcastet commented Mar 9, 2020

sjeaugey commented Mar 9, 2020

vaeksare commented Mar 9, 2020

romerojosh left a comment

romerojosh Mar 12, 2020

romerojosh Mar 12, 2020

vaeksare Mar 17, 2020

romerojosh Mar 18, 2020

tgaddair Mar 19, 2020

romerojosh Mar 19, 2020

vaeksare Mar 19, 2020

romerojosh Mar 12, 2020

vaeksare Mar 17, 2020

romerojosh Mar 18, 2020

romerojosh Mar 12, 2020

romerojosh Mar 18, 2020

romerojosh Mar 12, 2020

vaeksare commented Mar 17, 2020

stale bot commented Nov 6, 2020

liuyunfeng2016 commented Jan 8, 2021


		find_package(CUDA)

		list(APPEND CUDA_NVCC_FLAGS "--compiler-options -fPIC -D_FORCE_INLINES -arch=sm_60")

Adasum Full GPU Ring-based Allreduce #1760

Adasum Full GPU Ring-based Allreduce #1760

Conversation

vaeksare commented Mar 2, 2020

nvcastet commented Mar 4, 2020

vaeksare commented Mar 5, 2020

vaeksare commented Mar 5, 2020

nvcastet commented Mar 9, 2020

sjeaugey commented Mar 9, 2020

vaeksare commented Mar 9, 2020

romerojosh left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

vaeksare commented Mar 17, 2020

stale bot commented Nov 6, 2020

liuyunfeng2016 commented Jan 8, 2021