Adding CudaMallocSync support when using CUDA version >= 11.2 #4026

matt-stack · 2021-05-13T20:31:56Z

Certain memory allocation patterns can take advantage of using CudaMallocAsync in CUDA 11.2 as opposed to using cudaMalloc. There is an immediate cudaDeviceSynchronize() call after the cudaMallocAsync in this implementation so that we can take advantage of the implicit pool allocators, and not introduce unintended asynchronous behavior. CudaFree has also been replaced by cudaFreeAsync when the ifdef confirms CUDA 11.2, so that the memory is returned to the pool, and a cudaDeviceSynchronize() right after.

…e to compile using cudaMallocAsync and cudaFreeAsync with an immediate cudaDeviceSynchronize if CUDART_VERSION is greater than 11.2, or uses normal cudaMalloc and cudaFree if using an older version

dalg24-jenkins · 2021-05-13T20:31:58Z

Can one of the admins verify this patch?

Rombur · 2021-05-13T21:37:58Z

OK to test

crtrott · 2021-05-13T23:35:44Z

Pushed the clang-format fix.

core/src/Cuda/Kokkos_CudaSpace.cpp

Co-authored-by: Damien L-G <dalg24+github@gmail.com>

Adding #error to impl deallocate

Adding a call to cudaDeviceSynchronize before cudaFreeAsync to ensure to users that cudaFreeAsync does not add any unintended asynchronous behavior

masterleinad · 2021-05-24T15:58:33Z

Retest this please.

masterleinad · 2021-05-24T21:38:41Z

Retest this please.

matt-stack · 2021-05-24T21:54:59Z

@masterleinad Testing now

masterleinad · 2021-05-25T13:08:08Z

Retest this please.

matt-stack · 2021-05-26T14:59:35Z

Update: I am working on one small change regarding a small "bug" (that has been fixed in the most recent CUDA). Hoping to push for review and thoughts today

matt-stack · 2021-05-31T22:05:24Z

A Cuda bug came up when I was running the unit tests, and it requires a small addition to CudaSpace to avoid. I am adding it here to this PR for thoughts. The bug is that if you request a number very close to the numerical limit for size_t, internally this number is rounded up to SIZE_MAX+1, which gets used by some code that does not have a check for 0, and segfaults. It was reported that this fix will be in Cuda 11.4.

This came up from KokkosCore_UnitTest_Cuda2, where one test allocates
auto arg_alloc_size = std::numeric_limits<size_t>::max() - 42;

The check does reduce the readability of this section of CudaSpace with an extra if-else statement, and after much thought I am curious if it should be included. The case where the bug is triggered seems unlikely (requesting close to SIZE_MAX of size_t, or a small negative size_t which is the same case). I can revert it back to the original version if these additions are unnecessary. @crtrott @maxpkatz

masterleinad · 2021-06-01T13:41:44Z

The indentation needs to be fixed:

./scripts/docker/check_format_cpp.sh
diff --git a/core/src/Cuda/Kokkos_CudaSpace.cpp b/core/src/Cuda/Kokkos_CudaSpace.cpp
index 6123e93cf..9b8aaef68 100644
--- a/core/src/Cuda/Kokkos_CudaSpace.cpp
+++ b/core/src/Cuda/Kokkos_CudaSpace.cpp
@@ -226,11 +226,10 @@ void *CudaSpace::impl_allocate(
 #error CUDART_VERSION undefined!
 #elif (CUDART_VERSION >= 11020)
   cudaError_t error_code;
-  if ( (size_t)arg_alloc_size < (std::numeric_limits<size_t>::max() - 1000) ) {
+  if ((size_t)arg_alloc_size < (std::numeric_limits<size_t>::max() - 1000)) {
     error_code = cudaMallocAsync(&ptr, arg_alloc_size, 0);
     cudaDeviceSynchronize();
-  }
-  else {
+  } else {
     error_code = cudaErrorInvalidValue;
   }
 #else
script returned exit code 1

masterleinad · 2021-06-01T14:28:21Z

core/src/Cuda/Kokkos_CudaSpace.cpp

+  }
+  else {
+    error_code = cudaErrorInvalidValue;
+  }
+#else


Since this is a performance improvement, I would fall back to the original approach, i.e. using cudaMalloc, if the bug is triggered. Also, we should add a FIXME with an explanation so that we can remove the restriction eventually.

I agree, I will make these changes now

@masterleinad Thanks for your comments, I added these changes

…to absorb the error if triggered

masterleinad

Please also wrap cudaDeviceSynchronize as suggested below.

core/src/Cuda/Kokkos_CudaSpace.cpp

Co-authored-by: Daniel Arndt <arndtd@ornl.gov>

Adding CUDA_SAFE_CALL wrapper to sync Co-authored-by: Daniel Arndt <arndtd@ornl.gov>

crtrott · 2021-06-04T14:58:52Z

Retest this please

matt-stack · 2021-06-07T15:20:53Z

Are there any more tests that should be run/change requests?

matt-stack · 2021-06-10T18:17:13Z

Hi @dalg24, would you be able to check out this current state for approval? Thanks!

core/src/Cuda/Kokkos_CudaSpace.cpp

crtrott

Hi guys,

I did some performance checking and for small allocations this seems to be slower.

I ran this code, which does a bunch of random sizes allocations, deletes them, and does that in a loop:

#include <Kokkos_Core.hpp>
#include <cmath>
#include <cstdlib>

int main(int argc, char* argv[]) {
  Kokkos::initialize(argc, argv);
  {
    int N = argc > 1 ? atoi(argv[1]) : 1000;
    int R = argc > 2 ? atoi(argv[2]) : 10;
    int MAX_SIZE = argc > 3 ? atoi(argv[3]) : 10000000;

    double** ptrs = new double*[R];
    srand(5123);
    Kokkos::Timer timer;
    for(int i=0;i<N;i++) {
      for(int r=0; r<R; r++) {
        int size = rand()%MAX_SIZE;
        ptrs[r]=(double*)Kokkos::kokkos_malloc<>(size*8);
      }
      for(int r=0; r<R; r++) Kokkos::kokkos_free<>(ptrs[r]);
    }
    printf("%lf\n",R*N/timer.seconds());
  }
  Kokkos::finalize();
}

At the end it spits out an allocation/deallocation rate.
With develop for N=1000, R=10 and MAX_SIZE=100 i got around 70k/s while with this PR I get only 54k/s.
With MAX_SIZE=10,000,000 I get 875/s and 899/s respectively (i.e. the new code is faster).

With MAX_SIZE=10,000 I still get 67k vs 53k.
With MAX_SIZE=100,000 however its 18k vs 52k
With MAX_SIZE=1,000,000 its 4k vs 9k

Note Size is size in number of doubles. So it looks like <100kB cudaMalloc might be faster, while above that the async thing is faster. We probably should have a switchover in the code, the current one is also a bit weird? What is the business with std::numeric_limits<size_t>::max there?? even -1000 that limit is unrealstic, not even Summit has that much memory on the entire machine, not to speak of a single node.

matt-stack · 2021-06-11T03:36:38Z

@crtrott I ran your test case and saw similar results on V100 with CUDA 11.2, I think a switch of cudaMalloc for <100kb and cudaMallocAsync for greater is a great idea. I can add that in for review. Yeah so for the std::numeric_limits<size_t>::max if statement, that is there because there is a bug in cudaMallocAsync that if you request a number close to SIZE_MAX then it will throw a segfault. It came up after running one of the Kokkos Cuda unit tests (the specific test request: auto alloc_size = std::numeric_limits<size_t>::max() - 42;) so I thought I would include the size check as it was a case in the units. The size statement check can easily be taken out, I agree that a user would already have to be in dangerous territory to trigger it (requesting unrealistic size, or a negative int, etc)

…requests, removed bug check

typo fix

matt-stack · 2021-06-17T21:53:26Z

@crtrott Hi Christian, I added a memory threshold of a requested alloc size based on testing I ran. I found that having any request less than 5 kb use cudaMalloc and larger use cudaMallocAsync was the best mix for the test case. I also removed the "bug check" if statement (detailed above), but could easily put it back in.

core/src/Cuda/Kokkos_CudaSpace.cpp

matt-stack and others added 2 commits May 13, 2021 16:27

adding conditional ifdefs areound cudaMalloc and cudaFree in CudaSpac…

dda34fb

…e to compile using cudaMallocAsync and cudaFreeAsync with an immediate cudaDeviceSynchronize if CUDART_VERSION is greater than 11.2, or uses normal cudaMalloc and cudaFree if using an older version

Merge branch 'kokkos:develop' into develop

1c7e5dd

Apply clang-format

bd988a6

dalg24 requested changes May 14, 2021

View reviewed changes

core/src/Cuda/Kokkos_CudaSpace.cpp Outdated Show resolved Hide resolved

core/src/Cuda/Kokkos_CudaSpace.cpp Outdated Show resolved Hide resolved

core/src/Cuda/Kokkos_CudaSpace.cpp Outdated Show resolved Hide resolved

matt-stack and others added 3 commits May 14, 2021 08:03

Update core/src/Cuda/Kokkos_CudaSpace.cpp

733bbbe

Co-authored-by: Damien L-G <dalg24+github@gmail.com>

Update Kokkos_CudaSpace.cpp

3d80e71

Adding #error to impl deallocate

Adding Sync call before Free

ac17a44

Adding a call to cudaDeviceSynchronize before cudaFreeAsync to ensure to users that cudaFreeAsync does not add any unintended asynchronous behavior

masterleinad approved these changes May 20, 2021

View reviewed changes

matt-stack requested a review from dalg24 May 24, 2021 15:12

addition to subvert size_t max value rollover bug for cudaMallocAsync

17f3e94

masterleinad reviewed Jun 1, 2021

View reviewed changes

adding FIXME to deprecate bug check in the future, adding cudaMalloc …

d4810fe

…to absorb the error if triggered

masterleinad approved these changes Jun 2, 2021

View reviewed changes

masterleinad reviewed Jun 2, 2021

View reviewed changes

core/src/Cuda/Kokkos_CudaSpace.cpp Outdated Show resolved Hide resolved

core/src/Cuda/Kokkos_CudaSpace.cpp Outdated Show resolved Hide resolved

core/src/Cuda/Kokkos_CudaSpace.cpp Outdated Show resolved Hide resolved

matt-stack and others added 3 commits June 2, 2021 12:47

Update core/src/Cuda/Kokkos_CudaSpace.cpp

1550a96

Co-authored-by: Daniel Arndt <arndtd@ornl.gov>

Update core/src/Cuda/Kokkos_CudaSpace.cpp

cf4626e

Co-authored-by: Daniel Arndt <arndtd@ornl.gov>

Update core/src/Cuda/Kokkos_CudaSpace.cpp

1e09a55

Adding CUDA_SAFE_CALL wrapper to sync Co-authored-by: Daniel Arndt <arndtd@ornl.gov>

dalg24 reviewed Jun 10, 2021

View reviewed changes

core/src/Cuda/Kokkos_CudaSpace.cpp Outdated Show resolved Hide resolved

dalg24 approved these changes Jun 10, 2021

View reviewed changes

crtrott requested changes Jun 10, 2021

View reviewed changes

matt-stack and others added 4 commits June 17, 2021 08:37

Merge branch 'kokkos:develop' into develop

72551f9

Merge branch 'kokkos:develop' into develop

7703d58

adding requested size threshold to improve performance of small size …

96e8b6c

…requests, removed bug check

Typo

d307b6f

typo fix

crtrott requested changes Jun 30, 2021

View reviewed changes

core/src/Cuda/Kokkos_CudaSpace.cpp Outdated Show resolved Hide resolved

Clang format fix

0b67467

crtrott approved these changes Jun 30, 2021

View reviewed changes

crtrott merged commit 607bf89 into kokkos:develop Jul 1, 2021

ajpowelsnl added the InDevelop Enhancement, fix, etc. has been merged into the develop branch; label Jul 20, 2021

maxpkatz mentioned this pull request Aug 11, 2021

cudaMallocAsync not supported by UCX, may cause failure in OpenMPI+Kokkos+cuda applications #4228

Closed

masterleinad mentioned this pull request Aug 10, 2023

cudaMallocAsync #3981

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Adding CudaMallocSync support when using CUDA version >= 11.2 #4026

Adding CudaMallocSync support when using CUDA version >= 11.2 #4026

matt-stack commented May 13, 2021

dalg24-jenkins commented May 13, 2021

Rombur commented May 13, 2021

crtrott commented May 13, 2021

masterleinad commented May 24, 2021

masterleinad commented May 24, 2021

matt-stack commented May 24, 2021

masterleinad commented May 25, 2021

matt-stack commented May 26, 2021

matt-stack commented May 31, 2021

masterleinad commented Jun 1, 2021

masterleinad Jun 1, 2021

matt-stack Jun 2, 2021

matt-stack Jun 2, 2021

masterleinad left a comment

crtrott commented Jun 4, 2021

matt-stack commented Jun 7, 2021

matt-stack commented Jun 10, 2021

crtrott left a comment

matt-stack commented Jun 11, 2021

matt-stack commented Jun 17, 2021

Adding CudaMallocSync support when using CUDA version >= 11.2 #4026

Adding CudaMallocSync support when using CUDA version >= 11.2 #4026

Conversation

matt-stack commented May 13, 2021

dalg24-jenkins commented May 13, 2021

Rombur commented May 13, 2021

crtrott commented May 13, 2021

masterleinad commented May 24, 2021

masterleinad commented May 24, 2021

matt-stack commented May 24, 2021

masterleinad commented May 25, 2021

matt-stack commented May 26, 2021

matt-stack commented May 31, 2021

masterleinad commented Jun 1, 2021

masterleinad Jun 1, 2021

Choose a reason for hiding this comment

matt-stack Jun 2, 2021

Choose a reason for hiding this comment

matt-stack Jun 2, 2021

Choose a reason for hiding this comment

masterleinad left a comment

Choose a reason for hiding this comment

crtrott commented Jun 4, 2021

matt-stack commented Jun 7, 2021

matt-stack commented Jun 10, 2021

crtrott left a comment

Choose a reason for hiding this comment

matt-stack commented Jun 11, 2021

matt-stack commented Jun 17, 2021