Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Adding CudaMallocSync support when using CUDA version >= 11.2 #4026

Merged
merged 16 commits into from
Jul 1, 2021

Conversation

matt-stack
Copy link
Contributor

#3981

Certain memory allocation patterns can take advantage of using CudaMallocAsync in CUDA 11.2 as opposed to using cudaMalloc. There is an immediate cudaDeviceSynchronize() call after the cudaMallocAsync in this implementation so that we can take advantage of the implicit pool allocators, and not introduce unintended asynchronous behavior. CudaFree has also been replaced by cudaFreeAsync when the ifdef confirms CUDA 11.2, so that the memory is returned to the pool, and a cudaDeviceSynchronize() right after.

matt-stack and others added 2 commits May 13, 2021 16:27
…e to compile using cudaMallocAsync and cudaFreeAsync with an immediate cudaDeviceSynchronize if CUDART_VERSION is greater than 11.2, or uses normal cudaMalloc and cudaFree if using an older version
@dalg24-jenkins
Copy link
Collaborator

Can one of the admins verify this patch?

@Rombur
Copy link
Member

Rombur commented May 13, 2021

OK to test

@crtrott
Copy link
Member

crtrott commented May 13, 2021

Pushed the clang-format fix.

core/src/Cuda/Kokkos_CudaSpace.cpp Outdated Show resolved Hide resolved
core/src/Cuda/Kokkos_CudaSpace.cpp Outdated Show resolved Hide resolved
core/src/Cuda/Kokkos_CudaSpace.cpp Outdated Show resolved Hide resolved
matt-stack and others added 3 commits May 14, 2021 08:03
Co-authored-by: Damien L-G <dalg24+github@gmail.com>
Adding #error to impl deallocate
Adding a call to cudaDeviceSynchronize before cudaFreeAsync to ensure to users that cudaFreeAsync does not add any unintended asynchronous behavior
@matt-stack matt-stack requested a review from dalg24 May 24, 2021 15:12
@masterleinad
Copy link
Contributor

Retest this please.

1 similar comment
@masterleinad
Copy link
Contributor

Retest this please.

@matt-stack
Copy link
Contributor Author

@masterleinad Testing now

@masterleinad
Copy link
Contributor

Retest this please.

@matt-stack
Copy link
Contributor Author

Update: I am working on one small change regarding a small "bug" (that has been fixed in the most recent CUDA). Hoping to push for review and thoughts today

@matt-stack
Copy link
Contributor Author

A Cuda bug came up when I was running the unit tests, and it requires a small addition to CudaSpace to avoid. I am adding it here to this PR for thoughts. The bug is that if you request a number very close to the numerical limit for size_t, internally this number is rounded up to SIZE_MAX+1, which gets used by some code that does not have a check for 0, and segfaults. It was reported that this fix will be in Cuda 11.4.

This came up from KokkosCore_UnitTest_Cuda2, where one test allocates
auto arg_alloc_size = std::numeric_limits<size_t>::max() - 42;

The check does reduce the readability of this section of CudaSpace with an extra if-else statement, and after much thought I am curious if it should be included. The case where the bug is triggered seems unlikely (requesting close to SIZE_MAX of size_t, or a small negative size_t which is the same case). I can revert it back to the original version if these additions are unnecessary. @crtrott @maxpkatz

@masterleinad
Copy link
Contributor

The indentation needs to be fixed:

./scripts/docker/check_format_cpp.sh
diff --git a/core/src/Cuda/Kokkos_CudaSpace.cpp b/core/src/Cuda/Kokkos_CudaSpace.cpp
index 6123e93cf..9b8aaef68 100644
--- a/core/src/Cuda/Kokkos_CudaSpace.cpp
+++ b/core/src/Cuda/Kokkos_CudaSpace.cpp
@@ -226,11 +226,10 @@ void *CudaSpace::impl_allocate(
 #error CUDART_VERSION undefined!
 #elif (CUDART_VERSION >= 11020)
   cudaError_t error_code;
-  if ( (size_t)arg_alloc_size < (std::numeric_limits<size_t>::max() - 1000) ) {
+  if ((size_t)arg_alloc_size < (std::numeric_limits<size_t>::max() - 1000)) {
     error_code = cudaMallocAsync(&ptr, arg_alloc_size, 0);
     cudaDeviceSynchronize();
-  }
-  else {
+  } else {
     error_code = cudaErrorInvalidValue;
   }
 #else
script returned exit code 1

Comment on lines 232 to 236
}
else {
error_code = cudaErrorInvalidValue;
}
#else
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since this is a performance improvement, I would fall back to the original approach, i.e. using cudaMalloc, if the bug is triggered. Also, we should add a FIXME with an explanation so that we can remove the restriction eventually.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree, I will make these changes now

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@masterleinad Thanks for your comments, I added these changes

Copy link
Contributor

@masterleinad masterleinad left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please also wrap cudaDeviceSynchronize as suggested below.

core/src/Cuda/Kokkos_CudaSpace.cpp Outdated Show resolved Hide resolved
core/src/Cuda/Kokkos_CudaSpace.cpp Outdated Show resolved Hide resolved
core/src/Cuda/Kokkos_CudaSpace.cpp Outdated Show resolved Hide resolved
matt-stack and others added 3 commits June 2, 2021 12:47
Co-authored-by: Daniel Arndt <arndtd@ornl.gov>
Co-authored-by: Daniel Arndt <arndtd@ornl.gov>
Adding CUDA_SAFE_CALL wrapper to sync

Co-authored-by: Daniel Arndt <arndtd@ornl.gov>
@crtrott
Copy link
Member

crtrott commented Jun 4, 2021

Retest this please

@matt-stack
Copy link
Contributor Author

Are there any more tests that should be run/change requests?

@matt-stack
Copy link
Contributor Author

Hi @dalg24, would you be able to check out this current state for approval? Thanks!

Copy link
Member

@crtrott crtrott left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi guys,

I did some performance checking and for small allocations this seems to be slower.

I ran this code, which does a bunch of random sizes allocations, deletes them, and does that in a loop:

#include <Kokkos_Core.hpp>
#include <cmath>
#include <cstdlib>

int main(int argc, char* argv[]) {
  Kokkos::initialize(argc, argv);
  {
    int N = argc > 1 ? atoi(argv[1]) : 1000;
    int R = argc > 2 ? atoi(argv[2]) : 10;
    int MAX_SIZE = argc > 3 ? atoi(argv[3]) : 10000000;

    double** ptrs = new double*[R];
    srand(5123);
    Kokkos::Timer timer;
    for(int i=0;i<N;i++) {
      for(int r=0; r<R; r++) {
        int size = rand()%MAX_SIZE;
        ptrs[r]=(double*)Kokkos::kokkos_malloc<>(size*8);
      }
      for(int r=0; r<R; r++) Kokkos::kokkos_free<>(ptrs[r]);
    }
    printf("%lf\n",R*N/timer.seconds());
  }
  Kokkos::finalize();
}

At the end it spits out an allocation/deallocation rate.
With develop for N=1000, R=10 and MAX_SIZE=100 i got around 70k/s while with this PR I get only 54k/s.
With MAX_SIZE=10,000,000 I get 875/s and 899/s respectively (i.e. the new code is faster).

With MAX_SIZE=10,000 I still get 67k vs 53k.
With MAX_SIZE=100,000 however its 18k vs 52k
With MAX_SIZE=1,000,000 its 4k vs 9k

Note Size is size in number of doubles. So it looks like <100kB cudaMalloc might be faster, while above that the async thing is faster. We probably should have a switchover in the code, the current one is also a bit weird? What is the business with std::numeric_limits<size_t>::max there?? even -1000 that limit is unrealstic, not even Summit has that much memory on the entire machine, not to speak of a single node.

@matt-stack
Copy link
Contributor Author

@crtrott I ran your test case and saw similar results on V100 with CUDA 11.2, I think a switch of cudaMalloc for <100kb and cudaMallocAsync for greater is a great idea. I can add that in for review. Yeah so for the std::numeric_limits<size_t>::max if statement, that is there because there is a bug in cudaMallocAsync that if you request a number close to SIZE_MAX then it will throw a segfault. It came up after running one of the Kokkos Cuda unit tests (the specific test request: auto alloc_size = std::numeric_limits<size_t>::max() - 42;) so I thought I would include the size check as it was a case in the units. The size statement check can easily be taken out, I agree that a user would already have to be in dangerous territory to trigger it (requesting unrealistic size, or a negative int, etc)

@matt-stack
Copy link
Contributor Author

@crtrott Hi Christian, I added a memory threshold of a requested alloc size based on testing I ran. I found that having any request less than 5 kb use cudaMalloc and larger use cudaMallocAsync was the best mix for the test case. I also removed the "bug check" if statement (detailed above), but could easily put it back in.

core/src/Cuda/Kokkos_CudaSpace.cpp Outdated Show resolved Hide resolved
@crtrott crtrott merged commit 607bf89 into kokkos:develop Jul 1, 2021
@ajpowelsnl ajpowelsnl added the InDevelop Enhancement, fix, etc. has been merged into the develop branch; label Jul 20, 2021
@masterleinad masterleinad mentioned this pull request Aug 10, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
InDevelop Enhancement, fix, etc. has been merged into the develop branch;
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

8 participants