-
Notifications
You must be signed in to change notification settings - Fork 407
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Adding CudaMallocSync support when using CUDA version >= 11.2 #4026
Conversation
…e to compile using cudaMallocAsync and cudaFreeAsync with an immediate cudaDeviceSynchronize if CUDART_VERSION is greater than 11.2, or uses normal cudaMalloc and cudaFree if using an older version
Can one of the admins verify this patch? |
OK to test |
Pushed the clang-format fix. |
Co-authored-by: Damien L-G <dalg24+github@gmail.com>
Adding #error to impl deallocate
Adding a call to cudaDeviceSynchronize before cudaFreeAsync to ensure to users that cudaFreeAsync does not add any unintended asynchronous behavior
Retest this please. |
1 similar comment
Retest this please. |
@masterleinad Testing now |
Retest this please. |
Update: I am working on one small change regarding a small "bug" (that has been fixed in the most recent CUDA). Hoping to push for review and thoughts today |
A Cuda bug came up when I was running the unit tests, and it requires a small addition to CudaSpace to avoid. I am adding it here to this PR for thoughts. The bug is that if you request a number very close to the numerical limit for size_t, internally this number is rounded up to SIZE_MAX+1, which gets used by some code that does not have a check for 0, and segfaults. It was reported that this fix will be in Cuda 11.4. This came up from KokkosCore_UnitTest_Cuda2, where one test allocates The check does reduce the readability of this section of CudaSpace with an extra if-else statement, and after much thought I am curious if it should be included. The case where the bug is triggered seems unlikely (requesting close to SIZE_MAX of size_t, or a small negative size_t which is the same case). I can revert it back to the original version if these additions are unnecessary. @crtrott @maxpkatz |
The indentation needs to be fixed:
|
core/src/Cuda/Kokkos_CudaSpace.cpp
Outdated
} | ||
else { | ||
error_code = cudaErrorInvalidValue; | ||
} | ||
#else |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since this is a performance improvement, I would fall back to the original approach, i.e. using cudaMalloc
, if the bug is triggered. Also, we should add a FIXME with an explanation so that we can remove the restriction eventually.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree, I will make these changes now
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@masterleinad Thanks for your comments, I added these changes
…to absorb the error if triggered
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please also wrap cudaDeviceSynchronize
as suggested below.
Co-authored-by: Daniel Arndt <arndtd@ornl.gov>
Co-authored-by: Daniel Arndt <arndtd@ornl.gov>
Adding CUDA_SAFE_CALL wrapper to sync Co-authored-by: Daniel Arndt <arndtd@ornl.gov>
Retest this please |
Are there any more tests that should be run/change requests? |
Hi @dalg24, would you be able to check out this current state for approval? Thanks! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi guys,
I did some performance checking and for small allocations this seems to be slower.
I ran this code, which does a bunch of random sizes allocations, deletes them, and does that in a loop:
#include <Kokkos_Core.hpp>
#include <cmath>
#include <cstdlib>
int main(int argc, char* argv[]) {
Kokkos::initialize(argc, argv);
{
int N = argc > 1 ? atoi(argv[1]) : 1000;
int R = argc > 2 ? atoi(argv[2]) : 10;
int MAX_SIZE = argc > 3 ? atoi(argv[3]) : 10000000;
double** ptrs = new double*[R];
srand(5123);
Kokkos::Timer timer;
for(int i=0;i<N;i++) {
for(int r=0; r<R; r++) {
int size = rand()%MAX_SIZE;
ptrs[r]=(double*)Kokkos::kokkos_malloc<>(size*8);
}
for(int r=0; r<R; r++) Kokkos::kokkos_free<>(ptrs[r]);
}
printf("%lf\n",R*N/timer.seconds());
}
Kokkos::finalize();
}
At the end it spits out an allocation/deallocation rate.
With develop for N=1000, R=10 and MAX_SIZE=100 i got around 70k/s while with this PR I get only 54k/s.
With MAX_SIZE=10,000,000 I get 875/s and 899/s respectively (i.e. the new code is faster).
With MAX_SIZE=10,000 I still get 67k vs 53k.
With MAX_SIZE=100,000 however its 18k vs 52k
With MAX_SIZE=1,000,000 its 4k vs 9k
Note Size is size in number of doubles. So it looks like <100kB cudaMalloc might be faster, while above that the async thing is faster. We probably should have a switchover in the code, the current one is also a bit weird? What is the business with std::numeric_limits<size_t>::max there?? even -1000 that limit is unrealstic, not even Summit has that much memory on the entire machine, not to speak of a single node.
@crtrott I ran your test case and saw similar results on V100 with CUDA 11.2, I think a switch of cudaMalloc for <100kb and cudaMallocAsync for greater is a great idea. I can add that in for review. Yeah so for the std::numeric_limits<size_t>::max if statement, that is there because there is a bug in cudaMallocAsync that if you request a number close to SIZE_MAX then it will throw a segfault. It came up after running one of the Kokkos Cuda unit tests (the specific test request: auto alloc_size = std::numeric_limits<size_t>::max() - 42;) so I thought I would include the size check as it was a case in the units. The size statement check can easily be taken out, I agree that a user would already have to be in dangerous territory to trigger it (requesting unrealistic size, or a negative int, etc) |
@crtrott Hi Christian, I added a memory threshold of a requested alloc size based on testing I ran. I found that having any request less than 5 kb use cudaMalloc and larger use cudaMallocAsync was the best mix for the test case. I also removed the "bug check" if statement (detailed above), but could easily put it back in. |
#3981
Certain memory allocation patterns can take advantage of using CudaMallocAsync in CUDA 11.2 as opposed to using cudaMalloc. There is an immediate cudaDeviceSynchronize() call after the cudaMallocAsync in this implementation so that we can take advantage of the implicit pool allocators, and not introduce unintended asynchronous behavior. CudaFree has also been replaced by cudaFreeAsync when the ifdef confirms CUDA 11.2, so that the memory is returned to the pool, and a cudaDeviceSynchronize() right after.