-
Notifications
You must be signed in to change notification settings - Fork 412
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Memory Pool size setting for cudaAsyncMalloc #7032
base: develop
Are you sure you want to change the base?
Conversation
- 1GiB for 32 bit - 16 GiB for 64 bit - removed comparison of NULL with Ox0 - reapplied clang-format
core/src/Cuda/Kokkos_CudaSpace.cpp
Outdated
// Not permitted | ||
return false; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This should probably print the problem.
// Not permitted | |
return false; | |
std::cerr << "KOKKOS_CUDA_MEMPOOL_SIZE couldn't be parsed properly!\n" | |
// Not permitted | |
return false; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So I can change it to that. Currently what will happen is that false will be returned but the error_code
will still be cudaSuccess
and the error is dealt with as an exception on line 320. The logic is that if this routine returns false either a CUDA API failed or the parsing failed. The failure of the CUDA API can be determined by looking at error_code
However I am happy to accommodate whichever way you prefer to promulgate the error to the user. I can put the error here, or at the point of raising the exception.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added a brief report here about not being able to parse units in af41845
core/src/Cuda/Kokkos_CudaSpace.cpp
Outdated
|
||
requested_size *= factor; | ||
size_t n_bytes = static_cast<size_t>(std::ceil(requested_size)); | ||
if (!(n_bytes > 0)) return false; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
size_t
is unsigned so this equivalent to
if (!(n_bytes > 0)) return false; | |
if (n_bytes == 0) return false; |
but do we need this check then or would you rather wan to check that requested_size
is not greater than the largest number representable by size_t
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So I was giving the possibility of a -ve original amount in the double. But you are right - after the case it should be an unsigned number. I guess if the amount is -ve we'd get a bad cast exception. I should maybe check the - some other way.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should be resolved in: 7e6b25a
core/src/Cuda/Kokkos_CudaSpace.cpp
Outdated
std::cout << "Initializing Default Memory Pool for device " << device_id | ||
<< "\n"; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What is the comment here? It seems empty
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I meant for you to avoid printing to std::cout
.
std::cout << "Initializing Default Memory Pool for device " << device_id | |
<< "\n"; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK.
Co-authored-by: Daniel Arndt <arndtd@ornl.gov>
Accepting suggestion from @masterleinad re printing info message. Co-authored-by: Daniel Arndt <arndtd@ornl.gov>
Accepting change from @masterleinad Co-authored-by: Daniel Arndt <arndtd@ornl.gov>
Accepting suggested change from @masterleinad Co-authored-by: Daniel Arndt <arndtd@ornl.gov>
Accepting change fuggested by @masterleinad Co-authored-by: Daniel Arndt <arndtd@ornl.gov>
- Formatting (spaces etc. accepted on page) - Formatting required size (check for nonpositivity or being too big) - Remaining issue: error_code propagation strategy
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi @masterleinad
I accepted the cosmetic changes, and pushed changes to deal with the some error reporting. One real issue remains I guess which is with the 'safe call' v.s. my return fals approach. The latter one is designed to return false either if there is a parse error or if there is a Cuda error. The Cuda error is reflected in the value of error_code
. Let me know how you want to deal with that. I can do whichever way is best for you.
core/src/Cuda/Kokkos_CudaSpace.cpp
Outdated
|
||
requested_size *= factor; | ||
size_t n_bytes = static_cast<size_t>(std::ceil(requested_size)); | ||
if (!(n_bytes > 0)) return false; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So I was giving the possibility of a -ve original amount in the double. But you are right - after the case it should be an unsigned number. I guess if the amount is -ve we'd get a bad cast exception. I should maybe check the - some other way.
core/src/Cuda/Kokkos_CudaSpace.cpp
Outdated
|
||
requested_size *= factor; | ||
size_t n_bytes = static_cast<size_t>(std::ceil(requested_size)); | ||
if (!(n_bytes > 0)) return false; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should be resolved in: 7e6b25a
core/src/Cuda/Kokkos_CudaSpace.cpp
Outdated
// Not permitted | ||
return false; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added a brief report here about not being able to parse units in af41845
for (size_t num : sizes) { | ||
inner_loop_timer.reset(); | ||
for (int i = 0; i < iters; i++) { | ||
Kokkos::View<float *, MemorySpace> a("unlabeled", num); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Kokkos::View<float *, MemorySpace> a("unlabeled", num); | |
Kokkos::View<float *, MemorySpace> a(Kokkos::view_alloc(Kokkos::WithoutInitializing, "unlabeled"), num); |
You don't want to measure initialization here but only allocation, right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Right. That is a good suggestion. I should accept this and redraw my graph.
inner_loop_times.push_back(std::make_pair<>( | ||
num * sizeof(float), inner_loop_time / static_cast<double>(iters))); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
inner_loop_times.push_back(std::make_pair<>( | |
num * sizeof(float), inner_loop_time / static_cast<double>(iters))); | |
inner_loop_times.emplace_back( | |
num * sizeof(float), inner_loop_time / static_cast<double>(iters)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah... Maybe... I think push_back is just as good in this instance?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just more concise. 🙂
// | ||
std::pair<double, double> test(bool up) { | ||
int iters = 50; | ||
size_t minimum = 8 / sizeof(float); // 64K |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So the smallest allocation will use 8 bytes.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes. originally I was going for more. I see 'comment-rot' there. The comment about 64K is not true. I wanted to start there but others suggested 8 bytes. I will fix.
// Check the env var for reporting | ||
char *env_string = getenv("KOKKOS_CUDA_MEMPOOL_SIZE"); | ||
std::cout << "Async Malloc Benchmark: KOKKOS_CUDA_MEMPOOL_SIZE is "; | ||
|
||
if (env_string == nullptr) | ||
std::cout << "not set,"; | ||
else | ||
std::cout << " " << env_string << ","; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
// Check the env var for reporting | |
char *env_string = getenv("KOKKOS_CUDA_MEMPOOL_SIZE"); | |
std::cout << "Async Malloc Benchmark: KOKKOS_CUDA_MEMPOOL_SIZE is "; | |
if (env_string == nullptr) | |
std::cout << "not set,"; | |
else | |
std::cout << " " << env_string << ","; | |
#ifdef KOKKOS_ENABLE_CUDA | |
// Check the env var for reporting | |
char *env_string = getenv("KOKKOS_CUDA_MEMPOOL_SIZE"); | |
std::cout << "Async Malloc Benchmark: KOKKOS_CUDA_MEMPOOL_SIZE is "; | |
if (env_string == nullptr) | |
std::cout << "not set,"; | |
else | |
std::cout << ' ' << env_string << ','; | |
#endif |
to be more explicit that this only makes sense if we are actually testing with the Cuda
backend and we might want to add similar logic for other backends.
// Now we allocate the pool size amount + a little more (64 bytes is arbitrary | ||
// it just needs to be more than the retention size. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How do you know that this will not overflow? Maybe take min(n_bytes + 64, std::numeric_limits<size_t>::max())
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good catch... Thanks!
|
||
// At this point requested_size should be appropriate | ||
// neither too big nor negative. | ||
size_t n_bytes = static_cast<size_t>(std::ceil(requested_size)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This could be larger than the largest representable size_t
, though.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
On the other hand if I floor it we may be short. I can set the check on requested size to be < max
// Now we free the memory, and our poolsize amount will be retained in the | ||
// pool |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
// Now we free the memory, and our poolsize amount will be retained in the | |
// pool | |
// Now we free the memory, and our pool size amount will be retained in the | |
// pool |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK
// Always use the Async. Don't use the 40K Lower limit - may well be arch | ||
// dependent. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should check cudaDevAttrMemoryPoolsSupported
before going here, see kokkos/kokkos-core-wiki#525 (comment), but probably not in this pull request.
// if the API calls all returned cudaSuccess, the strtod() could | ||
// have thrown an exception which we caught. In this case we | ||
// throw a dumb exception here |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
// if the API calls all returned cudaSuccess, the strtod() could | |
// have thrown an exception which we caught. In this case we | |
// throw a dumb exception here | |
// if the API calls all returned cudaSuccess, parsing KOKKOS_CUDA_MEMPOOL_SIZE | |
// could have failed. In this case, we throw a dumb exception here |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Blocking for now: I am not convinced that this is the right programmatic path forward for this for a couple reasons.
- I am not comfortable adding a bunch of backend specific environment variables and command line options for Kokkos - this becomes very fast untestable in particular if people than start to ask for all the other options of stuff like the default memory pool.
- I am not convinced that making the default allocators behave more like actual memory pools is the right thing to do. Generally, I want them to be just "system allocators".
- People would need to understand the implications of what that threshold does and my gut feeling is that in isolation it just becomes a "make it larger and then allocations are fast" without folks realizing what that means (e.g. that this is a fixed overhead on the GPU potentially causing issues in multi-tenant cases).
- I think we should expose memory pool functionality as memory spaces which advertise that they are backed by a memory pool and which you pass around. These things can then have all the typical memory pool property and we could map it to backend specific ones if available instead of writing our own impl.
We may wanna organize a meeting for a larger discussion. |
Hi Christian, Thanks for your comments and reasoning. I will reach out to you on the slack or direct email, and we can organize a meeting. Best wishes, B |
A patch to allow setting the pool size for cudaMallocAsync
Implementation
The first time cudaAsyncMalloc is called (in void* impl_alloc_common() in Kokkos_CudaSpace.cpp)
we check the environment variable KOKKOS_CUDA_MEMPOOL_SIZE . We overallocate this by 64 bytes (an arbitrary amount, just to get us over the size), then set some properties
on the device default mempool to retain KOKKOS_CUDA_MEMPOOL_SIZE of memory after an async free. Subsequent allocations are
faster (by between 1 to 2 orders of magnitude) for sufficiently large sized chunks of memory which fit in the Pool.
Efficacy
A benchmark test has been placed in kokkos/benchmarks/async_test which can be built using the 'Makefile' setup.
To execute it, export the KOKKOS_CUDA_MEMPOOL_SIZE and run
async_alloc.cuda
. The utility will range through allocations from 8B to 16GB.And collect timing of allocating (and freeing) a Kokkos::View.
The
-d
flag toasync_alloc.cuda
can be used to specify cycling downwards i.e. from 16GB to 8B.The attached PDF shows the benchmark times, sweeping up from 0 to 8GB sizes, with various mempool settings show the gains from the async allocator from allocation sizes of 512KB upwards
This data is from an Ada L40S GPU. Other GPU architecture benchmarks are work in progress just now.
AsyncAllocUp.pdf