Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Memory Pool size setting for cudaAsyncMalloc #7032

Open
wants to merge 13 commits into
base: develop
Choose a base branch
from

Conversation

bjoo
Copy link
Contributor

@bjoo bjoo commented May 24, 2024

A patch to allow setting the pool size for cudaMallocAsync

Implementation

The first time cudaAsyncMalloc is called (in void* impl_alloc_common() in Kokkos_CudaSpace.cpp)
we check the environment variable KOKKOS_CUDA_MEMPOOL_SIZE . We overallocate this by 64 bytes (an arbitrary amount, just to get us over the size), then set some properties
on the device default mempool to retain KOKKOS_CUDA_MEMPOOL_SIZE of memory after an async free. Subsequent allocations are
faster (by between 1 to 2 orders of magnitude) for sufficiently large sized chunks of memory which fit in the Pool.

Efficacy

A benchmark test has been placed in kokkos/benchmarks/async_test which can be built using the 'Makefile' setup.
To execute it, export the KOKKOS_CUDA_MEMPOOL_SIZE and run async_alloc.cuda. The utility will range through allocations from 8B to 16GB.
And collect timing of allocating (and freeing) a Kokkos::View.

The -d flag to async_alloc.cuda can be used to specify cycling downwards i.e. from 16GB to 8B.

The attached PDF shows the benchmark times, sweeping up from 0 to 8GB sizes, with various mempool settings show the gains from the async allocator from allocation sizes of 512KB upwards

  • at about 16MiB the allocator with an unspecified (0) poolsize becomes as expensive as using cudaMalloc and in fact becomes worse
  • using a pool maintains an advantage of between 1 to 2 orders of magnitude depending on the allocation size
  • after 4GB the allocation efficiency with the 4.2GB pool starts to deteriorate as we run out of pool space.

This data is from an Ada L40S GPU. Other GPU architecture benchmarks are work in progress just now.
AsyncAllocUp.pdf

bjoo added 3 commits May 24, 2024 13:58
- 1GiB for 32 bit
- 16 GiB for 64 bit
- removed comparison of NULL with Ox0
- reapplied clang-format
Comment on lines 223 to 224
// Not permitted
return false;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should probably print the problem.

Suggested change
// Not permitted
return false;
std::cerr << "KOKKOS_CUDA_MEMPOOL_SIZE couldn't be parsed properly!\n"
// Not permitted
return false;

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So I can change it to that. Currently what will happen is that false will be returned but the error_code will still be cudaSuccess and the error is dealt with as an exception on line 320. The logic is that if this routine returns false either a CUDA API failed or the parsing failed. The failure of the CUDA API can be determined by looking at error_code

However I am happy to accommodate whichever way you prefer to promulgate the error to the user. I can put the error here, or at the point of raising the exception.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added a brief report here about not being able to parse units in af41845

core/src/Cuda/Kokkos_CudaSpace.cpp Outdated Show resolved Hide resolved

requested_size *= factor;
size_t n_bytes = static_cast<size_t>(std::ceil(requested_size));
if (!(n_bytes > 0)) return false;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

size_t is unsigned so this equivalent to

Suggested change
if (!(n_bytes > 0)) return false;
if (n_bytes == 0) return false;

but do we need this check then or would you rather wan to check that requested_size is not greater than the largest number representable by size_t?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So I was giving the possibility of a -ve original amount in the double. But you are right - after the case it should be an unsigned number. I guess if the amount is -ve we'd get a bad cast exception. I should maybe check the - some other way.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should be resolved in: 7e6b25a

core/src/Cuda/Kokkos_CudaSpace.cpp Outdated Show resolved Hide resolved
Comment on lines 200 to 201
std::cout << "Initializing Default Memory Pool for device " << device_id
<< "\n";
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is the comment here? It seems empty

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I meant for you to avoid printing to std::cout.

Suggested change
std::cout << "Initializing Default Memory Pool for device " << device_id
<< "\n";

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK.

core/src/Cuda/Kokkos_CudaSpace.cpp Show resolved Hide resolved
core/src/Cuda/Kokkos_CudaSpace.cpp Outdated Show resolved Hide resolved
core/src/Cuda/Kokkos_CudaSpace.cpp Outdated Show resolved Hide resolved
core/src/Cuda/Kokkos_CudaSpace.cpp Outdated Show resolved Hide resolved
bjoo and others added 7 commits May 27, 2024 13:29
Co-authored-by: Daniel Arndt <arndtd@ornl.gov>
Accepting suggestion from @masterleinad re printing info message.

Co-authored-by: Daniel Arndt <arndtd@ornl.gov>
Accepting change from @masterleinad

Co-authored-by: Daniel Arndt <arndtd@ornl.gov>
Accepting suggested change from @masterleinad

Co-authored-by: Daniel Arndt <arndtd@ornl.gov>
Accepting change fuggested by @masterleinad

Co-authored-by: Daniel Arndt <arndtd@ornl.gov>
- Formatting (spaces etc. accepted on page)
- Formatting required size (check for nonpositivity or being too big)
- Remaining issue: error_code propagation strategy
Copy link
Contributor Author

@bjoo bjoo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @masterleinad
I accepted the cosmetic changes, and pushed changes to deal with the some error reporting. One real issue remains I guess which is with the 'safe call' v.s. my return fals approach. The latter one is designed to return false either if there is a parse error or if there is a Cuda error. The Cuda error is reflected in the value of error_code. Let me know how you want to deal with that. I can do whichever way is best for you.


requested_size *= factor;
size_t n_bytes = static_cast<size_t>(std::ceil(requested_size));
if (!(n_bytes > 0)) return false;
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So I was giving the possibility of a -ve original amount in the double. But you are right - after the case it should be an unsigned number. I guess if the amount is -ve we'd get a bad cast exception. I should maybe check the - some other way.

core/src/Cuda/Kokkos_CudaSpace.cpp Show resolved Hide resolved

requested_size *= factor;
size_t n_bytes = static_cast<size_t>(std::ceil(requested_size));
if (!(n_bytes > 0)) return false;
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should be resolved in: 7e6b25a

Comment on lines 223 to 224
// Not permitted
return false;
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added a brief report here about not being able to parse units in af41845

benchmarks/async_alloc/async_alloc.cpp Outdated Show resolved Hide resolved
for (size_t num : sizes) {
inner_loop_timer.reset();
for (int i = 0; i < iters; i++) {
Kokkos::View<float *, MemorySpace> a("unlabeled", num);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
Kokkos::View<float *, MemorySpace> a("unlabeled", num);
Kokkos::View<float *, MemorySpace> a(Kokkos::view_alloc(Kokkos::WithoutInitializing, "unlabeled"), num);

You don't want to measure initialization here but only allocation, right?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right. That is a good suggestion. I should accept this and redraw my graph.

Comment on lines +51 to +52
inner_loop_times.push_back(std::make_pair<>(
num * sizeof(float), inner_loop_time / static_cast<double>(iters)));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
inner_loop_times.push_back(std::make_pair<>(
num * sizeof(float), inner_loop_time / static_cast<double>(iters)));
inner_loop_times.emplace_back(
num * sizeof(float), inner_loop_time / static_cast<double>(iters));

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah... Maybe... I think push_back is just as good in this instance?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just more concise. 🙂

//
std::pair<double, double> test(bool up) {
int iters = 50;
size_t minimum = 8 / sizeof(float); // 64K
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So the smallest allocation will use 8 bytes.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes. originally I was going for more. I see 'comment-rot' there. The comment about 64K is not true. I wanted to start there but others suggested 8 bytes. I will fix.

Comment on lines 68 to 75
// Check the env var for reporting
char *env_string = getenv("KOKKOS_CUDA_MEMPOOL_SIZE");
std::cout << "Async Malloc Benchmark: KOKKOS_CUDA_MEMPOOL_SIZE is ";

if (env_string == nullptr)
std::cout << "not set,";
else
std::cout << " " << env_string << ",";
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
// Check the env var for reporting
char *env_string = getenv("KOKKOS_CUDA_MEMPOOL_SIZE");
std::cout << "Async Malloc Benchmark: KOKKOS_CUDA_MEMPOOL_SIZE is ";
if (env_string == nullptr)
std::cout << "not set,";
else
std::cout << " " << env_string << ",";
#ifdef KOKKOS_ENABLE_CUDA
// Check the env var for reporting
char *env_string = getenv("KOKKOS_CUDA_MEMPOOL_SIZE");
std::cout << "Async Malloc Benchmark: KOKKOS_CUDA_MEMPOOL_SIZE is ";
if (env_string == nullptr)
std::cout << "not set,";
else
std::cout << ' ' << env_string << ',';
#endif

to be more explicit that this only makes sense if we are actually testing with the Cuda backend and we might want to add similar logic for other backends.

Comment on lines +287 to +288
// Now we allocate the pool size amount + a little more (64 bytes is arbitrary
// it just needs to be more than the retention size.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How do you know that this will not overflow? Maybe take min(n_bytes + 64, std::numeric_limits<size_t>::max())?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch... Thanks!


// At this point requested_size should be appropriate
// neither too big nor negative.
size_t n_bytes = static_cast<size_t>(std::ceil(requested_size));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This could be larger than the largest representable size_t, though.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On the other hand if I floor it we may be short. I can set the check on requested size to be < max

Comment on lines +293 to +294
// Now we free the memory, and our poolsize amount will be retained in the
// pool
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
// Now we free the memory, and our poolsize amount will be retained in the
// pool
// Now we free the memory, and our pool size amount will be retained in the
// pool

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK

Comment on lines +317 to +318
// Always use the Async. Don't use the 40K Lower limit - may well be arch
// dependent.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should check cudaDevAttrMemoryPoolsSupported before going here, see kokkos/kokkos-core-wiki#525 (comment), but probably not in this pull request.

Comment on lines +335 to +337
// if the API calls all returned cudaSuccess, the strtod() could
// have thrown an exception which we caught. In this case we
// throw a dumb exception here
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
// if the API calls all returned cudaSuccess, the strtod() could
// have thrown an exception which we caught. In this case we
// throw a dumb exception here
// if the API calls all returned cudaSuccess, parsing KOKKOS_CUDA_MEMPOOL_SIZE
// could have failed. In this case, we throw a dumb exception here

Copy link
Member

@crtrott crtrott left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Blocking for now: I am not convinced that this is the right programmatic path forward for this for a couple reasons.

  1. I am not comfortable adding a bunch of backend specific environment variables and command line options for Kokkos - this becomes very fast untestable in particular if people than start to ask for all the other options of stuff like the default memory pool.
  2. I am not convinced that making the default allocators behave more like actual memory pools is the right thing to do. Generally, I want them to be just "system allocators".
  3. People would need to understand the implications of what that threshold does and my gut feeling is that in isolation it just becomes a "make it larger and then allocations are fast" without folks realizing what that means (e.g. that this is a fixed overhead on the GPU potentially causing issues in multi-tenant cases).
  4. I think we should expose memory pool functionality as memory spaces which advertise that they are backed by a memory pool and which you pass around. These things can then have all the typical memory pool property and we could map it to backend specific ones if available instead of writing our own impl.

@crtrott
Copy link
Member

crtrott commented May 28, 2024

We may wanna organize a meeting for a larger discussion.

@bjoo
Copy link
Contributor Author

bjoo commented May 29, 2024

Hi Christian, Thanks for your comments and reasoning. I will reach out to you on the slack or direct email, and we can organize a meeting. Best wishes, B

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants