Memory Pool size setting for cudaAsyncMalloc #7032

bjoo · 2024-05-24T14:58:42Z

A patch to allow setting the pool size for cudaMallocAsync

Implementation

The first time cudaAsyncMalloc is called (in void* impl_alloc_common() in Kokkos_CudaSpace.cpp)
we check the environment variable KOKKOS_CUDA_MEMPOOL_SIZE . We overallocate this by 64 bytes (an arbitrary amount, just to get us over the size), then set some properties
on the device default mempool to retain KOKKOS_CUDA_MEMPOOL_SIZE of memory after an async free. Subsequent allocations are
faster (by between 1 to 2 orders of magnitude) for sufficiently large sized chunks of memory which fit in the Pool.

Efficacy

A benchmark test has been placed in kokkos/benchmarks/async_test which can be built using the 'Makefile' setup.
To execute it, export the KOKKOS_CUDA_MEMPOOL_SIZE and run async_alloc.cuda. The utility will range through allocations from 8B to 16GB.
And collect timing of allocating (and freeing) a Kokkos::View.

The -d flag to async_alloc.cuda can be used to specify cycling downwards i.e. from 16GB to 8B.

The attached PDF shows the benchmark times, sweeping up from 0 to 8GB sizes, with various mempool settings show the gains from the async allocator from allocation sizes of 512KB upwards

at about 16MiB the allocator with an unspecified (0) poolsize becomes as expensive as using cudaMalloc and in fact becomes worse
using a pool maintains an advantage of between 1 to 2 orders of magnitude depending on the allocation size
after 4GB the allocation efficiency with the 4.2GB pool starts to deteriorate as we run out of pool space.

This data is from an Ada L40S GPU. Other GPU architecture benchmarks are work in progress just now.
AsyncAllocUp.pdf

- 1GiB for 32 bit - 16 GiB for 64 bit - removed comparison of NULL with Ox0 - reapplied clang-format

masterleinad · 2024-05-24T20:09:03Z

core/src/Cuda/Kokkos_CudaSpace.cpp

+        // Not permitted
+        return false;


This should probably print the problem.

Suggested change

// Not permitted

return false;

std::cerr << "KOKKOS_CUDA_MEMPOOL_SIZE couldn't be parsed properly!\n"

// Not permitted

return false;

So I can change it to that. Currently what will happen is that false will be returned but the error_code will still be cudaSuccess and the error is dealt with as an exception on line 320. The logic is that if this routine returns false either a CUDA API failed or the parsing failed. The failure of the CUDA API can be determined by looking at error_code

However I am happy to accommodate whichever way you prefer to promulgate the error to the user. I can put the error here, or at the point of raising the exception.

Added a brief report here about not being able to parse units in af41845

core/src/Cuda/Kokkos_CudaSpace.cpp

masterleinad · 2024-05-24T20:15:19Z

core/src/Cuda/Kokkos_CudaSpace.cpp

+
+  requested_size *= factor;
+  size_t n_bytes = static_cast<size_t>(std::ceil(requested_size));
+  if (!(n_bytes > 0)) return false;


size_t is unsigned so this equivalent to

Suggested change

if (!(n_bytes > 0)) return false;

if (n_bytes == 0) return false;

but do we need this check then or would you rather wan to check that requested_size is not greater than the largest number representable by size_t?

So I was giving the possibility of a -ve original amount in the double. But you are right - after the case it should be an unsigned number. I guess if the amount is -ve we'd get a bad cast exception. I should maybe check the - some other way.

Should be resolved in: 7e6b25a

core/src/Cuda/Kokkos_CudaSpace.cpp

masterleinad · 2024-05-24T20:16:34Z

core/src/Cuda/Kokkos_CudaSpace.cpp

+  std::cout << "Initializing Default Memory Pool for device " << device_id
+            << "\n";


What is the comment here? It seems empty

I meant for you to avoid printing to std::cout.

Suggested change

std::cout << "Initializing Default Memory Pool for device " << device_id

<< "\n";

core/src/Cuda/Kokkos_CudaSpace.cpp

Co-authored-by: Daniel Arndt <arndtd@ornl.gov>

@masterleinad

Accepting suggestion from @masterleinad re printing info message. Co-authored-by: Daniel Arndt <arndtd@ornl.gov>

@masterleinad

Accepting change from @masterleinad Co-authored-by: Daniel Arndt <arndtd@ornl.gov>

@masterleinad

Accepting suggested change from @masterleinad Co-authored-by: Daniel Arndt <arndtd@ornl.gov>

@masterleinad

Accepting change fuggested by @masterleinad Co-authored-by: Daniel Arndt <arndtd@ornl.gov>

- Formatting (spaces etc. accepted on page) - Formatting required size (check for nonpositivity or being too big) - Remaining issue: error_code propagation strategy

bjoo

Hi @masterleinad
I accepted the cosmetic changes, and pushed changes to deal with the some error reporting. One real issue remains I guess which is with the 'safe call' v.s. my return fals approach. The latter one is designed to return false either if there is a parse error or if there is a Cuda error. The Cuda error is reflected in the value of error_code. Let me know how you want to deal with that. I can do whichever way is best for you.

bjoo · 2024-05-27T17:25:15Z

core/src/Cuda/Kokkos_CudaSpace.cpp

+
+  requested_size *= factor;
+  size_t n_bytes = static_cast<size_t>(std::ceil(requested_size));
+  if (!(n_bytes > 0)) return false;


So I was giving the possibility of a -ve original amount in the double. But you are right - after the case it should be an unsigned number. I guess if the amount is -ve we'd get a bad cast exception. I should maybe check the - some other way.

core/src/Cuda/Kokkos_CudaSpace.cpp

bjoo · 2024-05-27T18:03:04Z

core/src/Cuda/Kokkos_CudaSpace.cpp

+
+  requested_size *= factor;
+  size_t n_bytes = static_cast<size_t>(std::ceil(requested_size));
+  if (!(n_bytes > 0)) return false;


Should be resolved in: 7e6b25a

bjoo · 2024-05-27T18:11:30Z

core/src/Cuda/Kokkos_CudaSpace.cpp

+        // Not permitted
+        return false;


Added a brief report here about not being able to parse units in af41845

benchmarks/async_alloc/async_alloc.cpp

masterleinad · 2024-05-28T14:29:57Z

benchmarks/async_alloc/async_alloc.cpp

+  for (size_t num : sizes) {
+    inner_loop_timer.reset();
+    for (int i = 0; i < iters; i++) {
+      Kokkos::View<float *, MemorySpace> a("unlabeled", num);


Suggested change

Kokkos::View<float *, MemorySpace> a("unlabeled", num);

Kokkos::View<float *, MemorySpace> a(Kokkos::view_alloc(Kokkos::WithoutInitializing, "unlabeled"), num);

You don't want to measure initialization here but only allocation, right?

Right. That is a good suggestion. I should accept this and redraw my graph.

masterleinad · 2024-05-28T14:38:05Z

benchmarks/async_alloc/async_alloc.cpp

+    inner_loop_times.push_back(std::make_pair<>(
+        num * sizeof(float), inner_loop_time / static_cast<double>(iters)));


Suggested change

inner_loop_times.push_back(std::make_pair<>(

num * sizeof(float), inner_loop_time / static_cast<double>(iters)));

inner_loop_times.emplace_back(

num * sizeof(float), inner_loop_time / static_cast<double>(iters));

Yeah... Maybe... I think push_back is just as good in this instance?

Just more concise. 🙂

masterleinad · 2024-05-28T14:40:41Z

benchmarks/async_alloc/async_alloc.cpp

+//
+std::pair<double, double> test(bool up) {
+  int iters      = 50;
+  size_t minimum = 8 / sizeof(float);  // 64K


So the smallest allocation will use 8 bytes.

Yes. originally I was going for more. I see 'comment-rot' there. The comment about 64K is not true. I wanted to start there but others suggested 8 bytes. I will fix.

masterleinad · 2024-05-28T14:44:09Z

benchmarks/async_alloc/async_alloc.cpp

+  // Check the env var for reporting
+  char *env_string = getenv("KOKKOS_CUDA_MEMPOOL_SIZE");
+  std::cout << "Async Malloc Benchmark: KOKKOS_CUDA_MEMPOOL_SIZE is ";
+
+  if (env_string == nullptr)
+    std::cout << "not set,";
+  else
+    std::cout << " " << env_string << ",";


Suggested change

// Check the env var for reporting

char *env_string = getenv("KOKKOS_CUDA_MEMPOOL_SIZE");

std::cout << "Async Malloc Benchmark: KOKKOS_CUDA_MEMPOOL_SIZE is ";

if (env_string == nullptr)

std::cout << "not set,";

else

std::cout << " " << env_string << ",";

#ifdef KOKKOS_ENABLE_CUDA

// Check the env var for reporting

char *env_string = getenv("KOKKOS_CUDA_MEMPOOL_SIZE");

std::cout << "Async Malloc Benchmark: KOKKOS_CUDA_MEMPOOL_SIZE is ";

if (env_string == nullptr)

std::cout << "not set,";

else

std::cout << ' ' << env_string << ',';

#endif

to be more explicit that this only makes sense if we are actually testing with the Cuda backend and we might want to add similar logic for other backends.

masterleinad · 2024-05-28T15:12:55Z

core/src/Cuda/Kokkos_CudaSpace.cpp

+  // Now we allocate the pool size amount + a little more (64 bytes is arbitrary
+  // it just needs to be more than the retention size.


How do you know that this will not overflow? Maybe take min(n_bytes + 64, std::numeric_limits<size_t>::max())?

Good catch... Thanks!

masterleinad · 2024-05-28T15:13:43Z

core/src/Cuda/Kokkos_CudaSpace.cpp

+
+  // At this point requested_size should be appropriate
+  // neither too big nor negative.
+  size_t n_bytes = static_cast<size_t>(std::ceil(requested_size));


This could be larger than the largest representable size_t, though.

On the other hand if I floor it we may be short. I can set the check on requested size to be < max

masterleinad · 2024-05-28T15:14:06Z

core/src/Cuda/Kokkos_CudaSpace.cpp

+  // Now we free the memory, and our poolsize amount will be retained in the
+  // pool


Suggested change

// Now we free the memory, and our poolsize amount will be retained in the

// pool

// Now we free the memory, and our pool size amount will be retained in the

// pool

masterleinad · 2024-05-28T15:17:09Z

core/src/Cuda/Kokkos_CudaSpace.cpp

+  // Always use the Async. Don't use the 40K Lower limit - may well be arch
+  // dependent.


We should check cudaDevAttrMemoryPoolsSupported before going here, see kokkos/kokkos-core-wiki#525 (comment), but probably not in this pull request.

masterleinad · 2024-05-28T15:20:00Z

core/src/Cuda/Kokkos_CudaSpace.cpp

+        // if the API calls all returned cudaSuccess, the strtod() could
+        // have thrown an exception which we caught. In this case we
+        // throw a dumb exception here


Suggested change

// if the API calls all returned cudaSuccess, the strtod() could

// have thrown an exception which we caught. In this case we

// throw a dumb exception here

// if the API calls all returned cudaSuccess, parsing KOKKOS_CUDA_MEMPOOL_SIZE

// could have failed. In this case, we throw a dumb exception here

crtrott

Blocking for now: I am not convinced that this is the right programmatic path forward for this for a couple reasons.

I am not comfortable adding a bunch of backend specific environment variables and command line options for Kokkos - this becomes very fast untestable in particular if people than start to ask for all the other options of stuff like the default memory pool.
I am not convinced that making the default allocators behave more like actual memory pools is the right thing to do. Generally, I want them to be just "system allocators".
People would need to understand the implications of what that threshold does and my gut feeling is that in isolation it just becomes a "make it larger and then allocations are fast" without folks realizing what that means (e.g. that this is a fixed overhead on the GPU potentially causing issues in multi-tenant cases).
I think we should expose memory pool functionality as memory spaces which advertise that they are backed by a memory pool and which you pass around. These things can then have all the typical memory pool property and we could map it to backend specific ones if available instead of writing our own impl.

crtrott · 2024-05-28T20:14:50Z

We may wanna organize a meeting for a larger discussion.

bjoo · 2024-05-29T13:12:59Z

Hi Christian, Thanks for your comments and reasoning. I will reach out to you on the slack or direct email, and we can organize a meeting. Best wishes, B

bjoo added 3 commits May 24, 2024 13:58

Modifications for cudaMallocAsync PR

8db8d73

Made async alloc work with default memory space

cc00fb4

Made maximum contingent on precision

cd0a2eb

- 1GiB for 32 bit - 16 GiB for 64 bit - removed comparison of NULL with Ox0 - reapplied clang-format

masterleinad reviewed May 24, 2024

View reviewed changes

bjoo and others added 7 commits May 27, 2024 13:29

Update core/src/Cuda/Kokkos_CudaSpace.cpp

4136a5a

Co-authored-by: Daniel Arndt <arndtd@ornl.gov>

Update core/src/Cuda/Kokkos_CudaSpace.cpp

7126181

Accepting suggestion from @masterleinad re printing info message. Co-authored-by: Daniel Arndt <arndtd@ornl.gov>

Update core/src/Cuda/Kokkos_CudaSpace.cpp

987f4df

Accepting change from @masterleinad Co-authored-by: Daniel Arndt <arndtd@ornl.gov>

Update core/src/Cuda/Kokkos_CudaSpace.cpp

105d45e

Accepting suggested change from @masterleinad Co-authored-by: Daniel Arndt <arndtd@ornl.gov>

Update core/src/Cuda/Kokkos_CudaSpace.cpp

cdf688c

Accepting change fuggested by @masterleinad Co-authored-by: Daniel Arndt <arndtd@ornl.gov>

Changes from review by @masterleinad

7e6b25a

- Formatting (spaces etc. accepted on page) - Formatting required size (check for nonpositivity or being too big) - Remaining issue: error_code propagation strategy

Added extra error reporting

af41845

bjoo commented May 27, 2024

View reviewed changes

masterleinad reviewed May 28, 2024

View reviewed changes

crtrott requested changes May 28, 2024

View reviewed changes

bjoo added 3 commits June 3, 2024 15:52

Changed name of non-CUDA exe

d683e81

Addressed further comments from @masterleinad

85d4173

Removed a print

9653174

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Memory Pool size setting for cudaAsyncMalloc #7032

Memory Pool size setting for cudaAsyncMalloc #7032

bjoo commented May 24, 2024

masterleinad May 24, 2024

bjoo May 27, 2024

bjoo May 27, 2024

masterleinad May 24, 2024

bjoo May 27, 2024

bjoo May 27, 2024

masterleinad May 24, 2024

bjoo May 27, 2024

masterleinad May 28, 2024

bjoo May 28, 2024

bjoo left a comment

bjoo May 27, 2024

bjoo May 27, 2024

bjoo May 27, 2024

masterleinad May 28, 2024

bjoo May 28, 2024

masterleinad May 28, 2024

bjoo May 28, 2024

masterleinad May 28, 2024

masterleinad May 28, 2024

bjoo May 28, 2024

masterleinad May 28, 2024

masterleinad May 28, 2024

bjoo May 28, 2024

masterleinad May 28, 2024

bjoo May 28, 2024

masterleinad May 28, 2024

bjoo May 28, 2024

masterleinad May 28, 2024

masterleinad May 28, 2024

crtrott left a comment

crtrott commented May 28, 2024

bjoo commented May 29, 2024

	if (!(n_bytes > 0)) return false;
	if (n_bytes == 0) return false;

		std::cout << "Initializing Default Memory Pool for device " << device_id
		<< "\n";

	Kokkos::View<float *, MemorySpace> a("unlabeled", num);
	Kokkos::View<float *, MemorySpace> a(Kokkos::view_alloc(Kokkos::WithoutInitializing, "unlabeled"), num);

		inner_loop_times.push_back(std::make_pair<>(
		num * sizeof(float), inner_loop_time / static_cast<double>(iters)));

		// Now we allocate the pool size amount + a little more (64 bytes is arbitrary
		// it just needs to be more than the retention size.

		// Now we free the memory, and our poolsize amount will be retained in the
		// pool

		// Always use the Async. Don't use the 40K Lower limit - may well be arch
		// dependent.

Memory Pool size setting for cudaAsyncMalloc #7032

Are you sure you want to change the base?

Memory Pool size setting for cudaAsyncMalloc #7032

Conversation

bjoo commented May 24, 2024

A patch to allow setting the pool size for cudaMallocAsync

Implementation

Efficacy

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

bjoo left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

crtrott left a comment

Choose a reason for hiding this comment

crtrott commented May 28, 2024

bjoo commented May 29, 2024