Implement pool of USM IndirectKernelMemory #4596

joeatodd · 2021-12-07T14:36:22Z

This PR implements a round-robin pool of USM device allocations to be used as IndirectKernelMem. This avoids significant host runtime overhead associated with patterns like:

Submit q.memcpy for Kernel A's functor to IndirectKernelMem
Wait for q.memcpy
Submit Kernel A
Wait for Kernel A
Submit q.memcpy for Kernel B's functor
etc...

With the USM pool, these operations can all be queued by Kokkos on the host before the first q.memcpy finishes. This gave us around a 5% performance improvement for a LAMMPS Tersoff run with 6.5M particles on 2 GPUs, and significantly improved the scalability when moving to more nodes.

Notes

For now, this is implemented for IndirectKernelMem but not IndirectReducerMem. It might be worth investigating if this would also improve performance.

At present, the memcpy event is stored as a public member of USMObjectMem. This should probably be private with an API. @masterleinad suggests that the event should be returned from copy_from. I am happy with either/both.

dalg24-jenkins · 2021-12-07T14:36:25Z

Can one of the admins verify this patch?

dalg24 · 2021-12-07T14:41:41Z

Add to whitelist

core/src/SYCL/Kokkos_SYCL_Instance.hpp

core/src/SYCL/Kokkos_SYCL_Instance.cpp

core/src/SYCL/Kokkos_SYCL_Instance.hpp

masterleinad · 2021-12-07T17:14:35Z

Retest this please.

joeatodd · 2021-12-09T20:20:12Z

I'm marking this as a draft until the following issues are addressed:

Pick a better default allocation size, and handle the case when a kernel doesn't fit (reallocate larger)
Synchronous copy to local buffer to avoid issues when functor goes out of scope before q.memcpy completes.
Better handling of m_copy_event.

masterleinad · 2021-12-09T20:12:27Z

core/src/SYCL/Kokkos_SYCL_Instance.cpp

+      // TODO 0x1440= 5184, arbitrary, larger than largest encountered kernel.
+      usm_mem.reserve(0x1440);


I'm not quite sure that we preallocate memory. Do you have a good reason for that?

As discussed offline, I think we need to see some performance figures for typical cases, in case repeated reallocation hurts performance.

core/src/SYCL/Kokkos_SYCL_Instance.cpp

core/src/SYCL/Kokkos_SYCL_Instance.hpp

core/src/SYCL/Kokkos_SYCL_Parallel_Range.hpp

This might hurt performance in a bad case, should profile & check.

No longer need IndirectReducerMem

Kernels associated with functors which rely on IndirectKernelMemory must wait for the memcpy to finish. Previously (prior to USM Pool), this was implemented as a host-side fence on the copy operation. A temporary solution was to have the copy event (m_copy_event) as a public member of USMObjectMem, but this wasn't ideal. Now SYCLFunctionWrapper has a method which returns the memcopy event, as does USMObjectMem. For trivially copyable kernels, the returned event is a default-constructed one (which is immediately 'ready') and so won't incur a wait. Note that now the SYCLFunctionWrapper class has the associated USMObjectMem as a member. As such, the 'Storage' argument to register_event is probably superfluous & should probably be removed.

masterleinad

Looks already pretty good to me. I see better performance with this already (storing the kernel in device memory) but we should evaluate together with #4627 for the best default storage location (possibly device-dependent).

core/src/SYCL/Kokkos_SYCL_Instance.cpp

core/src/SYCL/Kokkos_SYCL_Instance.hpp

masterleinad · 2021-12-20T15:41:41Z

core/src/SYCL/Kokkos_SYCL_Parallel_Range.hpp

@@ -94,9 +95,13 @@ class Kokkos::Impl::ParallelFor<FunctorType, Kokkos::RangePolicy<Traits...>,
      FunctorWrapperRangePolicyParallelFor<Functor, Policy> f{policy.begin(),
                                                              functor};
      sycl::range<1> range(policy.end() - policy.begin());
+      cgh.depends_on(memcpy_events);


Do we still need a std::vector of events here?

No, not really. I just changed it for consistency with the other versions of sycl_direct_launch. I can revert if you prefer.

I think I would prefer just passing the single event directly instead of creating a one-element std::vector if possible.

No problem I'll change this & other occurrences.

masterleinad · 2021-12-21T16:01:19Z

core/src/SYCL/Kokkos_SYCL_Parallel_Range.hpp

+    });
+    // This barrier would prevent q.memcpy for subsequent kernels from being
+    // brought forward in time.
+    // q.submit_barrier(std::vector<sycl::event>{parallel_for_event});


I think this should be reverted in the end (at the moment it doesn't matter too much because we create queues as in-order as a workaround for other issues).

masterleinad · 2022-01-07T20:08:33Z

We should make this (at least the part without improving dealing with the memory events) a priority for the release.

joeatodd · 2022-01-12T09:57:08Z

We should make this (at least the part without improving dealing with the memory events) a priority for the release.

I'm keen to get this out the door too. I think, functionally, the only remaining thing we discussed was to check the impact of 1) pre-allocation size (if any) and 2) pool size. I can do that shortly.

masterleinad · 2022-01-19T19:14:25Z

@joeatodd Any news here? The window for the next release is about to close.

joeatodd · 2022-02-02T13:48:25Z

Based on some tests with LAMMPS, I think that a relatively small pool size (2-4) is sufficient. There's a slight performance improvement moving from pool size 2 to pool size 4, but none beyond that. I think given the modest memory footprint of 4 USM allocations of a few kilobytes, we should opt for 4. @masterleinad are you happy for me to simply hardcode the pool size to 4 for now & rebase?

masterleinad · 2022-02-02T14:43:55Z

@masterleinad are you happy for me to simply hardcode the pool size to 4 for now & rebase?

Yes, that sounds good to me.

joeatodd · 2022-02-07T09:07:02Z

The jenkins pipeline is still failing and I'm not really sure why. Any help appreciated!

- no loop in reductions - no references allowed in kernels - rename atomic_wrapping_fetch_inc

masterleinad

Looks mostly good to me.

core/src/SYCL/Kokkos_SYCL_Instance.hpp

masterleinad · 2022-02-07T18:16:31Z

core/src/SYCL/Kokkos_SYCL_Parallel_Range.hpp

@@ -94,9 +95,13 @@ class Kokkos::Impl::ParallelFor<FunctorType, Kokkos::RangePolicy<Traits...>,
      FunctorWrapperRangePolicyParallelFor<Functor, Policy> f{policy.begin(),
                                                              functor};
      sycl::range<1> range(policy.end() - policy.begin());
+      cgh.depends_on(memcpy_events);


I think I would prefer just passing the single event directly instead of creating a one-element std::vector if possible.

core/src/SYCL/Kokkos_SYCL_Parallel_Range.hpp

core/src/SYCL/Kokkos_SYCL_Instance.cpp

core/src/SYCL/Kokkos_SYCL_Instance.hpp

core/src/SYCL/Kokkos_SYCL_Parallel_Range.hpp

masterleinad

Looks OK to me.

dalg24 requested a review from masterleinad December 7, 2021 15:08

masterleinad force-pushed the joe/usm-pool-us branch from db89995 to 2eda84c Compare December 7, 2021 15:51

dalg24 reviewed Dec 7, 2021

View reviewed changes

joeatodd force-pushed the joe/usm-pool-us branch from 2eda84c to c95a7ee Compare December 9, 2021 20:02

joeatodd marked this pull request as draft December 9, 2021 20:20

masterleinad reviewed Dec 9, 2021

View reviewed changes

joeatodd and others added 6 commits December 17, 2021 17:03

Implement pool of USM IndirectKernelMemory

8047e44

Fix indentation

cb5e069

Address review comments

c3930ac

Atomically increment (& wrap) round-robin pool.

936d01d

Add staging buffer to USMObjectMem to avoid lifetime issues

227383b

Don't preallocate USMObjectMem

a0ef3e4

This might hurt performance in a bad case, should profile & check.

joeatodd force-pushed the joe/usm-pool-us branch from bf7b20e to a0ef3e4 Compare December 17, 2021 17:04

joeatodd added 4 commits December 17, 2021 17:12

Clang format

0a80fe7

Use IndirectKernelMem for reducers too

251bbbc

No longer need IndirectReducerMem

Update comments on atomic update of round robin variable

fbf48ea

masterleinad mentioned this pull request Dec 20, 2021

Use host-pinned memory for SYCL kernel memory #4627

Merged

joeatodd added 2 commits December 20, 2021 15:28

clang-format

b8f31cc

clang-format again...

e42ab62

masterleinad reviewed Dec 20, 2021

View reviewed changes

joeatodd added 3 commits December 21, 2021 11:04

Remove Storage argument to register_event - m_storage is a member

7afbdb1

USMObjectMem - only use local staging if required (usm device alloc)

86bb659

Other half of changes to register_event...

aecc9a8

masterleinad reviewed Dec 21, 2021

View reviewed changes

masterleinad mentioned this pull request Jan 18, 2022

Use sycl::is_device_copyable (new) #4637

Merged

Set USM pool size = 4 based on tests with LAMMPS

2be87d1

joeatodd marked this pull request as ready for review February 2, 2022 14:57

joeatodd and others added 3 commits February 4, 2022 13:55

Merge branch 'develop' into joe/usm-pool-us

6e7d89b

clang-format

f17d924

clang-format 2

8ba9b5b

joeatodd added 2 commits February 7, 2022 15:18

Fix merge issues

e270d08

- no loop in reductions - no references allowed in kernels - rename atomic_wrapping_fetch_inc

clang-format

451ac95

masterleinad requested changes Feb 7, 2022

View reviewed changes

dalg24 reviewed Feb 7, 2022

View reviewed changes

joeatodd added 6 commits February 8, 2022 20:53

Fix atomic function name in comment

20f1660

Replace vector of events w/ single event where possible

9e7d19b

Remove unused &Storage from SYCLFunctionWrapper, align templates

0ac0281

clang-format

978be1d

Reinstate barriers after kernel submission

82ea5c8

Leave Storage& parameter unnamed to avoid warning

d8969f1

masterleinad approved these changes Feb 9, 2022

View reviewed changes

dalg24 approved these changes Feb 9, 2022

View reviewed changes

dalg24 merged commit 441ad2c into kokkos:develop Feb 9, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Implement pool of USM IndirectKernelMemory #4596

Implement pool of USM IndirectKernelMemory #4596

joeatodd commented Dec 7, 2021

dalg24-jenkins commented Dec 7, 2021

dalg24 commented Dec 7, 2021

masterleinad commented Dec 7, 2021

joeatodd commented Dec 9, 2021

masterleinad Dec 9, 2021

joeatodd Dec 21, 2021

masterleinad left a comment

masterleinad Dec 20, 2021

joeatodd Dec 21, 2021

masterleinad Feb 7, 2022

joeatodd Feb 8, 2022

masterleinad Dec 21, 2021

masterleinad commented Jan 7, 2022

joeatodd commented Jan 12, 2022

masterleinad commented Jan 19, 2022

joeatodd commented Feb 2, 2022

masterleinad commented Feb 2, 2022

joeatodd commented Feb 7, 2022

masterleinad left a comment

masterleinad Feb 7, 2022

masterleinad left a comment

		// TODO 0x1440= 5184, arbitrary, larger than largest encountered kernel.
		usm_mem.reserve(0x1440);

Implement pool of USM IndirectKernelMemory #4596

Implement pool of USM IndirectKernelMemory #4596

Conversation

joeatodd commented Dec 7, 2021

Notes

dalg24-jenkins commented Dec 7, 2021

dalg24 commented Dec 7, 2021

masterleinad commented Dec 7, 2021

joeatodd commented Dec 9, 2021

Choose a reason for hiding this comment

Choose a reason for hiding this comment

masterleinad left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

masterleinad commented Jan 7, 2022

joeatodd commented Jan 12, 2022

masterleinad commented Jan 19, 2022

joeatodd commented Feb 2, 2022

masterleinad commented Feb 2, 2022

joeatodd commented Feb 7, 2022

masterleinad left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

masterleinad left a comment

Choose a reason for hiding this comment