-
Notifications
You must be signed in to change notification settings - Fork 407
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Implement pool of USM IndirectKernelMemory #4596
Conversation
Can one of the admins verify this patch? |
Add to whitelist |
db89995
to
2eda84c
Compare
Retest this please. |
2eda84c
to
c95a7ee
Compare
I'm marking this as a draft until the following issues are addressed:
|
// TODO 0x1440= 5184, arbitrary, larger than largest encountered kernel. | ||
usm_mem.reserve(0x1440); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not quite sure that we preallocate memory. Do you have a good reason for that?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As discussed offline, I think we need to see some performance figures for typical cases, in case repeated reallocation hurts performance.
This might hurt performance in a bad case, should profile & check.
bf7b20e
to
a0ef3e4
Compare
No longer need IndirectReducerMem
Kernels associated with functors which rely on IndirectKernelMemory must wait for the memcpy to finish. Previously (prior to USM Pool), this was implemented as a host-side fence on the copy operation. A temporary solution was to have the copy event (m_copy_event) as a public member of USMObjectMem, but this wasn't ideal. Now SYCLFunctionWrapper has a method which returns the memcopy event, as does USMObjectMem. For trivially copyable kernels, the returned event is a default-constructed one (which is immediately 'ready') and so won't incur a wait. Note that now the SYCLFunctionWrapper class has the associated USMObjectMem as a member. As such, the 'Storage' argument to register_event is probably superfluous & should probably be removed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks already pretty good to me. I see better performance with this already (storing the kernel in device
memory) but we should evaluate together with #4627 for the best default storage location (possibly device-dependent).
@@ -94,9 +95,13 @@ class Kokkos::Impl::ParallelFor<FunctorType, Kokkos::RangePolicy<Traits...>, | |||
FunctorWrapperRangePolicyParallelFor<Functor, Policy> f{policy.begin(), | |||
functor}; | |||
sycl::range<1> range(policy.end() - policy.begin()); | |||
cgh.depends_on(memcpy_events); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we still need a std::vector
of events here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No, not really. I just changed it for consistency with the other versions of sycl_direct_launch. I can revert if you prefer.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think I would prefer just passing the single event directly instead of creating a one-element std::vector
if possible.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No problem I'll change this & other occurrences.
}); | ||
// This barrier would prevent q.memcpy for subsequent kernels from being | ||
// brought forward in time. | ||
// q.submit_barrier(std::vector<sycl::event>{parallel_for_event}); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this should be reverted in the end (at the moment it doesn't matter too much because we create queues as in-order as a workaround for other issues).
We should make this (at least the part without improving dealing with the memory events) a priority for the release. |
I'm keen to get this out the door too. I think, functionally, the only remaining thing we discussed was to check the impact of 1) pre-allocation size (if any) and 2) pool size. I can do that shortly. |
@joeatodd Any news here? The window for the next release is about to close. |
Based on some tests with LAMMPS, I think that a relatively small pool size (2-4) is sufficient. There's a slight performance improvement moving from pool size 2 to pool size 4, but none beyond that. I think given the modest memory footprint of 4 USM allocations of a few kilobytes, we should opt for 4. @masterleinad are you happy for me to simply hardcode the pool size to 4 for now & rebase? |
Yes, that sounds good to me. |
The jenkins pipeline is still failing and I'm not really sure why. Any help appreciated! |
- no loop in reductions - no references allowed in kernels - rename atomic_wrapping_fetch_inc
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks mostly good to me.
@@ -94,9 +95,13 @@ class Kokkos::Impl::ParallelFor<FunctorType, Kokkos::RangePolicy<Traits...>, | |||
FunctorWrapperRangePolicyParallelFor<Functor, Policy> f{policy.begin(), | |||
functor}; | |||
sycl::range<1> range(policy.end() - policy.begin()); | |||
cgh.depends_on(memcpy_events); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think I would prefer just passing the single event directly instead of creating a one-element std::vector
if possible.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks OK to me.
This PR implements a round-robin pool of USM device allocations to be used as
IndirectKernelMem
. This avoids significant host runtime overhead associated with patterns like:With the USM pool, these operations can all be queued by Kokkos on the host before the first
q.memcpy
finishes. This gave us around a 5% performance improvement for a LAMMPS Tersoff run with 6.5M particles on 2 GPUs, and significantly improved the scalability when moving to more nodes.Notes
For now, this is implemented for
IndirectKernelMem
but notIndirectReducerMem
. It might be worth investigating if this would also improve performance.At present, the
memcpy
event is stored as a public member ofUSMObjectMem
. This should probably be private with an API. @masterleinad suggests that the event should be returned fromcopy_from
. I am happy with either/both.