Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implement pool of USM IndirectKernelMemory #4596

Merged
merged 27 commits into from
Feb 9, 2022
Merged

Conversation

joeatodd
Copy link
Contributor

@joeatodd joeatodd commented Dec 7, 2021

This PR implements a round-robin pool of USM device allocations to be used as IndirectKernelMem. This avoids significant host runtime overhead associated with patterns like:

  • Submit q.memcpy for Kernel A's functor to IndirectKernelMem
  • Wait for q.memcpy
  • Submit Kernel A
  • Wait for Kernel A
  • Submit q.memcpy for Kernel B's functor
  • etc...

With the USM pool, these operations can all be queued by Kokkos on the host before the first q.memcpy finishes. This gave us around a 5% performance improvement for a LAMMPS Tersoff run with 6.5M particles on 2 GPUs, and significantly improved the scalability when moving to more nodes.

Notes

For now, this is implemented for IndirectKernelMem but not IndirectReducerMem. It might be worth investigating if this would also improve performance.

At present, the memcpy event is stored as a public member of USMObjectMem. This should probably be private with an API. @masterleinad suggests that the event should be returned from copy_from. I am happy with either/both.

@dalg24-jenkins
Copy link
Collaborator

Can one of the admins verify this patch?

@dalg24
Copy link
Member

dalg24 commented Dec 7, 2021

Add to whitelist

core/src/SYCL/Kokkos_SYCL_Instance.hpp Outdated Show resolved Hide resolved
core/src/SYCL/Kokkos_SYCL_Instance.hpp Outdated Show resolved Hide resolved
core/src/SYCL/Kokkos_SYCL_Instance.cpp Outdated Show resolved Hide resolved
core/src/SYCL/Kokkos_SYCL_Instance.hpp Outdated Show resolved Hide resolved
@masterleinad
Copy link
Contributor

Retest this please.

@joeatodd
Copy link
Contributor Author

joeatodd commented Dec 9, 2021

I'm marking this as a draft until the following issues are addressed:

  • Pick a better default allocation size, and handle the case when a kernel doesn't fit (reallocate larger)
  • Synchronous copy to local buffer to avoid issues when functor goes out of scope before q.memcpy completes.
  • Better handling of m_copy_event.

@joeatodd joeatodd marked this pull request as draft December 9, 2021 20:20
Comment on lines 157 to 158
// TODO 0x1440= 5184, arbitrary, larger than largest encountered kernel.
usm_mem.reserve(0x1440);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not quite sure that we preallocate memory. Do you have a good reason for that?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As discussed offline, I think we need to see some performance figures for typical cases, in case repeated reallocation hurts performance.

core/src/SYCL/Kokkos_SYCL_Instance.cpp Outdated Show resolved Hide resolved
core/src/SYCL/Kokkos_SYCL_Instance.hpp Outdated Show resolved Hide resolved
core/src/SYCL/Kokkos_SYCL_Instance.hpp Outdated Show resolved Hide resolved
core/src/SYCL/Kokkos_SYCL_Instance.hpp Outdated Show resolved Hide resolved
core/src/SYCL/Kokkos_SYCL_Parallel_Range.hpp Outdated Show resolved Hide resolved
No longer need IndirectReducerMem
Kernels associated with functors which rely on IndirectKernelMemory
must wait for the memcpy to finish. Previously (prior to USM Pool),
this was implemented as a host-side fence on the copy operation.

A temporary solution was to have the copy event (m_copy_event) as a
public member of USMObjectMem, but this wasn't ideal. Now
SYCLFunctionWrapper has a method which returns the memcopy event, as
does USMObjectMem.

For trivially copyable kernels, the returned event is a
default-constructed one (which is immediately 'ready') and so won't
incur a wait.

Note that now the SYCLFunctionWrapper class has the associated
USMObjectMem as a member. As such, the 'Storage' argument to
register_event is probably superfluous & should probably be removed.
Copy link
Contributor

@masterleinad masterleinad left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks already pretty good to me. I see better performance with this already (storing the kernel in device memory) but we should evaluate together with #4627 for the best default storage location (possibly device-dependent).

core/src/SYCL/Kokkos_SYCL_Instance.cpp Outdated Show resolved Hide resolved
core/src/SYCL/Kokkos_SYCL_Instance.hpp Outdated Show resolved Hide resolved
core/src/SYCL/Kokkos_SYCL_Instance.hpp Outdated Show resolved Hide resolved
core/src/SYCL/Kokkos_SYCL_Instance.hpp Outdated Show resolved Hide resolved
@@ -94,9 +95,13 @@ class Kokkos::Impl::ParallelFor<FunctorType, Kokkos::RangePolicy<Traits...>,
FunctorWrapperRangePolicyParallelFor<Functor, Policy> f{policy.begin(),
functor};
sycl::range<1> range(policy.end() - policy.begin());
cgh.depends_on(memcpy_events);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we still need a std::vector of events here?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, not really. I just changed it for consistency with the other versions of sycl_direct_launch. I can revert if you prefer.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think I would prefer just passing the single event directly instead of creating a one-element std::vector if possible.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No problem I'll change this & other occurrences.

});
// This barrier would prevent q.memcpy for subsequent kernels from being
// brought forward in time.
// q.submit_barrier(std::vector<sycl::event>{parallel_for_event});
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this should be reverted in the end (at the moment it doesn't matter too much because we create queues as in-order as a workaround for other issues).

@masterleinad
Copy link
Contributor

We should make this (at least the part without improving dealing with the memory events) a priority for the release.

@joeatodd
Copy link
Contributor Author

We should make this (at least the part without improving dealing with the memory events) a priority for the release.

I'm keen to get this out the door too. I think, functionally, the only remaining thing we discussed was to check the impact of 1) pre-allocation size (if any) and 2) pool size. I can do that shortly.

@masterleinad
Copy link
Contributor

@joeatodd Any news here? The window for the next release is about to close.

@joeatodd
Copy link
Contributor Author

joeatodd commented Feb 2, 2022

Based on some tests with LAMMPS, I think that a relatively small pool size (2-4) is sufficient. There's a slight performance improvement moving from pool size 2 to pool size 4, but none beyond that. I think given the modest memory footprint of 4 USM allocations of a few kilobytes, we should opt for 4. @masterleinad are you happy for me to simply hardcode the pool size to 4 for now & rebase?

@masterleinad
Copy link
Contributor

@masterleinad are you happy for me to simply hardcode the pool size to 4 for now & rebase?

Yes, that sounds good to me.

@joeatodd joeatodd marked this pull request as ready for review February 2, 2022 14:57
@joeatodd
Copy link
Contributor Author

joeatodd commented Feb 7, 2022

The jenkins pipeline is still failing and I'm not really sure why. Any help appreciated!

- no loop in reductions
- no references allowed in kernels
- rename atomic_wrapping_fetch_inc
Copy link
Contributor

@masterleinad masterleinad left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks mostly good to me.

core/src/SYCL/Kokkos_SYCL_Instance.hpp Outdated Show resolved Hide resolved
@@ -94,9 +95,13 @@ class Kokkos::Impl::ParallelFor<FunctorType, Kokkos::RangePolicy<Traits...>,
FunctorWrapperRangePolicyParallelFor<Functor, Policy> f{policy.begin(),
functor};
sycl::range<1> range(policy.end() - policy.begin());
cgh.depends_on(memcpy_events);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think I would prefer just passing the single event directly instead of creating a one-element std::vector if possible.

core/src/SYCL/Kokkos_SYCL_Parallel_Range.hpp Outdated Show resolved Hide resolved
core/src/SYCL/Kokkos_SYCL_Parallel_Range.hpp Outdated Show resolved Hide resolved
core/src/SYCL/Kokkos_SYCL_Instance.cpp Outdated Show resolved Hide resolved
core/src/SYCL/Kokkos_SYCL_Instance.cpp Outdated Show resolved Hide resolved
core/src/SYCL/Kokkos_SYCL_Instance.hpp Outdated Show resolved Hide resolved
core/src/SYCL/Kokkos_SYCL_Parallel_Range.hpp Outdated Show resolved Hide resolved
Copy link
Contributor

@masterleinad masterleinad left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks OK to me.

@dalg24 dalg24 merged commit 441ad2c into kokkos:develop Feb 9, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants