-
Notifications
You must be signed in to change notification settings - Fork 436
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Team- and thread-level sort, sort_by_key #5317
Conversation
@brian-kelley Have you considered using CUB for the CUDA back-end? CUB should always come with your CUDA installation. |
@mhoemmen Yes, the reasons I didn't use CUB is that its two block-level sorts (Radix and Merge) both require There is still a slight performance hit - when we tried to publish a KokkosKernels paper, I ran some experiments against CUB BlockRadixSort. On V100, this implementation sorted a bunch of 256-element int arrays (each one started in global, was loaded to shared or registers, and was written back to global after sorting) about 9% slower than BlockRadixSort. |
71b050c
to
83822f9
Compare
Add sort functions that can be called from device, and exploit team and thread level parallelism. The new functions use bitonic sort, which is good for this because it's in-place but highly parallel (when sorting N items, N/2 pairs are compared at once). It's also comparison-based, so there are versions that can accept an arbitrary comparison functor (operator()(a, b) returns true if key a goes before key b). sort_by_key is the same, but in addition to the keys it takes a values view of the same length. The pairs keys(i) and values(i) are all sorted according to the key. This is useful for sorting CRS matrices for example.
Looks like testing had a random failure in |
- Use existing swap function and binary less-than predicate - Add FIXMEs about adding ceiling power-of-2 utility (used several places)
(generic across team-level and thread-level using templates)
@masterleinad Thanks for the review - I just pushed all the suggestions. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this is pretty good. But I'd like us to put it into Experimental, and then hopefully by Kokkos 4.1 we have the more capable execution resource handles, so instead of having:
sort_team(TeamHandle, ...)
sort_thread(TeamHandle, ...)
we simply have:
sort(TeamHandle, ...)
sort(ThreadHandle, ...)
sort(InlineHandle, ...)
Agree about |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we can still avoid some duplicated code. Otherwise, this looks good to me.
- in Sort, put kokkos includes together - in NestedSort, include Kokkos_Core.hpp so that it can be used standalone - in NestedSort, move #includes inside include guard
All the suggestions so far are pushed now |
|
Looks like the test machine fetnat03 has a full disk. |
This last round of testing exposed a little issue with Does this deserve its own issue? Somewhat related to #4685 , which was resolved by the suggestion to always use AUTO for team size of dummy policies. |
- remove code duplication (only doing compare+swap in one place) - fix team size < 32 on OMPTarget
- Don't duplicate logic to randomly generate offsets/keys - Use standard library to generate randoms on host for offsets
when EXPECT_EQ, EXPECT_TRUE tests fail.
@dalg24 I just pushed those suggestions (except I didn't change anything about the prefix sum) |
@crtrott Could you give this another review? I moved the functions to Kokkos::Experimental. |
@@ -35,6 +35,7 @@ foreach(Tag Threads;Serial;OpenMP;Cuda;HPX;HIP;SYCL;OpenMPTarget) | |||
"#include <Test${Tag}_Category.hpp>\n" | |||
"#include <TestRandomCommon.hpp>\n" | |||
"#include <TestSortCommon.hpp>\n" | |||
"#include <TestNestedSort.hpp>\n" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why is it not a separate cpp source file?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would prefer generating multiple source file but this do not need to be handled here if you open an issue for it
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just open that additional issue, we also should try what happens if you would use this at the top level (i.e. with an execution space instance ...)
@@ -35,6 +35,7 @@ foreach(Tag Threads;Serial;OpenMP;Cuda;HPX;HIP;SYCL;OpenMPTarget) | |||
"#include <Test${Tag}_Category.hpp>\n" | |||
"#include <TestRandomCommon.hpp>\n" | |||
"#include <TestSortCommon.hpp>\n" | |||
"#include <TestNestedSort.hpp>\n" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you open an issue so we can split this and not end up with a single huge object file.
(Addresses #645)
Add sort functions that can be called from device,
and exploit team and thread level parallelism. The new functions
use bitonic sort, which is good for this because it's in-place but
highly parallel (when sorting N items, N/2 pairs are compared at once).
It's also comparison-based, so there are versions that can accept an
arbitrary comparison functor (
bool operator()(a, b)
returns true if key agoes before key b, otherwise false). Any key type that can be copied in device code will work, as long as there is
operator<
for it (or you provide a comparator). So things like Kokkos::pair (see #4487) are fine.sort_by_key is the same, but in addition to the keys it takes a values
view of the same length. The pairs keys(i) and values(i) are all sorted
according to the key. This is useful for sorting CRS matrices for
example.
The interfaces to sort_[team/thread] and sort_by_key[team/thread] are designed to be similar to Thrust, except they take Kokkos::Views and not iterators.
The new function signatures (8 of them, but all implemented in terms of 2 Impl:: functions):
BTW, there are 2 full sort implementations now since TeamVectorRange() and ThreadVectorRange() are functions, not types that could be templated on. But once the generic ranges and team/thread handles get added, all the functions could be in terms of 1 implementation.
I tested this myself on architectures PASCAL61, VEGA908, INTEL_XEHP, and made sure that all 8 functions are covered by the new tests.