Team- and thread-level sort, sort_by_key #5317

brian-kelley · 2022-08-08T23:02:49Z

(Addresses #645)

Add sort functions that can be called from device,
and exploit team and thread level parallelism. The new functions
use bitonic sort, which is good for this because it's in-place but
highly parallel (when sorting N items, N/2 pairs are compared at once).
It's also comparison-based, so there are versions that can accept an
arbitrary comparison functor (bool operator()(a, b) returns true if key a
goes before key b, otherwise false). Any key type that can be copied in device code will work, as long as there is operator< for it (or you provide a comparator). So things like Kokkos::pair (see #4487) are fine.

sort_by_key is the same, but in addition to the keys it takes a values
view of the same length. The pairs keys(i) and values(i) are all sorted
according to the key. This is useful for sorting CRS matrices for
example.

The interfaces to sort_[team/thread] and sort_by_key[team/thread] are designed to be similar to Thrust, except they take Kokkos::Views and not iterators.

The new function signatures (8 of them, but all implemented in terms of 2 Impl:: functions):

template <class TeamMember, class ViewType>
KOKKOS_INLINE_FUNCTION void sort_team(const TeamMember& t, const ViewType& view);

template <class TeamMember, class ViewType, class Comparator>
KOKKOS_INLINE_FUNCTION void sort_team(const TeamMember& t, const ViewType& view, const Comparator& comp);

template <class TeamMember, class KeyViewType, class ValueViewType>
KOKKOS_INLINE_FUNCTION void sort_by_key_team(const TeamMember& t, const KeyViewType& keyView, const ValueViewType& valueView);

template <class TeamMember, class KeyViewType, class ValueViewType, class Comparator>
KOKKOS_INLINE_FUNCTION void sort_by_key_team(const TeamMember& t, const KeyViewType& keyView, const ValueViewType& valueView, const Comparator& comp);

template <class TeamMember, class ViewType>
KOKKOS_INLINE_FUNCTION void sort_thread(const TeamMember& t, const ViewType& view);

template <class TeamMember, class ViewType, class Comparator>
KOKKOS_INLINE_FUNCTION void sort_thread(const TeamMember& t, const ViewType& view, const Comparator& comp);

template <class TeamMember, class KeyViewType, class ValueViewType>
KOKKOS_INLINE_FUNCTION void sort_by_key_thread(const TeamMember& t, const KeyViewType& keyView, const ValueViewType& valueView);

template <class TeamMember, class KeyViewType, class ValueViewType, class Comparator>
KOKKOS_INLINE_FUNCTION void sort_by_key_thread(const TeamMember& t, const KeyViewType& keyView, const ValueViewType& valueView, const Comparator& comp);

BTW, there are 2 full sort implementations now since TeamVectorRange() and ThreadVectorRange() are functions, not types that could be templated on. But once the generic ranges and team/thread handles get added, all the functions could be in terms of 1 implementation.

I tested this myself on architectures PASCAL61, VEGA908, INTEL_XEHP, and made sure that all 8 functions are covered by the new tests.

mhoemmen · 2022-08-09T14:48:39Z

@brian-kelley Have you considered using CUB for the CUDA back-end? CUB should always come with your CUDA installation.

brian-kelley · 2022-08-09T16:20:24Z

@mhoemmen Yes, the reasons I didn't use CUB is that its two block-level sorts (Radix and Merge) both require __shared__ temporary storage, and both require the input to be partitioned across threads with a fixed number of items per thread (so the lengths have some small-ish upper bound based on hardware limitations). This bitonic sort doesn't need extra space and any team/thread size can sort any length array (and the data can be any View, any layout, shared or global). So if the Kokkos user wants to use every byte of shared available for other stuff, they can still use this.

There is still a slight performance hit - when we tried to publish a KokkosKernels paper, I ran some experiments against CUB BlockRadixSort. On V100, this implementation sorted a bunch of 256-element int arrays (each one started in global, was loaded to shared or registers, and was written back to global after sorting) about 9% slower than BlockRadixSort.

Add sort functions that can be called from device, and exploit team and thread level parallelism. The new functions use bitonic sort, which is good for this because it's in-place but highly parallel (when sorting N items, N/2 pairs are compared at once). It's also comparison-based, so there are versions that can accept an arbitrary comparison functor (operator()(a, b) returns true if key a goes before key b). sort_by_key is the same, but in addition to the keys it takes a values view of the same length. The pairs keys(i) and values(i) are all sorted according to the key. This is useful for sorting CRS matrices for example.

brian-kelley · 2022-08-10T17:05:12Z

Looks like testing had a random failure in cuda.debug_pin_um_to_host

algorithms/src/Kokkos_Sort.hpp

- Use existing swap function and binary less-than predicate - Add FIXMEs about adding ceiling power-of-2 utility (used several places)

(generic across team-level and thread-level using templates)

brian-kelley · 2022-08-10T23:43:11Z

@masterleinad Thanks for the review - I just pushed all the suggestions.

algorithms/unit_tests/TestSort.hpp

algorithms/src/Kokkos_Sort.hpp

crtrott

I think this is pretty good. But I'd like us to put it into Experimental, and then hopefully by Kokkos 4.1 we have the more capable execution resource handles, so instead of having:

sort_team(TeamHandle, ...)
sort_thread(TeamHandle, ...)

we simply have:

sort(TeamHandle, ...)
sort(ThreadHandle, ...)
sort(InlineHandle, ...)

dalg24 · 2022-08-11T19:40:52Z

I think this is pretty good. But I'd like us to put it into Experimental, and then hopefully by Kokkos 4.1 we have the more capable execution resource handles, so instead of having:
sort_team(TeamHandle, ...)
sort_thread(TeamHandle, ...)
we simply have:
sort(TeamHandle, ...)
sort(ThreadHandle, ...)
sort(InlineHandle, ...)

Agree about Experimental::. Maybe put it in a separate header.

masterleinad

I think we can still avoid some duplicated code. Otherwise, this looks good to me.

algorithms/src/Kokkos_NestedSort.hpp

- in Sort, put kokkos includes together - in NestedSort, include Kokkos_Core.hpp so that it can be used standalone - in NestedSort, move #includes inside include guard

brian-kelley · 2022-08-11T20:25:51Z

All the suggestions so far are pushed now

masterleinad · 2022-08-12T18:18:05Z

27: [ RUN      ] openmptarget.NestedSort
27: OpenMPTarget backend requires a minimum of 32 threads per team.

brian-kelley · 2022-08-12T20:36:14Z

Looks like the test machine fetnat03 has a full disk.

brian-kelley · 2022-08-15T14:57:11Z

This last round of testing exposed a little issue with TeamPolicy<OpenMPTarget> - vector_length_max() returned 32, but with that high of a value Kokkos::AUTO can't also satisfy the requirement that team size is at least 32.

Does this deserve its own issue? Somewhat related to #4685 , which was resolved by the suggestion to always use AUTO for team size of dummy policies.

- remove code duplication (only doing compare+swap in one place) - fix team size < 32 on OMPTarget

algorithms/src/Kokkos_NestedSort.hpp

algorithms/unit_tests/TestSort.hpp

algorithms/unit_tests/TestSortCommon.hpp

- Don't duplicate logic to randomly generate offsets/keys - Use standard library to generate randoms on host for offsets

when EXPECT_EQ, EXPECT_TRUE tests fail.

brian-kelley · 2022-08-18T23:47:36Z

@dalg24 I just pushed those suggestions (except I didn't change anything about the prefix sum)

brian-kelley · 2022-08-23T15:41:28Z

@crtrott Could you give this another review? I moved the functions to Kokkos::Experimental.

dalg24 · 2022-08-31T20:33:56Z

algorithms/unit_tests/CMakeLists.txt

@@ -35,6 +35,7 @@ foreach(Tag Threads;Serial;OpenMP;Cuda;HPX;HIP;SYCL;OpenMPTarget)
      "#include <Test${Tag}_Category.hpp>\n"
      "#include <TestRandomCommon.hpp>\n"
      "#include <TestSortCommon.hpp>\n"
+      "#include <TestNestedSort.hpp>\n"


Why is it not a separate cpp source file?

I would prefer generating multiple source file but this do not need to be handled here if you open an issue for it

algorithms/unit_tests/TestNestedSort.hpp

crtrott

Just open that additional issue, we also should try what happens if you would use this at the top level (i.e. with an execution space instance ...)

crtrott · 2022-08-31T20:37:25Z

algorithms/unit_tests/CMakeLists.txt

@@ -35,6 +35,7 @@ foreach(Tag Threads;Serial;OpenMP;Cuda;HPX;HIP;SYCL;OpenMPTarget)
      "#include <Test${Tag}_Category.hpp>\n"
      "#include <TestRandomCommon.hpp>\n"
      "#include <TestSortCommon.hpp>\n"
+      "#include <TestNestedSort.hpp>\n"


Can you open an issue so we can split this and not end up with a single huge object file.

brian-kelley added the Enhancement Improve existing capability; will potentially require voting label Aug 8, 2022

brian-kelley requested review from PhilMiller and masterleinad August 8, 2022 23:02

brian-kelley self-assigned this Aug 8, 2022

brian-kelley mentioned this pull request Aug 8, 2022

[do not close] NGA/task 3: tracking sorting task #5071

Closed

3 tasks

brian-kelley force-pushed the Do645 branch 2 times, most recently from 71b050c to 83822f9 Compare August 9, 2022 20:03

brian-kelley force-pushed the Do645 branch from 83822f9 to 1f946f9 Compare August 9, 2022 21:34

masterleinad reviewed Aug 10, 2022

View reviewed changes

brian-kelley added 2 commits August 10, 2022 16:58

Small updates to nested sort

91531a5

- Use existing swap function and binary less-than predicate - Add FIXMEs about adding ceiling power-of-2 utility (used several places)

Refactor nested-parallelism sort to need 1 impl only

a86eb7f

(generic across team-level and thread-level using templates)

sort_thread tests: Fix OOB access on idle threads

6ae55a9

dalg24 reviewed Aug 11, 2022

View reviewed changes

algorithms/unit_tests/TestSort.hpp Outdated Show resolved Hide resolved

dalg24 reviewed Aug 11, 2022

View reviewed changes

algorithms/src/Kokkos_Sort.hpp Outdated Show resolved Hide resolved

algorithms/src/Kokkos_Sort.hpp Outdated Show resolved Hide resolved

crtrott requested changes Aug 11, 2022

View reviewed changes

brian-kelley added 2 commits August 11, 2022 13:43

Use KOKKOS_FUNCTION in place of INLINE, FORCEINLINE

294b7f6

Move team/thread sort to Experimental::, new header

6ea1250

masterleinad reviewed Aug 11, 2022

View reviewed changes

algorithms/src/Kokkos_NestedSort.hpp Outdated Show resolved Hide resolved

Fix Sort/NestedSort includes

745cfd8

- in Sort, put kokkos includes together - in NestedSort, include Kokkos_Core.hpp so that it can be used standalone - in NestedSort, move #includes inside include guard

masterleinad approved these changes Aug 11, 2022

View reviewed changes

brian-kelley force-pushed the Do645 branch from 6a270cd to fd40491 Compare August 15, 2022 14:50

brian-kelley requested review from dalg24 and crtrott August 16, 2022 15:24

Nested bitonic sort: small changes

cb2eaff

- remove code duplication (only doing compare+swap in one place) - fix team size < 32 on OMPTarget

brian-kelley force-pushed the Do645 branch from fd40491 to cb2eaff Compare August 17, 2022 18:02

dalg24 reviewed Aug 17, 2022

View reviewed changes

brian-kelley added 4 commits August 18, 2022 15:47

Pass tag by value

9d67e0b

NestedSort test cleanup

67bb790

- Don't duplicate logic to randomly generate offsets/keys - Use standard library to generate randoms on host for offsets

Nested sort test: add error messages

7994904

when EXPECT_EQ, EXPECT_TRUE tests fail.

Move nested sort tests into a separate file

35faf09

Move sort test includes, revert whitespace changes

0cc7917

dalg24 reviewed Aug 31, 2022

View reviewed changes

algorithms/unit_tests/TestNestedSort.hpp Show resolved Hide resolved

crtrott approved these changes Aug 31, 2022

View reviewed changes

crtrott merged commit e50a7be into kokkos:develop Aug 31, 2022

ajpowelsnl mentioned this pull request Mar 22, 2023

Changelog 4.0.0/team thread sort #6007

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Team- and thread-level sort, sort_by_key #5317

Team- and thread-level sort, sort_by_key #5317

brian-kelley commented Aug 8, 2022 •

edited

Loading

mhoemmen commented Aug 9, 2022

brian-kelley commented Aug 9, 2022 •

edited

Loading

brian-kelley commented Aug 10, 2022 •

edited

Loading

brian-kelley commented Aug 10, 2022

crtrott left a comment

dalg24 commented Aug 11, 2022

masterleinad left a comment

brian-kelley commented Aug 11, 2022

masterleinad commented Aug 12, 2022

brian-kelley commented Aug 12, 2022

brian-kelley commented Aug 15, 2022 •

edited

Loading

brian-kelley commented Aug 18, 2022

brian-kelley commented Aug 23, 2022

dalg24 Aug 31, 2022

dalg24 Aug 31, 2022

crtrott left a comment

crtrott Aug 31, 2022

Team- and thread-level sort, sort_by_key #5317

Team- and thread-level sort, sort_by_key #5317

Conversation

brian-kelley commented Aug 8, 2022 • edited Loading

mhoemmen commented Aug 9, 2022

brian-kelley commented Aug 9, 2022 • edited Loading

brian-kelley commented Aug 10, 2022 • edited Loading

brian-kelley commented Aug 10, 2022

crtrott left a comment

Choose a reason for hiding this comment

dalg24 commented Aug 11, 2022

masterleinad left a comment

Choose a reason for hiding this comment

brian-kelley commented Aug 11, 2022

masterleinad commented Aug 12, 2022

brian-kelley commented Aug 12, 2022

brian-kelley commented Aug 15, 2022 • edited Loading

brian-kelley commented Aug 18, 2022

brian-kelley commented Aug 23, 2022

dalg24 Aug 31, 2022

Choose a reason for hiding this comment

dalg24 Aug 31, 2022

Choose a reason for hiding this comment

crtrott left a comment

Choose a reason for hiding this comment

crtrott Aug 31, 2022

Choose a reason for hiding this comment

brian-kelley commented Aug 8, 2022 •

edited

Loading

brian-kelley commented Aug 9, 2022 •

edited

Loading

brian-kelley commented Aug 10, 2022 •

edited

Loading

brian-kelley commented Aug 15, 2022 •

edited

Loading