[SYCL] Add tangle/opportunistic algorithms #9220

Pennycook · 2023-04-26T17:59:16Z

Enables the following functions to be used with tangle_group and opportunistic_group arguments:

group_barrier
group_broadcast
any_of_group
all_of_group
none_of_group
reduce_over_group
exclusive_scan_over_group
inclusive_scan_over_group

A few quick notes to reviewers:

This implementation leverages the fact that it is undefined behavior to use a tangle group or opportunistic group in control flow that does not match the control flow at the point of construction to avoid using a mask for most operations. I think it is safe to call the NonUniform intrinsics directly, because they are already control-flow-aware.
In a few places, I've deliberately duplicated the implementation across tangle group and opportunistic group even though they're the same. I've done this primarily in an attempt to simplify @JackAKirk's efforts to merge in his CUDA implementation, because I expect that there may be some cases where the CUDA implementations of these groups do diverge. If this turns out not to be true, we can tidy things up afterwards.
In general, tangle and opportunistic group are not the same thing. But I expect their behavior will be identical on all of the SPIR-V implementations that we're targeting.

Enables the following functions to be used with tangle_group and opportunistic_group arguments: - group_barrier - group_broadcast - any_of_group - all_of_group - none_of_group - reduce_over_group - exclusive_scan_over_group - inclusive_scan_over_group Signed-off-by: John Pennycook <john.pennycook@intel.com>

sycl/test-e2e/NonUniformGroups/opportunistic_group_algorithms.cpp

aelovikov-intel · 2023-04-26T19:59:01Z

sycl/test-e2e/NonUniformGroups/tangle_group_algorithms.cpp

@@ -0,0 +1,137 @@
+// RUN: %clangxx -fsycl -fsycl-device-code-split=per_kernel -fsycl-targets=%sycl_triple %s -o %t.out


Please add a comment why this per-kernel split is necessary.

@steffenlarsen suggested I add this to one of the other tests. Steffen, could you please advise on what the comment should say here?

Since we're only testing with a single sub-group size in this test, I think it will be okay to not have the option. The reason we needed it in the other test was because it would try with different sub-group sizes, which isn't currently split correctly so the binaries would have potentially invalid kernels together with valid ones, potentially causing build failures when launching valid kernels.

Ah, ok. Thanks for the explanation. Removed the flag in 92c0b10.

sycl/test-e2e/NonUniformGroups/tangle_group_algorithms.cpp

aelovikov-intel

LGTM, but I'd really like other reviewers to look into it as well.

This is consistent with the tangle_group tests, and may fix the error on Windows.

Pennycook · 2023-04-27T13:57:50Z

All of the tests passed except for tangle_group_algorithms.cpp on Windows. We had something similar with the original tangle_group.cpp, so I'm trying the same fix -- -fno-sycl-early-optimizations -- to see if it works.

@steffenlarsen, if you still have direct access to a Windows box, would you mind taking a quick look at the failing test as well?

Pennycook · 2023-04-27T15:19:35Z

It still fails even with -fno-sycl-early-optimizations, so I'm stumped.

JackAKirk · 2023-04-27T17:53:49Z

sycl/test-e2e/NonUniformGroups/opportunistic_group_algorithms.cpp

+          // robust test, but choosing an arbitrary work-item (i.e. rather
+          // than the leader) should test an implementation's ability to handle
+          // arbitrary group membership.
+          if (OriginalLID == ArbitraryItem) {


Since you only have a single thread per group is this going to properly test the group implementations for intel case? In cuda backend it wouldn't for the reduce_over_group case. Also in cuda impl the reduce algorithm behaves differently if OpportunisticGroup.get_local_range() equals 2^n where n is positive integer not zero, or if it does not equal this, or if it is equal 1 (like in this test currently, the more trivial case), or if it 32 (full warp). Making four different cases in total.

But I could add these cases later if needs be.

This isn't going to test every path, but I couldn't think of a good way to do that reliably. The semantics of opportunistic groups are (deliberately) really weird.

Even if we added a case where we picked a power of 2 (say, 8) work-items and had them all take the same branch, the specification doesn't require all 8 of those work-items to end up in the same opportunistic group. The specification only requires that all the work-items who encounter the constructor "together" (furious hand waving) form an opportunistic group. There's no way to query which work-items end up in which group, or how many groups are formed. A single work-item executing the branch was the only case I could think of with predictable, portable behavior.

Ideally, we'd probably want to somehow work out which work-items were split into which opportunistic groups, and then dynamically determine what the algorithm results should be given the partitioning that actually happened at runtime. But I couldn't think of a good way to do that. If we can figure out a good way to write that test, we should definitely add it.

I agree that adding some backend-specific tests would be a good idea, too.

JackAKirk

LGTM with respect to cuda backend point of view.

Pennycook · 2023-05-01T14:21:32Z

@steffenlarsen - Is there anybody else you think should review this?

Pennycook requested review from aelovikov-intel, steffenlarsen and JackAKirk April 26, 2023 17:59

Pennycook requested a review from a team as a code owner April 26, 2023 17:59

aelovikov-intel reviewed Apr 26, 2023

View reviewed changes

Pennycook added 2 commits April 26, 2023 14:24

Test results for all work-items

e9ce19a

Replace macro with lambda

2045e6a

aelovikov-intel approved these changes Apr 26, 2023

View reviewed changes

Pennycook temporarily deployed to aws April 27, 2023 02:10 — with GitHub Actions Inactive

Pennycook temporarily deployed to aws April 27, 2023 05:34 — with GitHub Actions Inactive

Pennycook added 2 commits April 27, 2023 06:55

Remove device split flag

92c0b10

Disable early optimizations for tangle_group

a296e0b

This is consistent with the tangle_group tests, and may fix the error on Windows.

Pennycook temporarily deployed to aws April 27, 2023 14:24 — with GitHub Actions Inactive

Pennycook temporarily deployed to aws April 27, 2023 15:08 — with GitHub Actions Inactive

JackAKirk reviewed Apr 27, 2023

View reviewed changes

JackAKirk approved these changes Apr 27, 2023

View reviewed changes

Temporarily disable tangle_group test on Windows

40ec628

Pennycook temporarily deployed to aws April 28, 2023 14:38 — with GitHub Actions Inactive

Pennycook temporarily deployed to aws April 28, 2023 16:00 — with GitHub Actions Inactive

steffenlarsen approved these changes May 2, 2023

View reviewed changes

steffenlarsen merged commit 29e629e into intel:sycl May 2, 2023

Pennycook deleted the tangle_and_opportunistic_algorithms branch May 2, 2023 14:13

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[SYCL] Add tangle/opportunistic algorithms #9220

[SYCL] Add tangle/opportunistic algorithms #9220

Uh oh!

Pennycook commented Apr 26, 2023 •

edited

Loading

Uh oh!

Uh oh!

aelovikov-intel Apr 26, 2023

Uh oh!

Pennycook Apr 26, 2023

Uh oh!

steffenlarsen Apr 27, 2023

Uh oh!

Pennycook Apr 27, 2023

Uh oh!

Uh oh!

aelovikov-intel left a comment

Uh oh!

Pennycook commented Apr 27, 2023

Uh oh!

Pennycook commented Apr 27, 2023

Uh oh!

JackAKirk Apr 27, 2023 •

edited

Loading

Uh oh!

Pennycook Apr 27, 2023

Uh oh!

JackAKirk left a comment •

edited

Loading

Uh oh!

Pennycook commented May 1, 2023

Uh oh!

Uh oh!

		@@ -0,0 +1,137 @@
		// RUN: %clangxx -fsycl -fsycl-device-code-split=per_kernel -fsycl-targets=%sycl_triple %s -o %t.out

[SYCL] Add tangle/opportunistic algorithms #9220

[SYCL] Add tangle/opportunistic algorithms #9220

Uh oh!

Conversation

Pennycook commented Apr 26, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

aelovikov-intel Apr 26, 2023

Choose a reason for hiding this comment

Uh oh!

Pennycook Apr 26, 2023

Choose a reason for hiding this comment

Uh oh!

steffenlarsen Apr 27, 2023

Choose a reason for hiding this comment

Uh oh!

Pennycook Apr 27, 2023

Choose a reason for hiding this comment

Uh oh!

Uh oh!

aelovikov-intel left a comment

Choose a reason for hiding this comment

Uh oh!

Pennycook commented Apr 27, 2023

Uh oh!

Pennycook commented Apr 27, 2023

Uh oh!

JackAKirk Apr 27, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Pennycook Apr 27, 2023

Choose a reason for hiding this comment

Uh oh!

JackAKirk left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Pennycook commented May 1, 2023

Uh oh!

Uh oh!

Pennycook commented Apr 26, 2023 •

edited

Loading

JackAKirk Apr 27, 2023 •

edited

Loading

JackAKirk left a comment •

edited

Loading