Skip to content

Conversation

Pennycook
Copy link
Contributor

@Pennycook Pennycook commented Apr 26, 2023

Enables the following functions to be used with tangle_group and opportunistic_group arguments:

  • group_barrier
  • group_broadcast
  • any_of_group
  • all_of_group
  • none_of_group
  • reduce_over_group
  • exclusive_scan_over_group
  • inclusive_scan_over_group

A few quick notes to reviewers:

  1. This implementation leverages the fact that it is undefined behavior to use a tangle group or opportunistic group in control flow that does not match the control flow at the point of construction to avoid using a mask for most operations. I think it is safe to call the NonUniform intrinsics directly, because they are already control-flow-aware.

  2. In a few places, I've deliberately duplicated the implementation across tangle group and opportunistic group even though they're the same. I've done this primarily in an attempt to simplify @JackAKirk's efforts to merge in his CUDA implementation, because I expect that there may be some cases where the CUDA implementations of these groups do diverge. If this turns out not to be true, we can tidy things up afterwards.

  3. In general, tangle and opportunistic group are not the same thing. But I expect their behavior will be identical on all of the SPIR-V implementations that we're targeting.

Enables the following functions to be used with tangle_group and
opportunistic_group arguments:

- group_barrier
- group_broadcast
- any_of_group
- all_of_group
- none_of_group
- reduce_over_group
- exclusive_scan_over_group
- inclusive_scan_over_group

Signed-off-by: John Pennycook <john.pennycook@intel.com>
@Pennycook Pennycook requested a review from a team as a code owner April 26, 2023 17:59
@@ -0,0 +1,137 @@
// RUN: %clangxx -fsycl -fsycl-device-code-split=per_kernel -fsycl-targets=%sycl_triple %s -o %t.out
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please add a comment why this per-kernel split is necessary.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@steffenlarsen suggested I add this to one of the other tests. Steffen, could you please advise on what the comment should say here?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since we're only testing with a single sub-group size in this test, I think it will be okay to not have the option. The reason we needed it in the other test was because it would try with different sub-group sizes, which isn't currently split correctly so the binaries would have potentially invalid kernels together with valid ones, potentially causing build failures when launching valid kernels.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, ok. Thanks for the explanation. Removed the flag in 92c0b10.

Copy link
Contributor

@aelovikov-intel aelovikov-intel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, but I'd really like other reviewers to look into it as well.

@Pennycook Pennycook temporarily deployed to aws April 27, 2023 02:10 — with GitHub Actions Inactive
@Pennycook Pennycook temporarily deployed to aws April 27, 2023 05:34 — with GitHub Actions Inactive
This is consistent with the tangle_group tests, and may fix the error on
Windows.
@Pennycook
Copy link
Contributor Author

All of the tests passed except for tangle_group_algorithms.cpp on Windows. We had something similar with the original tangle_group.cpp, so I'm trying the same fix -- -fno-sycl-early-optimizations -- to see if it works.

@steffenlarsen, if you still have direct access to a Windows box, would you mind taking a quick look at the failing test as well?

@Pennycook Pennycook temporarily deployed to aws April 27, 2023 14:24 — with GitHub Actions Inactive
@Pennycook Pennycook temporarily deployed to aws April 27, 2023 15:08 — with GitHub Actions Inactive
@Pennycook
Copy link
Contributor Author

It still fails even with -fno-sycl-early-optimizations, so I'm stumped.

// robust test, but choosing an arbitrary work-item (i.e. rather
// than the leader) should test an implementation's ability to handle
// arbitrary group membership.
if (OriginalLID == ArbitraryItem) {
Copy link
Contributor

@JackAKirk JackAKirk Apr 27, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since you only have a single thread per group is this going to properly test the group implementations for intel case? In cuda backend it wouldn't for the reduce_over_group case. Also in cuda impl the reduce algorithm behaves differently if OpportunisticGroup.get_local_range() equals 2^n where n is positive integer not zero, or if it does not equal this, or if it is equal 1 (like in this test currently, the more trivial case), or if it 32 (full warp). Making four different cases in total.

But I could add these cases later if needs be.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This isn't going to test every path, but I couldn't think of a good way to do that reliably. The semantics of opportunistic groups are (deliberately) really weird.

Even if we added a case where we picked a power of 2 (say, 8) work-items and had them all take the same branch, the specification doesn't require all 8 of those work-items to end up in the same opportunistic group. The specification only requires that all the work-items who encounter the constructor "together" (furious hand waving) form an opportunistic group. There's no way to query which work-items end up in which group, or how many groups are formed. A single work-item executing the branch was the only case I could think of with predictable, portable behavior.

Ideally, we'd probably want to somehow work out which work-items were split into which opportunistic groups, and then dynamically determine what the algorithm results should be given the partitioning that actually happened at runtime. But I couldn't think of a good way to do that. If we can figure out a good way to write that test, we should definitely add it.

I agree that adding some backend-specific tests would be a good idea, too.

Copy link
Contributor

@JackAKirk JackAKirk left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM with respect to cuda backend point of view.

@Pennycook Pennycook temporarily deployed to aws April 28, 2023 14:38 — with GitHub Actions Inactive
@Pennycook Pennycook temporarily deployed to aws April 28, 2023 16:00 — with GitHub Actions Inactive
@Pennycook
Copy link
Contributor Author

@steffenlarsen - Is there anybody else you think should review this?

@steffenlarsen steffenlarsen merged commit 29e629e into intel:sycl May 2, 2023
@Pennycook Pennycook deleted the tangle_and_opportunistic_algorithms branch May 2, 2023 14:13
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants