-
Notifications
You must be signed in to change notification settings - Fork 798
[SYCL] Add tangle/opportunistic algorithms #9220
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SYCL] Add tangle/opportunistic algorithms #9220
Conversation
Enables the following functions to be used with tangle_group and opportunistic_group arguments: - group_barrier - group_broadcast - any_of_group - all_of_group - none_of_group - reduce_over_group - exclusive_scan_over_group - inclusive_scan_over_group Signed-off-by: John Pennycook <john.pennycook@intel.com>
sycl/test-e2e/NonUniformGroups/opportunistic_group_algorithms.cpp
Outdated
Show resolved
Hide resolved
@@ -0,0 +1,137 @@ | |||
// RUN: %clangxx -fsycl -fsycl-device-code-split=per_kernel -fsycl-targets=%sycl_triple %s -o %t.out |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please add a comment why this per-kernel split is necessary.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@steffenlarsen suggested I add this to one of the other tests. Steffen, could you please advise on what the comment should say here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since we're only testing with a single sub-group size in this test, I think it will be okay to not have the option. The reason we needed it in the other test was because it would try with different sub-group sizes, which isn't currently split correctly so the binaries would have potentially invalid kernels together with valid ones, potentially causing build failures when launching valid kernels.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, ok. Thanks for the explanation. Removed the flag in 92c0b10.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, but I'd really like other reviewers to look into it as well.
This is consistent with the tangle_group tests, and may fix the error on Windows.
All of the tests passed except for tangle_group_algorithms.cpp on Windows. We had something similar with the original tangle_group.cpp, so I'm trying the same fix -- @steffenlarsen, if you still have direct access to a Windows box, would you mind taking a quick look at the failing test as well? |
It still fails even with |
// robust test, but choosing an arbitrary work-item (i.e. rather | ||
// than the leader) should test an implementation's ability to handle | ||
// arbitrary group membership. | ||
if (OriginalLID == ArbitraryItem) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since you only have a single thread per group is this going to properly test the group implementations for intel case? In cuda backend it wouldn't for the reduce_over_group case. Also in cuda impl the reduce algorithm behaves differently if OpportunisticGroup.get_local_range()
equals 2^n where n is positive integer not zero, or if it does not equal this, or if it is equal 1 (like in this test currently, the more trivial case), or if it 32 (full warp). Making four different cases in total.
But I could add these cases later if needs be.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This isn't going to test every path, but I couldn't think of a good way to do that reliably. The semantics of opportunistic groups are (deliberately) really weird.
Even if we added a case where we picked a power of 2 (say, 8) work-items and had them all take the same branch, the specification doesn't require all 8 of those work-items to end up in the same opportunistic group. The specification only requires that all the work-items who encounter the constructor "together" (furious hand waving) form an opportunistic group. There's no way to query which work-items end up in which group, or how many groups are formed. A single work-item executing the branch was the only case I could think of with predictable, portable behavior.
Ideally, we'd probably want to somehow work out which work-items were split into which opportunistic groups, and then dynamically determine what the algorithm results should be given the partitioning that actually happened at runtime. But I couldn't think of a good way to do that. If we can figure out a good way to write that test, we should definitely add it.
I agree that adding some backend-specific tests would be a good idea, too.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM with respect to cuda backend point of view.
@steffenlarsen - Is there anybody else you think should review this? |
Enables the following functions to be used with tangle_group and opportunistic_group arguments:
A few quick notes to reviewers:
This implementation leverages the fact that it is undefined behavior to use a tangle group or opportunistic group in control flow that does not match the control flow at the point of construction to avoid using a mask for most operations. I think it is safe to call the NonUniform intrinsics directly, because they are already control-flow-aware.
In a few places, I've deliberately duplicated the implementation across tangle group and opportunistic group even though they're the same. I've done this primarily in an attempt to simplify @JackAKirk's efforts to merge in his CUDA implementation, because I expect that there may be some cases where the CUDA implementations of these groups do diverge. If this turns out not to be true, we can tidy things up afterwards.
In general, tangle and opportunistic group are not the same thing. But I expect their behavior will be identical on all of the SPIR-V implementations that we're targeting.