-
Notifications
You must be signed in to change notification settings - Fork 796
[SYCL][CUDA] Implement root group barrier #14828
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
| sycl::group_barrier(root); | ||
|
|
||
| if (it.get_group(0) % 2 == 0) { | ||
| X += sycl::sin(X); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't see that we ever check neither X nor Y. Should we? Otherwise, why we need this? How do we make sure it ran?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wanted to explicitly delay some of the workgroups by adding them more work to do, because I've seen this test passing if insufficient barrier was used. For instance on CUDA backend, doing work-group wide barrier would be enough for it to pass and that is not correct. I think this test should perform some work-group divergence to actually check that we actually perform gpu-wide barrier.
How do we make sure it ran?
The X and Y are declared as volatile and my understanding was that this would prevent compiler from removing them with some optimization.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Right, can we have a comment explaining it? Otherwise we risk that this code will just be removed in the future thinking it's not required.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
|
@intel/llvm-gatekeepers can you merge this, please? |
This PR adds an algorithm for doing a GPU wide barrier in CUDA backend.
Rough outline of the algorithm:
0ththread from each workgroup performsatomic.add(1)ld.acquirein a loop until it's equal to total amount of workgroups.barrier.syncOne caveat to this is that there is no initialization of the atomic start value. So if we call this barrier several times in a kernel, on the second iteration, the start value will already contain the result from previous barrier. That's why we actually spin the while loop while
current value % totalWgroups != 0.