-
Notifications
You must be signed in to change notification settings - Fork 407
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
SYCL: Improve and simplify parallel_scan implementation #6064
SYCL: Improve and simplify parallel_scan implementation #6064
Conversation
e3c9427
to
6cc246a
Compare
3354928
to
e07be73
Compare
Requires #6065. |
2144d9b
to
bdaa12c
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
AS far as I can tell looks good.
(global_range + max_subgroup_size - 1) / max_subgroup_size; | ||
|
||
const auto local_range = sg.get_local_range()[0]; | ||
const int local_range = sg.get_local_range()[0]; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
did SYCL change so that you need int now?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No, it's not necessary to do that but I have seen that SYCL is very sensitive to 64-bit index calculations, and local indices surely will never require that.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just to be clear, auto
deduces size_t
, not int
. I agree we don't need that much of an index range.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I do not like the complexity added with the "lambda factory" but I will merge since others seems that it is fine.
The SYCL build passed. Ignoring the rest. |
This pull request changes the previous recursive
parallel_scan
in theSYCL
backend to a two-pass one as we use forCuda
andHIP
andSYCL
reductions. This simplifies the code (also making it more uniform) and reduces the memory footprint (since we only need to store intermediate results for all items and group scans but not recursive group scans).On the way, I made sure that all local operations operate on indices of type
int
avoiding 64-bit index operations.A second improvement is switching to an auto-detection of the work group size as we do for reductions by querying a dummy kernel for the maximum group size.
Finally, this fixes a couple of unit tests that were failing with
SYCL+Cuda
since #5707.