Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

SYCL: Improve and simplify parallel_scan implementation #6064

Merged
merged 2 commits into from
May 9, 2023

Conversation

masterleinad
Copy link
Contributor

@masterleinad masterleinad commented Apr 18, 2023

This pull request changes the previous recursive parallel_scan in the SYCL backend to a two-pass one as we use for Cuda and HIP and SYCL reductions. This simplifies the code (also making it more uniform) and reduces the memory footprint (since we only need to store intermediate results for all items and group scans but not recursive group scans).
On the way, I made sure that all local operations operate on indices of type int avoiding 64-bit index operations.

A second improvement is switching to an auto-detection of the work group size as we do for reductions by querying a dummy kernel for the maximum group size.

Finally, this fixes a couple of unit tests that were failing with SYCL+Cuda since #5707.

@masterleinad masterleinad force-pushed the sycl_improve_parallel_scan_new branch from e3c9427 to 6cc246a Compare April 18, 2023 17:18
@masterleinad masterleinad force-pushed the sycl_improve_parallel_scan_new branch from 3354928 to e07be73 Compare April 20, 2023 21:03
@masterleinad masterleinad changed the title [WIP] SYCL: Improve and simplify parallel_scan implementation SYCL: Improve and simplify parallel_scan implementation Apr 20, 2023
@masterleinad masterleinad marked this pull request as ready for review April 20, 2023 21:16
@masterleinad
Copy link
Contributor Author

Requires #6065.

@masterleinad masterleinad force-pushed the sycl_improve_parallel_scan_new branch from 2144d9b to bdaa12c Compare April 28, 2023 18:16
Copy link
Member

@crtrott crtrott left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

AS far as I can tell looks good.

(global_range + max_subgroup_size - 1) / max_subgroup_size;

const auto local_range = sg.get_local_range()[0];
const int local_range = sg.get_local_range()[0];
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

did SYCL change so that you need int now?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, it's not necessary to do that but I have seen that SYCL is very sensitive to 64-bit index calculations, and local indices surely will never require that.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just to be clear, auto deduces size_t, not int. I agree we don't need that much of an index range.

@masterleinad masterleinad requested a review from nliber May 5, 2023 21:37
Copy link
Member

@dalg24 dalg24 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I do not like the complexity added with the "lambda factory" but I will merge since others seems that it is fine.

@dalg24
Copy link
Member

dalg24 commented May 9, 2023

The SYCL build passed. Ignoring the rest.

@dalg24 dalg24 merged commit 6ede773 into kokkos:develop May 9, 2023
25 of 26 checks passed
@masterleinad masterleinad mentioned this pull request May 25, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

5 participants