Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve performance for SYCL parallel_reduce #3732

Merged
merged 4 commits into from
Feb 4, 2021

Conversation

masterleinad
Copy link
Contributor

@masterleinad masterleinad commented Jan 19, 2021

  • let each thread load two values
  • use better memory access pattern
    - let the last block do the final reduction

@masterleinad masterleinad force-pushed the sycl_reduce_performance branch 2 times, most recently from 7c46954 to d238a47 Compare January 20, 2021 20:23
@masterleinad masterleinad changed the title [WIP] Improve performance for SYCL parallel_reduce Improve performance for SYCL parallel_reduce Jan 27, 2021
@masterleinad
Copy link
Contributor Author

Benchmark showed better performance for the recursive implementation so I dropped the commit related to the one that lets the last block do the final reduction. A backup for that implementation can be found at https://github.com/masterleinad/kokkos/tree/sycl_reduce_performance_backup.

@masterleinad masterleinad force-pushed the sycl_reduce_performance branch 2 times, most recently from 19e1c92 to 1f32a62 Compare January 31, 2021 22:17
Copy link
Member

@dalg24 dalg24 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is hard to parse what is going on in this reduction algorithm :/

const typename Policy::index_type upper_bound =
(global_id + values_per_thread * wgroup_size) < size
? global_id + values_per_thread * wgroup_size
: size;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

std::min might be more readable than this ternary op

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I wasn't quite sure if that's supported on the device. I can check.

@masterleinad
Copy link
Contributor Author

It is hard to parse what is going on in this reduction algorithm :/

I thought I didn't change it too much but I'm happy to add some more comments.

@kokkos kokkos deleted a comment from dalg24 Feb 3, 2021
const auto init_size = std::max<std::size_t>(
((size + values_per_thread - 1) / values_per_thread + wgroup_size - 1) /
wgroup_size,
1);
const auto value_count =
FunctorValueTraits<ReducerTypeFwd, WorkTagFwd>::value_count(
selected_reducer);
const auto results_ptr =
static_cast<pointer_type>(Experimental::SYCLSharedUSMSpace().allocate(
"SYCL parallel_reduce result storage",
sizeof(*m_result_ptr) * std::max(value_count, 1u) * init_size));
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, sizeof(value_type) should be the same. Let me try that.

@masterleinad masterleinad added the Blocks Promotion Overview issue for release-blocking bugs label Feb 3, 2021
Copy link
Member

@crtrott crtrott left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the values per thread thing is good for dot but bad for physics codes with reducitons

const auto init_size =
std::max<std::size_t>((size + wgroup_size - 1) / wgroup_size, 1);
constexpr size_t wgroup_size = 128;
constexpr size_t values_per_thread = 2;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this values per thread, is probably right for something like dot product, but probably wrong for LAMMPS where the reduction is on rather beefy kernels which often don't have enough parallelism (think 30k atoms, each looping serially over its neighbors and then do a reduction over all of them) not sure if you care right now

@dalg24 dalg24 merged commit fe74423 into kokkos:develop Feb 4, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Blocks Promotion Overview issue for release-blocking bugs
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants