Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Deduce workgroup size for SYCL parallel_reduce RangePolicy #5227

Merged
merged 3 commits into from
Aug 18, 2022

Conversation

masterleinad
Copy link
Contributor

The crucial new lines of code are

        auto dummy_reduction_lambda =
            reduction_lambda_factory({1, cgh}, num_teams_done, nullptr);

        static sycl::kernel kernel = [&] {
          sycl::kernel_id functor_kernel_id =
              sycl::get_kernel_id<decltype(dummy_reduction_lambda)>();
          auto kernel_bundle =
              sycl::get_kernel_bundle<sycl::bundle_state::executable>(
                  q.get_context(), std::vector{functor_kernel_id});
          return kernel_bundle.get_kernel(functor_kernel_id);
        }();
        auto multiple = kernel.get_info<sycl::info::kernel_device_specific::
                                            preferred_work_group_size_multiple>(
            q.get_device());
        auto max =
            kernel
                .get_info<sycl::info::kernel_device_specific::work_group_size>(
                    q.get_device());
        const size_t wgroup_size =
            static_cast<size_t>(max / multiple) * multiple;

We have a similar approach already for TeamPolicy to find a suitable vector length/subgroup size.

@masterleinad
Copy link
Contributor Author

The diff is best viewed without whitespace changes.

@masterleinad masterleinad force-pushed the sycl_deduce_wgroup_size_reduce branch from 929cebe to 3297b04 Compare July 18, 2022 19:26
@masterleinad masterleinad marked this pull request as ready for review July 19, 2022 03:08
Copy link
Member

@Rombur Rombur left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It looks fine but the code could use more comments.

sycl::access::target::local>
num_teams_done(1, cgh);

auto dummy_reduction_lambda =
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is called dummy?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Because it gets base dummy memory pointers.

sycl::accessor<unsigned int, 1, sycl::access::mode::read_write,
sycl::access::target::local>
num_teams_done,
sycl::device_ptr<value_type> results_ptr) mutable {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not getting -Wshadow warnings?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Apparently not. 🙂

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why does this need to be mutable?

Comment on lines +448 to +455
static sycl::kernel kernel = [&] {
sycl::kernel_id functor_kernel_id =
sycl::get_kernel_id<decltype(dummy_reduction_lambda)>();
auto kernel_bundle =
sycl::get_kernel_bundle<sycl::bundle_state::executable>(
q.get_context(), std::vector{functor_kernel_id});
return kernel_bundle.get_kernel(functor_kernel_id);
}();
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This code you have 2x in Kokkos_SYCL_Parallel_Team.hpp.
Did you consider defining some kind of helper function for it?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure if it's worth it. In my opinion, this is pretty concise.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I will not approve unless you refactor

@masterleinad
Copy link
Contributor Author

CUDA-9.2-NVCC failing is unrelated.

@crtrott crtrott self-assigned this Aug 17, 2022
Copy link
Member

@crtrott crtrott left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is that thing mutable?

sycl::accessor<unsigned int, 1, sycl::access::mode::read_write,
sycl::access::target::local>
num_teams_done,
sycl::device_ptr<value_type> results_ptr) mutable {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why does this need to be mutable?

@masterleinad
Copy link
Contributor Author

@crtrott I dropped the mutable specifier for the lambdas.

@crtrott crtrott merged commit 82834e3 into kokkos:develop Aug 18, 2022
@masterleinad masterleinad mentioned this pull request Sep 7, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants