Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use alternative SYCL parallel_reduce implementation #3671

Merged
merged 2 commits into from
Jan 19, 2021

Conversation

masterleinad
Copy link
Contributor

This implementation tries to side-step some issues with Intel implementation and should also allow more easily to enable all Reduce tests.

@masterleinad
Copy link
Contributor Author

Retest this please.

@masterleinad
Copy link
Contributor Author

Retest this please.

Copy link
Member

@crtrott crtrott left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, why does the Intel reduction not work for us? What is the issue? Did we report that to Intel? And should we try the trick in CUDA where we have a "done counter" so that the last work group can do the final reduction instead of doing another kernel? Also can't the final be applied by the last thread who writes out the final result instead of launching another kernel?

core/src/SYCL/Kokkos_SYCL_Parallel_Reduce.hpp Outdated Show resolved Hide resolved
const typename Policy::index_type id =
static_cast<typename Policy::index_type>(item.get_id()) +
policy.begin();
if constexpr (std::is_same<WorkTag, void>::value)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah I LOVE this. Whish we could to things like that everywhere already!!

@masterleinad
Copy link
Contributor Author

Ok, why does the Intel reduction not work for us? What is the issue?

We have some problems where some local workgroup sizes are not divisible by the global ones despite us setting local sizes to 1 in all the places we have access to. Also debugging for supporting the whole parallel_reduce interface was easier this way. If necessary, I can see if we can go back to see Intel implementation once the simple case is fixed.

Did we report that to Intel?

@nliber Should have reported to Intel.

And should we try the trick in CUDA where we have a "done counter" so that the last work group can do the final reduction instead of doing another kernel? Also can't the final be applied by the last thread who writes out the final result instead of launching another kernel?

For me, all tests are now passing so I can look into optimizing it a little bit.

@masterleinad masterleinad marked this pull request as ready for review December 21, 2020 23:06
@masterleinad masterleinad changed the title [WIP] Use alternative SYCL parallel_reduce implementation Use alternative SYCL parallel_reduce implementation Dec 21, 2020
@masterleinad
Copy link
Contributor Author

I am happy with the current status.


} // namespace

TEST(TEST_CATEGORY, reduce_device_view_range_policy) {
#ifdef KOKKOS_ENABLE_SYCL
int N = 100 * 1024 * 1024;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why did you reduce the size?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was running out of memory while allocating.

core/unit_test/TestCXX11.hpp Outdated Show resolved Hide resolved
@@ -141,7 +141,7 @@ class RuntimeReduceFunctor {
void operator()(size_type iwork, ScalarType dst[]) const {
const size_type tmp[3] = {1, iwork + 1, nwork - iwork};

for (size_type i = 0; i < value_count; ++i) {
for (size_type i = 0; i < static_cast<size_type>(value_count); ++i) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would have changed the declaration and updated init and join. (Not blocking)

for (member_type i = m_policy.begin(); i < e; ++i) m_functor(i, update);
}
template <typename T>
struct HasJoin {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is the issue with ReduceFunctorHasJoin?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I had (more) trouble with the current detection mechanism and only require a specific signature here. More concretely, the current implementation doesn't work if there are multiple overloads.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Did you consider updating ReduceFunctorHasJoin?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can't do that since ReduceFunctorHasJoin detects if there is any overload. It doesn't check for a specific signature.
Since we expect a certain signature anyway I could try to see if asking for a specific signature in all the places where ReduceFunctorHasJoin is used instead would work. I would prefer a different pull request for that, though.

const auto results_ptr =
static_cast<pointer_type>(Experimental::SYCLSharedUSMSpace().allocate(
"SYCL parallel_reduce result storage",
sizeof(*m_result_ptr) * std::max(value_count, 1u) * init_size));
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can value_count actually be zero?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, we check that in

for (unsigned count = 0; count < CountLimit; ++count) {
result_type result("result", count);
result_host_type host_result = Kokkos::create_mirror(result);
// Test result to host pointer:
std::string str("TestKernelReduce");
if (count % 2 == 0) {
Kokkos::parallel_reduce(nw, functor_type(nw, count),
host_result.data());
} else {
Kokkos::parallel_reduce(str, nw, functor_type(nw, count),
host_result.data());
}
for (unsigned j = 0; j < count; ++j) {
const uint64_t correct = 0 == j % 3 ? nw : nsum;
ASSERT_EQ(host_result(j), (ScalarType)correct);
host_result(j) = 0;
}
}
.

@masterleinad masterleinad force-pushed the sycl_reduce_alternative_new branch 2 times, most recently from e3d5e35 to fbe6b50 Compare December 29, 2020 21:53
@masterleinad
Copy link
Contributor Author

@crtrott @nliber Any chance that we can move forward with this?

@crtrott
Copy link
Member

crtrott commented Jan 13, 2021

I looked over it and it looks ok. All the change comments which come to mind to me, are obsolete based on the fact that we need a fundamental algorithmic overhaul to make this not be really slow. I.e. not have an iterative approach.

@masterleinad
Copy link
Contributor Author

I looked over it and it looks ok. All the change comments which come to mind to me, are obsolete based on the fact that we need a fundamental algorithmic overhaul to make this not be really slow. I.e. not have an iterative approach.

Yes, for me the point here really is to get the functionality implemented. We can look into performance later.

@masterleinad
Copy link
Contributor Author

@crtrott needs to approve after already approving in a comment. Can use more reviews in particular by @nliber. Ready from my side.

Copy link
Member

@crtrott crtrott left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

approving with the comment that this is way suboptimal and needs major revisions down the line,

@masterleinad masterleinad force-pushed the sycl_reduce_alternative_new branch 3 times, most recently from ce9b0dc to fef7e22 Compare January 19, 2021 16:53
@dalg24 dalg24 merged commit 9d090c5 into kokkos:develop Jan 19, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants