Use alternative SYCL parallel_reduce implementation #3671

masterleinad · 2020-12-16T23:23:11Z

This implementation tries to side-step some issues with Intel implementation and should also allow more easily to enable all Reduce tests.

masterleinad · 2020-12-17T05:13:41Z

Retest this please.

masterleinad · 2020-12-17T22:22:05Z

Retest this please.

crtrott

Ok, why does the Intel reduction not work for us? What is the issue? Did we report that to Intel? And should we try the trick in CUDA where we have a "done counter" so that the last work group can do the final reduction instead of doing another kernel? Also can't the final be applied by the last thread who writes out the final result instead of launching another kernel?

core/src/SYCL/Kokkos_SYCL_Parallel_Reduce.hpp

crtrott · 2020-12-18T16:56:55Z

core/src/SYCL/Kokkos_SYCL_Parallel_Reduce.hpp

+          const typename Policy::index_type id =
+              static_cast<typename Policy::index_type>(item.get_id()) +
+              policy.begin();
+          if constexpr (std::is_same<WorkTag, void>::value)


Ah I LOVE this. Whish we could to things like that everywhere already!!

masterleinad · 2020-12-21T20:15:21Z

Ok, why does the Intel reduction not work for us? What is the issue?

We have some problems where some local workgroup sizes are not divisible by the global ones despite us setting local sizes to 1 in all the places we have access to. Also debugging for supporting the whole parallel_reduce interface was easier this way. If necessary, I can see if we can go back to see Intel implementation once the simple case is fixed.

Did we report that to Intel?

@nliber Should have reported to Intel.

And should we try the trick in CUDA where we have a "done counter" so that the last work group can do the final reduction instead of doing another kernel? Also can't the final be applied by the last thread who writes out the final result instead of launching another kernel?

For me, all tests are now passing so I can look into optimizing it a little bit.

masterleinad · 2020-12-21T23:13:59Z

I am happy with the current status.

dalg24 · 2020-12-22T13:33:58Z

core/unit_test/TestReductions_DeviceView.hpp


 }  // namespace

 TEST(TEST_CATEGORY, reduce_device_view_range_policy) {
+#ifdef KOKKOS_ENABLE_SYCL
+  int N = 100 * 1024 * 1024;


Why did you reduce the size?

I was running out of memory while allocating.

core/unit_test/TestCXX11.hpp

dalg24 · 2020-12-22T13:40:43Z

core/unit_test/TestReduce.hpp

@@ -141,7 +141,7 @@ class RuntimeReduceFunctor {
  void operator()(size_type iwork, ScalarType dst[]) const {
    const size_type tmp[3] = {1, iwork + 1, nwork - iwork};

-    for (size_type i = 0; i < value_count; ++i) {
+    for (size_type i = 0; i < static_cast<size_type>(value_count); ++i) {


I would have changed the declaration and updated init and join. (Not blocking)

dalg24 · 2020-12-22T13:49:54Z

core/src/SYCL/Kokkos_SYCL_Parallel_Reduce.hpp

-    for (member_type i = m_policy.begin(); i < e; ++i) m_functor(i, update);
-  }
+  template <typename T>
+  struct HasJoin {


What is the issue with ReduceFunctorHasJoin?

I had (more) trouble with the current detection mechanism and only require a specific signature here. More concretely, the current implementation doesn't work if there are multiple overloads.

Did you consider updating ReduceFunctorHasJoin?

I can't do that since ReduceFunctorHasJoin detects if there is any overload. It doesn't check for a specific signature.
Since we expect a certain signature anyway I could try to see if asking for a specific signature in all the places where ReduceFunctorHasJoin is used instead would work. I would prefer a different pull request for that, though.

dalg24 · 2020-12-22T14:12:43Z

core/src/SYCL/Kokkos_SYCL_Parallel_Reduce.hpp

+    const auto results_ptr =
+        static_cast<pointer_type>(Experimental::SYCLSharedUSMSpace().allocate(
+            "SYCL parallel_reduce result storage",
+            sizeof(*m_result_ptr) * std::max(value_count, 1u) * init_size));


Can value_count actually be zero?

Yes, we check that in

kokkos/core/unit_test/TestReduce.hpp

Lines 441 to 461 in 8c24107

for (unsigned count = 0; count < CountLimit; ++count) {

result_type result("result", count);

result_host_type host_result = Kokkos::create_mirror(result);

// Test result to host pointer:

std::string str("TestKernelReduce");

if (count % 2 == 0) {

Kokkos::parallel_reduce(nw, functor_type(nw, count),

host_result.data());

} else {

Kokkos::parallel_reduce(str, nw, functor_type(nw, count),

host_result.data());

}

for (unsigned j = 0; j < count; ++j) {

const uint64_t correct = 0 == j % 3 ? nw : nsum;

ASSERT_EQ(host_result(j), (ScalarType)correct);

host_result(j) = 0;

}

}

.

masterleinad · 2021-01-11T17:13:24Z

@crtrott @nliber Any chance that we can move forward with this?

crtrott · 2021-01-13T15:58:54Z

I looked over it and it looks ok. All the change comments which come to mind to me, are obsolete based on the fact that we need a fundamental algorithmic overhaul to make this not be really slow. I.e. not have an iterative approach.

masterleinad · 2021-01-13T16:01:14Z

I looked over it and it looks ok. All the change comments which come to mind to me, are obsolete based on the fact that we need a fundamental algorithmic overhaul to make this not be really slow. I.e. not have an iterative approach.

Yes, for me the point here really is to get the functionality implemented. We can look into performance later.

masterleinad · 2021-01-13T18:30:50Z

@crtrott needs to approve after already approving in a comment. Can use more reviews in particular by @nliber. Ready from my side.

crtrott

approving with the comment that this is way suboptimal and needs major revisions down the line,

core/src/SYCL/Kokkos_SYCL_Parallel_Reduce.hpp

masterleinad force-pushed the sycl_reduce_alternative_new branch from 49aaecf to f728e8f Compare December 17, 2020 20:37

crtrott requested changes Dec 18, 2020

View reviewed changes

masterleinad force-pushed the sycl_reduce_alternative_new branch from 9982d1f to 0ebdc77 Compare December 21, 2020 23:04

masterleinad marked this pull request as ready for review December 21, 2020 23:06

masterleinad changed the title ~~[WIP] Use alternative SYCL parallel_reduce implementation~~ Use alternative SYCL parallel_reduce implementation Dec 21, 2020

masterleinad requested a review from nliber December 21, 2020 23:10

dalg24 reviewed Dec 22, 2020

View reviewed changes

masterleinad force-pushed the sycl_reduce_alternative_new branch 2 times, most recently from e3d5e35 to fbe6b50 Compare December 29, 2020 21:53

masterleinad mentioned this pull request Jan 11, 2021

Changes for indirect launch of SYCL parallel reduce #3511

Merged

crtrott approved these changes Jan 15, 2021

View reviewed changes

nliber approved these changes Jan 15, 2021

View reviewed changes

core/src/SYCL/Kokkos_SYCL_Parallel_Reduce.hpp Show resolved Hide resolved

masterleinad force-pushed the sycl_reduce_alternative_new branch 3 times, most recently from ce9b0dc to fef7e22 Compare January 19, 2021 16:53

masterleinad added 2 commits January 19, 2021 15:09

Alternative SYCL parallel_reduce implementation

44513b4

Enable more reduce tests for SYCL

aa4c886

masterleinad force-pushed the sycl_reduce_alternative_new branch from fef7e22 to aa4c886 Compare January 19, 2021 20:10

dalg24 merged commit 9d090c5 into kokkos:develop Jan 19, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Use alternative SYCL parallel_reduce implementation #3671

Use alternative SYCL parallel_reduce implementation #3671

masterleinad commented Dec 16, 2020

masterleinad commented Dec 17, 2020

masterleinad commented Dec 17, 2020

crtrott left a comment

crtrott Dec 18, 2020

masterleinad commented Dec 21, 2020

masterleinad commented Dec 21, 2020

dalg24 Dec 22, 2020

masterleinad Dec 22, 2020

dalg24 Dec 22, 2020

dalg24 Dec 22, 2020

masterleinad Dec 22, 2020

dalg24 Dec 22, 2020

masterleinad Dec 22, 2020

dalg24 Dec 22, 2020

masterleinad Dec 22, 2020

masterleinad commented Jan 11, 2021

crtrott commented Jan 13, 2021

masterleinad commented Jan 13, 2021

masterleinad commented Jan 13, 2021

crtrott left a comment

	for (unsigned count = 0; count < CountLimit; ++count) {
	result_type result("result", count);
	result_host_type host_result = Kokkos::create_mirror(result);

	// Test result to host pointer:

	std::string str("TestKernelReduce");
	if (count % 2 == 0) {
	Kokkos::parallel_reduce(nw, functor_type(nw, count),
	host_result.data());
	} else {
	Kokkos::parallel_reduce(str, nw, functor_type(nw, count),
	host_result.data());
	}

	for (unsigned j = 0; j < count; ++j) {
	const uint64_t correct = 0 == j % 3 ? nw : nsum;
	ASSERT_EQ(host_result(j), (ScalarType)correct);
	host_result(j) = 0;
	}
	}

Use alternative SYCL parallel_reduce implementation #3671

Use alternative SYCL parallel_reduce implementation #3671

Conversation

masterleinad commented Dec 16, 2020

masterleinad commented Dec 17, 2020

masterleinad commented Dec 17, 2020

crtrott left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

masterleinad commented Dec 21, 2020

masterleinad commented Dec 21, 2020

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

masterleinad commented Jan 11, 2021

crtrott commented Jan 13, 2021

masterleinad commented Jan 13, 2021

masterleinad commented Jan 13, 2021

crtrott left a comment

Choose a reason for hiding this comment