-
Notifications
You must be signed in to change notification settings - Fork 407
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Use alternative SYCL parallel_reduce implementation #3671
Use alternative SYCL parallel_reduce implementation #3671
Conversation
Retest this please. |
49aaecf
to
f728e8f
Compare
Retest this please. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok, why does the Intel reduction not work for us? What is the issue? Did we report that to Intel? And should we try the trick in CUDA where we have a "done counter" so that the last work group can do the final reduction instead of doing another kernel? Also can't the final be applied by the last thread who writes out the final result instead of launching another kernel?
const typename Policy::index_type id = | ||
static_cast<typename Policy::index_type>(item.get_id()) + | ||
policy.begin(); | ||
if constexpr (std::is_same<WorkTag, void>::value) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah I LOVE this. Whish we could to things like that everywhere already!!
We have some problems where some local workgroup sizes are not divisible by the global ones despite us setting local sizes to 1 in all the places we have access to. Also debugging for supporting the whole parallel_reduce interface was easier this way. If necessary, I can see if we can go back to see Intel implementation once the simple case is fixed.
@nliber Should have reported to Intel.
For me, all tests are now passing so I can look into optimizing it a little bit. |
9982d1f
to
0ebdc77
Compare
I am happy with the current status. |
|
||
} // namespace | ||
|
||
TEST(TEST_CATEGORY, reduce_device_view_range_policy) { | ||
#ifdef KOKKOS_ENABLE_SYCL | ||
int N = 100 * 1024 * 1024; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why did you reduce the size?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was running out of memory while allocating.
@@ -141,7 +141,7 @@ class RuntimeReduceFunctor { | |||
void operator()(size_type iwork, ScalarType dst[]) const { | |||
const size_type tmp[3] = {1, iwork + 1, nwork - iwork}; | |||
|
|||
for (size_type i = 0; i < value_count; ++i) { | |||
for (size_type i = 0; i < static_cast<size_type>(value_count); ++i) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would have changed the declaration and updated init
and join
. (Not blocking)
for (member_type i = m_policy.begin(); i < e; ++i) m_functor(i, update); | ||
} | ||
template <typename T> | ||
struct HasJoin { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What is the issue with ReduceFunctorHasJoin
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I had (more) trouble with the current detection mechanism and only require a specific signature here. More concretely, the current implementation doesn't work if there are multiple overloads.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Did you consider updating ReduceFunctorHasJoin
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I can't do that since ReduceFunctorHasJoin
detects if there is any overload. It doesn't check for a specific signature.
Since we expect a certain signature anyway I could try to see if asking for a specific signature in all the places where ReduceFunctorHasJoin
is used instead would work. I would prefer a different pull request for that, though.
const auto results_ptr = | ||
static_cast<pointer_type>(Experimental::SYCLSharedUSMSpace().allocate( | ||
"SYCL parallel_reduce result storage", | ||
sizeof(*m_result_ptr) * std::max(value_count, 1u) * init_size)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can value_count
actually be zero?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, we check that in
kokkos/core/unit_test/TestReduce.hpp
Lines 441 to 461 in 8c24107
for (unsigned count = 0; count < CountLimit; ++count) { | |
result_type result("result", count); | |
result_host_type host_result = Kokkos::create_mirror(result); | |
// Test result to host pointer: | |
std::string str("TestKernelReduce"); | |
if (count % 2 == 0) { | |
Kokkos::parallel_reduce(nw, functor_type(nw, count), | |
host_result.data()); | |
} else { | |
Kokkos::parallel_reduce(str, nw, functor_type(nw, count), | |
host_result.data()); | |
} | |
for (unsigned j = 0; j < count; ++j) { | |
const uint64_t correct = 0 == j % 3 ? nw : nsum; | |
ASSERT_EQ(host_result(j), (ScalarType)correct); | |
host_result(j) = 0; | |
} | |
} |
e3d5e35
to
fbe6b50
Compare
I looked over it and it looks ok. All the change comments which come to mind to me, are obsolete based on the fact that we need a fundamental algorithmic overhaul to make this not be really slow. I.e. not have an iterative approach. |
Yes, for me the point here really is to get the functionality implemented. We can look into performance later. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
approving with the comment that this is way suboptimal and needs major revisions down the line,
ce9b0dc
to
fef7e22
Compare
fef7e22
to
aa4c886
Compare
This implementation tries to side-step some issues with Intel implementation and should also allow more easily to enable all Reduce tests.