-
Notifications
You must be signed in to change notification settings - Fork 407
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Changes for indirect launch of SYCL parallel reduce #3511
Conversation
Which of these commits contain the changes you really want? I am happy to help cleaning this up. |
27f31e1
to
bf368ed
Compare
You will need to fix the indentation. |
Apparently, the indentation is still not correct. Are you using |
After #3480 has been merged, this should be rebased. |
Hmmm... There are a lot of unrelated changes and conflicting files here. Which of the commits do you actually care for? (Maybe just rebase onto upstream/develop). |
9a4e0c2
to
be75ca3
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am mostly fine with this pull request but I am a little concerned about the memory being used after the underlying allocation is freed and whether we can/want to provide some kind of safeguard mechanism.
core/unit_test/incremental/Test11a_ParallelFor_TeamThreadRange.hpp
Outdated
Show resolved
Hide resolved
core/unit_test/incremental/Test11a_ParallelFor_TeamThreadRange.hpp
Outdated
Show resolved
Hide resolved
core/unit_test/incremental/Test11b_ParallelFor_TeamVectorRange.hpp
Outdated
Show resolved
Hide resolved
core/unit_test/incremental/Test11b_ParallelFor_TeamVectorRange.hpp
Outdated
Show resolved
Hide resolved
core/unit_test/incremental/Test11c_ParallelFor_ThreadVectorRange.hpp
Outdated
Show resolved
Hide resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Overall this looks OK to me after rebasing. I leave it up to you to change the return type of memcpy_from
to std::unique_ptr<T>
or not.
This looks good to me but you still need to rebase and fix the formatting. |
9f8f832
to
9e04ea2
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Still looks OK to me. Just some comments. We also need someone else looking at this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks largely good. Some questions, and one request - assuming it doesn't have any drawbacks I have not thought about: allocate the memory via memory spaces so the tools can track it.
sycl::malloc(sizeof(*m_result_ptr), q, sycl::usm::alloc::shared)); | ||
using ReductionResultMem = | ||
Experimental::Impl::SYCLInternal::ReductionResultMem; | ||
ReductionResultMem& reductionResultMem = instance.m_reductionResultMem; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In hindsight and wrt to #3671, I am not quite sure if we want to treat the result ptr the same as the kernels. In particular, I am not sure if we really want to store the result in shared
rather than device
space but I am fine with addressing that later.
Hmmm... Looks like something went wrong here. Is this rebased on top of |
Yeah, I don't think that worked. Rebasing again now... |
19f3b11
to
5868550
Compare
Retest this please. |
@@ -190,7 +187,7 @@ class Kokkos::Impl::ParallelFor<FunctorType, Kokkos::MDRangePolicy<Traits...>, | |||
return {global_sizes, local_sizes}; | |||
} | |||
if constexpr (Policy::rank == 6) { | |||
// id0,id1 encoded within first index; id2,id3 to second index; id4,id5 to | |||
// id0,id1 encoded within first index; id2,id3 to second index; id4,id5 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
// id0,id1 encoded within first index; id2,id3 to second index; id4,id5 | |
// id0,id1 encoded within first index; id2,id3 to second index; id4,id5 to |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed
There is still a conflict here. |
5318c77
to
5587569
Compare
memory. Using device memory for it. Delete redundant ParallelFor Fix up parallel_for Because of issues under SYCL+CUDA (possibly with swapping sycl::queue), made USMObjectMem non-movable and added a .reset method. Due to suspected issues with sycl::queue assignment under SYCL+CUDA, inside USMObjectMem made m_q optional and added appropriate checks: either checking m_q directly or m_data, because "m_data || !m_q" is a class invariant Improved the documentation of USMObjectMem class invariants Removed extra assert from USMObjectMem::Deleter Simplified USMObjectMem::reserve More improvements to USMObjectMem invariant documentation Removed gdb files from .gitignore Minor fix to comments Clarified asserts WIP: in USMObjectMem, added a fence() function to replace calls to memcopied.wait() for dealing with sychronous errors reported by exceptions. Generalized the internal fence() function in USMObjectMem In USMObjectMem, replaced throw with throw_runtime_exception Added FIXME_SYCL for future work Minor changes for P/R 3511
5587569
to
cad3a89
Compare
These are updates for SYCL parallel reduce.