-
Notifications
You must be signed in to change notification settings - Fork 407
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix global fence in Kokkos::resize(DynRankView) #6184
Fix global fence in Kokkos::resize(DynRankView) #6184
Conversation
c099b85
to
6635026
Compare
@masterleinad thanks for the PR! I have a build in progress, will update when I have results |
@@ -2299,9 +2299,10 @@ inline void impl_resize(const Impl::ViewCtorProp<ViewCtorArgs...>& arg_prop, | |||
if constexpr (alloc_prop_input::has_execution_space) | |||
Kokkos::Impl::DynRankViewRemap<drview_type, drview_type>( | |||
Impl::get_property<Impl::ExecutionSpaceTag>(prop_copy), v_resized, v); | |||
else | |||
else { | |||
Kokkos::Impl::DynRankViewRemap<drview_type, drview_type>(v_resized, v); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we're thinking about this like allocation + deep_copy
, don't we need a global fence before this remap call, and then only a local fence after? We can do the allocation and copy stream-wise and fence them locally, but we don't know what stream may have been modifying the data in v
before it got passed in.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This matches what we are doing for regular Views. Allocation using the default stream is blocking (cudaMalloc or cudaDeviceSynchronize) anyway.
I prefer thinking about other changes (to be done consistently across View types) in a different pull request and getting this issue fixed first.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Regardless, we still need to ensure here that we're globally synchronizing before launching the remap - blocking cudaMalloc
doesn't guarantee completion of a parallel_for
launched on an explicit stream.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I opened #6186 so we can improve this in a follow-up for all overloads.
@masterleinad these changes resolved the failing MueLu tests in Cuda+UVM builds:
Let me know if there is another round of changes based on @PhilMiller feedback and needing retest |
9b7f163
to
a360df2
Compare
|
||
template <typename SharedMemorySpace> | ||
void test_dyn_rank_view_resize() { | ||
int n = 1000000; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does it need to be that big or would n=10? 100? 1000? also do?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am having trouble reproducing even with the current form even though it matches the failing MueLu
case.
I can only see the missing fence in the kernel logger so this is more of a best effort for a test case.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I can see that we need large enough, Otherwise the kernel is done before we launch. One suggestion for better reproducebility: do the host loop in reverse. Because CUDA does tend to schedule in order.
|
||
Kokkos::fence(); | ||
|
||
for (int i = 0; i < 2 * n; ++i) ASSERT_EQ(device_view(i), i + 1); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please add a comment (in code) on the intent for the test.
Because out of context, one could expect your test should be doing other things, like asserting that dynrankview has the right size, that the data was preserved, that the data in excess is initialized, etc.
53bf1f7
to
c3ad3f8
Compare
Retest this please |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not blocking: but I would be interested in the reverse host loop thing to check whether it more reproducable catches the race.
|
||
template <typename SharedMemorySpace> | ||
void test_dyn_rank_view_resize() { | ||
int n = 1000000; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I can see that we need large enough, Otherwise the kernel is done before we launch. One suggestion for better reproducebility: do the host loop in reverse. Because CUDA does tend to schedule in order.
bdbae62
to
bd9ad31
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why did you add to this file?
Did you consider other options?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There are basically three files that add tests for DynRankView
(TestDynViewAPI_generic.hpp
, TestDynViewAPI_rank12345.hpp
, and TestDynViewAPI_rank67.hpp
) and test its API. I didn't want to add to the gigantic TestDynViewAPI.hpp
and it's enough to test a one-dimensional View, so I added it to TestDynViewAPI_rank12345.hpp
to avoid adding another file.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would have added a new file but I am not blocking.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I confirmed the added fence resolves the failures in the MueLu unit tests in Cuda+UVM builds, approving based on this, thanks @masterleinad (I don't have strong opinions regarding the unit test)
@PhilMiller brought up some important points which will be followed up through issue #6186
|
* Fix global fence in Kokkos::resize(DynRankView) * Add test case * Comment on the intent of the test and guard for existence of SharedSpace * Guard execution of test differently * Loop in reverse order on the host to increase chances for detecting a missing fence * Test with 1k elements, improve name of the test
Fixes #6165. Still need a test, though.
@ndellingwood Can you please confirm that this resolves the failing
Trilinos
/MueLu
tests?