Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix global fence in Kokkos::resize(DynRankView) #6184

Merged
merged 6 commits into from
Jun 8, 2023

Conversation

masterleinad
Copy link
Contributor

Fixes #6165. Still need a test, though.
@ndellingwood Can you please confirm that this resolves the failing Trilinos/MueLu tests?

@ndellingwood
Copy link
Contributor

@masterleinad thanks for the PR! I have a build in progress, will update when I have results

@@ -2299,9 +2299,10 @@ inline void impl_resize(const Impl::ViewCtorProp<ViewCtorArgs...>& arg_prop,
if constexpr (alloc_prop_input::has_execution_space)
Kokkos::Impl::DynRankViewRemap<drview_type, drview_type>(
Impl::get_property<Impl::ExecutionSpaceTag>(prop_copy), v_resized, v);
else
else {
Kokkos::Impl::DynRankViewRemap<drview_type, drview_type>(v_resized, v);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we're thinking about this like allocation + deep_copy, don't we need a global fence before this remap call, and then only a local fence after? We can do the allocation and copy stream-wise and fence them locally, but we don't know what stream may have been modifying the data in v before it got passed in.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This matches what we are doing for regular Views. Allocation using the default stream is blocking (cudaMalloc or cudaDeviceSynchronize) anyway.

I prefer thinking about other changes (to be done consistently across View types) in a different pull request and getting this issue fixed first.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Regardless, we still need to ensure here that we're globally synchronizing before launching the remap - blocking cudaMalloc doesn't guarantee completion of a parallel_for launched on an explicit stream.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I opened #6186 so we can improve this in a follow-up for all overloads.

@ndellingwood
Copy link
Contributor

@masterleinad these changes resolved the failing MueLu tests in Cuda+UVM builds:

Test project /home/ndellin/trilinos/Trilinos/Build/Weaver-Cuda11-nightly-cmplx/packages/muelu/test/unit_tests
    Start 5: MueLu_UnitTestsIntrepid2Tpetra_MPI_1
1/1 Test #5: MueLu_UnitTestsIntrepid2Tpetra_MPI_1 ...   Passed   45.76 sec

100% tests passed, 0 tests failed out of 1

Label Time Summary:
MueLu    =  45.76 sec*proc (1 test)

Total Test time (real) =  45.84 sec
bash-4.4$ ctest -R MueLu_UnitTestsIntrepid2Tpetra_MPI_4
Test project /home/ndellin/trilinos/Trilinos/Build/Weaver-Cuda11-nightly-cmplx/packages/muelu/test/unit_tests
    Start 6: MueLu_UnitTestsIntrepid2Tpetra_MPI_4
1/1 Test #6: MueLu_UnitTestsIntrepid2Tpetra_MPI_4 ...   Passed  142.07 sec

100% tests passed, 0 tests failed out of 1

Label Time Summary:
MueLu    = 568.28 sec*proc (1 test)

Total Test time (real) = 142.17 sec
bash-4.4$ ctest -R MueLu_UnitTestsTpetra_MPI_1
Test project /home/ndellin/trilinos/Trilinos/Build/Weaver-Cuda11-nightly-cmplx/packages/muelu/test/unit_tests
    Start 1: MueLu_UnitTestsTpetra_MPI_1
1/1 Test #1: MueLu_UnitTestsTpetra_MPI_1 ......   Passed   27.70 sec

100% tests passed, 0 tests failed out of 1

Label Time Summary:
MueLu    =  27.70 sec*proc (1 test)

Total Test time (real) =  27.72 sec

Let me know if there is another round of changes based on @PhilMiller feedback and needing retest

@masterleinad masterleinad marked this pull request as ready for review June 6, 2023 15:30
@masterleinad masterleinad added this to the Release 4.1 milestone Jun 6, 2023

template <typename SharedMemorySpace>
void test_dyn_rank_view_resize() {
int n = 1000000;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does it need to be that big or would n=10? 100? 1000? also do?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am having trouble reproducing even with the current form even though it matches the failing MueLu case.
I can only see the missing fence in the kernel logger so this is more of a best effort for a test case.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can see that we need large enough, Otherwise the kernel is done before we launch. One suggestion for better reproducebility: do the host loop in reverse. Because CUDA does tend to schedule in order.


Kokkos::fence();

for (int i = 0; i < 2 * n; ++i) ASSERT_EQ(device_view(i), i + 1);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please add a comment (in code) on the intent for the test.
Because out of context, one could expect your test should be doing other things, like asserting that dynrankview has the right size, that the data was preserved, that the data in excess is initialized, etc.

@Rombur
Copy link
Member

Rombur commented Jun 7, 2023

Retest this please

@masterleinad masterleinad requested a review from dalg24 June 7, 2023 15:00
Copy link
Member

@crtrott crtrott left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not blocking: but I would be interested in the reverse host loop thing to check whether it more reproducable catches the race.


template <typename SharedMemorySpace>
void test_dyn_rank_view_resize() {
int n = 1000000;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can see that we need large enough, Otherwise the kernel is done before we launch. One suggestion for better reproducebility: do the host loop in reverse. Because CUDA does tend to schedule in order.

containers/unit_tests/TestDynViewAPI_rank12345.hpp Outdated Show resolved Hide resolved
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why did you add to this file?
Did you consider other options?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There are basically three files that add tests for DynRankView (TestDynViewAPI_generic.hpp, TestDynViewAPI_rank12345.hpp, and TestDynViewAPI_rank67.hpp) and test its API. I didn't want to add to the gigantic TestDynViewAPI.hpp and it's enough to test a one-dimensional View, so I added it to TestDynViewAPI_rank12345.hpp to avoid adding another file.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would have added a new file but I am not blocking.

@masterleinad masterleinad mentioned this pull request Jun 7, 2023
Copy link
Contributor

@ndellingwood ndellingwood left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I confirmed the added fence resolves the failures in the MueLu unit tests in Cuda+UVM builds, approving based on this, thanks @masterleinad (I don't have strong opinions regarding the unit test)

@PhilMiller brought up some important points which will be followed up through issue #6186

@masterleinad
Copy link
Contributor Author

MSVC failing looks spurious.

@masterleinad masterleinad added the Blocks Promotion Overview issue for release-blocking bugs label Jun 8, 2023
@masterleinad masterleinad requested a review from dalg24 June 8, 2023 15:55
@dalg24 dalg24 merged commit a406372 into kokkos:develop Jun 8, 2023
27 of 28 checks passed
nliber pushed a commit to nliber/kokkos that referenced this pull request Jun 22, 2023
* Fix global fence in Kokkos::resize(DynRankView)

* Add test case

* Comment on the intent of the test and guard for existence of SharedSpace

* Guard execution of test differently

* Loop in reverse order on the host to increase chances for detecting a missing fence

* Test with 1k elements, improve name of the test
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Blocks Promotion Overview issue for release-blocking bugs
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Trilinos nightly test failures in MueLu tests with Cuda builds, UVM enabled
6 participants