New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

Sign up for GitHub

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Jump to bottom

Fix global fence in Kokkos::resize(DynRankView) #6184

Merged

dalg24 merged 6 commits into kokkos:develop from masterleinad:fix_dynrankview_resize_fence

Jun 8, 2023

Contributor

masterleinad commented Jun 6, 2023

Fixes #6165. Still need a test, though.
@ndellingwood Can you please confirm that this resolves the failing Trilinos/MueLu tests?


          Fix global fence in Kokkos::resize(DynRankView)

masterleinad force-pushed the fix_dynrankview_resize_fence branch from c099b85 to 6635026 Compare

June 6, 2023 01:07

Contributor

ndellingwood commented Jun 6, 2023

@masterleinad thanks for the PR! I have a build in progress, will update when I have results

PhilMiller reviewed

View reviewed changes

containers/src/Kokkos_DynRankView.hpp

                 if constexpr (alloc_prop_input::has_execution_space)
                   Kokkos::Impl::DynRankViewRemap<drview_type, drview_type>(
                       Impl::get_property<Impl::ExecutionSpaceTag>(prop_copy), v_resized, v);
-                else
+                else {
                   Kokkos::Impl::DynRankViewRemap<drview_type, drview_type>(v_resized, v);

Contributor

PhilMiller Jun 6, 2023

If we're thinking about this like allocation + deep_copy, don't we need a global fence before this remap call, and then only a local fence after? We can do the allocation and copy stream-wise and fence them locally, but we don't know what stream may have been modifying the data in v before it got passed in.

Contributor Author

masterleinad Jun 6, 2023

This matches what we are doing for regular Views. Allocation using the default stream is blocking (cudaMalloc or cudaDeviceSynchronize) anyway.

I prefer thinking about other changes (to be done consistently across View types) in a different pull request and getting this issue fixed first.

Contributor

PhilMiller Jun 6, 2023

Regardless, we still need to ensure here that we're globally synchronizing before launching the remap - blocking cudaMalloc doesn't guarantee completion of a parallel_for launched on an explicit stream.

Contributor Author

masterleinad Jun 6, 2023

I opened #6186 so we can improve this in a follow-up for all overloads.

Contributor

ndellingwood commented Jun 6, 2023

@masterleinad these changes resolved the failing MueLu tests in Cuda+UVM builds:

Test project /home/ndellin/trilinos/Trilinos/Build/Weaver-Cuda11-nightly-cmplx/packages/muelu/test/unit_tests
    Start 5: MueLu_UnitTestsIntrepid2Tpetra_MPI_1
1/1 Test #5: MueLu_UnitTestsIntrepid2Tpetra_MPI_1 ...   Passed   45.76 sec

100% tests passed, 0 tests failed out of 1

Label Time Summary:
MueLu    =  45.76 sec*proc (1 test)

Total Test time (real) =  45.84 sec
bash-4.4$ ctest -R MueLu_UnitTestsIntrepid2Tpetra_MPI_4
Test project /home/ndellin/trilinos/Trilinos/Build/Weaver-Cuda11-nightly-cmplx/packages/muelu/test/unit_tests
    Start 6: MueLu_UnitTestsIntrepid2Tpetra_MPI_4
1/1 Test #6: MueLu_UnitTestsIntrepid2Tpetra_MPI_4 ...   Passed  142.07 sec

100% tests passed, 0 tests failed out of 1

Label Time Summary:
MueLu    = 568.28 sec*proc (1 test)

Total Test time (real) = 142.17 sec
bash-4.4$ ctest -R MueLu_UnitTestsTpetra_MPI_1
Test project /home/ndellin/trilinos/Trilinos/Build/Weaver-Cuda11-nightly-cmplx/packages/muelu/test/unit_tests
    Start 1: MueLu_UnitTestsTpetra_MPI_1
1/1 Test #1: MueLu_UnitTestsTpetra_MPI_1 ......   Passed   27.70 sec

100% tests passed, 0 tests failed out of 1

Label Time Summary:
MueLu    =  27.70 sec*proc (1 test)

Total Test time (real) =  27.72 sec

Let me know if there is another round of changes based on @PhilMiller feedback and needing retest


          Add test case

a360df2

masterleinad force-pushed the fix_dynrankview_resize_fence branch from 9b7f163 to a360df2 Compare

June 6, 2023 15:30

masterleinad marked this pull request as ready for review

June 6, 2023 15:30

masterleinad added this to the Release 4.1 milestone

masterleinad mentioned this pull request

Rethink fencing for Kokkos::resize #6186

Open

dalg24 reviewed

View reviewed changes

containers/unit_tests/TestDynViewAPI_rank12345.hpp Show resolved Hide resolved

containers/unit_tests/TestDynViewAPI_rank12345.hpp Outdated

+              template <typename SharedMemorySpace>
+              void test_dyn_rank_view_resize() {
+                int n = 1000000;

Member

dalg24 Jun 6, 2023

Does it need to be that big or would n=10? 100? 1000? also do?

Contributor Author

masterleinad Jun 6, 2023

I am having trouble reproducing even with the current form even though it matches the failing MueLu case.
I can only see the missing fence in the kernel logger so this is more of a best effort for a test case.

Member

crtrott Jun 7, 2023

I can see that we need large enough, Otherwise the kernel is done before we launch. One suggestion for better reproducebility: do the host loop in reverse. Because CUDA does tend to schedule in order.

containers/unit_tests/TestDynViewAPI_rank12345.hpp


		Kokkos::fence();

		for (int i = 0; i < 2 * n; ++i) ASSERT_EQ(device_view(i), i + 1);

Member

dalg24 Jun 6, 2023

Please add a comment (in code) on the intent for the test.
Because out of context, one could expect your test should be doing other things, like asserting that dynrankview has the right size, that the data was preserved, that the data in excess is initialized, etc.

masterleinad linked an issue

that may be closed by this pull request

Trilinos nightly test failures in MueLu tests with Cuda builds, UVM enabled #6165

Closed

dalg24 reviewed

View reviewed changes

containers/unit_tests/TestDynViewAPI_rank12345.hpp Show resolved Hide resolved


          Comment on the intent of the test and guard for existence of SharedSpace

c3ad3f8

masterleinad force-pushed the fix_dynrankview_resize_fence branch from 53bf1f7 to c3ad3f8 Compare

June 6, 2023 17:46

Member

Rombur commented Jun 7, 2023

Retest this please


          Guard execution of test differently

5e31a2d

masterleinad requested a review from dalg24

June 7, 2023 15:00

crtrott approved these changes

View reviewed changes

Member

crtrott left a comment

Not blocking: but I would be interested in the reverse host loop thing to check whether it more reproducable catches the race.

containers/unit_tests/TestDynViewAPI_rank12345.hpp Outdated

+              template <typename SharedMemorySpace>
+              void test_dyn_rank_view_resize() {
+                int n = 1000000;

Member

crtrott Jun 7, 2023

I can see that we need large enough, Otherwise the kernel is done before we launch. One suggestion for better reproducebility: do the host loop in reverse. Because CUDA does tend to schedule in order.


          Loop in reverse order on the host to increase chances for detecting a…

bd9ad31

… missing fence

masterleinad force-pushed the fix_dynrankview_resize_fence branch from bdbae62 to bd9ad31 Compare

June 7, 2023 15:30

dalg24 reviewed

View reviewed changes

containers/unit_tests/TestDynViewAPI_rank12345.hpp Outdated Show resolved Hide resolved

containers/unit_tests/TestDynViewAPI_rank12345.hpp

Member

dalg24 Jun 7, 2023

Why did you add to this file?
Did you consider other options?

Contributor Author

masterleinad Jun 7, 2023

There are basically three files that add tests for DynRankView (TestDynViewAPI_generic.hpp, TestDynViewAPI_rank12345.hpp, and TestDynViewAPI_rank67.hpp) and test its API. I didn't want to add to the gigantic TestDynViewAPI.hpp and it's enough to test a one-dimensional View, so I added it to TestDynViewAPI_rank12345.hpp to avoid adding another file.

Member

dalg24 Jun 8, 2023

I would have added a new file but I am not blocking.

masterleinad mentioned this pull request

CHANGELOG: 4.1.0 #5902

Closed

masterleinad commented

View reviewed changes

containers/unit_tests/TestDynViewAPI_rank12345.hpp Outdated Show resolved Hide resolved

containers/unit_tests/TestDynViewAPI_rank12345.hpp Outdated Show resolved Hide resolved

containers/unit_tests/TestDynViewAPI_rank12345.hpp Outdated Show resolved Hide resolved


          Test with 1k elements, improve name of the test

81694e6

ndellingwood approved these changes

View reviewed changes

Contributor

ndellingwood left a comment

I confirmed the added fence resolves the failures in the MueLu unit tests in Cuda+UVM builds, approving based on this, thanks @masterleinad (I don't have strong opinions regarding the unit test)

@PhilMiller brought up some important points which will be followed up through issue #6186

Contributor Author

masterleinad commented Jun 8, 2023

MSVC failing looks spurious.

masterleinad added the Blocks Promotion label

masterleinad requested a review from dalg24

June 8, 2023 15:55

dalg24 approved these changes

View reviewed changes

dalg24 merged commit a406372 into kokkos:develop

27 of 28 checks passed

nliber pushed a commit to nliber/kokkos that referenced this pull request


          Fix global fence in Kokkos::resize(DynRankView) (kokkos#6184)

c365101

* Fix global fence in Kokkos::resize(DynRankView)

* Add test case

* Comment on the intent of the test and guard for existence of SharedSpace

* Guard execution of test differently

* Loop in reverse order on the host to increase chances for detecting a missing fence

* Test with 1k elements, improve name of the test

masterleinad mentioned this pull request

Trilinos nightly test failures in MueLu tests with Cuda builds, UVM enabled #6165

Closed

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment