Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

view(uvm): don't fence during view's allocation (#6005) #6008

Merged
merged 1 commit into from
May 4, 2023

Conversation

romintomasetti
Copy link
Contributor

@romintomasetti romintomasetti commented Mar 23, 2023

This PR is a fix for

Basically, when CudaUVM is enabled, there is a fence before and after the view's allocation (whether the space instance was given via arg_prop or not).

However, the fence was not performed on the space instance the user provided (if any).

According to the "poll" below, it was decided to

enforce Kokkos semantics: Don't ever fence for the view initialization when an execution space instance is given

Therefore, any additional fence that was added for CudaUVM in view's allocation has been removed. Note that fencing still occurs in the backend implementations.

@dalg24-jenkins
Copy link
Collaborator

Can one of the admins verify this patch?

containers/src/Kokkos_DynRankView.hpp Outdated Show resolved Hide resolved
containers/src/Kokkos_DynRankView.hpp Outdated Show resolved Hide resolved
core/src/Kokkos_View.hpp Outdated Show resolved Hide resolved
core/src/Kokkos_View.hpp Outdated Show resolved Hide resolved
@romintomasetti romintomasetti force-pushed the cuda-uvm-space-instance-fence branch 3 times, most recently from d26e6cc to e6d2ee2 Compare March 23, 2023 14:57
core/src/Kokkos_View.hpp Outdated Show resolved Hide resolved
containers/src/Kokkos_DynRankView.hpp Outdated Show resolved Hide resolved
@romintomasetti romintomasetti force-pushed the cuda-uvm-space-instance-fence branch 3 times, most recently from ad29115 to 33bed43 Compare March 24, 2023 08:40
@masterleinad
Copy link
Contributor

The developer meeting asked for more clarification of what we are discussing:

Consider

Kokkos::Cuda exec_space;
Kokkos::View<int*, Kokkos::CudaUVMSpace> view(Kokkos::view_alloc(exec_space, "view"), 10);
view[0] = 2;

should this code work or do we require the user to fence?
What about

Kokkos::Experimental::HPX exec_space;
Kokkos::View<int*, Kokkos::HostSpace> view(Kokkos::view_alloc(exec_space, "view"), 10);
view[0] = 2;

? Currently, we are fencing the default execution space instance(!) before allocation and after (potential initialization) only for Kokkos::CudaUVMSpace. Note that fences for the allocation itself are in the backend implementations.
I would think that we can get rid of the first fence for sure. Can we also get rid of the second one in case the user provided an execution space instance? Do we care to check for Kokkos::WithoutInitializing for this fence additionally?

@dalg24
Copy link
Member

dalg24 commented Mar 29, 2023

The developer meeting asked for more clarification of what we are discussing:

Consider

Kokkos::Cuda exec_space;
Kokkos::View<int*, Kokkos::CudaUVMSpace> view(Kokkos::view_alloc(exec_space, "view"), 10);
view[0] = 2;

should this code work or do we require the user to fence? What about

Kokkos::Experimental::HPX exec_space;
Kokkos::View<int*, Kokkos::HostSpace> view(Kokkos::view_alloc(exec_space, "view"), 10);
view[0] = 2;

In my opinion, yes both need fences.

@romintomasetti
Copy link
Contributor Author

The developer meeting asked for more clarification of what we are discussing:

Consider

Kokkos::Cuda exec_space;
Kokkos::View<int*, Kokkos::CudaUVMSpace> view(Kokkos::view_alloc(exec_space, "view"), 10);
view[0] = 2;

should this code work or do we require the user to fence? What about

Kokkos::Experimental::HPX exec_space;
Kokkos::View<int*, Kokkos::HostSpace> view(Kokkos::view_alloc(exec_space, "view"), 10);
view[0] = 2;

? Currently, we are fencing the default execution space instance(!) before allocation and after (potential initialization) only for Kokkos::CudaUVMSpace. Note that fences for the allocation itself are in the backend implementations. I would think that we can get rid of the first fence for sure. Can we also get rid of the second one in case the user provided an execution space instance? Do we care to check for Kokkos::WithoutInitializing for this fence additionally?

Hi @dalg24 @masterleinad !

We tried to think about this a bit more. Here is a table that show what we have in mind. Note the introduction of the opt-in Kokkos::Async flag, that allows the user to say "I'm good with async".

In our opinion, the behavior for a shared memory space should be similar to any other one.

space instance Kokkos::view_alloc expected behavior
no space instance Use default space instance, and fence it at the end of Kokkos::View constructor (view is 'ready')
no space instance Kokkos::Async Use default space instance, don't fence it at the end of the constructor (view might not be 'ready' on host, but is ready on the FIFO queue of the default space instance)
space instance provided Use provided space instance and fence it at the end of the constructor (view is 'ready')
space instance provided Kokkos::Async Use provided space instance and don't fence it (view might not be 'ready' on host but is ready on the FIFO queue of the provided space instance)

@Rombur
Copy link
Member

Rombur commented Mar 31, 2023

Note the introduction of the opt-in Kokkos::Async flag, that allows the user to say "I'm good with async".

There is no need for this flag. If you give an execution space , you agree that the operation may be asynchronous. If you don't provide an execution space. we should fence for you. If you provide an execution space, then it's your responsibility.

@masterleinad
Copy link
Contributor

masterleinad commented Mar 31, 2023

Let's just do a simple poll that might decide pretty much already how we want to proceed:

  • enforce Kokkos semantics: Don't ever fence for the view initialization when an execution space instance is given 🚀
  • fence in some cases even if an execution space instance is given, e.g. for UVM allocations 🎉

@romintomasetti
Copy link
Contributor Author

Hi @masterleinad ! 👋

I guess your poll is unanimous: Kokkos semantics should always be enforced, which in this case implies this rule:

Don't ever fence for the view initialization when an execution space instance is given

If you agree, I'll make the changes.

@masterleinad
Copy link
Contributor

If you agree, I'll make the changes.

Yes, just delete the UVM fences.

@romintomasetti
Copy link
Contributor Author

@masterleinad I guess the reason for failure of GCC-8.4.0 is the same as before. Do I need to worry about the failure of SYCL-OneAPI?

@masterleinad
Copy link
Contributor

@masterleinad I guess the reason for failure of GCC-8.4.0 is the same as before. Do I need to worry about the failure of SYCL-OneAPI?

This doesn't really touch SYCL and we have seen the failures elsewhere. I wouldn't worry about it (but we might decide to merge a fix first and rerun CI just to make sure).

@romintomasetti
Copy link
Contributor Author

@masterleinad The last action showed only GCC-8.4.0 failing. Should we proceed with the PR?

@masterleinad
Copy link
Contributor

@masterleinad The last action showed only GCC-8.4.0 failing. Should we proceed with the PR?

I already approved it. So you need to bug other people (possibly someone with the power to merge it). 🙂

@romintomasetti
Copy link
Contributor Author

Indeed, sorry @masterleinad ! I guess I need to bug @dalg24 and @PhilMiller 📨

@maartenarnst
Copy link
Contributor

Hi @dalg24 and @PhilMiller, would you have a moment to look at this PR, and, if you agree with the changes, approve? Many thanks in advance!

@dalg24
Copy link
Member

dalg24 commented May 1, 2023

Make sure you update the title line and the description of this PR to reflect the latest version.

//! @name Some tests are skipped for @c CudaUVM memory space.
///@{
#ifdef KOKKOS_ENABLE_CUDA
#define GTEST_SKIP_IF_CUDAUVM_MEMORY_SPACE \
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Other discussion aside, standard practice when defining a macro like this is to wrap the contents in do { ... } while (false) so that the use sites have to treat it like a statement with a trailing ;

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is now

#ifdef KOKKOS_ENABLE_CUDA
#define GTEST_SKIP_IF_CUDAUVM_MEMORY_SPACE                            \
  if constexpr (std::is_same_v<typename TEST_EXECSPACE::memory_space, \
                               Kokkos::CudaUVMSpace>)                 \
    GTEST_SKIP() << "skipping since CudaUVMSpace requires additional fences"
#else
#define GTEST_SKIP_IF_CUDAUVM_MEMORY_SPACE do {} while(false)
#endif

such that the caller must add the trailing ;. Is that what you meant? :)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't agree with Phil's suggestion. This only apply to macro that take arguments.
I would prefer to revert but I am not blocking on this.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I also preferred how it was before. I'll revert.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@dalg24 I think this is the last thread. If you agree with all the changes, this PR is good to go 🚀

@romintomasetti romintomasetti force-pushed the cuda-uvm-space-instance-fence branch from ef07355 to e395d5f Compare May 2, 2023 08:04
@romintomasetti romintomasetti changed the title view(uvm): fence if need in allocation (#6005) view(uvm): don't fence during view's allocation (#6005) May 2, 2023
@romintomasetti romintomasetti force-pushed the cuda-uvm-space-instance-fence branch 5 times, most recently from cb17816 to 0a47643 Compare May 2, 2023 16:07
Comment on lines +178 to +179
Impl::cuda_device_synchronize(
"Kokkos::Cuda: backend fence after async malloc");
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@masterleinad This is just a comment about the clang-format. Do you think it would make sense (not in this PR of course) to lower the penalty of excess characters? (See https://clang.llvm.org/docs/ClangFormatStyleOptions.html#penaltyexcesscharacter)

I mean, sometimes a line with 87 characters might be nicer than the same line split weirdly across 2 lines 😄 This is just a question, not suggesting to do anything but gathering your opinion on this 😄

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We need to update the clang-format version we are using to something more recent in the near future.
This will trigger changes throughout the code base. When we do so would be a time where we would consider changing the clang-format configuration. If that's something you are interested in you are welcome to explore options but this needs to be done in an issue elsewhere.
A word of warning, this is a time sink and at times can get controversial.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  • If I'm not mistaken, the PenaltyExcessCharacter is available from clang-format 3.7, and you're requiring version 8.0 in scripts/apply-clang-format. So I guess the option could be readily used.
  • Instead of modifying the whole code base when the clang format changes (either version or options), I think you should use git-clang-format to check for style only in modified files. This would avoid committing purely stylistic changes. That would avoid messing up the git history.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We allow disabling clang-format when appropriate. You will find all sort of examples throughout the code base such as

// clang-format off
// Numeric distinguished value traits
KOKKOS_IMPL_SPECIALIZE_NUMERIC_TRAIT(infinity, __float128, __float128, HUGE_VALQ)
KOKKOS_IMPL_SPECIALIZE_NUMERIC_TRAIT(finite_min, __float128, __float128, -FLT128_MAX)
KOKKOS_IMPL_SPECIALIZE_NUMERIC_TRAIT(finite_max, __float128, __float128, FLT128_MAX)
KOKKOS_IMPL_SPECIALIZE_NUMERIC_TRAIT(epsilon, __float128, __float128, FLT128_EPSILON)
KOKKOS_IMPL_SPECIALIZE_NUMERIC_TRAIT(round_error, __float128, __float128, static_cast<__float128>(0.5))
KOKKOS_IMPL_SPECIALIZE_NUMERIC_TRAIT(norm_min, __float128, __float128, FLT128_MIN)
KOKKOS_IMPL_SPECIALIZE_NUMERIC_TRAIT(denorm_min, __float128, __float128, FLT128_DENORM_MIN)
KOKKOS_IMPL_SPECIALIZE_NUMERIC_TRAIT(quiet_NaN, __float128, __float128, __builtin_nanq(""))
KOKKOS_IMPL_SPECIALIZE_NUMERIC_TRAIT(signaling_NaN, __float128, __float128, __builtin_nansq(""))
// Numeric characteristics traits
KOKKOS_IMPL_SPECIALIZE_NUMERIC_TRAIT(digits, __float128, int, FLT128_MANT_DIG)
KOKKOS_IMPL_SPECIALIZE_NUMERIC_TRAIT(digits10, __float128, int, FLT128_DIG)
KOKKOS_IMPL_SPECIALIZE_NUMERIC_TRAIT(max_digits10, __float128, int, 36)
KOKKOS_IMPL_SPECIALIZE_NUMERIC_TRAIT(radix, __float128, int, 2)
KOKKOS_IMPL_SPECIALIZE_NUMERIC_TRAIT(min_exponent, __float128, int, FLT128_MIN_EXP)
KOKKOS_IMPL_SPECIALIZE_NUMERIC_TRAIT(max_exponent, __float128, int, FLT128_MAX_EXP)
KOKKOS_IMPL_SPECIALIZE_NUMERIC_TRAIT(min_exponent10, __float128, int, FLT128_MIN_10_EXP)
KOKKOS_IMPL_SPECIALIZE_NUMERIC_TRAIT(max_exponent10, __float128, int, FLT128_MAX_10_EXP)
// clang-format on
or
// clang-format off
#pragma acc parallel loop gang(static:chunk_size) vector copyin(functor) async(async_arg)
// clang-format on

In principle you are right but in practice I think it would mean that the entire file would be reformatted as soon as someone touches it and this "spreads" the noise across the history of the repo instead of having one clean "break".

Comment on lines 165 to 166
(event.descriptor().find(fence_event_message_for_zeromemset(
TEST_EXECSPACE{})) != std::string::npos))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This change is unnecessary. Please undo.

containers/unit_tests/TestWithoutInitializing.hpp Outdated Show resolved Hide resolved
Comment on lines 33 to 29
/**
* @name Some tests are skipped for @c CudaUVM memory space.
*
* @todo To be revised according to the future of @c KOKKOS_ENABLE_CUDA_UVM .
*/
///@{
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not requiring changes but these are 5 lines of comment (not even counting the closing comment) when one single line would do and we do not really use Doxygen so chances are it would not even generate correct documentation after other edit the file.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've used /// instead of /** ... */. Tell me if it's good for you ;)

@dalg24
Copy link
Member

dalg24 commented May 3, 2023

(almost there)

@romintomasetti romintomasetti force-pushed the cuda-uvm-space-instance-fence branch from 0a47643 to 5c2d948 Compare May 4, 2023 06:56
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

7 participants