-
Notifications
You must be signed in to change notification settings - Fork 407
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Disable memset on A64FX and use parallel init instead #4884
Conversation
@@ -1298,9 +1298,11 @@ inline std::enable_if_t< | |||
contiguous_fill_or_memset( | |||
const ExecutionSpace& exec_space, const View<DT, DP...>& dst, | |||
typename ViewTraits<DT, DP...>::const_value_type& value) { | |||
#ifndef KOKKOS_ARCH_A64FX |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Add a comment here and below
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you really need to disable it here, too? This is also used for Kokkos::deep_copy(view, 0)
. Changing core/src/impl/Kokkos_ViewMapping.hpp
should be enough to avoid calling memset in the View initialization.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it will screw up withoutviewinit followed by deep_copy(a,0) later
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is that the usual case? If you specify Kokkos::WithoutInitializing
you wouldn't normally run deep_copy(a,0)
later but fill the View in another way, right?
My thinking is that you would normally call deep_copy(a,0)
for a VIew that has already been used, or am I wrong?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I know of a number of apps where that doesn't happen. Basically you know that your normal algorithm will anyway reset the data because you call the thing multiple times (like timesteps or other types of iterations). So you tell yourself: I gonna reset this thing anyway, so during allocation (or more often than not resizing as part of load balancing etc.) you avoid initialization since you know you will overwrite.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If you think we should do it here as well, that's fine with me.
This fixes performance issues related to 1st touch which do not seem to pop up on X86 On A64FX we get wrong first touch with memset, leading to bad performance. Kokkos stream benchmark only get 180GB/s vs 600GB/s when doing parallel set zero. This is with GCC.
93dbaa3
to
eeb76ed
Compare
Failure (timeout in Clang+CUDA build) is unrelated |
This fixes performance issues related to 1st touch which do not seem to pop up on X86
On A64FX we get wrong first touch with memset, leading to bad performance.
Kokkos stream benchmark only get 180GB/s vs 600GB/s when doing parallel set zero.
This is with GCC.
We need to check other archs/OS combos where we may need to disable memset.