Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Disable memset on A64FX and use parallel init instead #4884

Merged
merged 1 commit into from
Mar 24, 2022

Conversation

crtrott
Copy link
Member

@crtrott crtrott commented Mar 17, 2022

This fixes performance issues related to 1st touch which do not seem to pop up on X86

On A64FX we get wrong first touch with memset, leading to bad performance.
Kokkos stream benchmark only get 180GB/s vs 600GB/s when doing parallel set zero.
This is with GCC.

We need to check other archs/OS combos where we may need to disable memset.

@@ -1298,9 +1298,11 @@ inline std::enable_if_t<
contiguous_fill_or_memset(
const ExecutionSpace& exec_space, const View<DT, DP...>& dst,
typename ViewTraits<DT, DP...>::const_value_type& value) {
#ifndef KOKKOS_ARCH_A64FX
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add a comment here and below

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you really need to disable it here, too? This is also used for Kokkos::deep_copy(view, 0). Changing core/src/impl/Kokkos_ViewMapping.hpp should be enough to avoid calling memset in the View initialization.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it will screw up withoutviewinit followed by deep_copy(a,0) later

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is that the usual case? If you specify Kokkos::WithoutInitializing you wouldn't normally run deep_copy(a,0) later but fill the View in another way, right?
My thinking is that you would normally call deep_copy(a,0) for a VIew that has already been used, or am I wrong?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I know of a number of apps where that doesn't happen. Basically you know that your normal algorithm will anyway reset the data because you call the thing multiple times (like timesteps or other types of iterations). So you tell yourself: I gonna reset this thing anyway, so during allocation (or more often than not resizing as part of load balancing etc.) you avoid initialization since you know you will overwrite.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you think we should do it here as well, that's fine with me.

This fixes performance issues related to 1st touch which do not seem to pop up on X86

On A64FX we get wrong first touch with memset, leading to bad performance.
Kokkos stream benchmark only get 180GB/s vs 600GB/s when doing parallel set zero.
This is with GCC.
@dalg24
Copy link
Member

dalg24 commented Mar 23, 2022

Failure (timeout in Clang+CUDA build) is unrelated
@masterleinad do you accept Christian's answer?

@dalg24 dalg24 merged commit 7c2deab into kokkos:develop Mar 24, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants