Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

optimize min/max atomics with early exit on no-op case #3265

Merged
merged 15 commits into from
Aug 26, 2020

Conversation

jeffhammond
Copy link

Based on the prototype written by @dhollman in https://godbolt.org/z/zqP8ox

Fixes #3144

@jeffhammond
Copy link
Author

jeffhammond commented Aug 10, 2020

@dalg24 asked about performance (#3190 (review)).

The early-exit change reduces the cost of atomic min/max by ~5x with 1 thread and ~6x with 4 threads (on 4-core Intel Core i7 Kaby Lake processor). One can estimate the value for a real workload by considering how often min/max has no effect for a given distribution of inputs.

1 thread

================ int
Time for 100% min replacements: 0.0504914
Time for 100% max replacements: 0.0506939
Time for 100% max early exits: 0.00871919
Time for 100% min early exits: 0.00901207
================ long
Time for 100% min replacements: 0.0497887
Time for 100% max replacements: 0.0518098
Time for 100% max early exits: 0.00974867
Time for 100% min early exits: 0.00890822
================ long long
Time for 100% min replacements: 0.0506316
Time for 100% max replacements: 0.0515252
Time for 100% max early exits: 0.00905691
Time for 100% min early exits: 0.00921661
================ unsigned int
Time for 100% min replacements: 0.0515325
Time for 100% max replacements: 0.0561518
Time for 100% max early exits: 0.00576655
Time for 100% min early exits: 0.00636291
================ unsigned long
Time for 100% min replacements: 0.0499094
Time for 100% max replacements: 0.0525061
Time for 100% max early exits: 0.00946185
Time for 100% min early exits: 0.00904288
================ unsigned long long
Time for 100% min replacements: 0.0512696
Time for 100% max replacements: 0.0515732
Time for 100% max early exits: 0.0091282
Time for 100% min early exits: 0.00928072
================ float
Time for 100% min replacements: 0.0543137
Time for 100% max replacements: 0.0557416
Time for 100% max early exits: 0.0100482
Time for 100% min early exits: 0.00990205
================ double
Time for 100% min replacements: 0.0554962
Time for 100% max replacements: 0.0568094
Time for 100% max early exits: 0.0105707
Time for 100% min early exits: 0.00993827

4 threads

================ int
Time for 100% min replacements: 0.0132574
Time for 100% max replacements: 0.0133464
Time for 100% max early exits: 0.0021481
Time for 100% min early exits: 0.00215593
================ long
Time for 100% min replacements: 0.0133953
Time for 100% max replacements: 0.0138049
Time for 100% max early exits: 0.00282516
Time for 100% min early exits: 0.00365484
================ long long
Time for 100% min replacements: 0.0134711
Time for 100% max replacements: 0.0133278
Time for 100% max early exits: 0.00296979
Time for 100% min early exits: 0.00286541
================ unsigned int
Time for 100% min replacements: 0.0139581
Time for 100% max replacements: 0.0147479
Time for 100% max early exits: 0.00194901
Time for 100% min early exits: 0.00158964
================ unsigned long
Time for 100% min replacements: 0.0135064
Time for 100% max replacements: 0.0134934
Time for 100% max early exits: 0.00316169
Time for 100% min early exits: 0.00280566
================ unsigned long long
Time for 100% min replacements: 0.0135957
Time for 100% max replacements: 0.0133858
Time for 100% max early exits: 0.00281782
Time for 100% min early exits: 0.00305754
================ float
Time for 100% min replacements: 0.0146303
Time for 100% max replacements: 0.0144835
Time for 100% max early exits: 0.00248436
Time for 100% min early exits: 0.00248833
================ double
Time for 100% min replacements: 0.0149186
Time for 100% max replacements: 0.0147723
Time for 100% max early exits: 0.00295521
Time for 100% min early exits: 0.00290403

@jeffhammond jeffhammond marked this pull request as ready for review August 10, 2020 22:53
@jeffhammond
Copy link
Author

@dhollman Did I apply your suggested implementation properly?

If this passes we can cleanup history after.
@crtrott
Copy link
Member

crtrott commented Aug 11, 2020

I pushed the clang-format here. We gonna rewrite history to fold it into the prior commit if this works and otherwise passes testing. Btw. its 5x and 6x respectivly :-)

core/src/impl/Kokkos_Atomic_Generic.hpp Outdated Show resolved Hide resolved
core/perf_test/test_atomic_minmax_simple.cpp Outdated Show resolved Hide resolved
core/perf_test/test_atomic_minmax_simple.cpp Outdated Show resolved Hide resolved
core/perf_test/test_atomic_minmax_simple.cpp Show resolved Hide resolved
core/perf_test/test_atomic_minmax_simple.cpp Outdated Show resolved Hide resolved
core/perf_test/test_atomic_minmax_simple.cpp Outdated Show resolved Hide resolved
@dalg24
Copy link
Member

dalg24 commented Aug 11, 2020

(Failure on Jenkins is not related to this PR)

core/src/impl/Kokkos_Atomic_Generic.hpp Outdated Show resolved Hide resolved
core/src/impl/Kokkos_Atomic_Generic.hpp Outdated Show resolved Hide resolved
core/src/impl/Kokkos_Atomic_Generic.hpp Outdated Show resolved Hide resolved
Copy link

@calewis calewis left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, but don't merge just on my review.

core/perf_test/Makefile Outdated Show resolved Hide resolved
Copy link
Member

@crtrott crtrott left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

typo in the makefile

core/perf_test/Makefile Outdated Show resolved Hide resolved
@dalg24
Copy link
Member

dalg24 commented Aug 13, 2020

I decided to default the length to 1'000'000 rather than waiting for #3279 to land.

diff --git a/core/perf_test/CMakeLists.txt b/core/perf_test/CMakeLists.txt
index 49741ed8..3b7d154f 100644
--- a/core/perf_test/CMakeLists.txt
+++ b/core/perf_test/CMakeLists.txt
@@ -93,6 +93,7 @@ KOKKOS_ADD_EXECUTABLE_AND_TEST(
   PerformanceTest_Atomic_MinMax
   SOURCES test_atomic_minmax_simple.cpp
   CATEGORIES PERFORMANCE
+  ARGS 1000000
 )

 KOKKOS_ADD_EXECUTABLE_AND_TEST(

@@ -165,6 +202,7 @@ KOKKOS_INLINE_FUNCTION T atomic_fetch_oper(
} oldval, assume, newval;

oldval.t = *dest;
if (check_early_exit(Oper{}, oldval.t, val)) return oldval.t;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Needs to be in the while loop

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why? The purpose of this check is to determine if there is no need to even attempt an update.

I assumed load atomicity is architectural here, so if you are seeing issues on PowerPC, the fix is not to put the check in the loop but to add a relaxed atomic load of oldval.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not asking you to change anything. @crtrott mentioned yesterday there might be an issue with the current implementation. I commented as a reminder for us.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am fine with not doing anything, but I would like to understand why anyone thinks this check belongs in the loop.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, @crtrott I'm not sure why the early exit check needs to be in the while loop. Probably the fewer branches the better in the CAS loop itself

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

desul/desul#13 has our discussion of this topic. It may be easier to have it over there.

@dalg24 dalg24 requested a review from crtrott August 22, 2020 00:32
Copy link

@dhollman dhollman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@dalg24 dalg24 merged commit bbd4c99 into kokkos:develop Aug 26, 2020
@jeffhammond jeffhammond deleted the issue-3144-new branch July 29, 2021 07:36
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

6 participants