-
Notifications
You must be signed in to change notification settings - Fork 10.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[OpenMP][FIX] Remove unsound omp_get_thread_limit deduplication #79524
Conversation
@llvm/pr-subscribers-llvm-transforms Author: Matt (MattPD) ChangesThe deduplication of the calls to However, now that we have Thus, removing Here's a simple example:
GCC-compiled binary execution: https://gcc.godbolt.org/z/Pjv3TWoTq
Clang/LLVM-compiled binary execution: https://clang.godbolt.org/z/zdPbrdMPn
By my reading of the OpenMP spec GCC does the right thing here; cf. <https://www.openmp.org/spec-html/5.2/openmpse12.html#x34-330002.4>: The common subexpression elimination (CSE) of the second call to Compiling with
OMP170 has the following explanation: https://openmp.llvm.org/remarks/OMP170.html > This optimization remark indicates that a call to an OpenMP runtime call was replaced with the result of an existing one. This occurs when the compiler knows that the result of a runtime call is immutable. Removing duplicate calls is done by replacing all calls to that function with the result of the first call. This cannot be done automatically by the compiler because the implementations of the OpenMP runtime calls live in a separate library the compiler cannot see. At the same time I do not believe we have an analysis checking whether this precondition holds here: "This occurs when the compiler knows that the result of a runtime call is immutable." AFAICT, such analysis doesn't appear to exist in the original patch introducing deduplication, either: The fix is to remove it from
As a result, we're no longer unsoundly deduplicating the OpenMP runtime call Full diff: https://github.com/llvm/llvm-project/pull/79524.diff 2 Files Affected:
diff --git a/llvm/lib/Transforms/IPO/OpenMPOpt.cpp b/llvm/lib/Transforms/IPO/OpenMPOpt.cpp
index 4176d561363fbd9..77ca36d64029f09 100644
--- a/llvm/lib/Transforms/IPO/OpenMPOpt.cpp
+++ b/llvm/lib/Transforms/IPO/OpenMPOpt.cpp
@@ -1471,7 +1471,6 @@ struct OpenMPOpt {
OMPRTL_omp_get_num_threads,
OMPRTL_omp_in_parallel,
OMPRTL_omp_get_cancellation,
- OMPRTL_omp_get_thread_limit,
OMPRTL_omp_get_supported_active_levels,
OMPRTL_omp_get_level,
OMPRTL_omp_get_ancestor_thread_num,
diff --git a/llvm/test/Transforms/OpenMP/deduplication_soundness.ll b/llvm/test/Transforms/OpenMP/deduplication_soundness.ll
new file mode 100644
index 000000000000000..71034751ea7bbf1
--- /dev/null
+++ b/llvm/test/Transforms/OpenMP/deduplication_soundness.ll
@@ -0,0 +1,70 @@
+; RUN: opt -passes=openmp-opt-cgscc -S < %s | FileCheck %s
+
+target datalayout = "e-m:e-p270:32:32-p271:32:32-p272:64:64-i64:64-i128:128-f80:128-n8:16:32:64-S128"
+
+declare i32 @printf(ptr noundef, ...)
+declare i32 @omp_get_thread_limit()
+; Function Attrs: nounwind
+declare void @__kmpc_set_thread_limit(ptr, i32, i32)
+; Function Attrs: nounwind
+declare i32 @__kmpc_global_thread_num(ptr)
+; Function Attrs: nounwind
+declare noalias ptr @__kmpc_omp_task_alloc(ptr, i32, i32, i64, i64, ptr)
+; Function Attrs: nounwind
+declare void @__kmpc_omp_task_complete_if0(ptr, i32, ptr)
+; Function Attrs: nounwind
+declare void @__kmpc_omp_task_begin_if0(ptr, i32, ptr)
+
+%struct.ident_t = type { i32, i32, i32, i32, ptr }
+
+@.str = private unnamed_addr constant [28 x i8] c"\0A1:target thread_limit: %d\0A\00", align 1
+@0 = private unnamed_addr constant [23 x i8] c";unknown;unknown;0;0;;\00", align 1
+@1 = private unnamed_addr constant %struct.ident_t { i32 0, i32 2, i32 0, i32 22, ptr @0 }, align 8
+@.str.1 = private unnamed_addr constant [28 x i8] c"\0A2:target thread_limit: %d\0A\00", align 1
+
+define dso_local i32 @main() local_unnamed_addr {
+; CHECK-LABEL: define {{[^@]+}}@main
+; CHECK-NEXT: entry:
+; CHECK: %call.i.i.i = call i32 @omp_get_thread_limit()
+; CHECK-NEXT: %call1.i.i.i = call i32 (ptr, ...) @printf(ptr noundef nonnull dereferenceable(1) @.str, i32 noundef %call.i.i.i)
+; CHECK: %call.i.i.i1 = call i32 @omp_get_thread_limit()
+; CHECK-NEXT: %call1.i.i.i2 = call i32 (ptr, ...) @printf(ptr noundef nonnull dereferenceable(1) @.str.1, i32 noundef %call.i.i.i1)
+entry:
+ %0 = call i32 @__kmpc_global_thread_num(ptr nonnull @1)
+ %1 = call ptr @__kmpc_omp_task_alloc(ptr nonnull @1, i32 %0, i32 1, i64 40, i64 0, ptr nonnull @.omp_task_entry.)
+ call void @__kmpc_omp_task_begin_if0(ptr nonnull @1, i32 %0, ptr %1)
+ call void @__kmpc_set_thread_limit(ptr nonnull @1, i32 %0, i32 4)
+ %call.i.i.i = call i32 @omp_get_thread_limit()
+ %call1.i.i.i = call i32 (ptr, ...) @printf(ptr noundef nonnull dereferenceable(1) @.str, i32 noundef %call.i.i.i)
+ call void @__kmpc_omp_task_complete_if0(ptr nonnull @1, i32 %0, ptr %1)
+ %2 = call ptr @__kmpc_omp_task_alloc(ptr nonnull @1, i32 %0, i32 1, i64 40, i64 0, ptr nonnull @.omp_task_entry..3)
+ call void @__kmpc_omp_task_begin_if0(ptr nonnull @1, i32 %0, ptr %2)
+ call void @__kmpc_set_thread_limit(ptr nonnull @1, i32 %0, i32 3)
+ %call.i.i.i1 = call i32 @omp_get_thread_limit()
+ %call1.i.i.i2 = call i32 (ptr, ...) @printf(ptr noundef nonnull dereferenceable(1) @.str.1, i32 noundef %call.i.i.i1)
+ call void @__kmpc_omp_task_complete_if0(ptr nonnull @1, i32 %0, ptr %2)
+ ret i32 0
+}
+
+define internal noundef i32 @.omp_task_entry.(i32 noundef %0, ptr noalias nocapture noundef readonly %1) {
+entry:
+ tail call void @__kmpc_set_thread_limit(ptr nonnull @1, i32 %0, i32 4)
+ %call.i.i = tail call i32 @omp_get_thread_limit()
+ %call1.i.i = tail call i32 (ptr, ...) @printf(ptr noundef nonnull dereferenceable(1) @.str, i32 noundef %call.i.i)
+ ret i32 0
+}
+
+define internal noundef i32 @.omp_task_entry..3(i32 noundef %0, ptr noalias nocapture noundef readonly %1) {
+entry:
+ tail call void @__kmpc_set_thread_limit(ptr nonnull @1, i32 %0, i32 3)
+ %call.i.i = tail call i32 @omp_get_thread_limit()
+ %call1.i.i = tail call i32 (ptr, ...) @printf(ptr noundef nonnull dereferenceable(1) @.str.1, i32 noundef %call.i.i)
+ ret i32 0
+}
+
+attributes #1 = { alwaysinline norecurse nounwind uwtable }
+attributes #3 = { alwaysinline nounwind uwtable }
+
+!llvm.module.flags = !{!0}
+
+!0 = !{i32 7, !"openmp", i32 51}
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Seems reasonable to me. Some nits about the added test.
@@ -0,0 +1,70 @@ | |||
; RUN: opt -passes=openmp-opt-cgscc -S < %s | FileCheck %s |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For a test like this I'd recommend generating the check lines with llvm/utils/update_test_checks.py --function-signature
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks! I'd prefer this too, if possible; will see if there's a way to avoid generating more check lines than I already have (in the past had to use UTC_ARGS
for in certain cases, but perhaps won't be needed here).
Edit: UTC_ARGS: --function main --scrub-attributes --filter "@omp_get_thread_limit|@use"
does it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done!
@@ -0,0 +1,70 @@ | |||
; RUN: opt -passes=openmp-opt-cgscc -S < %s | FileCheck %s | |||
|
|||
target datalayout = "e-m:e-p270:32:32-p271:32:32-p272:64:64-i64:64-i128:128-f80:128-n8:16:32:64-S128" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you clean up this test somewhat? I don't think we need the data layout or attributes like noundef
and dso_local
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done!
|
||
target datalayout = "e-m:e-p270:32:32-p271:32:32-p272:64:64-i64:64-i128:128-f80:128-n8:16:32:64-S128" | ||
|
||
declare i32 @printf(ptr noundef, ...) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just make this some external call instead of printf.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done!
Let me know if any further changes may be required
The deduplication of the calls to `omp_get_thread_limit` used to be legal when originally added in <llvm@e28936f#diff-de101c82aff66b2bda2d1f53fde3dde7b0d370f14f1ff37b7919ce38531230dfR123>, as the result (thread_limit) was immutable. However, now that we have `thread_limit` clause, we no longer have immutability; therefore `omp_get_thread_limit()` is not a deduplicable runtime call. Thus, removing `omp_get_thread_limit` from the `DeduplicableRuntimeCallIDs` array. Here's a simple example: ``` int main() { { printf("\n1:target thread_limit: %d\n", omp_get_thread_limit()); } { printf("\n2:target thread_limit: %d\n", omp_get_thread_limit()); } return 0; } ``` GCC-compiled binary execution: https://gcc.godbolt.org/z/Pjv3TWoTq ``` 1:target thread_limit: 4 2:target thread_limit: 3 ``` Clang/LLVM-compiled binary execution: https://clang.godbolt.org/z/zdPbrdMPn ``` 1:target thread_limit: 4 2:target thread_limit: 4 ``` By my reading of the OpenMP spec GCC does the right thing here; cf. <https://www.openmp.org/spec-html/5.2/openmpse12.html#x34-330002.4>: > If a target construct with a thread_limit clause is encountered, the thread-limit-var ICV from the data environment of the generated initial task is instead set to an implementation defined value between one and the value specified in the clause. The common subexpression elimination (CSE) of the second call to `omp_get_thread_limit` by LLVM does not seem to be correct, as it's not an available expression at any program point(s) (in the scope of the clause in question) after the second target construct with a `thread_limit` clause is encountered. Compiling with `-Rpass=openmp-opt -Rpass-analysis=openmp-opt -Rpass-missed=openmp-opt` we have: https://clang.godbolt.org/z/G7dfhP7jh ``` <source>:8:42: remark: OpenMP runtime call omp_get_thread_limit deduplicated. [OMP170] [-Rpass=openmp-opt] 8 | printf("\n1:target thread_limit: %d\n",omp_get_thread_limit()); | ^ ``` OMP170 has the following explanation: https://openmp.llvm.org/remarks/OMP170.html > This optimization remark indicates that a call to an OpenMP runtime call was replaced with the result of an existing one. This occurs when the compiler knows that the result of a runtime call is immutable. Removing duplicate calls is done by replacing all calls to that function with the result of the first call. This cannot be done automatically by the compiler because the implementations of the OpenMP runtime calls live in a separate library the compiler cannot see. This optimization will trigger for known OpenMP runtime calls whose return value will not change. At the same time I do not believe we have an analysis checking whether this precondition holds here: "This occurs when the compiler knows that the result of a runtime call is immutable." AFAICT, such analysis doesn't appear to exist in the original patch introducing deduplication, either: - llvm@9548b74 - https://reviews.llvm.org/D69930 The fix is to remove it from `DeduplicableRuntimeCallIDs`, effectively reverting the addition in this commit (noting that `omp_get_max_threads` is not present in `DeduplicableRuntimeCallIDs`, so it's possible this addition was incorrect in the first place): - [OpenMP][Opt] Annotate known runtime functions and deduplicate more, - llvm@e28936f#diff-de101c82aff66b2bda2d1f53fde3dde7b0d370f14f1ff37b7919ce38531230dfR123 As a result, we're no longer unsoundly deduplicating the OpenMP runtime call `omp_get_thread_limit` as illustrated by the test case: Note the (correctly) repeated `call i32 @omp_get_thread_limit()`.
1c7d5b4
to
4167b66
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LG overall. Shouldn't need the nonnull
, nounwind
, or noundef
attributes either.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry, I meant all of them but just pointed out that one for reference.
Co-authored-by: Joseph Huber <huberjn@outlook.com>
154721e
to
31efd7d
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good enough, need me to merge this for you?
Sounds great! Yes, please, I appreciate it (as well as the useful feedback), thanks! :-) |
The deduplication of the calls to
omp_get_thread_limit
used to be legal when originally added in e28936f#diff-de101c82aff66b2bda2d1f53fde3dde7b0d370f14f1ff37b7919ce38531230dfR123, as the result (thread_limit) was immutable.However, now that we have
thread_limit
clause, we no longer have immutability; thereforeomp_get_thread_limit()
is not a deduplicable runtime call.Thus, removing
omp_get_thread_limit
from theDeduplicableRuntimeCallIDs
array.Here's a simple example:
GCC-compiled binary execution: https://gcc.godbolt.org/z/Pjv3TWoTq
Clang/LLVM-compiled binary execution: https://clang.godbolt.org/z/zdPbrdMPn
By my reading of the OpenMP spec GCC does the right thing here; cf. https://www.openmp.org/spec-html/5.2/openmpse12.html#x34-330002.4:
The common subexpression elimination (CSE) of the second call to
omp_get_thread_limit
by LLVM does not seem to be correct, as it's not an available expression at any program point(s) (in the scope of the clause in question) after the second target construct with athread_limit
clause is encountered.Compiling with
-Rpass=openmp-opt -Rpass-analysis=openmp-opt -Rpass-missed=openmp-opt
we have:https://clang.godbolt.org/z/G7dfhP7jh
OMP170 has the following explanation: https://openmp.llvm.org/remarks/OMP170.html
At the same time I do not believe we have an analysis checking whether this precondition holds here: "This occurs when the compiler knows that the result of a runtime call is immutable."
AFAICT, such analysis doesn't appear to exist in the original patch introducing deduplication, either:
The fix is to remove it from
DeduplicableRuntimeCallIDs
, effectively reverting the addition in this commit (noting thatomp_get_max_threads
is not present inDeduplicableRuntimeCallIDs
, so it's possible this addition was incorrect in the first place):As a result, we're no longer unsoundly deduplicating the OpenMP runtime call
omp_get_thread_limit
as illustrated by the test case: Note the (correctly) repeatedcall i32 @omp_get_thread_limit()
.