Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[OpenMP][FIX] Remove unsound omp_get_thread_limit deduplication #79524

Merged
merged 2 commits into from
Feb 22, 2024

Conversation

MattPD
Copy link
Member

@MattPD MattPD commented Jan 25, 2024

The deduplication of the calls to omp_get_thread_limit used to be legal when originally added in e28936f#diff-de101c82aff66b2bda2d1f53fde3dde7b0d370f14f1ff37b7919ce38531230dfR123, as the result (thread_limit) was immutable.

However, now that we have thread_limit clause, we no longer have immutability; therefore omp_get_thread_limit() is not a deduplicable runtime call.

Thus, removing omp_get_thread_limit from the DeduplicableRuntimeCallIDs array.

Here's a simple example:

#include <omp.h>
#include <stdio.h>

int main()
{
#pragma omp target thread_limit(4)
{
printf("\n1:target thread_limit: %d\n", omp_get_thread_limit());
}

#pragma omp target thread_limit(3)
{
printf("\n2:target thread_limit: %d\n", omp_get_thread_limit());
}
return 0;
}

GCC-compiled binary execution: https://gcc.godbolt.org/z/Pjv3TWoTq

1:target thread_limit: 4
2:target thread_limit: 3

Clang/LLVM-compiled binary execution: https://clang.godbolt.org/z/zdPbrdMPn

1:target thread_limit: 4
2:target thread_limit: 4

By my reading of the OpenMP spec GCC does the right thing here; cf. https://www.openmp.org/spec-html/5.2/openmpse12.html#x34-330002.4:

If a target construct with a thread_limit clause is encountered, the thread-limit-var ICV from the data environment of the generated initial task is instead set to an implementation defined value between one and the value specified in the clause.

The common subexpression elimination (CSE) of the second call to omp_get_thread_limit by LLVM does not seem to be correct, as it's not an available expression at any program point(s) (in the scope of the clause in question) after the second target construct with a thread_limit clause is encountered.

Compiling with -Rpass=openmp-opt -Rpass-analysis=openmp-opt -Rpass-missed=openmp-opt we have:
https://clang.godbolt.org/z/G7dfhP7jh

<source>:8:42: remark: OpenMP runtime call omp_get_thread_limit deduplicated. [OMP170] [-Rpass=openmp-opt]
8 | printf("\n1:target thread_limit: %d\n",omp_get_thread_limit());
| ^

OMP170 has the following explanation: https://openmp.llvm.org/remarks/OMP170.html

This optimization remark indicates that a call to an OpenMP runtime call was replaced with the result of an existing one. This occurs when the compiler knows that the result of a runtime call is immutable. Removing duplicate calls is done by replacing all calls to that function with the result of the first call. This cannot be done automatically by the compiler because the implementations of the OpenMP runtime calls live in a separate library the compiler cannot see.
This optimization will trigger for known OpenMP runtime calls whose return value will not change.

At the same time I do not believe we have an analysis checking whether this precondition holds here: "This occurs when the compiler knows that the result of a runtime call is immutable."

AFAICT, such analysis doesn't appear to exist in the original patch introducing deduplication, either:

The fix is to remove it from DeduplicableRuntimeCallIDs, effectively reverting the addition in this commit (noting that omp_get_max_threads is not present in DeduplicableRuntimeCallIDs, so it's possible this addition was incorrect in the first place):

As a result, we're no longer unsoundly deduplicating the OpenMP runtime call omp_get_thread_limit as illustrated by the test case: Note the (correctly) repeated call i32 @omp_get_thread_limit().

@llvmbot llvmbot added llvm:transforms clang:openmp OpenMP related changes to Clang labels Jan 25, 2024
@llvmbot
Copy link
Collaborator

llvmbot commented Jan 25, 2024

@llvm/pr-subscribers-llvm-transforms

Author: Matt (MattPD)

Changes

The deduplication of the calls to omp_get_thread_limit used to be legal when originally added in <e28936f#diff-de101c82aff66b2bda2d1f53fde3dde7b0d370f14f1ff37b7919ce38531230dfR123>, as the result (thread_limit) was immutable.

However, now that we have thread_limit clause, we no longer have immutability; therefore omp_get_thread_limit() is not a deduplicable runtime call.

Thus, removing omp_get_thread_limit from the DeduplicableRuntimeCallIDs array.

Here's a simple example:

#include &lt;omp.h&gt;
#include &lt;stdio.h&gt;

int main()
{
#pragma omp target thread_limit(4)
{
printf("\n1:target thread_limit: %d\n", omp_get_thread_limit());
}

#pragma omp target thread_limit(3)
{
printf("\n2:target thread_limit: %d\n", omp_get_thread_limit());
}
return 0;
}

GCC-compiled binary execution: https://gcc.godbolt.org/z/Pjv3TWoTq

1:target thread_limit: 4
2:target thread_limit: 3

Clang/LLVM-compiled binary execution: https://clang.godbolt.org/z/zdPbrdMPn

1:target thread_limit: 4
2:target thread_limit: 4

By my reading of the OpenMP spec GCC does the right thing here; cf. <https://www.openmp.org/spec-html/5.2/openmpse12.html#x34-330002.4>:
> If a target construct with a thread_limit clause is encountered, the thread-limit-var ICV from the data environment of the generated initial task is instead set to an implementation defined value between one and the value specified in the clause.

The common subexpression elimination (CSE) of the second call to omp_get_thread_limit by LLVM does not seem to be correct, as it's not an available expression at any program point(s) (in the scope of the clause in question) after the second target construct with a thread_limit clause is encountered.

Compiling with -Rpass=openmp-opt -Rpass-analysis=openmp-opt -Rpass-missed=openmp-opt we have:
https://clang.godbolt.org/z/G7dfhP7jh

&lt;source&gt;:8:42: remark: OpenMP runtime call omp_get_thread_limit deduplicated. [OMP170] [-Rpass=openmp-opt]
8 | printf("\n1:target thread_limit: %d\n",omp_get_thread_limit());
| ^

OMP170 has the following explanation: https://openmp.llvm.org/remarks/OMP170.html

> This optimization remark indicates that a call to an OpenMP runtime call was replaced with the result of an existing one. This occurs when the compiler knows that the result of a runtime call is immutable. Removing duplicate calls is done by replacing all calls to that function with the result of the first call. This cannot be done automatically by the compiler because the implementations of the OpenMP runtime calls live in a separate library the compiler cannot see.
This optimization will trigger for known OpenMP runtime calls whose return value will not change.

At the same time I do not believe we have an analysis checking whether this precondition holds here: "This occurs when the compiler knows that the result of a runtime call is immutable."

AFAICT, such analysis doesn't appear to exist in the original patch introducing deduplication, either:

The fix is to remove it from DeduplicableRuntimeCallIDs, effectively reverting the addition in this commit (noting that omp_get_max_threads is not present in DeduplicableRuntimeCallIDs, so it's possible this addition was incorrect in the first place):

As a result, we're no longer unsoundly deduplicating the OpenMP runtime call omp_get_thread_limit as illustrated by the test case: Note the (correctly) repeated call i32 @<!-- -->omp_get_thread_limit().


Full diff: https://github.com/llvm/llvm-project/pull/79524.diff

2 Files Affected:

  • (modified) llvm/lib/Transforms/IPO/OpenMPOpt.cpp (-1)
  • (added) llvm/test/Transforms/OpenMP/deduplication_soundness.ll (+70)
diff --git a/llvm/lib/Transforms/IPO/OpenMPOpt.cpp b/llvm/lib/Transforms/IPO/OpenMPOpt.cpp
index 4176d561363fbd9..77ca36d64029f09 100644
--- a/llvm/lib/Transforms/IPO/OpenMPOpt.cpp
+++ b/llvm/lib/Transforms/IPO/OpenMPOpt.cpp
@@ -1471,7 +1471,6 @@ struct OpenMPOpt {
         OMPRTL_omp_get_num_threads,
         OMPRTL_omp_in_parallel,
         OMPRTL_omp_get_cancellation,
-        OMPRTL_omp_get_thread_limit,
         OMPRTL_omp_get_supported_active_levels,
         OMPRTL_omp_get_level,
         OMPRTL_omp_get_ancestor_thread_num,
diff --git a/llvm/test/Transforms/OpenMP/deduplication_soundness.ll b/llvm/test/Transforms/OpenMP/deduplication_soundness.ll
new file mode 100644
index 000000000000000..71034751ea7bbf1
--- /dev/null
+++ b/llvm/test/Transforms/OpenMP/deduplication_soundness.ll
@@ -0,0 +1,70 @@
+; RUN: opt -passes=openmp-opt-cgscc -S < %s | FileCheck %s
+
+target datalayout = "e-m:e-p270:32:32-p271:32:32-p272:64:64-i64:64-i128:128-f80:128-n8:16:32:64-S128"
+
+declare i32 @printf(ptr noundef, ...)
+declare i32 @omp_get_thread_limit()
+; Function Attrs: nounwind
+declare void @__kmpc_set_thread_limit(ptr, i32, i32)
+; Function Attrs: nounwind
+declare i32 @__kmpc_global_thread_num(ptr)
+; Function Attrs: nounwind
+declare noalias ptr @__kmpc_omp_task_alloc(ptr, i32, i32, i64, i64, ptr)
+; Function Attrs: nounwind
+declare void @__kmpc_omp_task_complete_if0(ptr, i32, ptr)
+; Function Attrs: nounwind
+declare void @__kmpc_omp_task_begin_if0(ptr, i32, ptr)
+
+%struct.ident_t = type { i32, i32, i32, i32, ptr }
+
+@.str = private unnamed_addr constant [28 x i8] c"\0A1:target thread_limit: %d\0A\00", align 1
+@0 = private unnamed_addr constant [23 x i8] c";unknown;unknown;0;0;;\00", align 1
+@1 = private unnamed_addr constant %struct.ident_t { i32 0, i32 2, i32 0, i32 22, ptr @0 }, align 8
+@.str.1 = private unnamed_addr constant [28 x i8] c"\0A2:target thread_limit: %d\0A\00", align 1
+
+define dso_local i32 @main() local_unnamed_addr {
+; CHECK-LABEL: define {{[^@]+}}@main
+; CHECK-NEXT:  entry:
+; CHECK: %call.i.i.i = call i32 @omp_get_thread_limit()
+; CHECK-NEXT: %call1.i.i.i = call i32 (ptr, ...) @printf(ptr noundef nonnull dereferenceable(1) @.str, i32 noundef %call.i.i.i)
+; CHECK: %call.i.i.i1 = call i32 @omp_get_thread_limit()
+; CHECK-NEXT: %call1.i.i.i2 = call i32 (ptr, ...) @printf(ptr noundef nonnull dereferenceable(1) @.str.1, i32 noundef %call.i.i.i1)
+entry:
+  %0 = call i32 @__kmpc_global_thread_num(ptr nonnull @1)
+  %1 = call ptr @__kmpc_omp_task_alloc(ptr nonnull @1, i32 %0, i32 1, i64 40, i64 0, ptr nonnull @.omp_task_entry.)
+  call void @__kmpc_omp_task_begin_if0(ptr nonnull @1, i32 %0, ptr %1)
+  call void @__kmpc_set_thread_limit(ptr nonnull @1, i32 %0, i32 4)
+  %call.i.i.i = call i32 @omp_get_thread_limit()
+  %call1.i.i.i = call i32 (ptr, ...) @printf(ptr noundef nonnull dereferenceable(1) @.str, i32 noundef %call.i.i.i)
+  call void @__kmpc_omp_task_complete_if0(ptr nonnull @1, i32 %0, ptr %1)
+  %2 = call ptr @__kmpc_omp_task_alloc(ptr nonnull @1, i32 %0, i32 1, i64 40, i64 0, ptr nonnull @.omp_task_entry..3)
+  call void @__kmpc_omp_task_begin_if0(ptr nonnull @1, i32 %0, ptr %2)
+  call void @__kmpc_set_thread_limit(ptr nonnull @1, i32 %0, i32 3)
+  %call.i.i.i1 = call i32 @omp_get_thread_limit()
+  %call1.i.i.i2 = call i32 (ptr, ...) @printf(ptr noundef nonnull dereferenceable(1) @.str.1, i32 noundef %call.i.i.i1)
+  call void @__kmpc_omp_task_complete_if0(ptr nonnull @1, i32 %0, ptr %2)
+  ret i32 0
+}
+
+define internal noundef i32 @.omp_task_entry.(i32 noundef %0, ptr noalias nocapture noundef readonly %1) {
+entry:
+  tail call void @__kmpc_set_thread_limit(ptr nonnull @1, i32 %0, i32 4)
+  %call.i.i = tail call i32 @omp_get_thread_limit()
+  %call1.i.i = tail call i32 (ptr, ...) @printf(ptr noundef nonnull dereferenceable(1) @.str, i32 noundef %call.i.i)
+  ret i32 0
+}
+
+define internal noundef i32 @.omp_task_entry..3(i32 noundef %0, ptr noalias nocapture noundef readonly %1) {
+entry:
+  tail call void @__kmpc_set_thread_limit(ptr nonnull @1, i32 %0, i32 3)
+  %call.i.i = tail call i32 @omp_get_thread_limit()
+  %call1.i.i = tail call i32 (ptr, ...) @printf(ptr noundef nonnull dereferenceable(1) @.str.1, i32 noundef %call.i.i)
+  ret i32 0
+}
+
+attributes #1 = { alwaysinline norecurse nounwind uwtable }
+attributes #3 = { alwaysinline nounwind uwtable }
+
+!llvm.module.flags = !{!0}
+
+!0 = !{i32 7, !"openmp", i32 51}

Copy link
Contributor

@jhuber6 jhuber6 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems reasonable to me. Some nits about the added test.

@@ -0,0 +1,70 @@
; RUN: opt -passes=openmp-opt-cgscc -S < %s | FileCheck %s
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For a test like this I'd recommend generating the check lines with llvm/utils/update_test_checks.py --function-signature.

Copy link
Member Author

@MattPD MattPD Feb 5, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks! I'd prefer this too, if possible; will see if there's a way to avoid generating more check lines than I already have (in the past had to use UTC_ARGS for in certain cases, but perhaps won't be needed here).

Edit: UTC_ARGS: --function main --scrub-attributes --filter "@omp_get_thread_limit|@use" does it.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done!

@@ -0,0 +1,70 @@
; RUN: opt -passes=openmp-opt-cgscc -S < %s | FileCheck %s

target datalayout = "e-m:e-p270:32:32-p271:32:32-p272:64:64-i64:64-i128:128-f80:128-n8:16:32:64-S128"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you clean up this test somewhat? I don't think we need the data layout or attributes like noundef and dso_local.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done!


target datalayout = "e-m:e-p270:32:32-p271:32:32-p272:64:64-i64:64-i128:128-f80:128-n8:16:32:64-S128"

declare i32 @printf(ptr noundef, ...)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just make this some external call instead of printf.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done!

Let me know if any further changes may be required

The deduplication of the calls to `omp_get_thread_limit` used to be legal when originally added in <llvm@e28936f#diff-de101c82aff66b2bda2d1f53fde3dde7b0d370f14f1ff37b7919ce38531230dfR123>, as the result (thread_limit) was immutable.

However, now that we have `thread_limit` clause, we no longer have immutability; therefore `omp_get_thread_limit()` is not a deduplicable runtime call.

Thus, removing `omp_get_thread_limit` from the `DeduplicableRuntimeCallIDs` array.

Here's a simple example:
```

int main()
{
{
printf("\n1:target thread_limit: %d\n", omp_get_thread_limit());
}

{
printf("\n2:target thread_limit: %d\n", omp_get_thread_limit());
}
return 0;
}
```

GCC-compiled binary execution: https://gcc.godbolt.org/z/Pjv3TWoTq
```
1:target thread_limit: 4
2:target thread_limit: 3
```

Clang/LLVM-compiled binary execution: https://clang.godbolt.org/z/zdPbrdMPn
```
1:target thread_limit: 4
2:target thread_limit: 4
```

By my reading of the OpenMP spec GCC does the right thing here; cf. <https://www.openmp.org/spec-html/5.2/openmpse12.html#x34-330002.4>:
> If a target construct with a thread_limit clause is encountered, the thread-limit-var ICV from the data environment of the generated initial task is instead set to an implementation defined value between one and the value specified in the clause.

The common subexpression elimination (CSE) of the second call to `omp_get_thread_limit` by LLVM does not seem to be correct, as it's not an available expression at any program point(s) (in the scope of the clause in question) after the second target construct with a `thread_limit` clause is encountered.

Compiling with `-Rpass=openmp-opt -Rpass-analysis=openmp-opt -Rpass-missed=openmp-opt` we have:
https://clang.godbolt.org/z/G7dfhP7jh
```
<source>:8:42: remark: OpenMP runtime call omp_get_thread_limit deduplicated. [OMP170] [-Rpass=openmp-opt]
8 | printf("\n1:target thread_limit: %d\n",omp_get_thread_limit());
| ^
```

OMP170 has the following explanation: https://openmp.llvm.org/remarks/OMP170.html

> This optimization remark indicates that a call to an OpenMP runtime call was replaced with the result of an existing one. This occurs when the compiler knows that the result of a runtime call is immutable. Removing duplicate calls is done by replacing all calls to that function with the result of the first call. This cannot be done automatically by the compiler because the implementations of the OpenMP runtime calls live in a separate library the compiler cannot see.
This optimization will trigger for known OpenMP runtime calls whose return value will not change.

At the same time I do not believe we have an analysis checking whether this precondition holds here: "This occurs when the compiler knows that the result of a runtime call is immutable."

AFAICT, such analysis doesn't appear to exist in the original patch introducing deduplication, either:

- llvm@9548b74
- https://reviews.llvm.org/D69930

The fix is to remove it from `DeduplicableRuntimeCallIDs`, effectively reverting the addition in this commit (noting that `omp_get_max_threads` is not present in `DeduplicableRuntimeCallIDs`, so it's possible this addition was incorrect in the first place):

- [OpenMP][Opt] Annotate known runtime functions and deduplicate more,
- llvm@e28936f#diff-de101c82aff66b2bda2d1f53fde3dde7b0d370f14f1ff37b7919ce38531230dfR123

As a result, we're no longer unsoundly deduplicating the OpenMP runtime call `omp_get_thread_limit` as illustrated by the test case: Note the (correctly) repeated `call i32 @omp_get_thread_limit()`.
@MattPD MattPD force-pushed the no_get_thread_limit_deduplication branch from 1c7d5b4 to 4167b66 Compare February 21, 2024 02:56
Copy link
Contributor

@jhuber6 jhuber6 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LG overall. Shouldn't need the nonnull, nounwind, or noundef attributes either.

llvm/test/Transforms/OpenMP/deduplication_soundness.ll Outdated Show resolved Hide resolved
Copy link
Contributor

@jhuber6 jhuber6 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry, I meant all of them but just pointed out that one for reference.

Co-authored-by: Joseph Huber <huberjn@outlook.com>
@MattPD MattPD force-pushed the no_get_thread_limit_deduplication branch from 154721e to 31efd7d Compare February 21, 2024 22:29
Copy link
Contributor

@jhuber6 jhuber6 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good enough, need me to merge this for you?

@MattPD
Copy link
Member Author

MattPD commented Feb 21, 2024

Sounds great! Yes, please, I appreciate it (as well as the useful feedback), thanks! :-)

@jhuber6 jhuber6 merged commit 88e31f6 into llvm:main Feb 22, 2024
3 of 4 checks passed
@MattPD MattPD deleted the no_get_thread_limit_deduplication branch February 22, 2024 17:33
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
clang:openmp OpenMP related changes to Clang llvm:transforms
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants