[OpenMP][MLIR][OMPIRBuilder] Add a small optional constant alloca raise function pass to finalize, utilised in convertTarget #78818

agozillon · 2024-01-20T00:24:32Z

This patch seeks to add a mechanism to raise constant (not ConstantExpr or runtime/dynamic) sized allocations into the entry block for select functions that have been inserted into a list for processing. This processing occurs during the finalize call, after OutlinedInfo regions have completed. This currently has only been utilised for createOutlinedFunction, which is triggered for TargetOp generation in the OpenMP MLIR dialect lowering to LLVM-IR.

This currently is required for Target kernels generated by createOutlinedFunction to avoid subsequent optimization passes doing some unintentional malformed optimizations for AMD kernels (unsure if it occurs for other vendors). If the allocas are generated inside of the kernel and are not in the entry block and are subsequently passed to a function this can lead to required instructions being erased or manipulated in a way that causes the kernel to run into a HSA access error.

This fix is related to a series of problems found in: #74603

This problem primarily presents itself for Flang's HLFIR AssignOp currently, when utilised with a scalar temporary constant on the RHS and a descriptor type on the LHS. It will generate a call to a runtime function, wrap the RHS temporary in a newly allocated descriptor (an llvm struct), and pass both the LHS and RHS descriptor into the runtime function call. This will currently be
embedded into the middle of the target region in the user entry block, which means the allocas are also embedded in the middle, which seems to pose
issues when later passes are executed. This issue may present itself in other HLFIR operations or unrelated operations that generate allocas as a by product, but for the moment, this one test case is the only scenario I've found this problem.

Perhaps this is not the appropriate fix, I am very open to other suggestions, I've tried a few others (at varying levels of the flang/mlir compiler flow), but this one is the smallest and least intrusive change set. The other two, that come to mind (but I've not fully looked into, the former I tried a little with blocks but it had a few issues I'd need to think through):

Having a proper alloca only block (or region) generated for TargetOps that we could merge into the entry block that's generated by convertTarget's createOutlinedFunction.
Or diverging a little from Clang's current target generation and using the CodeExtractor to generate the user code as an outlined function region invoked from the kernel we make, with our kernel arguments passed into it. Similar to the current parallel generation. I am not sure how well this would intermingle with the existing parallel generation though that's layered in.

Both of these methods seem like quite a divergence from the current status quo, which I am not entirely sure is merited for the small test this change aims to fix.

llvmbot · 2024-01-20T00:25:02Z

@llvm/pr-subscribers-flang-openmp
@llvm/pr-subscribers-mlir

@llvm/pr-subscribers-mlir-llvm

Author: None (agozillon)

Changes

This patch seeks to add a mechanism to raise constant (not ConstantExpr or runtime/dynamic) sized allocations into the entry block for select functions that have been inserted into a list for processing. This processing occurs during the finalize call, after OutlinedInfo regions have completed. This currently has only been utilised for createOutlinedFunction, which is triggered for TargetOp generation in the OpenMP MLIR dialect lowering to LLVM-IR.

This currently is required for Target kernels generated by createOutlinedFunction to avoid subsequent optimization passes doing some unintentional malformed optimizations for AMD kernels (unsure if it occurs for other vendors). If the allocas are generated inside of the kernel and are not in the entry block and are subsequently passed to a function this can lead to required instructions being erased or manipulated in a way that causes the kernel to run into a HSA access error.

This fix is related to a series of problems found in: #74603

This problem primarily presents itself for Flang's HLFIR AssignOp currently, when utilised with a scalar temporary constant on the RHS and a descriptor type on the LHS. It will generate a call to a runtime function, wrap the RHS temporary in a newly allocated descriptor (an llvm struct), and pass both the LHS and RHS descriptor into the runtime function call. This will currently be
embedded into the middle of the target region in the user entry block, which means the allocas are also embedded in the middle, which seems to pose
issues when later passes are executed. This issue may present itself in other HLFIR operations or unrelated operations that generate allocas as a by product, but for the moment, this one test case is the only scenario I've found this problem.

Perhaps this is not the appropriate fix, I am very open to other suggestions, I've tried a few others (at varying levels of the flang/mlir compiler flow), but this one is the smallest and least intrusive change set. The other two, that come to mind (but I've not fully looked into, the former I tried a little with blocks but it had a few issues I'd need to think through):

Having a proper alloca only block (or region) generated for TargetOps that we could merge into the entry block that's generated by convertTarget's createOutlinedFunction.
Or diverging a little from Clang's current target generation and using the CodeExtractor to generate the user code as an outlined function region invoked from the kernel we make, with our kernel arguments passed into it. Similar to the current parallel generation. I am not sure how well this would intermingle with the existing parallel generation though that's layered in.

Both of these methods seem like quite a divergence from the current status quo, which I am not entirely sure is merited for the small test this change aims to fix.

Full diff: https://github.com/llvm/llvm-project/pull/78818.diff

3 Files Affected:

(modified) llvm/include/llvm/Frontend/OpenMP/OMPIRBuilder.h (+12)
(modified) llvm/lib/Frontend/OpenMP/OMPIRBuilder.cpp (+52)
(added) mlir/test/Target/LLVMIR/omptarget-constant-alloca-raise.mlir (+43)

diff --git a/llvm/include/llvm/Frontend/OpenMP/OMPIRBuilder.h b/llvm/include/llvm/Frontend/OpenMP/OMPIRBuilder.h
index 669104307fa0e23..e7be143695be1ac 100644
--- a/llvm/include/llvm/Frontend/OpenMP/OMPIRBuilder.h
+++ b/llvm/include/llvm/Frontend/OpenMP/OMPIRBuilder.h
@@ -1495,6 +1495,11 @@ class OpenMPIRBuilder {
   /// Collection of regions that need to be outlined during finalization.
   SmallVector<OutlineInfo, 16> OutlineInfos;
 
+  /// A collection of candidate target functions that's constant allocas will
+  /// attempt to be raised on a call of finalize after all currently enqueued
+  /// outline info's have been processed.
+  SmallVector<llvm::Function *, 16> ConstantAllocaRaiseCandidates;
+
   /// Collection of owned canonical loop objects that eventually need to be
   /// free'd.
   std::forward_list<CanonicalLoopInfo> LoopInfos;
@@ -1502,6 +1507,13 @@ class OpenMPIRBuilder {
   /// Add a new region that will be outlined later.
   void addOutlineInfo(OutlineInfo &&OI) { OutlineInfos.emplace_back(OI); }
 
+  /// Add a function that's constant allocas will attempt to be raised on a
+  /// call of finalize after all currently enqueued outline info's have been
+  /// processed.
+  void addConstantAllocaRaiseCandidates(Function *F) {
+    ConstantAllocaRaiseCandidates.emplace_back(F);
+  }
+
   /// An ordered map of auto-generated variables to their unique names.
   /// It stores variables with the following names: 1) ".gomp_critical_user_" +
   /// <critical_section_name> + ".var" for "omp critical" directives; 2)
diff --git a/llvm/lib/Frontend/OpenMP/OMPIRBuilder.cpp b/llvm/lib/Frontend/OpenMP/OMPIRBuilder.cpp
index f6cf358119fb715..15ba5118e7be07e 100644
--- a/llvm/lib/Frontend/OpenMP/OMPIRBuilder.cpp
+++ b/llvm/lib/Frontend/OpenMP/OMPIRBuilder.cpp
@@ -633,6 +633,30 @@ Function *OpenMPIRBuilder::getOrCreateRuntimeFunctionPtr(RuntimeFunction FnID) {
 
 void OpenMPIRBuilder::initialize() { initializeTypes(M); }
 
+static void raiseUserConstantDataAllocasToEntryBlock(IRBuilderBase &Builder,
+                                                     Function *Function) {
+  BasicBlock &EntryBlock = Function->getEntryBlock();
+  Instruction *MoveLocInst = EntryBlock.getFirstNonPHI();
+
+  // Loop over blocks looking for allocas, skip the entry block allocas here are
+  // in the appropriate place.
+  for (auto Block = std::next(Function->begin(), 1); Block != Function->end();
+       Block++) {
+    for (auto Inst = Block->getReverseIterator()->begin();
+         Inst != Block->getReverseIterator()->end();) {
+      if (auto *AllocaInst =
+              llvm::dyn_cast_if_present<llvm::AllocaInst>(Inst)) {
+        Inst++;
+        if (!isa<ConstantData>(AllocaInst->getArraySize()))
+          continue;
+        AllocaInst->moveBeforePreserving(MoveLocInst);
+      } else {
+        Inst++;
+      }
+    }
+  }
+}
+
 void OpenMPIRBuilder::finalize(Function *Fn) {
   SmallPtrSet<BasicBlock *, 32> ParallelRegionBlockSet;
   SmallVector<BasicBlock *, 32> Blocks;
@@ -728,6 +752,28 @@ void OpenMPIRBuilder::finalize(Function *Fn) {
   // Remove work items that have been completed.
   OutlineInfos = std::move(DeferredOutlines);
 
+  // The createTarget functions embeds user written code into
+  // the target region which may inject allocas which need to
+  // be moved to the entry block of our target or risk malformed
+  // optimisations by later passes, this is only relevant for
+  // the device pass which appears to be a little more delicate
+  // when it comes to optimisations (however, we do not block on
+  // that here, it's up to the inserter to the list to do so).
+  // This notbaly has to occur after the OutlinedInfo candidates
+  // have been extracted so we have an end product that will not
+  // be implicitly adversely affected by any raises unless
+  // intentionally appended to the list.
+  // NOTE: This only does so for ConstantData, it could be extended
+  // to ConstantExpr's with further effort, however, they should
+  // largely be folded when they get here. Extending it to runtime
+  // defined/read+writeable allocation sizes would be non-trivial
+  // (need to factor in movement of any stores to variables the
+  // allocation size depends on, as well as the usual loads,
+  // otherwise it'll yield the wrong result after movement) and
+  // likely be more suitable as an LLVM optimisation pass.
+  for (Function *F : ConstantAllocaRaiseCandidates)
+    raiseUserConstantDataAllocasToEntryBlock(Builder, F);
+
   EmitMetadataErrorReportFunctionTy &&ErrorReportFn =
       [](EmitMetadataErrorKind Kind,
          const TargetRegionEntryInfo &EntryInfo) -> void {
@@ -5033,6 +5079,12 @@ static Function *createOutlinedFunction(
 
   BasicBlock *UserCodeEntryBB = Builder.GetInsertBlock();
 
+  // As we embed the user code in the middle of our target region after we
+  // generate entry code, we must move what allocas we can into the entry
+  // block to avoid possible breaking optimisations for device
+  if (OMPBuilder.Config.isTargetDevice())
+    OMPBuilder.addConstantAllocaRaiseCandidates(Func);
+
   // Insert target deinit call in the device compilation pass.
   Builder.restoreIP(CBFunc(Builder.saveIP(), Builder.saveIP()));
   if (OMPBuilder.Config.isTargetDevice())
diff --git a/mlir/test/Target/LLVMIR/omptarget-constant-alloca-raise.mlir b/mlir/test/Target/LLVMIR/omptarget-constant-alloca-raise.mlir
new file mode 100644
index 000000000000000..be521fbe1f01a6d
--- /dev/null
+++ b/mlir/test/Target/LLVMIR/omptarget-constant-alloca-raise.mlir
@@ -0,0 +1,43 @@
+// RUN: mlir-translate -mlir-to-llvmir %s | FileCheck %s
+
+// A small condensed version of a problem requiring constant alloca raising in
+// Target Region Entries for user injected code, found in an issue in the Flang
+// compiler. Certain LLVM IR optimisation passes will perform runtime breaking 
+// transformations on allocations not found to be in the entry block, current
+// OpenMP dialect lowering of TargetOp's will inject user allocations after
+// compiler generated entry code, in a seperate block, this test checks that
+// a small function which attempts to raise some of these (specifically 
+// constant sized) allocations performs its task reasonably in these 
+// scenarios. 
+
+module attributes {omp.is_target_device = true} {
+  llvm.func @_QQmain() attributes {omp.declare_target = #omp.declaretarget<device_type = (host), capture_clause = (to)>} {
+    %1 = llvm.mlir.constant(1 : i64) : i64
+    %2 = llvm.alloca %1 x !llvm.struct<(ptr)> : (i64) -> !llvm.ptr
+    %3 = omp.map_info var_ptr(%2 : !llvm.ptr, !llvm.struct<(ptr)>) map_clauses(tofrom) capture(ByRef) -> !llvm.ptr
+    omp.target map_entries(%3 -> %arg0 : !llvm.ptr) {
+    ^bb0(%arg0: !llvm.ptr):
+      %4 = llvm.mlir.constant(1 : i32) : i32
+      %5 = llvm.alloca %4 x !llvm.struct<(ptr)> {alignment = 8 : i64} : (i32) -> !llvm.ptr
+      %6 = llvm.mlir.constant(50 : i32) : i32
+      %7 = llvm.mlir.constant(1 : i64) : i64
+      %8 = llvm.alloca %7 x i32 : (i64) -> !llvm.ptr
+      llvm.store %6, %8 : i32, !llvm.ptr
+      %9 = llvm.mlir.undef : !llvm.struct<(ptr)>
+      %10 = llvm.insertvalue %8, %9[0] : !llvm.struct<(ptr)> 
+      llvm.store %10, %5 : !llvm.struct<(ptr)>, !llvm.ptr
+      %88 = llvm.call @_ExternalCall(%arg0, %5) : (!llvm.ptr, !llvm.ptr) -> !llvm.struct<()>
+      omp.terminator
+    }
+    llvm.return
+  }
+  llvm.func @_ExternalCall(!llvm.ptr, !llvm.ptr) -> !llvm.struct<()>
+}
+
+// CHECK:      define weak_odr protected void @{{.*}}QQmain_l{{.*}}({{.*}}, {{.*}}) {
+// CHECK-NEXT: entry:
+// CHECK-NEXT:  %[[MOVED_ALLOCA1:.*]] = alloca { ptr }, align 8
+// CHECK-NEXT:  %[[MOVED_ALLOCA2:.*]] = alloca i32, i64 1, align 4
+// CHECK-NEXT:  %[[MAP_ARG_ALLOCA:.*]] = alloca ptr, align 8
+
+// CHECK: user_code.entry:                                  ; preds = %entry

kiranchandramohan · 2024-01-23T11:14:42Z

Could you share the HLFIR IR that needs this change?

agozillon · 2024-01-23T13:57:37Z

Could you share the HLFIR IR that needs this change?

Sure, here you go:

module attributes {dlti.dl_spec = #dlti.dl_spec<#dlti.dl_entry<i1, dense<8> : vector<2xi64>>, #dlti.dl_entry<!llvm.ptr, dense<64> : vector<4xi64>>, #dlti.dl_entry<f16, dense<16> : vector<2xi64>>, #dlti.dl_entry<i32, dense<32> : vector<2xi64>>, #dlti.dl_entry<i16, dense<16> : vector<2xi64>>, #dlti.dl_entry<i8, dense<8> : vector<2xi64>>, #dlti.dl_entry<!llvm.ptr<272>, dense<64> : vector<4xi64>>, #dlti.dl_entry<!llvm.ptr<271>, dense<32> : vector<4xi64>>, #dlti.dl_entry<!llvm.ptr<270>, dense<32> : vector<4xi64>>, #dlti.dl_entry<f128, dense<128> : vector<2xi64>>, #dlti.dl_entry<f64, dense<64> : vector<2xi64>>, #dlti.dl_entry<f80, dense<128> : vector<2xi64>>, #dlti.dl_entry<i128, dense<128> : vector<2xi64>>, #dlti.dl_entry<i64, dense<64> : vector<2xi64>>, #dlti.dl_entry<"dlti.stack_alignment", 128 : i64>, #dlti.dl_entry<"dlti.endianness", "little">>, fir.defaultkind = "a1c4d8i4l4r4", fir.kindmap = "", llvm.data_layout = "e-m:e-p270:32:32-p271:32:32-p272:64:64-i64:64-i128:128-f80:128-n8:16:32:64-S128", llvm.target_triple = "x86_64-unknown-linux-gnu", omp.is_gpu = false, omp.is_target_device = false, omp.requires = #omp<clause_requires none>, omp.version = #omp.version<version = 11>} {
  func.func @_QQmain() attributes {fir.bindc_name = "main"} {
    %0 = fir.alloca !fir.box<!fir.heap<i32>> {bindc_name = "test", uniq_name = "_QFEtest"}
    %1 = fir.zero_bits !fir.heap<i32>
    %2 = fir.embox %1 : (!fir.heap<i32>) -> !fir.box<!fir.heap<i32>>
    fir.store %2 to %0 : !fir.ref<!fir.box<!fir.heap<i32>>>
    %3:2 = hlfir.declare %0 {fortran_attrs = #fir.var_attrs<allocatable>, uniq_name = "_QFEtest"} : (!fir.ref<!fir.box<!fir.heap<i32>>>) -> (!fir.ref<!fir.box<!fir.heap<i32>>>, !fir.ref<!fir.box<!fir.heap<i32>>>)
    %4 = fir.allocmem i32 {fir.must_be_heap = true, uniq_name = "_QFEtest.alloc"}
    %5 = fir.embox %4 : (!fir.heap<i32>) -> !fir.box<!fir.heap<i32>>
    fir.store %5 to %3#1 : !fir.ref<!fir.box<!fir.heap<i32>>>
    %6 = fir.load %3#1 : !fir.ref<!fir.box<!fir.heap<i32>>>
    %7 = fir.box_offset %3#1 base_addr : (!fir.ref<!fir.box<!fir.heap<i32>>>) -> !fir.llvm_ptr<!fir.ref<i32>>
    %8 = fir.box_addr %6 : (!fir.box<!fir.heap<i32>>) -> !fir.heap<i32>
    %9 = omp.map_info var_ptr(%8 : !fir.heap<i32>, i32) var_ptr_ptr(%7 : !fir.llvm_ptr<!fir.ref<i32>>) map_clauses(tofrom) capture(ByRef) -> !fir.heap<i32> {name = "test"}
    %10 = omp.map_info var_ptr(%3#1 : !fir.ref<!fir.box<!fir.heap<i32>>>, !fir.box<!fir.heap<i32>>) map_clauses(tofrom) capture(ByRef) members(%9 : !fir.heap<i32>) -> !fir.ref<!fir.box<!fir.heap<i32>>> {name = "test"}
    omp.target map_entries(%10 -> %arg0 : !fir.ref<!fir.box<!fir.heap<i32>>>) {
    ^bb0(%arg0: !fir.ref<!fir.box<!fir.heap<i32>>>):
      %20:2 = hlfir.declare %arg0 {fortran_attrs = #fir.var_attrs<allocatable>, uniq_name = "_QFEtest"} : (!fir.ref<!fir.box<!fir.heap<i32>>>) -> (!fir.ref<!fir.box<!fir.heap<i32>>>, !fir.ref<!fir.box<!fir.heap<i32>>>)
      %c50_i32 = arith.constant 50 : i32
      hlfir.assign %c50_i32 to %20#0 realloc : i32, !fir.ref<!fir.box<!fir.heap<i32>>>
      omp.terminator
    }
    %c6_i32 = arith.constant 6 : i32
    %11 = fir.address_of(@_QQclXf0a9a187466a9dd5699bdae5c0b697a3) : !fir.ref<!fir.char<1,81>>
    %12 = fir.convert %11 : (!fir.ref<!fir.char<1,81>>) -> !fir.ref<i8>
    %c9_i32 = arith.constant 9 : i32
    %13 = fir.call @_FortranAioBeginExternalListOutput(%c6_i32, %12, %c9_i32) fastmath<contract> : (i32, !fir.ref<i8>, i32) -> !fir.ref<i8>
    %14 = fir.load %3#0 : !fir.ref<!fir.box<!fir.heap<i32>>>
    %15 = fir.box_addr %14 : (!fir.box<!fir.heap<i32>>) -> !fir.heap<i32>
    %16 = fir.load %15 : !fir.heap<i32>
    %c2_i32 = arith.constant 2 : i32
    %17 = arith.addi %16, %c2_i32 : i32
    %18 = fir.call @_FortranAioOutputInteger32(%13, %17) fastmath<contract> : (!fir.ref<i8>, i32) -> i1
    %19 = fir.call @_FortranAioEndIoStatement(%13) fastmath<contract> : (!fir.ref<i8>) -> i32
    return
  }
  fir.global linkonce @_QQclXf0a9a187466a9dd5699bdae5c0b697a3 constant : !fir.char<1,81> {
    %0 = fir.string_lit "/home/agozillo/git/flang-dev/work-dir/declare-target-map/single-value-alloca.f90\00"(81) : !fir.char<1,81>
    fir.has_value %0 : !fir.char<1,81>
  }
  func.func private @_FortranAioBeginExternalListOutput(i32, !fir.ref<i8>, i32) -> !fir.ref<i8> attributes {fir.io, fir.runtime}
  func.func private @_FortranAioOutputInteger32(!fir.ref<i8>, i32) -> i1 attributes {fir.io, fir.runtime}
  func.func private @_FortranAioEndIoStatement(!fir.ref<i8>) -> i32 attributes {fir.io, fir.runtime}
  fir.global @_QQEnvironmentDefaults constant : !fir.ref<tuple<i32, !fir.ref<!fir.array<0xtuple<!fir.ref<i8>, !fir.ref<i8>>>>>> {
    %0 = fir.zero_bits !fir.ref<tuple<i32, !fir.ref<!fir.array<0xtuple<!fir.ref<i8>, !fir.ref<i8>>>>>>
    fir.has_value %0 : !fir.ref<tuple<i32, !fir.ref<!fir.array<0xtuple<!fir.ref<i8>, !fir.ref<i8>>>>>>
  }
}

The relevant location is inside of the TargetOp, the rest is largely irrelevant for device but included for completeness. The HLFIR AssignOp will generate an allocation as part of it's lowering from my understanding (in this particular use case at least, there's different lowering's for it). Although, I wouldn't say the problem lies with the HLFIR operation necessarily, more just a side affect of how we lower to a target region for LLVM-IR being shown I think (maybe there's a way to make it more AMDGPU runtime friendly, or a different runtime function for AMDGPU, but I'm doubtful that's the ideal solution). Even if we make sure to raise all AllocaOp's (when we lower to the LLVM Dialect this is done currently, but TargetOp doesn't generate a new Blocks, so there is no isolated Entry Block as such) to the top this would still persist, as we inevitably embed the user code into a seperate block in-between some kernel entry code for the arguments that will branch off to a fail condition or the user code block containing the allocations, and then a later pass (not sure which yet unfortunately) will try to do some magic and end up breaking the generated executable.

agozillon · 2024-01-23T14:01:50Z

Probably worth noting this issue doesn't exist for the current variations of target parallel from what I've found, likely because the CodeExtractor magics the user code into a seperate function with an appropriate entry block (my knowledge on the lowering of parallel is still a WIP, so please take it with a grain of salt and anyone more familiar please feel free to override my statement).

agozillon · 2024-01-26T22:43:28Z

Small ping for some reviewer attention next week if any time can be spared

agozillon · 2024-02-01T23:13:07Z

Another small ping for reviewer attention if at all possible please.

TIFitis

Overall seems fine to me. However I'm not familiar about when and why this is required, so please wait for other reviewers as well.

Also, should we have a test for this in the OMPIRBuilderTest?

TIFitis · 2024-02-02T15:18:32Z

llvm/lib/Frontend/OpenMP/OMPIRBuilder.cpp

+    for (auto Inst = Block->getReverseIterator()->begin();
+         Inst != Block->getReverseIterator()->end();) {
+      if (auto *AllocaInst =
+              llvm::dyn_cast_if_present<llvm::AllocaInst>(Inst)) {


Nit: Remove llvm::

TIFitis · 2024-02-02T15:20:16Z

llvm/lib/Frontend/OpenMP/OMPIRBuilder.cpp

+  // Loop over blocks looking for allocas, skip the entry block allocas here are
+  // in the appropriate place.


The wording of this comment is a little off. Can you please rephrase it.

TIFitis · 2024-02-02T15:21:23Z

llvm/include/llvm/Frontend/OpenMP/OMPIRBuilder.h

+  /// Add a function that's constant allocas will attempt to be raised on a
+  /// call of finalize after all currently enqueued outline info's have been
+  /// processed.
+  void addConstantAllocaRaiseCandidates(Function *F) {


Does this really need to be a separate function?

I don't think so if we don't wish it to be, it was primarily to mimic the current style we use for addOutlineInfo if I recall correctly. Happy to change it!

agozillon · 2024-02-02T15:43:19Z

Thank you very much for the review @TIFitis I'll update this at the beginning of next week. And I'll hold off until we get a second reviewer that can sign off.

…se function pass to finalize, utilised in convertTarget This patch seeks to add a mechanism to raise constant (not ConstantExpr or runtime/dynamic) sized allocations into the entry block for select functions that have been inserted into a list for processing. This processing occurs during the finalize call, after OutlinedInfo regions have completed. This currently has only been utilised for createOutlinedFunction, which is triggered for TargetOp generation in the OpenMP MLIR dialect lowering to LLVM-IR. This currently is required for Target kernels generated by createOutlinedFunction to avoid subsequent optimisation passes doing some unintentional malformed optimisaitions for AMD kernels (unsure if it occurs for other vendors). If the allocas are generated inside of the kernel and are not in the entry block and are subsequently passed to a function this can lead to required instructions being erased or manipulated in a way that causes the kernel to run into a HSA access error. This fix is related to a series of problems found in: llvm#74603 This problem primarily presents itself for Flang's HLFIR AssignOp currently, when utilised with a scalar temporary constant on the RHS and a descriptor type on the LHS. It will generate a call to a runtime function, wrap the RHS temporary in a newly allocated descriptor (an llvm struct), and pass both the LHS and RHS descriptor into the runtime function call. This will currently be embedded into the middle of the target region in the user entry block, which means the allocas are also embedded in the middle, which seems to pose issues when later passes are executed. This issue may present itself in other HLFIR operations or unrelated operations that generate allocas as a by product, but for the moment, this one test case is the only scenario i've found this problem. Perhaps this is not the appropriate fix, I am very open to other suggestions, I've tried a few others (at varying levels of the flang/mlir compiler flow), but this one is the smallest and least intrusive changeset. The other two, that come to mind (but I've not fully looked into, the former I tried a little with blocks but it had a few issues I'd need to think through): * Having a proper alloca only block (or region) generated for TargetOps that we could merge into the entry block that's generated by convertTarget's createOutlinedFunction. * Or diverging a little from Clang's current target generation and using the CodeExtractor to generate the user code as an outlined function region invoked from the kernel we make, with our kernel arguments passed into it. Similar to the current parallel generation. I am not sure how well this would intermingle with the existing parallel generation though that's layered in. Both of these methods seem like quite a divergeance from the current status quo, which I am not entirely sure is meritted for the small test this change aims to fix.

- Fix badly written comment - Remove unneccesary function - remove unrequired llvm:: prefix

agozillon · 2024-02-05T14:54:28Z

Updated PR based on recent comments (commit) rebased on recent upstream changes to make sure it remains functional.

Changes made were:

Fix badly written comment
Remove unnecessary function
remove unrequired llvm:: prefix

Still looking for another reviewer for this PR if possible or further discussion.

agozillon · 2024-02-06T14:49:39Z

It slipped my mind that @TIFitis asked for an OpenMPIRBuilderTest, so I have now added one testing the alloca raising that this PR seeks to add in the latest commit!

github-actions · 2024-02-06T14:50:04Z

✅ With the latest revision this PR passed the C/C++ code formatter.

…to OpenMPIRBuilderTest.cpp Forgot to add this test yesertday and remembered today!

kiranchandramohan · 2024-02-06T15:26:52Z

The relevant location is inside of the TargetOp, the rest is largely irrelevant for device but included for completeness. The HLFIR AssignOp will generate an allocation as part of it's lowering from my understanding (in this particular use case at least, there's different lowering's for it). Although, I wouldn't say the problem lies with the HLFIR operation necessarily, more just a side affect of how we lower to a target region for LLVM-IR being shown I think (maybe there's a way to make it more AMDGPU runtime friendly, or a different runtime function for AMDGPU, but I'm doubtful that's the ideal solution). Even if we make sure to raise all AllocaOp's (when we lower to the LLVM Dialect this is done currently, but TargetOp doesn't generate a new Blocks, so there is no isolated Entry Block as such) to the top this would still persist, as we inevitably embed the user code into a seperate block in-between some kernel entry code for the arguments that will branch off to a fail condition or the user code block containing the allocations, and then a later pass (not sure which yet unfortunately) will try to do some magic and end up breaking the generated executable.

Is your observation that the OpenMPIRBuilder creates a separate Alloca block and this Alloca block will become the entry block in LLVM IR. So even if we move all the allocas to the entry block in FIR/HLFIR/LLVM dialect MLIR, it still won't be the entry block in LLVM IR?

Is it possible to merge the Alloca block to the succeeding block (which is hopefully the entry block at the MLIR dialects)? And if we have lifted in MLIR all to the entry block this will all work out fine?

agozillon · 2024-02-06T16:31:52Z

The relevant location is inside of the TargetOp, the rest is largely irrelevant for device but included for completeness. The HLFIR AssignOp will generate an allocation as part of it's lowering from my understanding (in this particular use case at least, there's different lowering's for it). Although, I wouldn't say the problem lies with the HLFIR operation necessarily, more just a side affect of how we lower to a target region for LLVM-IR being shown I think (maybe there's a way to make it more AMDGPU runtime friendly, or a different runtime function for AMDGPU, but I'm doubtful that's the ideal solution). Even if we make sure to raise all AllocaOp's (when we lower to the LLVM Dialect this is done currently, but TargetOp doesn't generate a new Blocks, so there is no isolated Entry Block as such) to the top this would still persist, as we inevitably embed the user code into a seperate block in-between some kernel entry code for the arguments that will branch off to a fail condition or the user code block containing the allocations, and then a later pass (not sure which yet unfortunately) will try to do some magic and end up breaking the generated executable.

Is your observation that the OpenMPIRBuilder creates a separate Alloca block and this Alloca block will become the entry block in LLVM IR. So even if we move all the allocas to the entry block in FIR/HLFIR/LLVM dialect MLIR, it still won't be the entry block in LLVM IR?

That's correct unfortunately with our current lowering method (which I believe is just how we traditionally generate the kernels from what I can gather from code comments) we generate an initial entry point that does some load in/setup of kernel input and then performs a libomptarget runtime invocation for initialization (__kmpc_target_init) before branching into a fail condition or the user entry code, which is effectively the lowered MLIR code in the region. So any allocas in the region will still be emitted outside of the alloca block currently.

Is it possible to merge the Alloca block to the succeeding block (which is hopefully the entry block at the MLIR dialects)? And if we have lifted in MLIR all to the entry block this will all work out fine?

I thought about doing this, but there's a few issues (some may be simpler to overcome than I think or maybe my thinking is just bad) I thought of that might get in the way of it:

I don't think we can just merge the MLIR generated block into the front of the newly generated OpenMPIRBuilder target block as we could end up with out of synch operations, if anything in the MLIR generated block depends on any of the input arguments.
The inverse of merging at the bottom of the block becomes a bit of an issue due to the kmpc_target_init call into a branch, as we can't meld it directly behind the branch.
We could maybe splice it into the OpenMPIRBuilder generated entry block, above the call to kmpc_target_init and branch below the argument entry code. However, we actually do not really have the concept of an strict alloca block when it comes to TargetOp (or any of the OpenMP region constructs from what I can tell), we simply prepend allocas to the top of the target region, and this primarily seems to be enforced in the FIR -> LLVM dialect lowering, the HLFIR/FIR CodeGen seems less strict on this. Enforcing a specific block at the FIR -> LLVM dialect lowering level didn't seem particularly hard when I tried it (but I didn't test it extensively). However, as I recall after doing this I found that it's not so easy to reliably find this alloca block when we're trying to merge them, as we can' t guarantee the first generated LLVM-IR block is the alloca block (at least from my limited testing). So we'd need some way to identify the block for merging.

We also need to be a tad careful when doing this as we have the nested parallel operations to consider that also do their own alloca movements and extraction, it's why I opted to defer the raising to the finalize method and restrict it solely to the Target kernel entry function.

It's been a little bit since I looked into this in detail now, so hopefully I am recollecting the above well enough! I am happy to look into it again to recollect more details if that's something we'd be interested in doing.

agozillon · 2024-02-12T18:59:27Z

Just a small ping to check if you're happy with the fix as is @kiranchandramohan or if you'd prefer me to dig into the block merge/splice method to see if it's feasible?

kiranchandramohan · 2024-02-12T22:43:39Z

Just a small ping to check if you're happy with the fix as is @kiranchandramohan or if you'd prefer me to dig into the block merge/splice method to see if it's feasible?

No need to pursue this.

I have couple of questions inline. Also, would this be better as an LLVM pass?

kiranchandramohan · 2024-02-12T22:39:32Z

mlir/test/Target/LLVMIR/omptarget-constant-alloca-raise.mlir

+      %4 = llvm.mlir.constant(1 : i32) : i32
+      %5 = llvm.alloca %4 x !llvm.struct<(ptr)> {alignment = 8 : i64} : (i32) -> !llvm.ptr
+      %6 = llvm.mlir.constant(50 : i32) : i32
+      %7 = llvm.mlir.constant(1 : i64) : i64
+      %8 = llvm.alloca %7 x i32 : (i64) -> !llvm.ptr
+      llvm.store %6, %8 : i32, !llvm.ptr
+      %9 = llvm.mlir.undef : !llvm.struct<(ptr)>
+      %10 = llvm.insertvalue %8, %9[0] : !llvm.struct<(ptr)> 
+      llvm.store %10, %5 : !llvm.struct<(ptr)>, !llvm.ptr
+      %88 = llvm.call @_ExternalCall(%arg0, %5) : (!llvm.ptr, !llvm.ptr) -> !llvm.struct<()>


If there is a loop around this block, would this still move the alloca to the entry block? If so, is that still semantically equivalent?

Thank you for the great question @kiranchandramohan and I encourage anyone else that has any thoughts on possible scenarios that this might break to also bring things up where there may be concerns. My LLVM knowledge is still very much a WIP.

In this case it would move the alloca to the entry block whether or not it's in a loop related series of blocks. From an MLIR perspective currently I don't think it's possible to have a series of loop blocks around the initial basic block of TargetOp (but I may be wrong on this and would love to know a scenario where this is possible to test), although, it is possible to have blocks chained from it relating to loops as normal. I believe even in the case of omp.parallel/omp.wsloop it's internal to the initial basic block, and this code actively avoids manipulating this kind of nested region currently, the entirety of the parallel do should be outlined into a seperate function by the time we try to raise allocations, the raising aims to only happen inside of the kernel entry function (TargetOp) currently, the parallel do has it's own allocation raising/sinking that occurs via the CodeExtractor that we need to avoid tampering with (as it will break it).

From a Fortran perspective (as someone who is still very bad at writing Fortran) I am not sure a user can inject an alloca directly into the TargetOp, from my understanding you can only declare a variable at the top of a function or in a module. So you can't quite do what you can in C++ where as an example you declare an integer directly in the OpenMP target region which will become an alloca.

However, as we do still need to consider this being used in C/C++ where it is possible, if we look at the below C/C++ examples where I've tried to come up with some breaking scenarios (perhaps my logic isn't quite sound and there is better examples that would cause breakage, in which case please do bring them up!):

Version 1

int test = 0; #pragma omp target map(tofrom : test) { for (int i = 0; i < 10; i++) { int j = 10; j += 10; test += j; } }

Version 2

int test = 0; #pragma omp target map(tofrom : test) { int j = 10; for (int i = 0; i < 10; i++) { j += 10; test += j; } }

In both cases, when we emit the device IR the alloca for j is moved to the entry block alongside i, and the correct results are achieved, so I believe it is still semantically equivalent. Without the initialization of j with 10 we get the same UB in both cases as well.

The main case where I can see it being an issue (and there may be others, which I would love to hear if someone can think of them so I can investigate) is if we have an allocation that is size dependent on some loop dependent variable, making it a dynamic allocation, however, this PR doesn't tackle this case (and doesn't handle constant expressions either, it needs them folded into a constant value currently), it'd be a lot more complex to deal with and Flang hasn't presented a case where we need to handle this scenario yet (hopefully never). If we get to the point of requiring this it'd require a significantly more complex solution I think.

However, disclaimer again, I am not the most LLVM savvy person yet, so anyone that may be reading, please do feel free to mention cases that it may be an issue that you can think of or if any of the above may be incorrect.

From a Fortran perspective (as someone who is still very bad at writing Fortran) I am not sure a user can inject an alloca directly into the TargetOp, from my understanding you can only declare a variable at the top of a function or in a module. So you can't quite do what you can in C++ where as an example you declare an integer directly in the OpenMP target region which will become an alloca.

There is a block construct in Fortran 2008 that can be used for this.

Also we have passes that can convert to Allocs to Allocas.

llvm-project/flang/lib/Optimizer/Transforms/StackArrays.cpp

Line 187 in 2d0137d

class AllocMemConversion : public mlir::OpRewritePattern<fir::AllocMemOp> {

Just pinging @jdoerfert if he has any opinion here.

From a Fortran perspective (as someone who is still very bad at writing Fortran) I am not sure a user can inject an alloca directly into the TargetOp, from my understanding you can only declare a variable at the top of a function or in a module. So you can't quite do what you can in C++ where as an example you declare an integer directly in the OpenMP target region which will become an alloca.

There is a block construct in Fortran 2008 that can be used for this.

Thank you Kiran! I can give this a shot then (provided it currently works in a target region, not tested it myself directly, so I am a little interested in that). Sorry that my Fortran knowledge is rather lacking in areas and thank you for introducing me to new stuff!

Also we have passes that can convert to Allocs to Allocas.

llvm-project/flang/lib/Optimizer/Transforms/StackArrays.cpp

Line 187 in 2d0137d

class AllocMemConversion : public mlir::OpRewritePattern<fir::AllocMemOp> {

I'll have a look into the pass, it could be quite useful in the future. I did notice it when I was looking around for other possible solutions and it caught my eye, but I wasn't quite sure if we needed to look into an extensive solution just yet. While I don't think the pass would solve the issue (perhaps I am incorrect with my thinking however), it would be a good basis to create a pass that might.

However, I think if we wanted to do this pass at an MLIR level we'd have to enforce an entry block for Target regions, and I believe it may be unnecessary to be a pass at that point, during the lowering to LLVM Dialect I think we try to move all allocas into the Target (or any) entry block, and if we make sure we have or generate an explicit entry block at this point, I think all allocas should hopefully end up naturally in the entry block where it's possible at least.

I made the following small little test case using blocks, it makes some attempt to generate the same behavior as the C++ test case above, with some extra on top:

PROGRAM main integer, allocatable :: a allocate(a) a = 5 !$omp target map(tofrom:a) sub_call : BLOCK INTEGER :: b b = a + 2 CALL Sub(b) do i = 1, 10 BLOCK integer :: j j = 10 j = j + 10 a = a + j END BLOCK a = a + b end do END BLOCK sub_call !$omp end target print *, a END PROGRAM main SUBROUTINE Sub(B) INTEGER :: b b = b * b END SUBROUTINE Sub

It yields the same answer for both target and regular Fortran host (as does moving j outside and into the first block yield identical results), what is interesting is that forcing undefined behavior via removing the initial assignment to j, the regular Fortran host code will print a garbage value, whereas the target will return an actual consistent value, effectively not emitting the addition of j in the loop to a. This happens with or without the alloca raise pass here, so seems a seperate inconsistency (and in any case is undefined behavior, so we're free to do as we wish I'd imagine).

I did, however, have to use the deprecated fir flow via -flang-deprecated-no-hlfir (which still uses the alloca raise, so it would still produce the relevant inconsistencies), as this little test has unfortunately shown another very similar compiler crash for allocatables to the one that this PR fixes (crashes with or without this PR sadly). So that will be fun to look into, I am unsure if it's a related or entirely new issue at this point though (will look into it more in the near future).

kiranchandramohan · 2024-02-12T22:40:06Z

llvm/lib/Frontend/OpenMP/OMPIRBuilder.cpp

+  // As we embed the user code in the middle of our target region after we
+  // generate entry code, we must move what allocas we can into the entry
+  // block to avoid possible breaking optimisations for device
+  if (OMPBuilder.Config.isTargetDevice())
+    OMPBuilder.ConstantAllocaRaiseCandidates.emplace_back(Func);


Assuming this code is going to run for all device targets. I guess you said that this is something specific for AMD GPU backends. Do you know whether such transformations apply to other targets as well?

I unfortunately can only speak to AMD GPU backends as it's all I've tested the fortran offload examples I work with on and don't know enough about the other backends (e.g. NVIDIA) to really comment on them unfortunately.

I can see if I can restrict it to the AMD target triple if that's something you'd like me to do to be cautious however? not sure how complicated it'd be to do so here, but I imagine it wouldn't be.

I believe someone did manage to get some of the offloading working for NVIDIA GPU's a while ago if I recall correctly, but I am not sure how easy that was for them, so I could in theory test that if we'd like to as well.

We do not have target specific code here at the moment. So no need to check this here.

agozillon · 2024-02-13T21:11:21Z

Just a small ping to check if you're happy with the fix as is @kiranchandramohan or if you'd prefer me to dig into the block merge/splice method to see if it's feasible?

No need to pursue this.

Sounds good, if it ever becomes necessary to pursue the other option I am more than happy to do so.

I have couple of questions inline. Also, would this be better as an LLVM pass?

I would like to avoid it being an LLVM pass for now if that's at all possible, it's a rather small/trivial piece of functionality that only gets performed on the Target/kernel entry function at the moment, I'd rather other people not start (or at least be able to) running it outside of the intended use-case for the moment. I think the easier to revert it is the better it is for the time being if we have to look at other options. However, if it expands to more complex use-cases (e.g. dynamic allocations or non-foldable constants, although, if we have to cover these cases I'd lean towards the entry block splicing or an alternative) then certainly it should be moved to an LLVM pass.

kiranchandramohan

I think constant allocas are fine to be hoisted. But as mentioned elsewhere there could be other cases where allocas might cause problems and might require a generic solution.

If there are no concerns expressed in a day or two you can consider submitting this patch and generalize this later on.

agozillon · 2024-02-22T15:09:45Z

Just a little forewarning, I will land this PR tomorrow, so if anyone has any concerns now would be the time to bring them forward! However, we can always revert the PR if there is afterwards (or any unexpected issues), but bringing them up just now would be ideal :-)

agozillon · 2024-02-23T22:00:52Z

Landed this PR now, if anyone encounters any unexpected behaviour relating to allocas being raised by this PR please drop a message here or send me a message and then we can revert the PR or make the necessary changes.

agozillon requested review from skatrak, jsjodin, kiranchandramohan, jdoerfert, DominikAdamski and dpalermo January 20, 2024 00:24

llvmbot added mlir:llvm mlir flang:openmp clang:openmp OpenMP related changes to Clang labels Jan 20, 2024

TIFitis approved these changes Feb 2, 2024

View reviewed changes

agozillon added 2 commits February 5, 2024 08:16

[NFC] Small update based on reviewer comments

f568ba2

- Fix badly written comment - Remove unneccesary function - remove unrequired llvm:: prefix

agozillon force-pushed the alloca-raise-fix-attempt-main branch from 5d93304 to f568ba2 Compare February 5, 2024 14:50

[OpenMPIRBuilder][Test] Add test checking alloca raise functionality …

f076f69

…to OpenMPIRBuilderTest.cpp Forgot to add this test yesertday and remembered today!

agozillon force-pushed the alloca-raise-fix-attempt-main branch from b2c79ef to f076f69 Compare February 6, 2024 15:17

kiranchandramohan reviewed Feb 12, 2024

View reviewed changes

kiranchandramohan approved these changes Feb 19, 2024

View reviewed changes

Merge branch 'main' into alloca-raise-fix-attempt-main

4ea752c

agozillon merged commit dcf4ca5 into llvm:main Feb 23, 2024
3 of 4 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[OpenMP][MLIR][OMPIRBuilder] Add a small optional constant alloca raise function pass to finalize, utilised in convertTarget #78818

[OpenMP][MLIR][OMPIRBuilder] Add a small optional constant alloca raise function pass to finalize, utilised in convertTarget #78818

agozillon commented Jan 20, 2024

llvmbot commented Jan 20, 2024 •

edited

kiranchandramohan commented Jan 23, 2024

agozillon commented Jan 23, 2024

agozillon commented Jan 23, 2024

agozillon commented Jan 26, 2024

agozillon commented Feb 1, 2024

TIFitis left a comment

TIFitis Feb 2, 2024

TIFitis Feb 2, 2024

TIFitis Feb 2, 2024

agozillon Feb 2, 2024

agozillon commented Feb 2, 2024

agozillon commented Feb 5, 2024

agozillon commented Feb 6, 2024

github-actions bot commented Feb 6, 2024 •

edited

kiranchandramohan commented Feb 6, 2024

agozillon commented Feb 6, 2024 •

edited

agozillon commented Feb 12, 2024

kiranchandramohan commented Feb 12, 2024

kiranchandramohan Feb 12, 2024

agozillon Feb 13, 2024

kiranchandramohan Feb 18, 2024 •

edited

kiranchandramohan Feb 18, 2024

agozillon Feb 19, 2024

agozillon Feb 20, 2024 •

edited

kiranchandramohan Feb 12, 2024

agozillon Feb 13, 2024

kiranchandramohan Feb 18, 2024

agozillon commented Feb 13, 2024

kiranchandramohan left a comment

agozillon commented Feb 22, 2024

agozillon commented Feb 23, 2024

		// Loop over blocks looking for allocas, skip the entry block allocas here are
		// in the appropriate place.

[OpenMP][MLIR][OMPIRBuilder] Add a small optional constant alloca raise function pass to finalize, utilised in convertTarget #78818

[OpenMP][MLIR][OMPIRBuilder] Add a small optional constant alloca raise function pass to finalize, utilised in convertTarget #78818

Conversation

agozillon commented Jan 20, 2024

llvmbot commented Jan 20, 2024 • edited

kiranchandramohan commented Jan 23, 2024

agozillon commented Jan 23, 2024

agozillon commented Jan 23, 2024

agozillon commented Jan 26, 2024

agozillon commented Feb 1, 2024

TIFitis left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

agozillon commented Feb 2, 2024

agozillon commented Feb 5, 2024

agozillon commented Feb 6, 2024

github-actions bot commented Feb 6, 2024 • edited

kiranchandramohan commented Feb 6, 2024

agozillon commented Feb 6, 2024 • edited

agozillon commented Feb 12, 2024

kiranchandramohan commented Feb 12, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

kiranchandramohan Feb 18, 2024 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

agozillon Feb 20, 2024 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

agozillon commented Feb 13, 2024

kiranchandramohan left a comment

Choose a reason for hiding this comment

agozillon commented Feb 22, 2024

agozillon commented Feb 23, 2024

llvmbot commented Jan 20, 2024 •

edited

github-actions bot commented Feb 6, 2024 •

edited

agozillon commented Feb 6, 2024 •

edited

kiranchandramohan Feb 18, 2024 •

edited

agozillon Feb 20, 2024 •

edited