[OpenMP] Remove dependency on `libffi` from offloading runtime #91264

jhuber6 · 2024-05-06T19:58:08Z

Summary:
This patch attempts to remove the dependency on libffi by instead
emitting the host / CPU kernels using an aggregate struct made from the
captured context. This callows us to have a fixed function prototype we
can call directly rather than requiring an extra library to decode the
ABI to call a function with N (non variadic) arguments.

Fixes #88738

llvmbot · 2024-05-06T19:58:42Z

@llvm/pr-subscribers-offload
@llvm/pr-subscribers-clang

@llvm/pr-subscribers-clang-codegen

Author: Joseph Huber (jhuber6)

Changes

Summary:
This patch attempts to remove the dependency on libffi by instead
emitting the host / CPU kernels using an aggregate struct made from the
captured context. This callows us to have a fixed function prototype we
can call directly rather than requiring an extra library to decode the
ABI to call a function with N (non variadic) arguments.

NOTE:
This currently fails for tests using a non-constant value for
num_teams on the CPU. It seems that these use a method called
CGF.EmitScalarExpr(NumTeams) which doesn't seem to work correctly
with the created aggregate struct.

Patch is 1.07 MiB, truncated to 20.00 KiB below, full version: https://github.com/llvm/llvm-project/pull/91264.diff

38 Files Affected:

(modified) clang/lib/CodeGen/CGOpenMPRuntime.cpp (+6-2)
(modified) clang/lib/CodeGen/CGStmtOpenMP.cpp (+126)
(modified) clang/lib/CodeGen/CodeGenFunction.h (+3)
(modified) clang/test/OpenMP/declare_target_codegen.cpp (+3-3)
(modified) clang/test/OpenMP/declare_target_link_codegen.cpp (+1-1)
(modified) clang/test/OpenMP/distribute_codegen.cpp (+90-76)
(modified) clang/test/OpenMP/distribute_simd_codegen.cpp (+196-160)
(modified) clang/test/OpenMP/openmp_offload_codegen.cpp (+1-1)
(modified) clang/test/OpenMP/target_firstprivate_codegen.cpp (+704-644)
(modified) clang/test/OpenMP/target_ompx_dyn_cgroup_mem_codegen.cpp (+170-102)
(modified) clang/test/OpenMP/target_parallel_codegen.cpp (+264-210)
(modified) clang/test/OpenMP/target_parallel_for_codegen.cpp (+306-240)
(modified) clang/test/OpenMP/target_parallel_for_simd_codegen.cpp (+638-498)
(modified) clang/test/OpenMP/target_parallel_generic_loop_codegen-2.cpp (+48-28)
(modified) clang/test/OpenMP/target_parallel_if_codegen.cpp (+178-106)
(modified) clang/test/OpenMP/target_parallel_num_threads_codegen.cpp (+154-94)
(modified) clang/test/OpenMP/target_private_codegen.cpp (+361-116)
(modified) clang/test/OpenMP/target_reduction_codegen.cpp (+290-106)
(modified) clang/test/OpenMP/target_task_affinity_codegen.cpp (+72-70)
(modified) clang/test/OpenMP/target_teams_codegen.cpp (+402-298)
(modified) clang/test/OpenMP/target_teams_distribute_codegen.cpp (+330-256)
(modified) clang/test/OpenMP/target_teams_distribute_parallel_for_codegen.cpp (+82-60)
(modified) clang/test/OpenMP/target_teams_distribute_parallel_for_firstprivate_codegen.cpp (+160-139)
(modified) clang/test/OpenMP/target_teams_distribute_parallel_for_private_codegen.cpp (+56-41)
(modified) clang/test/OpenMP/target_teams_distribute_parallel_for_simd_codegen.cpp (+110-80)
(modified) clang/test/OpenMP/target_teams_distribute_parallel_for_simd_firstprivate_codegen.cpp (+160-139)
(modified) clang/test/OpenMP/target_teams_distribute_parallel_for_simd_private_codegen.cpp (+56-41)
(modified) clang/test/OpenMP/target_teams_distribute_simd_codegen.cpp (+652-504)
(modified) clang/test/OpenMP/target_teams_generic_loop_codegen-1.cpp (+82-60)
(modified) clang/test/OpenMP/target_teams_generic_loop_private_codegen.cpp (+40-25)
(modified) clang/test/OpenMP/target_teams_map_codegen.cpp (+170-142)
(modified) clang/test/OpenMP/target_teams_num_teams_codegen.cpp (+154-94)
(modified) clang/test/OpenMP/target_teams_thread_limit_codegen.cpp (+164-100)
(modified) clang/test/OpenMP/teams_codegen.cpp (+104-60)
(modified) offload/plugins-nextgen/host/CMakeLists.txt (-13)
(removed) offload/plugins-nextgen/host/dynamic_ffi/ffi.cpp (-75)
(removed) offload/plugins-nextgen/host/dynamic_ffi/ffi.h (-78)
(modified) offload/plugins-nextgen/host/src/rtl.cpp (+23-18)

diff --git a/clang/lib/CodeGen/CGOpenMPRuntime.cpp b/clang/lib/CodeGen/CGOpenMPRuntime.cpp
index e39c7c58d2780e..3cd4bcff2f5852 100644
--- a/clang/lib/CodeGen/CGOpenMPRuntime.cpp
+++ b/clang/lib/CodeGen/CGOpenMPRuntime.cpp
@@ -5932,12 +5932,16 @@ void CGOpenMPRuntime::emitTargetOutlinedFunctionHelper(
 
   CodeGenFunction CGF(CGM, true);
   llvm::OpenMPIRBuilder::FunctionGenCallback &&GenerateOutlinedFunction =
-      [&CGF, &D, &CodeGen](StringRef EntryFnName) {
+      [&CGF, &D, &CodeGen, this](StringRef EntryFnName) {
         const CapturedStmt &CS = *D.getCapturedStmt(OMPD_target);
 
         CGOpenMPTargetRegionInfo CGInfo(CS, CodeGen, EntryFnName);
         CodeGenFunction::CGCapturedStmtRAII CapInfoRAII(CGF, &CGInfo);
-        return CGF.GenerateOpenMPCapturedStmtFunction(CS, D.getBeginLoc());
+        if (CGM.getLangOpts().OpenMPIsTargetDevice && !isGPU())
+          return CGF.GenerateOpenMPCapturedStmtFunctionAggregate(
+              CS, D.getBeginLoc());
+        else
+          return CGF.GenerateOpenMPCapturedStmtFunction(CS, D.getBeginLoc());
       };
 
   OMPBuilder.emitTargetRegionFunction(EntryInfo, GenerateOutlinedFunction,
diff --git a/clang/lib/CodeGen/CGStmtOpenMP.cpp b/clang/lib/CodeGen/CGStmtOpenMP.cpp
index ef3aa3a8e0dc61..b9d27815a8ae24 100644
--- a/clang/lib/CodeGen/CGStmtOpenMP.cpp
+++ b/clang/lib/CodeGen/CGStmtOpenMP.cpp
@@ -613,6 +613,102 @@ static llvm::Function *emitOutlinedFunctionPrologue(
   return F;
 }
 
+static llvm::Function *emitOutlinedFunctionPrologueAggregate(
+    CodeGenFunction &CGF, FunctionArgList &Args,
+    llvm::MapVector<const Decl *, std::pair<const VarDecl *, Address>>
+        &LocalAddrs,
+    llvm::DenseMap<const Decl *, std::pair<const Expr *, llvm::Value *>>
+        &VLASizes,
+    llvm::Value *&CXXThisValue, const CapturedStmt &CS, SourceLocation Loc,
+    StringRef FunctionName) {
+  const CapturedDecl *CD = CS.getCapturedDecl();
+  const RecordDecl *RD = CS.getCapturedRecordDecl();
+
+  CXXThisValue = nullptr;
+  // Build the argument list.
+  CodeGenModule &CGM = CGF.CGM;
+  ASTContext &Ctx = CGM.getContext();
+  Args.append(CD->param_begin(), CD->param_end());
+
+  // Create the function declaration.
+  const CGFunctionInfo &FuncInfo =
+      CGM.getTypes().arrangeBuiltinFunctionDeclaration(Ctx.VoidTy, Args);
+  llvm::FunctionType *FuncLLVMTy = CGM.getTypes().GetFunctionType(FuncInfo);
+
+  auto *F =
+      llvm::Function::Create(FuncLLVMTy, llvm::GlobalValue::InternalLinkage,
+                             FunctionName, &CGM.getModule());
+  CGM.SetInternalFunctionAttributes(CD, F, FuncInfo);
+  if (CD->isNothrow())
+    F->setDoesNotThrow();
+  F->setDoesNotRecurse();
+
+  // Generate the function.
+  CGF.StartFunction(CD, Ctx.VoidTy, F, FuncInfo, Args, Loc, Loc);
+  Address ContextAddr = CGF.GetAddrOfLocalVar(CD->getContextParam());
+  llvm::Value *ContextV = CGF.Builder.CreateLoad(ContextAddr);
+  LValue ContextLV = CGF.MakeNaturalAlignAddrLValue(
+      ContextV, CGM.getContext().getTagDeclType(RD));
+  auto I = CS.captures().begin();
+  for (const FieldDecl *FD : RD->fields()) {
+    LValue FieldLV = CGF.EmitLValueForFieldInitialization(ContextLV, FD);
+    // Do not map arguments if we emit function with non-original types.
+    Address LocalAddr = FieldLV.getAddress(CGF);
+    // If we are capturing a pointer by copy we don't need to do anything, just
+    // use the value that we get from the arguments.
+    if (I->capturesVariableByCopy() && FD->getType()->isAnyPointerType()) {
+      const VarDecl *CurVD = I->getCapturedVar();
+      LocalAddrs.insert({FD, {CurVD, LocalAddr}});
+      ++I;
+      continue;
+    }
+
+    LValue ArgLVal =
+        CGF.MakeAddrLValue(LocalAddr, FD->getType(), AlignmentSource::Decl);
+    if (FD->hasCapturedVLAType()) {
+      llvm::Value *ExprArg = CGF.EmitLoadOfScalar(ArgLVal, I->getLocation());
+      const VariableArrayType *VAT = FD->getCapturedVLAType();
+      VLASizes.try_emplace(FD, VAT->getSizeExpr(), ExprArg);
+    } else if (I->capturesVariable()) {
+      const VarDecl *Var = I->getCapturedVar();
+      QualType VarTy = Var->getType();
+      Address ArgAddr = ArgLVal.getAddress(CGF);
+      if (ArgLVal.getType()->isLValueReferenceType()) {
+        ArgAddr = CGF.EmitLoadOfReference(ArgLVal);
+      } else if (!VarTy->isVariablyModifiedType() || !VarTy->isPointerType()) {
+        assert(ArgLVal.getType()->isPointerType());
+        ArgAddr = CGF.EmitLoadOfPointer(
+            ArgAddr, ArgLVal.getType()->castAs<PointerType>());
+      }
+      LocalAddrs.insert(
+          {FD,
+           {Var, Address(ArgAddr.getBasePointer(), ArgAddr.getElementType(),
+                         Ctx.getDeclAlign(Var))}});
+    } else if (I->capturesVariableByCopy()) {
+      assert(!FD->getType()->isAnyPointerType() &&
+             "Not expecting a captured pointer.");
+      const VarDecl *Var = I->getCapturedVar();
+      Address CopyAddr = CGF.CreateMemTemp(FD->getType(), Ctx.getDeclAlign(FD),
+                                           Var->getName());
+      LValue CopyLVal =
+          CGF.MakeAddrLValue(CopyAddr, FD->getType(), AlignmentSource::Decl);
+
+      RValue ArgRVal = CGF.EmitLoadOfLValue(ArgLVal, I->getLocation());
+      CGF.EmitStoreThroughLValue(ArgRVal, CopyLVal);
+
+      LocalAddrs.insert({FD, {Var, CopyAddr}});
+    } else {
+      // If 'this' is captured, load it into CXXThisValue.
+      assert(I->capturesThis());
+      CXXThisValue = CGF.EmitLoadOfScalar(ArgLVal, I->getLocation());
+      LocalAddrs.insert({FD, {nullptr, ArgLVal.getAddress(CGF)}});
+    }
+    ++I;
+  }
+
+  return F;
+}
+
 llvm::Function *
 CodeGenFunction::GenerateOpenMPCapturedStmtFunction(const CapturedStmt &S,
                                                     SourceLocation Loc) {
@@ -695,6 +791,36 @@ CodeGenFunction::GenerateOpenMPCapturedStmtFunction(const CapturedStmt &S,
   return WrapperF;
 }
 
+llvm::Function *CodeGenFunction::GenerateOpenMPCapturedStmtFunctionAggregate(
+    const CapturedStmt &S, SourceLocation Loc) {
+  assert(
+      CapturedStmtInfo &&
+      "CapturedStmtInfo should be set when generating the captured function");
+  const CapturedDecl *CD = S.getCapturedDecl();
+  // Build the argument list.
+  FunctionArgList Args;
+  llvm::MapVector<const Decl *, std::pair<const VarDecl *, Address>> LocalAddrs;
+  llvm::DenseMap<const Decl *, std::pair<const Expr *, llvm::Value *>> VLASizes;
+  StringRef FunctionName = CapturedStmtInfo->getHelperName();
+  llvm::Function *F = emitOutlinedFunctionPrologueAggregate(
+      *this, Args, LocalAddrs, VLASizes, CXXThisValue, S, Loc, FunctionName);
+  CodeGenFunction::OMPPrivateScope LocalScope(*this);
+  for (const auto &LocalAddrPair : LocalAddrs) {
+    if (LocalAddrPair.second.first) {
+      LocalScope.addPrivate(LocalAddrPair.second.first,
+                            LocalAddrPair.second.second);
+    }
+  }
+  (void)LocalScope.Privatize();
+  for (const auto &VLASizePair : VLASizes)
+    VLASizeMap[VLASizePair.second.first] = VLASizePair.second.second;
+  PGO.assignRegionCounters(GlobalDecl(CD), F);
+  CapturedStmtInfo->EmitBody(*this, CD->getBody());
+  (void)LocalScope.ForceCleanup();
+  FinishFunction(CD->getBodyRBrace());
+  return F;
+}
+
 //===----------------------------------------------------------------------===//
 //                              OpenMP Directive Emission
 //===----------------------------------------------------------------------===//
diff --git a/clang/lib/CodeGen/CodeGenFunction.h b/clang/lib/CodeGen/CodeGenFunction.h
index e1e687af6a781b..4ad4b96767f795 100644
--- a/clang/lib/CodeGen/CodeGenFunction.h
+++ b/clang/lib/CodeGen/CodeGenFunction.h
@@ -3639,6 +3639,9 @@ class CodeGenFunction : public CodeGenTypeCache {
   Address GenerateCapturedStmtArgument(const CapturedStmt &S);
   llvm::Function *GenerateOpenMPCapturedStmtFunction(const CapturedStmt &S,
                                                      SourceLocation Loc);
+  llvm::Function *
+  GenerateOpenMPCapturedStmtFunctionAggregate(const CapturedStmt &S,
+                                              SourceLocation Loc);
   void GenerateOpenMPCapturedVars(const CapturedStmt &S,
                                   SmallVectorImpl<llvm::Value *> &CapturedVars);
   void emitOMPSimpleStore(LValue LVal, RValue RVal, QualType RValTy,
diff --git a/clang/test/OpenMP/declare_target_codegen.cpp b/clang/test/OpenMP/declare_target_codegen.cpp
index ba93772ede3e8e..81116c6617b5bd 100644
--- a/clang/test/OpenMP/declare_target_codegen.cpp
+++ b/clang/test/OpenMP/declare_target_codegen.cpp
@@ -150,7 +150,7 @@ int bar() { return 1 + foo() + bar() + baz1() + baz2(); }
 int maini1() {
   int a;
   static long aa = 32 + bbb + ccc + fff + ggg;
-// CHECK-DAG: define weak_odr protected void @__omp_offloading_{{.*}}maini1{{.*}}_l[[@LINE+1]](ptr {{.*}}, ptr noundef nonnull align {{[0-9]+}} dereferenceable({{[0-9]+}}) %{{.*}}, i64 {{.*}}, i64 {{.*}})
+// CHECK-DAG: define weak_odr protected void @__omp_offloading_{{.*}}maini1{{.*}}_l[[@LINE+1]](ptr {{.*}}, ptr {{.*}})
 #pragma omp target map(tofrom \
                        : a, b)
   {
@@ -163,7 +163,7 @@ int maini1() {
 
 int baz3() { return 2 + baz2(); }
 int baz2() {
-// CHECK-DAG: define weak_odr protected void @__omp_offloading_{{.*}}baz2{{.*}}_l[[@LINE+1]](ptr {{.*}}, i64 {{.*}})
+// CHECK-DAG: define weak_odr protected void @__omp_offloading_{{.*}}baz2{{.*}}_l[[@LINE+1]](ptr {{.*}}, ptr {{.*}})
 #pragma omp target parallel
   ++c;
   return 2 + baz3();
@@ -175,7 +175,7 @@ static __typeof(create) __t_create __attribute__((__weakref__("__create")));
 
 int baz5() {
   bool a;
-// CHECK-DAG: define weak_odr protected void @__omp_offloading_{{.*}}baz5{{.*}}_l[[@LINE+1]](ptr {{.*}}, i64 {{.*}})
+// CHECK-DAG: define weak_odr protected void @__omp_offloading_{{.*}}baz5{{.*}}_l[[@LINE+1]](ptr {{.*}}, ptr {{.*}})
 #pragma omp target
   a = __extension__(void *) & __t_create != 0;
   return a;
diff --git a/clang/test/OpenMP/declare_target_link_codegen.cpp b/clang/test/OpenMP/declare_target_link_codegen.cpp
index 189c9ac59c153c..ba63a4bc543476 100644
--- a/clang/test/OpenMP/declare_target_link_codegen.cpp
+++ b/clang/test/OpenMP/declare_target_link_codegen.cpp
@@ -52,7 +52,7 @@ int maini1() {
   return 0;
 }
 
-// DEVICE: define weak_odr protected void @__omp_offloading_{{.*}}_{{.*}}maini1{{.*}}_l44(ptr {{[^,]+}}, ptr noundef nonnull align {{[0-9]+}} dereferenceable{{[^,]*}}
+// DEVICE: define weak_odr protected void @__omp_offloading_{{.*}}_{{.*}}maini1{{.*}}_l44(ptr {{[^,]+}}, ptr {{[^,]*}}
 // DEVICE: [[C_REF:%.+]] = load ptr, ptr @c_decl_tgt_ref_ptr,
 // DEVICE: [[C:%.+]] = load i32, ptr [[C_REF]],
 // DEVICE: store i32 [[C]], ptr %
diff --git a/clang/test/OpenMP/distribute_codegen.cpp b/clang/test/OpenMP/distribute_codegen.cpp
index 34d14c89fedaed..aaa28980839668 100644
--- a/clang/test/OpenMP/distribute_codegen.cpp
+++ b/clang/test/OpenMP/distribute_codegen.cpp
@@ -1947,19 +1947,18 @@ int fint(void) { return ftemplate<int>(); }
 //
 //
 // CHECK17-LABEL: define {{[^@]+}}@{{__omp_offloading_[0-9a-z]+_[0-9a-z]+}}__Z23without_schedule_clausePfS_S_S__l56
-// CHECK17-SAME: (ptr noalias noundef [[DYN_PTR:%.*]], ptr noundef [[A:%.*]], ptr noundef [[B:%.*]], ptr noundef [[C:%.*]], ptr noundef [[D:%.*]]) #[[ATTR0:[0-9]+]] {
+// CHECK17-SAME: (ptr noalias noundef [[DYN_PTR:%.*]], ptr noalias noundef [[__CONTEXT:%.*]]) #[[ATTR0:[0-9]+]] {
 // CHECK17-NEXT:  entry:
 // CHECK17-NEXT:    [[DYN_PTR_ADDR:%.*]] = alloca ptr, align 8
-// CHECK17-NEXT:    [[A_ADDR:%.*]] = alloca ptr, align 8
-// CHECK17-NEXT:    [[B_ADDR:%.*]] = alloca ptr, align 8
-// CHECK17-NEXT:    [[C_ADDR:%.*]] = alloca ptr, align 8
-// CHECK17-NEXT:    [[D_ADDR:%.*]] = alloca ptr, align 8
+// CHECK17-NEXT:    [[__CONTEXT_ADDR:%.*]] = alloca ptr, align 8
 // CHECK17-NEXT:    store ptr [[DYN_PTR]], ptr [[DYN_PTR_ADDR]], align 8
-// CHECK17-NEXT:    store ptr [[A]], ptr [[A_ADDR]], align 8
-// CHECK17-NEXT:    store ptr [[B]], ptr [[B_ADDR]], align 8
-// CHECK17-NEXT:    store ptr [[C]], ptr [[C_ADDR]], align 8
-// CHECK17-NEXT:    store ptr [[D]], ptr [[D_ADDR]], align 8
-// CHECK17-NEXT:    call void (ptr, i32, ptr, ...) @__kmpc_fork_teams(ptr @[[GLOB2:[0-9]+]], i32 4, ptr @{{__omp_offloading_[0-9a-z]+_[0-9a-z]+}}__Z23without_schedule_clausePfS_S_S__l56.omp_outlined, ptr [[A_ADDR]], ptr [[B_ADDR]], ptr [[C_ADDR]], ptr [[D_ADDR]])
+// CHECK17-NEXT:    store ptr [[__CONTEXT]], ptr [[__CONTEXT_ADDR]], align 8
+// CHECK17-NEXT:    [[TMP0:%.*]] = load ptr, ptr [[__CONTEXT_ADDR]], align 8
+// CHECK17-NEXT:    [[TMP1:%.*]] = getelementptr inbounds [[STRUCT_ANON:%.*]], ptr [[TMP0]], i32 0, i32 0
+// CHECK17-NEXT:    [[TMP2:%.*]] = getelementptr inbounds [[STRUCT_ANON]], ptr [[TMP0]], i32 0, i32 1
+// CHECK17-NEXT:    [[TMP3:%.*]] = getelementptr inbounds [[STRUCT_ANON]], ptr [[TMP0]], i32 0, i32 2
+// CHECK17-NEXT:    [[TMP4:%.*]] = getelementptr inbounds [[STRUCT_ANON]], ptr [[TMP0]], i32 0, i32 3
+// CHECK17-NEXT:    call void (ptr, i32, ptr, ...) @__kmpc_fork_teams(ptr @[[GLOB2:[0-9]+]], i32 4, ptr @{{__omp_offloading_[0-9a-z]+_[0-9a-z]+}}__Z23without_schedule_clausePfS_S_S__l56.omp_outlined, ptr [[TMP1]], ptr [[TMP2]], ptr [[TMP3]], ptr [[TMP4]])
 // CHECK17-NEXT:    ret void
 //
 //
@@ -2058,19 +2057,18 @@ int fint(void) { return ftemplate<int>(); }
 //
 //
 // CHECK17-LABEL: define {{[^@]+}}@{{__omp_offloading_[0-9a-z]+_[0-9a-z]+}}__Z18static_not_chunkedPfS_S_S__l68
-// CHECK17-SAME: (ptr noalias noundef [[DYN_PTR:%.*]], ptr noundef [[A:%.*]], ptr noundef [[B:%.*]], ptr noundef [[C:%.*]], ptr noundef [[D:%.*]]) #[[ATTR0]] {
+// CHECK17-SAME: (ptr noalias noundef [[DYN_PTR:%.*]], ptr noalias noundef [[__CONTEXT:%.*]]) #[[ATTR0]] {
 // CHECK17-NEXT:  entry:
 // CHECK17-NEXT:    [[DYN_PTR_ADDR:%.*]] = alloca ptr, align 8
-// CHECK17-NEXT:    [[A_ADDR:%.*]] = alloca ptr, align 8
-// CHECK17-NEXT:    [[B_ADDR:%.*]] = alloca ptr, align 8
-// CHECK17-NEXT:    [[C_ADDR:%.*]] = alloca ptr, align 8
-// CHECK17-NEXT:    [[D_ADDR:%.*]] = alloca ptr, align 8
+// CHECK17-NEXT:    [[__CONTEXT_ADDR:%.*]] = alloca ptr, align 8
 // CHECK17-NEXT:    store ptr [[DYN_PTR]], ptr [[DYN_PTR_ADDR]], align 8
-// CHECK17-NEXT:    store ptr [[A]], ptr [[A_ADDR]], align 8
-// CHECK17-NEXT:    store ptr [[B]], ptr [[B_ADDR]], align 8
-// CHECK17-NEXT:    store ptr [[C]], ptr [[C_ADDR]], align 8
-// CHECK17-NEXT:    store ptr [[D]], ptr [[D_ADDR]], align 8
-// CHECK17-NEXT:    call void (ptr, i32, ptr, ...) @__kmpc_fork_teams(ptr @[[GLOB2]], i32 4, ptr @{{__omp_offloading_[0-9a-z]+_[0-9a-z]+}}__Z18static_not_chunkedPfS_S_S__l68.omp_outlined, ptr [[A_ADDR]], ptr [[B_ADDR]], ptr [[C_ADDR]], ptr [[D_ADDR]])
+// CHECK17-NEXT:    store ptr [[__CONTEXT]], ptr [[__CONTEXT_ADDR]], align 8
+// CHECK17-NEXT:    [[TMP0:%.*]] = load ptr, ptr [[__CONTEXT_ADDR]], align 8
+// CHECK17-NEXT:    [[TMP1:%.*]] = getelementptr inbounds [[STRUCT_ANON_0:%.*]], ptr [[TMP0]], i32 0, i32 0
+// CHECK17-NEXT:    [[TMP2:%.*]] = getelementptr inbounds [[STRUCT_ANON_0]], ptr [[TMP0]], i32 0, i32 1
+// CHECK17-NEXT:    [[TMP3:%.*]] = getelementptr inbounds [[STRUCT_ANON_0]], ptr [[TMP0]], i32 0, i32 2
+// CHECK17-NEXT:    [[TMP4:%.*]] = getelementptr inbounds [[STRUCT_ANON_0]], ptr [[TMP0]], i32 0, i32 3
+// CHECK17-NEXT:    call void (ptr, i32, ptr, ...) @__kmpc_fork_teams(ptr @[[GLOB2]], i32 4, ptr @{{__omp_offloading_[0-9a-z]+_[0-9a-z]+}}__Z18static_not_chunkedPfS_S_S__l68.omp_outlined, ptr [[TMP1]], ptr [[TMP2]], ptr [[TMP3]], ptr [[TMP4]])
 // CHECK17-NEXT:    ret void
 //
 //
@@ -2169,19 +2167,18 @@ int fint(void) { return ftemplate<int>(); }
 //
 //
 // CHECK17-LABEL: define {{[^@]+}}@{{__omp_offloading_[0-9a-z]+_[0-9a-z]+}}__Z14static_chunkedPfS_S_S__l80
-// CHECK17-SAME: (ptr noalias noundef [[DYN_PTR:%.*]], ptr noundef [[A:%.*]], ptr noundef [[B:%.*]], ptr noundef [[C:%.*]], ptr noundef [[D:%.*]]) #[[ATTR0]] {
+// CHECK17-SAME: (ptr noalias noundef [[DYN_PTR:%.*]], ptr noalias noundef [[__CONTEXT:%.*]]) #[[ATTR0]] {
 // CHECK17-NEXT:  entry:
 // CHECK17-NEXT:    [[DYN_PTR_ADDR:%.*]] = alloca ptr, align 8
-// CHECK17-NEXT:    [[A_ADDR:%.*]] = alloca ptr, align 8
-// CHECK17-NEXT:    [[B_ADDR:%.*]] = alloca ptr, align 8
-// CHECK17-NEXT:    [[C_ADDR:%.*]] = alloca ptr, align 8
-// CHECK17-NEXT:    [[D_ADDR:%.*]] = alloca ptr, align 8
+// CHECK17-NEXT:    [[__CONTEXT_ADDR:%.*]] = alloca ptr, align 8
 // CHECK17-NEXT:    store ptr [[DYN_PTR]], ptr [[DYN_PTR_ADDR]], align 8
-// CHECK17-NEXT:    store ptr [[A]], ptr [[A_ADDR]], align 8
-// CHECK17-NEXT:    store ptr [[B]], ptr [[B_ADDR]], align 8
-// CHECK17-NEXT:    store ptr [[C]], ptr [[C_ADDR]], align 8
-// CHECK17-NEXT:    store ptr [[D]], ptr [[D_ADDR]], align 8
-// CHECK17-NEXT:    call void (ptr, i32, ptr, ...) @__kmpc_fork_teams(ptr @[[GLOB2]], i32 4, ptr @{{__omp_offloading_[0-9a-z]+_[0-9a-z]+}}__Z14static_chunkedPfS_S_S__l80.omp_outlined, ptr [[A_ADDR]], ptr [[B_ADDR]], ptr [[C_ADDR]], ptr [[D_ADDR]])
+// CHECK17-NEXT:    store ptr [[__CONTEXT]], ptr [[__CONTEXT_ADDR]], align 8
+// CHECK17-NEXT:    [[TMP0:%.*]] = load ptr, ptr [[__CONTEXT_ADDR]], align 8
+// CHECK17-NEXT:    [[TMP1:%.*]] = getelementptr inbounds [[STRUCT_ANON_1:%.*]], ptr [[TMP0]], i32 0, i32 0
+// CHECK17-NEXT:    [[TMP2:%.*]] = getelementptr inbounds [[STRUCT_ANON_1]], ptr [[TMP0]], i32 0, i32 1
+// CHECK17-NEXT:    [[TMP3:%.*]] = getelementptr inbounds [[STRUCT_ANON_1]], ptr [[TMP0]], i32 0, i32 2
+// CHECK17-NEXT:    [[TMP4:%.*]] = getelementptr inbounds [[STRUCT_ANON_1]], ptr [[TMP0]], i32 0, i32 3
+// CHECK17-NEXT:    call void (ptr, i32, ptr, ...) @__kmpc_fork_teams(ptr @[[GLOB2]], i32 4, ptr @{{__omp_offloading_[0-9a-z]+_[0-9a-z]+}}__Z14static_chunkedPfS_S_S__l80.omp_outlined, ptr [[TMP1]], ptr [[TMP2]], ptr [[TMP3]], ptr [[TMP4]])
 // CHECK17-NEXT:    ret void
 //
 //
@@ -2297,13 +2294,18 @@ int fint(void) { return ftemplate<int>(); }
 //
 //
 // CHECK17-LABEL: define {{[^@]+}}@{{__omp_offloading_[0-9a-z]+_[0-9a-z]+}}__Z12test_precondv_l92
-// CHECK17-SAME: (ptr noalias noundef [[DYN_PTR:%.*]], i64 noundef [[A:%.*]]) #[[ATTR0]] {
+// CHECK17-SAME: (ptr noalias noundef [[DYN_PTR:%.*]], ptr noalias noundef [[__CONTEXT:%.*]]) #[[ATTR0]] {
 // CHECK17-NEXT:  entry:
 // CHECK17-NEXT:    [[DYN_PTR_ADDR:%.*]] = alloca ptr, align 8
-// CHECK17-NEXT:    [[A_ADDR:%.*]] = alloca i64, align 8
+// CHECK17-NEXT:    [[__CONTEXT_ADDR:%.*]] = alloca ptr, align 8
+// CHECK17-NEXT:    [[A:%.*]] = alloca i8, align 1
 // CHECK17-NEXT:    store ptr [[DYN_PTR]], ptr [[DYN_PTR_ADDR]], align 8
-// CHECK17-NEXT:    store i64 [[A]], ptr [[A_ADDR]], align 8
-// CHECK17-NEXT:    call void (ptr, i32, ptr, ...) @__kmpc_fork_teams(ptr @[[GLOB2]], i32 1, ptr @{{__omp_offloading_[0-9a-z]+_[0-9a-z]+}}__Z12test_precondv_l92.omp_outlined, ptr [[A_ADDR]])
+// CHECK17-NEXT:    store ptr [[__CONTEXT]], ptr [[__CONTEXT_ADDR]], align 8
+// CHECK17-NEXT:    [[TMP0:%.*]] = load ptr, ptr [[__CONTEXT_ADDR]], align 8
+// CHECK17-NEXT:    [[TMP1:%.*]] = getelementptr inbounds [[STRUCT_ANON_2:%.*]], ptr [[TMP0]], i32 0, i32 0
+// CHECK17-NEXT:    [[TMP2:%.*]] = load i8, ptr [[TMP1]], align 1
+// CHECK17-NEXT:    store i8 [[TMP2]], ptr [[A]], align 1
+// CHECK17-NEXT:    call void (ptr, i32, ptr, ...) @__kmpc_fork_teams(ptr @[[GLOB2]], i32 1, ptr @{{__omp_offloading_[0-9a-z]+_[0-9a-z]+}}__Z12test_precondv_l92.omp_outlined, ptr [[A]])
 // CHECK17-NEXT:    ret void
 //
 //
@@ -2401,13 +2403,18 @@ int fint(void) { return ftemplate<int>(); }
 //
 //
 // CHECK17-LABEL: define {{[^@]+}}@{{__omp_offloading_[0-9a-z]+_[0-9a-z]+}}__Z9ftemplateIiET_v_l108
-// CHECK17-SAME: (ptr noalias noundef [[DYN_PTR:%.*]], i64 noundef [[AA:%.*]]) #[[ATTR0]] {
+// CHECK17-SAME: (ptr noalias noundef [[DYN_PTR:%.*]], ptr noalias noundef [[__CONTEXT:%.*]]) #[[ATTR0]] {
 // CHECK17-NEXT:  entry:
 // CHECK17-NEXT:    [[DYN_PTR_ADDR:%.*]] = alloca ptr, align 8
-// CHECK17-NEXT:    [[AA_ADDR:%.*]] = alloca i64, align 8
+// CHECK17-NEXT:    [[__CONTEXT_ADDR:%.*]] = alloca ptr, align 8
+// CHECK17-NEXT:    [[AA:%.*]] = alloca i16, align 2
 // CHECK17-NEXT:    store ptr [[DYN_PTR]], ptr [[DYN_PTR_ADDR]], align 8
-// CHECK17-NEXT:    store i64 [[AA]], ptr [[AA_ADDR]], align 8
-// CHECK17-NEXT:    call void (ptr, i32, ptr, ...) @__kmpc_fork_teams(ptr @[[GLOB2]], i32 1, ptr @{{__omp_offloading_[0-9a-z]+_[0-9a-z]+}}__Z9ftemplateIiET_v_l108.omp_outlined, ptr [[AA_ADDR]])
+// CHECK17-NEXT:    store ptr [[__CONTEXT]], ptr [[__CONTEXT_ADDR]], align 8
+// C...
[truncated]

jhuber6 · 2024-05-06T20:27:56Z

I'm unsure how to resolve the issue of CGF.EmitScalarExpr(NumTeams) not returning the correct value now. For the following code

#include <stdio.h>
#include <stdlib.h>

int main() {
  int Teams = 10;
#pragma omp target teams distribute parallel for num_teams(Teams)
  for (int i = 0; i < 1; ++i)
    ;

  return 0;
}

I get this LLVM-IR, which suggests that it's reading beyond the expected __context struct.

; Function Attrs: convergent noinline norecurse nounwind optnone uwtable
define weak_odr protected void @__omp_offloading_10302_af886a3_main_l9(ptr noalias noundef %dyn_ptr, ptr noalias noundef %__context) #0 {
entry:
  %dyn_ptr.addr = alloca ptr, align 8
  %__context.addr = alloca ptr, align 8
  %.capture_expr. = alloca i32, align 4
  %0 = call i32 @__kmpc_global_thread_num(ptr @3)
  store ptr %dyn_ptr, ptr %dyn_ptr.addr, align 8
  store ptr %__context, ptr %__context.addr, align 8
  %1 = load ptr, ptr %__context.addr, align 8
  %2 = getelementptr inbounds %struct.anon, ptr %1, i32 0, i32 0
  %3 = load i32, ptr %2, align 4
  store i32 %3, ptr %.capture_expr., align 4
  %4 = load i32, ptr %.capture_expr., align 4
  call void @__kmpc_push_num_teams(ptr @3, i32 %0, i32 %4, i32 0)
  call void (ptr, i32, ptr, ...) @__kmpc_fork_teams(ptr @3, i32 0, ptr @__omp_offloading_10302_af886a3_main_l9.omp_outlined)
  ret void
}

Any idea how to resolve this? I'm assuming the way we do this now is no longer valid somehow because of the struct indirection.

alexey-bataev · 2024-05-07T13:34:17Z

= load i32, ptr %.capture_expr., align 4

Why do you think it reads beyond __context? %2 = getelementptr inbounds %struct.anon, ptr %1, i32 0, i32 0 points to the first element in the __context, if I'm not missing something. If it has the wrong value, looks like it is not written correctly

alexey-bataev · 2024-05-07T13:35:07Z

clang/lib/CodeGen/CGOpenMPRuntime.cpp

+        if (CGM.getLangOpts().OpenMPIsTargetDevice && !isGPU())
+          return CGF.GenerateOpenMPCapturedStmtFunctionAggregate(
+              CS, D.getBeginLoc());
+        else


Suggested change

else

alexey-bataev · 2024-05-07T13:35:58Z

clang/lib/CodeGen/CGStmtOpenMP.cpp

+      LocalAddrs.insert({FD, {Var, CopyAddr}});
+    } else {
+      // If 'this' is captured, load it into CXXThisValue.
+      assert(I->capturesThis());


Add a message

jhuber6 · 2024-05-07T14:06:28Z

= load i32, ptr %.capture_expr., align 4
Why do you think it reads beyond __context? %2 = getelementptr inbounds %struct.anon, ptr %1, i32 0, i32 0 points to the first element in the __context, if I'm not missing something. If it has the wrong value, looks like it is not written correctly

I think I copied the wrong code somehow,

; Function Attrs: convergent noinline norecurse nounwind optnone uwtable
define weak_odr protected void @__omp_offloading_10302_adc9471_main_l10(ptr noalias noundef %dyn_ptr, ptr noalias noundef %__context) #0 {
entry:
  %dyn_ptr.addr = alloca ptr, align 8
  %__context.addr = alloca ptr, align 8
  %Teams = alloca i32, align 4
  %Threads = alloca i32, align 4
  %.capture_expr. = alloca i32, align 4
  %.capture_expr.1 = alloca i32, align 4
  %Teams.casted = alloca i64, align 8
  %Threads.casted = alloca i64, align 8
  %0 = call i32 @__kmpc_global_thread_num(ptr @3)
  store ptr %dyn_ptr, ptr %dyn_ptr.addr, align 8
  store ptr %__context, ptr %__context.addr, align 8
  %1 = load ptr, ptr %__context.addr, align 8
  %2 = getelementptr inbounds %struct.anon, ptr %1, i32 0, i32 0
  %3 = load i32, ptr %2, align 4
  store i32 %3, ptr %Teams, align 4
  %4 = getelementptr inbounds %struct.anon, ptr %1, i32 0, i32 1
  %5 = load i32, ptr %4, align 4
  store i32 %5, ptr %Threads, align 4
  %6 = getelementptr inbounds %struct.anon, ptr %1, i32 0, i32 2
  %7 = load i32, ptr %6, align 4
  store i32 %7, ptr %.capture_expr., align 4
  %8 = getelementptr inbounds %struct.anon, ptr %1, i32 0, i32 3
  %9 = load i32, ptr %8, align 4
  store i32 %9, ptr %.capture_expr.1, align 4
  %10 = load i32, ptr %.capture_expr., align 4
  %11 = load i32, ptr %.capture_expr.1, align 4
  call void @__kmpc_push_num_teams(ptr @3, i32 %0, i32 %10, i32 %11)
  %12 = load i32, ptr %Teams, align 4
  store i32 %12, ptr %Teams.casted, align 4
  %13 = load i64, ptr %Teams.casted, align 8
  %14 = load i32, ptr %Threads, align 4
  store i32 %14, ptr %Threads.casted, align 4
  %15 = load i64, ptr %Threads.casted, align 8
  call void (ptr, i32, ptr, ...) @__kmpc_fork_teams(ptr @3, i32 2, ptr @__omp_offloading_10302_adc9471_main_l10.omp_outlined, i64 %13, i64 %15)
  ret void
}

This is what I get from the corresponding C code.

#include <stdio.h>
#include <assert.h>
#include <stdlib.h>

int main() {
  int Threads = 6;
  int Teams = 10;

  long unsigned s = 0;
#pragma omp target teams distribute parallel for num_teams(Teams)              \
    thread_limit(Threads)
  for (int i = 0; i < Threads * Teams; ++i) {
    assert(Teams == 10);
  }

  return 0;
}

When I compile run it, I get the following. So it warns on some nonsense team value (It will be even more corrupt with other cases, but this was the simplest I could get).

> clang malloc.c -fopenmp -fopenmp-targets=x86_64-pc-linux-gnu                                         
> ./a.out 
OMP: Warning #96: Cannot form a team with 48 threads, using 21 instead.
OMP: Hint Consider unsetting KMP_DEVICE_THREAD_LIMIT (KMP_ALL_THREADS), KMP_TEAMS_THREAD_LIMIT, and OMP_THREAD_LIMIT (if any are set).

The LLVM-IR is confusing to me because it's doing a GEP up to 3, which is suggesting that the Teams / Threads values are appended but the number of arguments. So, in the runtime for this case it states there are 5 arguments, first is the dyn_ptr that's alwasy there, then the two threads / teams. And then these extra threads / teams. This then causes mayhem because only the first two are set to what we expect them to be.

alexey-bataev · 2024-05-07T16:46:55Z

struct.anon

Can you provide full IR dump here?

jhuber6 · 2024-05-07T16:51:19Z

struct.anon
Can you provide full IR dump here?

https://godbolt.org/z/48h5s3W6v

alexey-bataev · 2024-05-07T18:28:27Z

struct.anon
Can you provide full IR dump here?
https://godbolt.org/z/48h5s3W6v

It does not look like the issue of the target code, I don't see any wrong access for __context. Мост probably something wrong with the host code/runtime.

jhuber6 · 2024-05-07T18:29:45Z

struct.anon
Can you provide full IR dump here?
https://godbolt.org/z/48h5s3W6v
It does not look like the issue of the target code, I don't see any wrong access for __context. Мост probably something wrong with the host code/runtime.

Yeah, I think that's correct. Looking at the IR it seems to add the two extra arguments and call them as I'd expect, but for some reason it gets corrupted in the runtime layer. It might be doing something weird with the arguments.

alexey-bataev · 2024-05-07T18:45:25Z

struct.anon
Can you provide full IR dump here?
https://godbolt.org/z/48h5s3W6v
It does not look like the issue of the target code, I don't see any wrong access for __context. Мост probably something wrong with the host code/runtime.
Yeah, I think that's correct. Looking at the IR it seems to add the two extra arguments and call them as I'd expect, but for some reason it gets corrupted in the runtime layer. It might be doing something weird with the arguments.

Maybe. The message is emitted on the host, so there is something wrong with the host code or runtime library.

jhuber6 · 2024-05-07T19:16:33Z

Maybe. The message is emitted on the host, so there is something wrong with the host code or runtime library.

This might be some issue with the host codegen actually. The following modified the runtime to print the arguments as-seen by the runtime.

> clang malloc.c -fopenmp -fopenmp-targets=x86_64-pc-linux-gnu                                                    
> ./a.out 
10
131675107360774
131675107360778
110294760161286
18446744073709551615
10
131675107360774
131675107360778
110294760161286
OMP: Warning #96: Cannot form a team with 48 threads, using 21 instead.
OMP: Hint Consider unsetting KMP_DEVICE_THREAD_LIMIT (KMP_ALL_THREADS), KMP_TEAMS_THREAD_LIMIT, and OMP_THREAD_LIMIT (if any are set).
> clang malloc.c -fopenmp -fopenmp-targets=x86_64-pc-linux-gnu -O3
> ./a.out 
10
6
10
6
18446744073709551615
10
6
10
6

With optimization on, I see what I expect. With -O0 it seems to give me garbage. Looking at the ASM also suggests that only the 0x10 value is written for some reason? https://godbolt.org/z/86hTjjaa8 is the host-IR I get without optimizations.

jhuber6 · 2024-05-07T19:43:19Z

I'm getting the same kind of output on main, but the warning is mysteriously absent. Same results for GPU offloading.

alexey-bataev · 2024-05-07T21:29:28Z

Hmm, hard to tell, need to debug it.

jhuber6 · 2024-05-07T21:46:15Z

Hmm, hard to tell, need to debug it.

Somehow when I print it in the runtime it shows up as garbage, but the actual region seems to get correct values. There shouldn't be anything in-between the arguments I'm printing and the kernel launch however so I'm stumped.

jhuber6 · 2024-05-09T20:37:52Z

I hacked around it in the runtime itself. Obviously this is very OpenMP specific behavior but so was the old method. Passes all tests now.

Summary: We previously had multiple options for this, this patch replaces them with `LIBOMPTARGET_DLOPEN_PLUGINS=` to be a list of plugins to dynamically use. It defaults to everything right now. This ignores the `host` plugin because the `libffi` dependency is going to be removed soon hopefully in llvm#91264.

Summary: We previously had multiple options for this, this patch replaces them with `LIBOMPTARGET_DLOPEN_PLUGINS=` to be a list of plugins to dynamically use. It defaults to everything right now. This ignores the `host` plugin because the `libffi` dependency is going to be removed soon hopefully in #91264.

Summary: This patch attempts to remove the dependency on `libffi` by instead emitting the host / CPU kernels using an aggregate struct made from the captured context. This callows us to have a fixed function prototype we can call directly rather than requiring an extra library to decode the ABI to call a function with N (non variadic) arguments.

jhuber6 · 2024-05-29T21:05:31Z

ping

jhuber6 requested review from alexey-bataev, ggeorgakoudis and jdoerfert May 6, 2024 19:58

llvmbot added clang Clang issues not falling into any other category clang:codegen clang:openmp OpenMP related changes to Clang offload labels May 6, 2024

alexey-bataev reviewed May 7, 2024

View reviewed changes

jhuber6 force-pushed the RemoveFFI branch from d62699d to 3a9ec9d Compare May 9, 2024 20:36

jhuber6 changed the title ~~[WIP][OpenMP] Remove dependency on libffi from offloading runtime~~ [OpenMP] Remove dependency on libffi from offloading runtime May 9, 2024

jhuber6 force-pushed the RemoveFFI branch from 3a9ec9d to 4a3ea3d Compare May 15, 2024 21:25

jhuber6 mentioned this pull request May 22, 2024

[Offload] Rework handling for loading vendor runtimes #93073

Merged

jhuber6 force-pushed the RemoveFFI branch from 4a3ea3d to f0c9fae Compare May 22, 2024 18:50

jhuber6 force-pushed the RemoveFFI branch from f0c9fae to c9adac9 Compare May 29, 2024 21:05

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[OpenMP] Remove dependency on `libffi` from offloading runtime #91264

[OpenMP] Remove dependency on `libffi` from offloading runtime #91264

jhuber6 commented May 6, 2024 •

edited

Loading

llvmbot commented May 6, 2024 •

edited

Loading

jhuber6 commented May 6, 2024

alexey-bataev commented May 7, 2024

alexey-bataev May 7, 2024

alexey-bataev May 7, 2024

jhuber6 commented May 7, 2024 •

edited

Loading

alexey-bataev commented May 7, 2024

jhuber6 commented May 7, 2024

alexey-bataev commented May 7, 2024

jhuber6 commented May 7, 2024

alexey-bataev commented May 7, 2024

jhuber6 commented May 7, 2024 •

edited

Loading

jhuber6 commented May 7, 2024 •

edited

Loading

alexey-bataev commented May 7, 2024

jhuber6 commented May 7, 2024

jhuber6 commented May 9, 2024

jhuber6 commented May 29, 2024

[OpenMP] Remove dependency on libffi from offloading runtime #91264

Are you sure you want to change the base?

[OpenMP] Remove dependency on libffi from offloading runtime #91264

Conversation

jhuber6 commented May 6, 2024 • edited Loading

llvmbot commented May 6, 2024 • edited Loading

jhuber6 commented May 6, 2024

alexey-bataev commented May 7, 2024

alexey-bataev May 7, 2024

Choose a reason for hiding this comment

alexey-bataev May 7, 2024

Choose a reason for hiding this comment

jhuber6 commented May 7, 2024 • edited Loading

alexey-bataev commented May 7, 2024

jhuber6 commented May 7, 2024

alexey-bataev commented May 7, 2024

jhuber6 commented May 7, 2024

alexey-bataev commented May 7, 2024

jhuber6 commented May 7, 2024 • edited Loading

jhuber6 commented May 7, 2024 • edited Loading

alexey-bataev commented May 7, 2024

jhuber6 commented May 7, 2024

jhuber6 commented May 9, 2024

jhuber6 commented May 29, 2024

[OpenMP] Remove dependency on `libffi` from offloading runtime #91264

[OpenMP] Remove dependency on `libffi` from offloading runtime #91264

jhuber6 commented May 6, 2024 •

edited

Loading

llvmbot commented May 6, 2024 •

edited

Loading

jhuber6 commented May 7, 2024 •

edited

Loading

jhuber6 commented May 7, 2024 •

edited

Loading

jhuber6 commented May 7, 2024 •

edited

Loading