Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[OpenMP] Remove dependency on libffi from offloading runtime #91264

Open
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

jhuber6
Copy link
Contributor

@jhuber6 jhuber6 commented May 6, 2024

Summary:
This patch attempts to remove the dependency on libffi by instead
emitting the host / CPU kernels using an aggregate struct made from the
captured context. This callows us to have a fixed function prototype we
can call directly rather than requiring an extra library to decode the
ABI to call a function with N (non variadic) arguments.

Fixes #88738

@llvmbot llvmbot added clang Clang issues not falling into any other category clang:codegen clang:openmp OpenMP related changes to Clang offload labels May 6, 2024
@llvmbot
Copy link
Collaborator

llvmbot commented May 6, 2024

@llvm/pr-subscribers-offload
@llvm/pr-subscribers-clang

@llvm/pr-subscribers-clang-codegen

Author: Joseph Huber (jhuber6)

Changes

Summary:
This patch attempts to remove the dependency on libffi by instead
emitting the host / CPU kernels using an aggregate struct made from the
captured context. This callows us to have a fixed function prototype we
can call directly rather than requiring an extra library to decode the
ABI to call a function with N (non variadic) arguments.

NOTE:
This currently fails for tests using a non-constant value for
num_teams on the CPU. It seems that these use a method called
CGF.EmitScalarExpr(NumTeams) which doesn't seem to work correctly
with the created aggregate struct.


Patch is 1.07 MiB, truncated to 20.00 KiB below, full version: https://github.com/llvm/llvm-project/pull/91264.diff

38 Files Affected:

  • (modified) clang/lib/CodeGen/CGOpenMPRuntime.cpp (+6-2)
  • (modified) clang/lib/CodeGen/CGStmtOpenMP.cpp (+126)
  • (modified) clang/lib/CodeGen/CodeGenFunction.h (+3)
  • (modified) clang/test/OpenMP/declare_target_codegen.cpp (+3-3)
  • (modified) clang/test/OpenMP/declare_target_link_codegen.cpp (+1-1)
  • (modified) clang/test/OpenMP/distribute_codegen.cpp (+90-76)
  • (modified) clang/test/OpenMP/distribute_simd_codegen.cpp (+196-160)
  • (modified) clang/test/OpenMP/openmp_offload_codegen.cpp (+1-1)
  • (modified) clang/test/OpenMP/target_firstprivate_codegen.cpp (+704-644)
  • (modified) clang/test/OpenMP/target_ompx_dyn_cgroup_mem_codegen.cpp (+170-102)
  • (modified) clang/test/OpenMP/target_parallel_codegen.cpp (+264-210)
  • (modified) clang/test/OpenMP/target_parallel_for_codegen.cpp (+306-240)
  • (modified) clang/test/OpenMP/target_parallel_for_simd_codegen.cpp (+638-498)
  • (modified) clang/test/OpenMP/target_parallel_generic_loop_codegen-2.cpp (+48-28)
  • (modified) clang/test/OpenMP/target_parallel_if_codegen.cpp (+178-106)
  • (modified) clang/test/OpenMP/target_parallel_num_threads_codegen.cpp (+154-94)
  • (modified) clang/test/OpenMP/target_private_codegen.cpp (+361-116)
  • (modified) clang/test/OpenMP/target_reduction_codegen.cpp (+290-106)
  • (modified) clang/test/OpenMP/target_task_affinity_codegen.cpp (+72-70)
  • (modified) clang/test/OpenMP/target_teams_codegen.cpp (+402-298)
  • (modified) clang/test/OpenMP/target_teams_distribute_codegen.cpp (+330-256)
  • (modified) clang/test/OpenMP/target_teams_distribute_parallel_for_codegen.cpp (+82-60)
  • (modified) clang/test/OpenMP/target_teams_distribute_parallel_for_firstprivate_codegen.cpp (+160-139)
  • (modified) clang/test/OpenMP/target_teams_distribute_parallel_for_private_codegen.cpp (+56-41)
  • (modified) clang/test/OpenMP/target_teams_distribute_parallel_for_simd_codegen.cpp (+110-80)
  • (modified) clang/test/OpenMP/target_teams_distribute_parallel_for_simd_firstprivate_codegen.cpp (+160-139)
  • (modified) clang/test/OpenMP/target_teams_distribute_parallel_for_simd_private_codegen.cpp (+56-41)
  • (modified) clang/test/OpenMP/target_teams_distribute_simd_codegen.cpp (+652-504)
  • (modified) clang/test/OpenMP/target_teams_generic_loop_codegen-1.cpp (+82-60)
  • (modified) clang/test/OpenMP/target_teams_generic_loop_private_codegen.cpp (+40-25)
  • (modified) clang/test/OpenMP/target_teams_map_codegen.cpp (+170-142)
  • (modified) clang/test/OpenMP/target_teams_num_teams_codegen.cpp (+154-94)
  • (modified) clang/test/OpenMP/target_teams_thread_limit_codegen.cpp (+164-100)
  • (modified) clang/test/OpenMP/teams_codegen.cpp (+104-60)
  • (modified) offload/plugins-nextgen/host/CMakeLists.txt (-13)
  • (removed) offload/plugins-nextgen/host/dynamic_ffi/ffi.cpp (-75)
  • (removed) offload/plugins-nextgen/host/dynamic_ffi/ffi.h (-78)
  • (modified) offload/plugins-nextgen/host/src/rtl.cpp (+23-18)
diff --git a/clang/lib/CodeGen/CGOpenMPRuntime.cpp b/clang/lib/CodeGen/CGOpenMPRuntime.cpp
index e39c7c58d2780e..3cd4bcff2f5852 100644
--- a/clang/lib/CodeGen/CGOpenMPRuntime.cpp
+++ b/clang/lib/CodeGen/CGOpenMPRuntime.cpp
@@ -5932,12 +5932,16 @@ void CGOpenMPRuntime::emitTargetOutlinedFunctionHelper(
 
   CodeGenFunction CGF(CGM, true);
   llvm::OpenMPIRBuilder::FunctionGenCallback &&GenerateOutlinedFunction =
-      [&CGF, &D, &CodeGen](StringRef EntryFnName) {
+      [&CGF, &D, &CodeGen, this](StringRef EntryFnName) {
         const CapturedStmt &CS = *D.getCapturedStmt(OMPD_target);
 
         CGOpenMPTargetRegionInfo CGInfo(CS, CodeGen, EntryFnName);
         CodeGenFunction::CGCapturedStmtRAII CapInfoRAII(CGF, &CGInfo);
-        return CGF.GenerateOpenMPCapturedStmtFunction(CS, D.getBeginLoc());
+        if (CGM.getLangOpts().OpenMPIsTargetDevice && !isGPU())
+          return CGF.GenerateOpenMPCapturedStmtFunctionAggregate(
+              CS, D.getBeginLoc());
+        else
+          return CGF.GenerateOpenMPCapturedStmtFunction(CS, D.getBeginLoc());
       };
 
   OMPBuilder.emitTargetRegionFunction(EntryInfo, GenerateOutlinedFunction,
diff --git a/clang/lib/CodeGen/CGStmtOpenMP.cpp b/clang/lib/CodeGen/CGStmtOpenMP.cpp
index ef3aa3a8e0dc61..b9d27815a8ae24 100644
--- a/clang/lib/CodeGen/CGStmtOpenMP.cpp
+++ b/clang/lib/CodeGen/CGStmtOpenMP.cpp
@@ -613,6 +613,102 @@ static llvm::Function *emitOutlinedFunctionPrologue(
   return F;
 }
 
+static llvm::Function *emitOutlinedFunctionPrologueAggregate(
+    CodeGenFunction &CGF, FunctionArgList &Args,
+    llvm::MapVector<const Decl *, std::pair<const VarDecl *, Address>>
+        &LocalAddrs,
+    llvm::DenseMap<const Decl *, std::pair<const Expr *, llvm::Value *>>
+        &VLASizes,
+    llvm::Value *&CXXThisValue, const CapturedStmt &CS, SourceLocation Loc,
+    StringRef FunctionName) {
+  const CapturedDecl *CD = CS.getCapturedDecl();
+  const RecordDecl *RD = CS.getCapturedRecordDecl();
+
+  CXXThisValue = nullptr;
+  // Build the argument list.
+  CodeGenModule &CGM = CGF.CGM;
+  ASTContext &Ctx = CGM.getContext();
+  Args.append(CD->param_begin(), CD->param_end());
+
+  // Create the function declaration.
+  const CGFunctionInfo &FuncInfo =
+      CGM.getTypes().arrangeBuiltinFunctionDeclaration(Ctx.VoidTy, Args);
+  llvm::FunctionType *FuncLLVMTy = CGM.getTypes().GetFunctionType(FuncInfo);
+
+  auto *F =
+      llvm::Function::Create(FuncLLVMTy, llvm::GlobalValue::InternalLinkage,
+                             FunctionName, &CGM.getModule());
+  CGM.SetInternalFunctionAttributes(CD, F, FuncInfo);
+  if (CD->isNothrow())
+    F->setDoesNotThrow();
+  F->setDoesNotRecurse();
+
+  // Generate the function.
+  CGF.StartFunction(CD, Ctx.VoidTy, F, FuncInfo, Args, Loc, Loc);
+  Address ContextAddr = CGF.GetAddrOfLocalVar(CD->getContextParam());
+  llvm::Value *ContextV = CGF.Builder.CreateLoad(ContextAddr);
+  LValue ContextLV = CGF.MakeNaturalAlignAddrLValue(
+      ContextV, CGM.getContext().getTagDeclType(RD));
+  auto I = CS.captures().begin();
+  for (const FieldDecl *FD : RD->fields()) {
+    LValue FieldLV = CGF.EmitLValueForFieldInitialization(ContextLV, FD);
+    // Do not map arguments if we emit function with non-original types.
+    Address LocalAddr = FieldLV.getAddress(CGF);
+    // If we are capturing a pointer by copy we don't need to do anything, just
+    // use the value that we get from the arguments.
+    if (I->capturesVariableByCopy() && FD->getType()->isAnyPointerType()) {
+      const VarDecl *CurVD = I->getCapturedVar();
+      LocalAddrs.insert({FD, {CurVD, LocalAddr}});
+      ++I;
+      continue;
+    }
+
+    LValue ArgLVal =
+        CGF.MakeAddrLValue(LocalAddr, FD->getType(), AlignmentSource::Decl);
+    if (FD->hasCapturedVLAType()) {
+      llvm::Value *ExprArg = CGF.EmitLoadOfScalar(ArgLVal, I->getLocation());
+      const VariableArrayType *VAT = FD->getCapturedVLAType();
+      VLASizes.try_emplace(FD, VAT->getSizeExpr(), ExprArg);
+    } else if (I->capturesVariable()) {
+      const VarDecl *Var = I->getCapturedVar();
+      QualType VarTy = Var->getType();
+      Address ArgAddr = ArgLVal.getAddress(CGF);
+      if (ArgLVal.getType()->isLValueReferenceType()) {
+        ArgAddr = CGF.EmitLoadOfReference(ArgLVal);
+      } else if (!VarTy->isVariablyModifiedType() || !VarTy->isPointerType()) {
+        assert(ArgLVal.getType()->isPointerType());
+        ArgAddr = CGF.EmitLoadOfPointer(
+            ArgAddr, ArgLVal.getType()->castAs<PointerType>());
+      }
+      LocalAddrs.insert(
+          {FD,
+           {Var, Address(ArgAddr.getBasePointer(), ArgAddr.getElementType(),
+                         Ctx.getDeclAlign(Var))}});
+    } else if (I->capturesVariableByCopy()) {
+      assert(!FD->getType()->isAnyPointerType() &&
+             "Not expecting a captured pointer.");
+      const VarDecl *Var = I->getCapturedVar();
+      Address CopyAddr = CGF.CreateMemTemp(FD->getType(), Ctx.getDeclAlign(FD),
+                                           Var->getName());
+      LValue CopyLVal =
+          CGF.MakeAddrLValue(CopyAddr, FD->getType(), AlignmentSource::Decl);
+
+      RValue ArgRVal = CGF.EmitLoadOfLValue(ArgLVal, I->getLocation());
+      CGF.EmitStoreThroughLValue(ArgRVal, CopyLVal);
+
+      LocalAddrs.insert({FD, {Var, CopyAddr}});
+    } else {
+      // If 'this' is captured, load it into CXXThisValue.
+      assert(I->capturesThis());
+      CXXThisValue = CGF.EmitLoadOfScalar(ArgLVal, I->getLocation());
+      LocalAddrs.insert({FD, {nullptr, ArgLVal.getAddress(CGF)}});
+    }
+    ++I;
+  }
+
+  return F;
+}
+
 llvm::Function *
 CodeGenFunction::GenerateOpenMPCapturedStmtFunction(const CapturedStmt &S,
                                                     SourceLocation Loc) {
@@ -695,6 +791,36 @@ CodeGenFunction::GenerateOpenMPCapturedStmtFunction(const CapturedStmt &S,
   return WrapperF;
 }
 
+llvm::Function *CodeGenFunction::GenerateOpenMPCapturedStmtFunctionAggregate(
+    const CapturedStmt &S, SourceLocation Loc) {
+  assert(
+      CapturedStmtInfo &&
+      "CapturedStmtInfo should be set when generating the captured function");
+  const CapturedDecl *CD = S.getCapturedDecl();
+  // Build the argument list.
+  FunctionArgList Args;
+  llvm::MapVector<const Decl *, std::pair<const VarDecl *, Address>> LocalAddrs;
+  llvm::DenseMap<const Decl *, std::pair<const Expr *, llvm::Value *>> VLASizes;
+  StringRef FunctionName = CapturedStmtInfo->getHelperName();
+  llvm::Function *F = emitOutlinedFunctionPrologueAggregate(
+      *this, Args, LocalAddrs, VLASizes, CXXThisValue, S, Loc, FunctionName);
+  CodeGenFunction::OMPPrivateScope LocalScope(*this);
+  for (const auto &LocalAddrPair : LocalAddrs) {
+    if (LocalAddrPair.second.first) {
+      LocalScope.addPrivate(LocalAddrPair.second.first,
+                            LocalAddrPair.second.second);
+    }
+  }
+  (void)LocalScope.Privatize();
+  for (const auto &VLASizePair : VLASizes)
+    VLASizeMap[VLASizePair.second.first] = VLASizePair.second.second;
+  PGO.assignRegionCounters(GlobalDecl(CD), F);
+  CapturedStmtInfo->EmitBody(*this, CD->getBody());
+  (void)LocalScope.ForceCleanup();
+  FinishFunction(CD->getBodyRBrace());
+  return F;
+}
+
 //===----------------------------------------------------------------------===//
 //                              OpenMP Directive Emission
 //===----------------------------------------------------------------------===//
diff --git a/clang/lib/CodeGen/CodeGenFunction.h b/clang/lib/CodeGen/CodeGenFunction.h
index e1e687af6a781b..4ad4b96767f795 100644
--- a/clang/lib/CodeGen/CodeGenFunction.h
+++ b/clang/lib/CodeGen/CodeGenFunction.h
@@ -3639,6 +3639,9 @@ class CodeGenFunction : public CodeGenTypeCache {
   Address GenerateCapturedStmtArgument(const CapturedStmt &S);
   llvm::Function *GenerateOpenMPCapturedStmtFunction(const CapturedStmt &S,
                                                      SourceLocation Loc);
+  llvm::Function *
+  GenerateOpenMPCapturedStmtFunctionAggregate(const CapturedStmt &S,
+                                              SourceLocation Loc);
   void GenerateOpenMPCapturedVars(const CapturedStmt &S,
                                   SmallVectorImpl<llvm::Value *> &CapturedVars);
   void emitOMPSimpleStore(LValue LVal, RValue RVal, QualType RValTy,
diff --git a/clang/test/OpenMP/declare_target_codegen.cpp b/clang/test/OpenMP/declare_target_codegen.cpp
index ba93772ede3e8e..81116c6617b5bd 100644
--- a/clang/test/OpenMP/declare_target_codegen.cpp
+++ b/clang/test/OpenMP/declare_target_codegen.cpp
@@ -150,7 +150,7 @@ int bar() { return 1 + foo() + bar() + baz1() + baz2(); }
 int maini1() {
   int a;
   static long aa = 32 + bbb + ccc + fff + ggg;
-// CHECK-DAG: define weak_odr protected void @__omp_offloading_{{.*}}maini1{{.*}}_l[[@LINE+1]](ptr {{.*}}, ptr noundef nonnull align {{[0-9]+}} dereferenceable({{[0-9]+}}) %{{.*}}, i64 {{.*}}, i64 {{.*}})
+// CHECK-DAG: define weak_odr protected void @__omp_offloading_{{.*}}maini1{{.*}}_l[[@LINE+1]](ptr {{.*}}, ptr {{.*}})
 #pragma omp target map(tofrom \
                        : a, b)
   {
@@ -163,7 +163,7 @@ int maini1() {
 
 int baz3() { return 2 + baz2(); }
 int baz2() {
-// CHECK-DAG: define weak_odr protected void @__omp_offloading_{{.*}}baz2{{.*}}_l[[@LINE+1]](ptr {{.*}}, i64 {{.*}})
+// CHECK-DAG: define weak_odr protected void @__omp_offloading_{{.*}}baz2{{.*}}_l[[@LINE+1]](ptr {{.*}}, ptr {{.*}})
 #pragma omp target parallel
   ++c;
   return 2 + baz3();
@@ -175,7 +175,7 @@ static __typeof(create) __t_create __attribute__((__weakref__("__create")));
 
 int baz5() {
   bool a;
-// CHECK-DAG: define weak_odr protected void @__omp_offloading_{{.*}}baz5{{.*}}_l[[@LINE+1]](ptr {{.*}}, i64 {{.*}})
+// CHECK-DAG: define weak_odr protected void @__omp_offloading_{{.*}}baz5{{.*}}_l[[@LINE+1]](ptr {{.*}}, ptr {{.*}})
 #pragma omp target
   a = __extension__(void *) & __t_create != 0;
   return a;
diff --git a/clang/test/OpenMP/declare_target_link_codegen.cpp b/clang/test/OpenMP/declare_target_link_codegen.cpp
index 189c9ac59c153c..ba63a4bc543476 100644
--- a/clang/test/OpenMP/declare_target_link_codegen.cpp
+++ b/clang/test/OpenMP/declare_target_link_codegen.cpp
@@ -52,7 +52,7 @@ int maini1() {
   return 0;
 }
 
-// DEVICE: define weak_odr protected void @__omp_offloading_{{.*}}_{{.*}}maini1{{.*}}_l44(ptr {{[^,]+}}, ptr noundef nonnull align {{[0-9]+}} dereferenceable{{[^,]*}}
+// DEVICE: define weak_odr protected void @__omp_offloading_{{.*}}_{{.*}}maini1{{.*}}_l44(ptr {{[^,]+}}, ptr {{[^,]*}}
 // DEVICE: [[C_REF:%.+]] = load ptr, ptr @c_decl_tgt_ref_ptr,
 // DEVICE: [[C:%.+]] = load i32, ptr [[C_REF]],
 // DEVICE: store i32 [[C]], ptr %
diff --git a/clang/test/OpenMP/distribute_codegen.cpp b/clang/test/OpenMP/distribute_codegen.cpp
index 34d14c89fedaed..aaa28980839668 100644
--- a/clang/test/OpenMP/distribute_codegen.cpp
+++ b/clang/test/OpenMP/distribute_codegen.cpp
@@ -1947,19 +1947,18 @@ int fint(void) { return ftemplate<int>(); }
 //
 //
 // CHECK17-LABEL: define {{[^@]+}}@{{__omp_offloading_[0-9a-z]+_[0-9a-z]+}}__Z23without_schedule_clausePfS_S_S__l56
-// CHECK17-SAME: (ptr noalias noundef [[DYN_PTR:%.*]], ptr noundef [[A:%.*]], ptr noundef [[B:%.*]], ptr noundef [[C:%.*]], ptr noundef [[D:%.*]]) #[[ATTR0:[0-9]+]] {
+// CHECK17-SAME: (ptr noalias noundef [[DYN_PTR:%.*]], ptr noalias noundef [[__CONTEXT:%.*]]) #[[ATTR0:[0-9]+]] {
 // CHECK17-NEXT:  entry:
 // CHECK17-NEXT:    [[DYN_PTR_ADDR:%.*]] = alloca ptr, align 8
-// CHECK17-NEXT:    [[A_ADDR:%.*]] = alloca ptr, align 8
-// CHECK17-NEXT:    [[B_ADDR:%.*]] = alloca ptr, align 8
-// CHECK17-NEXT:    [[C_ADDR:%.*]] = alloca ptr, align 8
-// CHECK17-NEXT:    [[D_ADDR:%.*]] = alloca ptr, align 8
+// CHECK17-NEXT:    [[__CONTEXT_ADDR:%.*]] = alloca ptr, align 8
 // CHECK17-NEXT:    store ptr [[DYN_PTR]], ptr [[DYN_PTR_ADDR]], align 8
-// CHECK17-NEXT:    store ptr [[A]], ptr [[A_ADDR]], align 8
-// CHECK17-NEXT:    store ptr [[B]], ptr [[B_ADDR]], align 8
-// CHECK17-NEXT:    store ptr [[C]], ptr [[C_ADDR]], align 8
-// CHECK17-NEXT:    store ptr [[D]], ptr [[D_ADDR]], align 8
-// CHECK17-NEXT:    call void (ptr, i32, ptr, ...) @__kmpc_fork_teams(ptr @[[GLOB2:[0-9]+]], i32 4, ptr @{{__omp_offloading_[0-9a-z]+_[0-9a-z]+}}__Z23without_schedule_clausePfS_S_S__l56.omp_outlined, ptr [[A_ADDR]], ptr [[B_ADDR]], ptr [[C_ADDR]], ptr [[D_ADDR]])
+// CHECK17-NEXT:    store ptr [[__CONTEXT]], ptr [[__CONTEXT_ADDR]], align 8
+// CHECK17-NEXT:    [[TMP0:%.*]] = load ptr, ptr [[__CONTEXT_ADDR]], align 8
+// CHECK17-NEXT:    [[TMP1:%.*]] = getelementptr inbounds [[STRUCT_ANON:%.*]], ptr [[TMP0]], i32 0, i32 0
+// CHECK17-NEXT:    [[TMP2:%.*]] = getelementptr inbounds [[STRUCT_ANON]], ptr [[TMP0]], i32 0, i32 1
+// CHECK17-NEXT:    [[TMP3:%.*]] = getelementptr inbounds [[STRUCT_ANON]], ptr [[TMP0]], i32 0, i32 2
+// CHECK17-NEXT:    [[TMP4:%.*]] = getelementptr inbounds [[STRUCT_ANON]], ptr [[TMP0]], i32 0, i32 3
+// CHECK17-NEXT:    call void (ptr, i32, ptr, ...) @__kmpc_fork_teams(ptr @[[GLOB2:[0-9]+]], i32 4, ptr @{{__omp_offloading_[0-9a-z]+_[0-9a-z]+}}__Z23without_schedule_clausePfS_S_S__l56.omp_outlined, ptr [[TMP1]], ptr [[TMP2]], ptr [[TMP3]], ptr [[TMP4]])
 // CHECK17-NEXT:    ret void
 //
 //
@@ -2058,19 +2057,18 @@ int fint(void) { return ftemplate<int>(); }
 //
 //
 // CHECK17-LABEL: define {{[^@]+}}@{{__omp_offloading_[0-9a-z]+_[0-9a-z]+}}__Z18static_not_chunkedPfS_S_S__l68
-// CHECK17-SAME: (ptr noalias noundef [[DYN_PTR:%.*]], ptr noundef [[A:%.*]], ptr noundef [[B:%.*]], ptr noundef [[C:%.*]], ptr noundef [[D:%.*]]) #[[ATTR0]] {
+// CHECK17-SAME: (ptr noalias noundef [[DYN_PTR:%.*]], ptr noalias noundef [[__CONTEXT:%.*]]) #[[ATTR0]] {
 // CHECK17-NEXT:  entry:
 // CHECK17-NEXT:    [[DYN_PTR_ADDR:%.*]] = alloca ptr, align 8
-// CHECK17-NEXT:    [[A_ADDR:%.*]] = alloca ptr, align 8
-// CHECK17-NEXT:    [[B_ADDR:%.*]] = alloca ptr, align 8
-// CHECK17-NEXT:    [[C_ADDR:%.*]] = alloca ptr, align 8
-// CHECK17-NEXT:    [[D_ADDR:%.*]] = alloca ptr, align 8
+// CHECK17-NEXT:    [[__CONTEXT_ADDR:%.*]] = alloca ptr, align 8
 // CHECK17-NEXT:    store ptr [[DYN_PTR]], ptr [[DYN_PTR_ADDR]], align 8
-// CHECK17-NEXT:    store ptr [[A]], ptr [[A_ADDR]], align 8
-// CHECK17-NEXT:    store ptr [[B]], ptr [[B_ADDR]], align 8
-// CHECK17-NEXT:    store ptr [[C]], ptr [[C_ADDR]], align 8
-// CHECK17-NEXT:    store ptr [[D]], ptr [[D_ADDR]], align 8
-// CHECK17-NEXT:    call void (ptr, i32, ptr, ...) @__kmpc_fork_teams(ptr @[[GLOB2]], i32 4, ptr @{{__omp_offloading_[0-9a-z]+_[0-9a-z]+}}__Z18static_not_chunkedPfS_S_S__l68.omp_outlined, ptr [[A_ADDR]], ptr [[B_ADDR]], ptr [[C_ADDR]], ptr [[D_ADDR]])
+// CHECK17-NEXT:    store ptr [[__CONTEXT]], ptr [[__CONTEXT_ADDR]], align 8
+// CHECK17-NEXT:    [[TMP0:%.*]] = load ptr, ptr [[__CONTEXT_ADDR]], align 8
+// CHECK17-NEXT:    [[TMP1:%.*]] = getelementptr inbounds [[STRUCT_ANON_0:%.*]], ptr [[TMP0]], i32 0, i32 0
+// CHECK17-NEXT:    [[TMP2:%.*]] = getelementptr inbounds [[STRUCT_ANON_0]], ptr [[TMP0]], i32 0, i32 1
+// CHECK17-NEXT:    [[TMP3:%.*]] = getelementptr inbounds [[STRUCT_ANON_0]], ptr [[TMP0]], i32 0, i32 2
+// CHECK17-NEXT:    [[TMP4:%.*]] = getelementptr inbounds [[STRUCT_ANON_0]], ptr [[TMP0]], i32 0, i32 3
+// CHECK17-NEXT:    call void (ptr, i32, ptr, ...) @__kmpc_fork_teams(ptr @[[GLOB2]], i32 4, ptr @{{__omp_offloading_[0-9a-z]+_[0-9a-z]+}}__Z18static_not_chunkedPfS_S_S__l68.omp_outlined, ptr [[TMP1]], ptr [[TMP2]], ptr [[TMP3]], ptr [[TMP4]])
 // CHECK17-NEXT:    ret void
 //
 //
@@ -2169,19 +2167,18 @@ int fint(void) { return ftemplate<int>(); }
 //
 //
 // CHECK17-LABEL: define {{[^@]+}}@{{__omp_offloading_[0-9a-z]+_[0-9a-z]+}}__Z14static_chunkedPfS_S_S__l80
-// CHECK17-SAME: (ptr noalias noundef [[DYN_PTR:%.*]], ptr noundef [[A:%.*]], ptr noundef [[B:%.*]], ptr noundef [[C:%.*]], ptr noundef [[D:%.*]]) #[[ATTR0]] {
+// CHECK17-SAME: (ptr noalias noundef [[DYN_PTR:%.*]], ptr noalias noundef [[__CONTEXT:%.*]]) #[[ATTR0]] {
 // CHECK17-NEXT:  entry:
 // CHECK17-NEXT:    [[DYN_PTR_ADDR:%.*]] = alloca ptr, align 8
-// CHECK17-NEXT:    [[A_ADDR:%.*]] = alloca ptr, align 8
-// CHECK17-NEXT:    [[B_ADDR:%.*]] = alloca ptr, align 8
-// CHECK17-NEXT:    [[C_ADDR:%.*]] = alloca ptr, align 8
-// CHECK17-NEXT:    [[D_ADDR:%.*]] = alloca ptr, align 8
+// CHECK17-NEXT:    [[__CONTEXT_ADDR:%.*]] = alloca ptr, align 8
 // CHECK17-NEXT:    store ptr [[DYN_PTR]], ptr [[DYN_PTR_ADDR]], align 8
-// CHECK17-NEXT:    store ptr [[A]], ptr [[A_ADDR]], align 8
-// CHECK17-NEXT:    store ptr [[B]], ptr [[B_ADDR]], align 8
-// CHECK17-NEXT:    store ptr [[C]], ptr [[C_ADDR]], align 8
-// CHECK17-NEXT:    store ptr [[D]], ptr [[D_ADDR]], align 8
-// CHECK17-NEXT:    call void (ptr, i32, ptr, ...) @__kmpc_fork_teams(ptr @[[GLOB2]], i32 4, ptr @{{__omp_offloading_[0-9a-z]+_[0-9a-z]+}}__Z14static_chunkedPfS_S_S__l80.omp_outlined, ptr [[A_ADDR]], ptr [[B_ADDR]], ptr [[C_ADDR]], ptr [[D_ADDR]])
+// CHECK17-NEXT:    store ptr [[__CONTEXT]], ptr [[__CONTEXT_ADDR]], align 8
+// CHECK17-NEXT:    [[TMP0:%.*]] = load ptr, ptr [[__CONTEXT_ADDR]], align 8
+// CHECK17-NEXT:    [[TMP1:%.*]] = getelementptr inbounds [[STRUCT_ANON_1:%.*]], ptr [[TMP0]], i32 0, i32 0
+// CHECK17-NEXT:    [[TMP2:%.*]] = getelementptr inbounds [[STRUCT_ANON_1]], ptr [[TMP0]], i32 0, i32 1
+// CHECK17-NEXT:    [[TMP3:%.*]] = getelementptr inbounds [[STRUCT_ANON_1]], ptr [[TMP0]], i32 0, i32 2
+// CHECK17-NEXT:    [[TMP4:%.*]] = getelementptr inbounds [[STRUCT_ANON_1]], ptr [[TMP0]], i32 0, i32 3
+// CHECK17-NEXT:    call void (ptr, i32, ptr, ...) @__kmpc_fork_teams(ptr @[[GLOB2]], i32 4, ptr @{{__omp_offloading_[0-9a-z]+_[0-9a-z]+}}__Z14static_chunkedPfS_S_S__l80.omp_outlined, ptr [[TMP1]], ptr [[TMP2]], ptr [[TMP3]], ptr [[TMP4]])
 // CHECK17-NEXT:    ret void
 //
 //
@@ -2297,13 +2294,18 @@ int fint(void) { return ftemplate<int>(); }
 //
 //
 // CHECK17-LABEL: define {{[^@]+}}@{{__omp_offloading_[0-9a-z]+_[0-9a-z]+}}__Z12test_precondv_l92
-// CHECK17-SAME: (ptr noalias noundef [[DYN_PTR:%.*]], i64 noundef [[A:%.*]]) #[[ATTR0]] {
+// CHECK17-SAME: (ptr noalias noundef [[DYN_PTR:%.*]], ptr noalias noundef [[__CONTEXT:%.*]]) #[[ATTR0]] {
 // CHECK17-NEXT:  entry:
 // CHECK17-NEXT:    [[DYN_PTR_ADDR:%.*]] = alloca ptr, align 8
-// CHECK17-NEXT:    [[A_ADDR:%.*]] = alloca i64, align 8
+// CHECK17-NEXT:    [[__CONTEXT_ADDR:%.*]] = alloca ptr, align 8
+// CHECK17-NEXT:    [[A:%.*]] = alloca i8, align 1
 // CHECK17-NEXT:    store ptr [[DYN_PTR]], ptr [[DYN_PTR_ADDR]], align 8
-// CHECK17-NEXT:    store i64 [[A]], ptr [[A_ADDR]], align 8
-// CHECK17-NEXT:    call void (ptr, i32, ptr, ...) @__kmpc_fork_teams(ptr @[[GLOB2]], i32 1, ptr @{{__omp_offloading_[0-9a-z]+_[0-9a-z]+}}__Z12test_precondv_l92.omp_outlined, ptr [[A_ADDR]])
+// CHECK17-NEXT:    store ptr [[__CONTEXT]], ptr [[__CONTEXT_ADDR]], align 8
+// CHECK17-NEXT:    [[TMP0:%.*]] = load ptr, ptr [[__CONTEXT_ADDR]], align 8
+// CHECK17-NEXT:    [[TMP1:%.*]] = getelementptr inbounds [[STRUCT_ANON_2:%.*]], ptr [[TMP0]], i32 0, i32 0
+// CHECK17-NEXT:    [[TMP2:%.*]] = load i8, ptr [[TMP1]], align 1
+// CHECK17-NEXT:    store i8 [[TMP2]], ptr [[A]], align 1
+// CHECK17-NEXT:    call void (ptr, i32, ptr, ...) @__kmpc_fork_teams(ptr @[[GLOB2]], i32 1, ptr @{{__omp_offloading_[0-9a-z]+_[0-9a-z]+}}__Z12test_precondv_l92.omp_outlined, ptr [[A]])
 // CHECK17-NEXT:    ret void
 //
 //
@@ -2401,13 +2403,18 @@ int fint(void) { return ftemplate<int>(); }
 //
 //
 // CHECK17-LABEL: define {{[^@]+}}@{{__omp_offloading_[0-9a-z]+_[0-9a-z]+}}__Z9ftemplateIiET_v_l108
-// CHECK17-SAME: (ptr noalias noundef [[DYN_PTR:%.*]], i64 noundef [[AA:%.*]]) #[[ATTR0]] {
+// CHECK17-SAME: (ptr noalias noundef [[DYN_PTR:%.*]], ptr noalias noundef [[__CONTEXT:%.*]]) #[[ATTR0]] {
 // CHECK17-NEXT:  entry:
 // CHECK17-NEXT:    [[DYN_PTR_ADDR:%.*]] = alloca ptr, align 8
-// CHECK17-NEXT:    [[AA_ADDR:%.*]] = alloca i64, align 8
+// CHECK17-NEXT:    [[__CONTEXT_ADDR:%.*]] = alloca ptr, align 8
+// CHECK17-NEXT:    [[AA:%.*]] = alloca i16, align 2
 // CHECK17-NEXT:    store ptr [[DYN_PTR]], ptr [[DYN_PTR_ADDR]], align 8
-// CHECK17-NEXT:    store i64 [[AA]], ptr [[AA_ADDR]], align 8
-// CHECK17-NEXT:    call void (ptr, i32, ptr, ...) @__kmpc_fork_teams(ptr @[[GLOB2]], i32 1, ptr @{{__omp_offloading_[0-9a-z]+_[0-9a-z]+}}__Z9ftemplateIiET_v_l108.omp_outlined, ptr [[AA_ADDR]])
+// CHECK17-NEXT:    store ptr [[__CONTEXT]], ptr [[__CONTEXT_ADDR]], align 8
+// C...
[truncated]

@jhuber6
Copy link
Contributor Author

jhuber6 commented May 6, 2024

I'm unsure how to resolve the issue of CGF.EmitScalarExpr(NumTeams) not returning the correct value now. For the following code

#include <stdio.h>
#include <stdlib.h>

int main() {
  int Teams = 10;
#pragma omp target teams distribute parallel for num_teams(Teams)
  for (int i = 0; i < 1; ++i)
    ;

  return 0;
}

I get this LLVM-IR, which suggests that it's reading beyond the expected __context struct.

; Function Attrs: convergent noinline norecurse nounwind optnone uwtable
define weak_odr protected void @__omp_offloading_10302_af886a3_main_l9(ptr noalias noundef %dyn_ptr, ptr noalias noundef %__context) #0 {
entry:
  %dyn_ptr.addr = alloca ptr, align 8
  %__context.addr = alloca ptr, align 8
  %.capture_expr. = alloca i32, align 4
  %0 = call i32 @__kmpc_global_thread_num(ptr @3)
  store ptr %dyn_ptr, ptr %dyn_ptr.addr, align 8
  store ptr %__context, ptr %__context.addr, align 8
  %1 = load ptr, ptr %__context.addr, align 8
  %2 = getelementptr inbounds %struct.anon, ptr %1, i32 0, i32 0
  %3 = load i32, ptr %2, align 4
  store i32 %3, ptr %.capture_expr., align 4
  %4 = load i32, ptr %.capture_expr., align 4
  call void @__kmpc_push_num_teams(ptr @3, i32 %0, i32 %4, i32 0)
  call void (ptr, i32, ptr, ...) @__kmpc_fork_teams(ptr @3, i32 0, ptr @__omp_offloading_10302_af886a3_main_l9.omp_outlined)
  ret void
}

Any idea how to resolve this? I'm assuming the way we do this now is no longer valid somehow because of the struct indirection.

@alexey-bataev
Copy link
Member

= load i32, ptr %.capture_expr., align 4

Why do you think it reads beyond __context? %2 = getelementptr inbounds %struct.anon, ptr %1, i32 0, i32 0 points to the first element in the __context, if I'm not missing something. If it has the wrong value, looks like it is not written correctly

if (CGM.getLangOpts().OpenMPIsTargetDevice && !isGPU())
return CGF.GenerateOpenMPCapturedStmtFunctionAggregate(
CS, D.getBeginLoc());
else
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
else

LocalAddrs.insert({FD, {Var, CopyAddr}});
} else {
// If 'this' is captured, load it into CXXThisValue.
assert(I->capturesThis());
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add a message

@jhuber6
Copy link
Contributor Author

jhuber6 commented May 7, 2024

= load i32, ptr %.capture_expr., align 4

Why do you think it reads beyond __context? %2 = getelementptr inbounds %struct.anon, ptr %1, i32 0, i32 0 points to the first element in the __context, if I'm not missing something. If it has the wrong value, looks like it is not written correctly

I think I copied the wrong code somehow,

; Function Attrs: convergent noinline norecurse nounwind optnone uwtable
define weak_odr protected void @__omp_offloading_10302_adc9471_main_l10(ptr noalias noundef %dyn_ptr, ptr noalias noundef %__context) #0 {
entry:
  %dyn_ptr.addr = alloca ptr, align 8
  %__context.addr = alloca ptr, align 8
  %Teams = alloca i32, align 4
  %Threads = alloca i32, align 4
  %.capture_expr. = alloca i32, align 4
  %.capture_expr.1 = alloca i32, align 4
  %Teams.casted = alloca i64, align 8
  %Threads.casted = alloca i64, align 8
  %0 = call i32 @__kmpc_global_thread_num(ptr @3)
  store ptr %dyn_ptr, ptr %dyn_ptr.addr, align 8
  store ptr %__context, ptr %__context.addr, align 8
  %1 = load ptr, ptr %__context.addr, align 8
  %2 = getelementptr inbounds %struct.anon, ptr %1, i32 0, i32 0
  %3 = load i32, ptr %2, align 4
  store i32 %3, ptr %Teams, align 4
  %4 = getelementptr inbounds %struct.anon, ptr %1, i32 0, i32 1
  %5 = load i32, ptr %4, align 4
  store i32 %5, ptr %Threads, align 4
  %6 = getelementptr inbounds %struct.anon, ptr %1, i32 0, i32 2
  %7 = load i32, ptr %6, align 4
  store i32 %7, ptr %.capture_expr., align 4
  %8 = getelementptr inbounds %struct.anon, ptr %1, i32 0, i32 3
  %9 = load i32, ptr %8, align 4
  store i32 %9, ptr %.capture_expr.1, align 4
  %10 = load i32, ptr %.capture_expr., align 4
  %11 = load i32, ptr %.capture_expr.1, align 4
  call void @__kmpc_push_num_teams(ptr @3, i32 %0, i32 %10, i32 %11)
  %12 = load i32, ptr %Teams, align 4
  store i32 %12, ptr %Teams.casted, align 4
  %13 = load i64, ptr %Teams.casted, align 8
  %14 = load i32, ptr %Threads, align 4
  store i32 %14, ptr %Threads.casted, align 4
  %15 = load i64, ptr %Threads.casted, align 8
  call void (ptr, i32, ptr, ...) @__kmpc_fork_teams(ptr @3, i32 2, ptr @__omp_offloading_10302_adc9471_main_l10.omp_outlined, i64 %13, i64 %15)
  ret void
}

This is what I get from the corresponding C code.

#include <stdio.h>
#include <assert.h>
#include <stdlib.h>

int main() {
  int Threads = 6;
  int Teams = 10;

  long unsigned s = 0;
#pragma omp target teams distribute parallel for num_teams(Teams)              \
    thread_limit(Threads)
  for (int i = 0; i < Threads * Teams; ++i) {
    assert(Teams == 10);
  }

  return 0;
}

When I compile run it, I get the following. So it warns on some nonsense team value (It will be even more corrupt with other cases, but this was the simplest I could get).

> clang malloc.c -fopenmp -fopenmp-targets=x86_64-pc-linux-gnu                                         
> ./a.out 
OMP: Warning #96: Cannot form a team with 48 threads, using 21 instead.
OMP: Hint Consider unsetting KMP_DEVICE_THREAD_LIMIT (KMP_ALL_THREADS), KMP_TEAMS_THREAD_LIMIT, and OMP_THREAD_LIMIT (if any are set).

The LLVM-IR is confusing to me because it's doing a GEP up to 3, which is suggesting that the Teams / Threads values are appended but the number of arguments. So, in the runtime for this case it states there are 5 arguments, first is the dyn_ptr that's alwasy there, then the two threads / teams. And then these extra threads / teams. This then causes mayhem because only the first two are set to what we expect them to be.

@alexey-bataev
Copy link
Member

struct.anon

Can you provide full IR dump here?

@jhuber6
Copy link
Contributor Author

jhuber6 commented May 7, 2024

struct.anon

Can you provide full IR dump here?

https://godbolt.org/z/48h5s3W6v

@alexey-bataev
Copy link
Member

struct.anon

Can you provide full IR dump here?

https://godbolt.org/z/48h5s3W6v

It does not look like the issue of the target code, I don't see any wrong access for __context. Мост probably something wrong with the host code/runtime.

@jhuber6
Copy link
Contributor Author

jhuber6 commented May 7, 2024

struct.anon

Can you provide full IR dump here?

https://godbolt.org/z/48h5s3W6v

It does not look like the issue of the target code, I don't see any wrong access for __context. Мост probably something wrong with the host code/runtime.

Yeah, I think that's correct. Looking at the IR it seems to add the two extra arguments and call them as I'd expect, but for some reason it gets corrupted in the runtime layer. It might be doing something weird with the arguments.

@alexey-bataev
Copy link
Member

struct.anon

Can you provide full IR dump here?

https://godbolt.org/z/48h5s3W6v

It does not look like the issue of the target code, I don't see any wrong access for __context. Мост probably something wrong with the host code/runtime.

Yeah, I think that's correct. Looking at the IR it seems to add the two extra arguments and call them as I'd expect, but for some reason it gets corrupted in the runtime layer. It might be doing something weird with the arguments.

Maybe. The message is emitted on the host, so there is something wrong with the host code or runtime library.

@jhuber6
Copy link
Contributor Author

jhuber6 commented May 7, 2024

Maybe. The message is emitted on the host, so there is something wrong with the host code or runtime library.

This might be some issue with the host codegen actually. The following modified the runtime to print the arguments as-seen by the runtime.

> clang malloc.c -fopenmp -fopenmp-targets=x86_64-pc-linux-gnu                                                    
> ./a.out 
10
131675107360774
131675107360778
110294760161286
18446744073709551615
10
131675107360774
131675107360778
110294760161286
OMP: Warning #96: Cannot form a team with 48 threads, using 21 instead.
OMP: Hint Consider unsetting KMP_DEVICE_THREAD_LIMIT (KMP_ALL_THREADS), KMP_TEAMS_THREAD_LIMIT, and OMP_THREAD_LIMIT (if any are set).
> clang malloc.c -fopenmp -fopenmp-targets=x86_64-pc-linux-gnu -O3
> ./a.out 
10
6
10
6
18446744073709551615
10
6
10
6

With optimization on, I see what I expect. With -O0 it seems to give me garbage. Looking at the ASM also suggests that only the 0x10 value is written for some reason? https://godbolt.org/z/86hTjjaa8 is the host-IR I get without optimizations.

@jhuber6
Copy link
Contributor Author

jhuber6 commented May 7, 2024

I'm getting the same kind of output on main, but the warning is mysteriously absent. Same results for GPU offloading.

@alexey-bataev
Copy link
Member

Hmm, hard to tell, need to debug it.

@jhuber6
Copy link
Contributor Author

jhuber6 commented May 7, 2024

Hmm, hard to tell, need to debug it.

Somehow when I print it in the runtime it shows up as garbage, but the actual region seems to get correct values. There shouldn't be anything in-between the arguments I'm printing and the kernel launch however so I'm stumped.

@jhuber6 jhuber6 changed the title [WIP][OpenMP] Remove dependency on libffi from offloading runtime [OpenMP] Remove dependency on libffi from offloading runtime May 9, 2024
@jhuber6
Copy link
Contributor Author

jhuber6 commented May 9, 2024

I hacked around it in the runtime itself. Obviously this is very OpenMP specific behavior but so was the old method. Passes all tests now.

jhuber6 added a commit to jhuber6/llvm-project that referenced this pull request May 22, 2024
Summary:
We previously had multiple options for this, this patch replaces them
with `LIBOMPTARGET_DLOPEN_PLUGINS=` to be a list of plugins to
dynamically use. It defaults to everything right now. This ignores the
`host` plugin because the `libffi` dependency is going to be removed
soon hopefully in llvm#91264.
jhuber6 added a commit that referenced this pull request May 22, 2024
Summary:
We previously had multiple options for this, this patch replaces them
with `LIBOMPTARGET_DLOPEN_PLUGINS=` to be a list of plugins to
dynamically use. It defaults to everything right now. This ignores the
`host` plugin because the `libffi` dependency is going to be removed
soon hopefully in #91264.
Summary:
This patch attempts to remove the dependency on `libffi` by instead
emitting the host / CPU kernels using an aggregate struct made from the
captured context. This callows us to have a fixed function prototype we
can call directly rather than requiring an extra library to decode the
ABI to call a function with N (non variadic) arguments.
@jhuber6
Copy link
Contributor Author

jhuber6 commented May 29, 2024

ping

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
clang:codegen clang:openmp OpenMP related changes to Clang clang Clang issues not falling into any other category offload
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[OpenMP] Libomptarget depends on libffi
3 participants