Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[HIP] support 128 bit int division #71978

Open
wants to merge 1 commit into
base: main
Choose a base branch
from
Open

Conversation

yxsamliu
Copy link
Collaborator

Currently nvcc supports 128 bit int division in device code. This patch adds support of 128 bit int division to HIP.

It builds lib functions for 128 bit division in compiler-rt for amdgcn target.

Then links compiler-rt with -mlink-bitcode-file.

It adds support of archive of bitcode to -mlink-bitcode-file.

It adds support of call of lib function in amdgcn backend.

Fixes: #71223

Fixes: SWDEV-426193

@llvmbot llvmbot added clang Clang issues not falling into any other category compiler-rt backend:AMDGPU clang:driver 'clang' and 'clang++' user-facing binaries. Not 'clang-cl' clang:codegen labels Nov 10, 2023
@llvmbot
Copy link
Collaborator

llvmbot commented Nov 10, 2023

@llvm/pr-subscribers-clang-codegen
@llvm/pr-subscribers-clang
@llvm/pr-subscribers-backend-amdgpu

@llvm/pr-subscribers-clang-driver

Author: Yaxun (Sam) Liu (yxsamliu)

Changes

Currently nvcc supports 128 bit int division in device code. This patch adds support of 128 bit int division to HIP.

It builds lib functions for 128 bit division in compiler-rt for amdgcn target.

Then links compiler-rt with -mlink-bitcode-file.

It adds support of archive of bitcode to -mlink-bitcode-file.

It adds support of call of lib function in amdgcn backend.

Fixes: #71223

Fixes: SWDEV-426193


Full diff: https://github.com/llvm/llvm-project/pull/71978.diff

9 Files Affected:

  • (modified) clang/lib/CodeGen/CodeGenAction.cpp (+91-12)
  • (modified) clang/lib/Driver/ToolChains/HIPAMD.cpp (+5)
  • (modified) compiler-rt/cmake/Modules/CompilerRTUtils.cmake (+2)
  • (modified) compiler-rt/cmake/base-config-ix.cmake (+5)
  • (modified) compiler-rt/cmake/builtin-config-ix.cmake (+2-1)
  • (modified) compiler-rt/lib/builtins/CMakeLists.txt (+16)
  • (modified) llvm/lib/Target/AMDGPU/AMDGPUISelLowering.cpp (+3-1)
  • (modified) llvm/lib/Target/AMDGPU/AMDGPUTargetMachine.cpp (+1)
  • (modified) llvm/lib/Target/AMDGPU/SIISelLowering.cpp (+11-3)
diff --git a/clang/lib/CodeGen/CodeGenAction.cpp b/clang/lib/CodeGen/CodeGenAction.cpp
index a31a271ed77d1ca..f5a3274fbdd2c3e 100644
--- a/clang/lib/CodeGen/CodeGenAction.cpp
+++ b/clang/lib/CodeGen/CodeGenAction.cpp
@@ -41,6 +41,7 @@
 #include "llvm/IRReader/IRReader.h"
 #include "llvm/LTO/LTOBackend.h"
 #include "llvm/Linker/Linker.h"
+#include "llvm/Object/Archive.h"
 #include "llvm/Pass.h"
 #include "llvm/Support/MemoryBuffer.h"
 #include "llvm/Support/SourceMgr.h"
@@ -937,27 +938,105 @@ bool CodeGenAction::loadLinkModules(CompilerInstance &CI) {
 
   for (const CodeGenOptions::BitcodeFileToLink &F :
        CI.getCodeGenOpts().LinkBitcodeFiles) {
-    auto BCBuf = CI.getFileManager().getBufferForFile(F.Filename);
-    if (!BCBuf) {
+
+    auto BCBufOrErr = CI.getFileManager().getBufferForFile(F.Filename);
+    if (!BCBufOrErr) {
       CI.getDiagnostics().Report(diag::err_cannot_open_file)
-          << F.Filename << BCBuf.getError().message();
+          << F.Filename << BCBufOrErr.getError().message();
       LinkModules.clear();
       return true;
     }
 
+    auto &BCBuf = *BCBufOrErr;
+
     Expected<std::unique_ptr<llvm::Module>> ModuleOrErr =
-        getOwningLazyBitcodeModule(std::move(*BCBuf), *VMContext);
-    if (!ModuleOrErr) {
-      handleAllErrors(ModuleOrErr.takeError(), [&](ErrorInfoBase &EIB) {
+        getOwningLazyBitcodeModule(std::move(BCBuf), *VMContext);
+
+    if (ModuleOrErr) {
+      LinkModules.push_back({std::move(ModuleOrErr.get()), F.PropagateAttrs,
+                             F.Internalize, F.LinkFlags});
+      continue;
+    } else {
+      // If parsing as bitcode failed, clear the error and try to parse as an
+      // archive.
+      handleAllErrors(ModuleOrErr.takeError(),
+                      [&](const llvm::ErrorInfoBase &EIB) {});
+
+      Expected<std::unique_ptr<llvm::object::Binary>> BinOrErr =
+          llvm::object::createBinary(BCBuf->getMemBufferRef(), VMContext);
+
+      if (!BinOrErr) {
+        handleAllErrors(BinOrErr.takeError(),
+                        [&](const llvm::ErrorInfoBase &EIB) {
+                          CI.getDiagnostics().Report(diag::err_cannot_open_file)
+                              << F.Filename << EIB.message();
+                        });
+        LinkModules.clear();
+        return true;
+      }
+
+      std::unique_ptr<llvm::object::Binary> &Bin = *BinOrErr;
+
+      if (Bin->isArchive()) {
+        llvm::object::Archive *Archive =
+            llvm::cast<llvm::object::Archive>(Bin.get());
+        Error Err = Error::success();
+
+        for (auto &Child : Archive->children(Err)) {
+          Expected<llvm::MemoryBufferRef> ChildBufOrErr =
+              Child.getMemoryBufferRef();
+          if (!ChildBufOrErr) {
+            handleAllErrors(
+                ChildBufOrErr.takeError(), [&](const llvm::ErrorInfoBase &EIB) {
+                  CI.getDiagnostics().Report(diag::err_cannot_open_file)
+                      << F.Filename << EIB.message();
+                });
+            continue;
+          }
+          auto ChildBuffer = llvm::MemoryBuffer::getMemBufferCopy(
+              ChildBufOrErr->getBuffer(), ChildBufOrErr->getBufferIdentifier());
+
+          if (!ChildBuffer) {
+            handleAllErrors(
+                ChildBufOrErr.takeError(), [&](const llvm::ErrorInfoBase &EIB) {
+                  CI.getDiagnostics().Report(diag::err_cannot_open_file)
+                      << F.Filename << EIB.message();
+                });
+            continue;
+          }
+
+          Expected<std::unique_ptr<llvm::Module>> ChildModuleOrErr =
+              getOwningLazyBitcodeModule(std::move(ChildBuffer), *VMContext);
+          if (!ChildModuleOrErr) {
+            handleAllErrors(
+                ChildModuleOrErr.takeError(),
+                [&](const llvm::ErrorInfoBase &EIB) {
+                  CI.getDiagnostics().Report(diag::err_cannot_open_file)
+                      << F.Filename << EIB.message();
+                });
+            continue;
+          }
+
+          LinkModules.push_back({std::move(ChildModuleOrErr.get()),
+                                 F.PropagateAttrs, F.Internalize, F.LinkFlags});
+        }
+        if (Err) {
+          CI.getDiagnostics().Report(diag::err_cannot_open_file)
+              << F.Filename << toString(std::move(Err));
+          LinkModules.clear();
+          return true;
+        }
+      } else {
+        // It's not an archive, and we failed to parse it as bitcode, so report
+        // an error.
         CI.getDiagnostics().Report(diag::err_cannot_open_file)
-            << F.Filename << EIB.message();
-      });
-      LinkModules.clear();
-      return true;
+            << F.Filename << "Unrecognized file format";
+        LinkModules.clear();
+        return true;
+      }
     }
-    LinkModules.push_back({std::move(ModuleOrErr.get()), F.PropagateAttrs,
-                           F.Internalize, F.LinkFlags});
   }
+
   return false;
 }
 
diff --git a/clang/lib/Driver/ToolChains/HIPAMD.cpp b/clang/lib/Driver/ToolChains/HIPAMD.cpp
index ccb36a6c846c806..2ea3c97136c2272 100644
--- a/clang/lib/Driver/ToolChains/HIPAMD.cpp
+++ b/clang/lib/Driver/ToolChains/HIPAMD.cpp
@@ -21,6 +21,7 @@
 #include "llvm/Support/Alignment.h"
 #include "llvm/Support/FileSystem.h"
 #include "llvm/Support/Path.h"
+#include "llvm/Support/VirtualFileSystem.h"
 #include "llvm/TargetParser/TargetParser.h"
 
 using namespace clang::driver;
@@ -403,6 +404,10 @@ HIPAMDToolChain::getDeviceLibs(const llvm::opt::ArgList &DriverArgs) const {
         BCLibs.emplace_back(AsanRTL, /*ShouldInternalize=*/false);
     }
 
+    auto BuiltinCRT = getCompilerRT(DriverArgs, "builtins");
+    if (getVFS().exists(BuiltinCRT))
+      BCLibs.emplace_back(BuiltinCRT, /*ShouldInternalize=*/false);
+
     // Add the HIP specific bitcode library.
     BCLibs.push_back(RocmInstallation->getHIPPath());
 
diff --git a/compiler-rt/cmake/Modules/CompilerRTUtils.cmake b/compiler-rt/cmake/Modules/CompilerRTUtils.cmake
index 25e7823716fc2f4..d0596a11c26a69a 100644
--- a/compiler-rt/cmake/Modules/CompilerRTUtils.cmake
+++ b/compiler-rt/cmake/Modules/CompilerRTUtils.cmake
@@ -456,6 +456,8 @@ function(get_compiler_rt_target arch variable)
       endif()
     endif()
     set(target "${arch}${triple_suffix}")
+  elseif(${arch} STREQUAL "amdgcn")
+    set(target "amdgcn-amd-amdhsa")
   else()
     set(target "${arch}${triple_suffix}")
   endif()
diff --git a/compiler-rt/cmake/base-config-ix.cmake b/compiler-rt/cmake/base-config-ix.cmake
index 908c8a40278cf0c..54adb48f445d96f 100644
--- a/compiler-rt/cmake/base-config-ix.cmake
+++ b/compiler-rt/cmake/base-config-ix.cmake
@@ -194,6 +194,11 @@ macro(test_targets)
     endif()
   endif()
 
+  set(COMPILER_RT_ENABLE_TARGET_AMDGCN OFF CACHE BOOL "Option to enable AMDGCN in Compiler RT")
+  if (COMPILER_RT_ENABLE_TARGET_AMDGCN)
+    add_default_target_arch("amdgcn")
+  endif()
+
   # Generate the COMPILER_RT_SUPPORTED_ARCH list.
   if(ANDROID)
     # Examine compiler output to determine target architecture.
diff --git a/compiler-rt/cmake/builtin-config-ix.cmake b/compiler-rt/cmake/builtin-config-ix.cmake
index 5ccc5d7a559b2ac..02db2f9ae2264f2 100644
--- a/compiler-rt/cmake/builtin-config-ix.cmake
+++ b/compiler-rt/cmake/builtin-config-ix.cmake
@@ -64,6 +64,7 @@ set(SPARCV9 sparcv9)
 set(WASM32 wasm32)
 set(WASM64 wasm64)
 set(VE ve)
+set(AMDGCN amdgcn)
 
 if(APPLE)
   set(ARM64 arm64 arm64e)
@@ -75,7 +76,7 @@ set(ALL_BUILTIN_SUPPORTED_ARCH
   ${X86} ${X86_64} ${ARM32} ${ARM64} ${AVR}
   ${HEXAGON} ${MIPS32} ${MIPS64} ${PPC32} ${PPC64}
   ${RISCV32} ${RISCV64} ${SPARC} ${SPARCV9}
-  ${WASM32} ${WASM64} ${VE} ${LOONGARCH64})
+  ${WASM32} ${WASM64} ${VE} ${LOONGARCH64} ${AMDGCN})
 
 include(CompilerRTUtils)
 include(CompilerRTDarwinUtils)
diff --git a/compiler-rt/lib/builtins/CMakeLists.txt b/compiler-rt/lib/builtins/CMakeLists.txt
index 360fdb0e99b57be..e7f2b370f50b271 100644
--- a/compiler-rt/lib/builtins/CMakeLists.txt
+++ b/compiler-rt/lib/builtins/CMakeLists.txt
@@ -553,6 +553,13 @@ set(aarch64_SOURCES
   aarch64/fp_mode.c
 )
 
+set(amdgcn_SOURCES
+  divti3.c
+  udivmodti4.c
+  truncdfbf2.c
+  truncsfbf2.c
+)
+
 if(COMPILER_RT_HAS_ASM_SME AND (COMPILER_RT_HAS_AUXV OR COMPILER_RT_BAREMETAL_BUILD))
   list(APPEND aarch64_SOURCES aarch64/sme-abi.S aarch64/sme-abi-init.c)
   message(STATUS "AArch64 SME ABI routines enabled")
@@ -838,6 +845,15 @@ else ()
         list(APPEND BUILTIN_CFLAGS_${arch} -fomit-frame-pointer -DCOMPILER_RT_ARMHF_TARGET)
       endif()
 
+      if (${arch} STREQUAL "amdgcn")
+        list(APPEND BUILTIN_CFLAGS_${arch}
+             --target=amdgcn-amd-amdhsa
+             -emit-llvm
+             -nogpuinc
+             -nogpulib
+             -Xclang -mcode-object-version=none )
+      endif()
+
       # For RISCV32, we must force enable int128 for compiling long
       # double routines.
       if(COMPILER_RT_ENABLE_SOFTWARE_INT128 OR "${arch}" STREQUAL "riscv32")
diff --git a/llvm/lib/Target/AMDGPU/AMDGPUISelLowering.cpp b/llvm/lib/Target/AMDGPU/AMDGPUISelLowering.cpp
index d4df9b6f49eb11c..e28d66fddad5aea 100644
--- a/llvm/lib/Target/AMDGPU/AMDGPUISelLowering.cpp
+++ b/llvm/lib/Target/AMDGPU/AMDGPUISelLowering.cpp
@@ -501,9 +501,11 @@ AMDGPUTargetLowering::AMDGPUTargetLowering(const TargetMachine &TM,
   setOperationAction(ISD::SELECT, MVT::v12f32, Promote);
   AddPromotedToType(ISD::SELECT, MVT::v12f32, MVT::v12i32);
 
-  // There are no libcalls of any kind.
   for (int I = 0; I < RTLIB::UNKNOWN_LIBCALL; ++I)
     setLibcallName(static_cast<RTLIB::Libcall>(I), nullptr);
+  // Supported compiler-rt libcalls should be enabled in compiler-rt for
+  // amdgcn first then added here.
+  setLibcallName(RTLIB::SDIV_I128, "__divti3");
 
   setSchedulingPreference(Sched::RegPressure);
   setJumpIsExpensive(true);
diff --git a/llvm/lib/Target/AMDGPU/AMDGPUTargetMachine.cpp b/llvm/lib/Target/AMDGPU/AMDGPUTargetMachine.cpp
index 951ed9420594b19..c78a031b624ce82 100644
--- a/llvm/lib/Target/AMDGPU/AMDGPUTargetMachine.cpp
+++ b/llvm/lib/Target/AMDGPU/AMDGPUTargetMachine.cpp
@@ -596,6 +596,7 @@ static bool mustPreserveGV(const GlobalValue &GV) {
   if (const Function *F = dyn_cast<Function>(&GV))
     return F->isDeclaration() || F->getName().startswith("__asan_") ||
            F->getName().startswith("__sanitizer_") ||
+           F->getName() == "__divti3" ||
            AMDGPU::isEntryFunctionCC(F->getCallingConv());
 
   GV.removeDeadConstantUsers();
diff --git a/llvm/lib/Target/AMDGPU/SIISelLowering.cpp b/llvm/lib/Target/AMDGPU/SIISelLowering.cpp
index 4681004d3ba74ff..56425d631291ef6 100644
--- a/llvm/lib/Target/AMDGPU/SIISelLowering.cpp
+++ b/llvm/lib/Target/AMDGPU/SIISelLowering.cpp
@@ -3426,8 +3426,9 @@ SDValue SITargetLowering::LowerCall(CallLoweringInfo &CLI,
                               "unsupported call to variadic function ");
   }
 
-  if (!CLI.CB)
-    report_fatal_error("unsupported libcall legalization");
+  if (!CLI.CB && Callee.getNode()->getOpcode() != ISD::ExternalSymbol)
+    report_fatal_error(
+        "unsupported libcall legalization: Callee function is unknown");
 
   if (IsTailCall && MF.getTarget().Options.GuaranteedTailCallOpt) {
     return lowerUnhandledCall(CLI, InVals,
@@ -3630,10 +3631,17 @@ SDValue SITargetLowering::LowerCall(CallLoweringInfo &CLI,
 
   std::vector<SDValue> Ops;
   Ops.push_back(Chain);
+  bool AddTargetGlobalAddr = true;
+  // Try to find the callee in the current module.
+  if (isa<ExternalSymbolSDNode>(Callee)) {
+    Callee = DAG.getSymbolFunctionGlobalAddress(Callee);
+    AddTargetGlobalAddr = false;
+  }
   Ops.push_back(Callee);
   // Add a redundant copy of the callee global which will not be legalized, as
   // we need direct access to the callee later.
-  if (GlobalAddressSDNode *GSD = dyn_cast<GlobalAddressSDNode>(Callee)) {
+  GlobalAddressSDNode *GSD = dyn_cast<GlobalAddressSDNode>(Callee);
+  if (GSD && AddTargetGlobalAddr) {
     const GlobalValue *GV = GSD->getGlobal();
     Ops.push_back(DAG.getTargetGlobalAddress(GV, DL, MVT::i64));
   } else {

@Artem-B
Copy link
Member

Artem-B commented Nov 10, 2023

Would it be feasible to consider switching to the new offloading driver mode and really link with the library instead? It may be a conveniently isolated use case with little/no existing users that would disrupt.

Copy link
Contributor

@arsenm arsenm left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

128-bit division should already work, we have an IR integer division expansion for > 64-bit divides. I think moving towards getting the infrastructure to a place where we can link in compiler-rt binaries is a good thing, but I don't think we're in a position to actually enable that at this time. We still don't have everything necessary to provide object linking, which this seems to rely on

clang/lib/CodeGen/CodeGenAction.cpp Outdated Show resolved Hide resolved
clang/lib/CodeGen/CodeGenAction.cpp Outdated Show resolved Hide resolved
clang/lib/CodeGen/CodeGenAction.cpp Outdated Show resolved Hide resolved
clang/lib/CodeGen/CodeGenAction.cpp Outdated Show resolved Hide resolved
clang/lib/CodeGen/CodeGenAction.cpp Outdated Show resolved Hide resolved
@jhuber6
Copy link
Contributor

jhuber6 commented Nov 13, 2023

Would it be feasible to consider switching to the new offloading driver mode and really link with the library instead? It may be a conveniently isolated use case with little/no existing users that would disrupt.

I've thought a reasonable amount about a compiler-rt for GPUs. Right now it's a little difficult because of the issue of compatibility. We could do the traditional "Build the library N times for N architectures", but I'd like to think of something more intelligent in the future. The use of -mlink-builtin-bitcode handles this by more-or-less forcing correct attributes.

What this patch does is a little interesting though, having the clang driver pick apart archives has always seemed a little weird. We did it in the past for AMD's old handling of static libraries. There's still a lot of that code leftover I want to delete. I really need to sit down and allow HIP to work with the new driver.

I've been messing around with generic IR a bit, and I think what might work is LLVM-IR that intentionally leaves off target specific attributes and then introduce a pass that adds them in if missing before other optimizations are run. Then we may be able to investigate the use of i-funcs to resolve target specific branches once the architecture is known (once being linked in). I think @JonChesterfield was thinking about something to that effect as well.

@Artem-B
Copy link
Member

Artem-B commented Nov 13, 2023

I don't think we're in a position to actually enable that at this time. We still don't have everything necessary to provide object linking, which this seems to rely on

OK. IR it is.

@yxsamliu
Copy link
Collaborator Author

yxsamliu commented Dec 7, 2023

Would it be feasible to consider switching to the new offloading driver mode and really link with the library instead? It may be a conveniently isolated use case with little/no existing users that would disrupt.

It is needed by an important HIP app for which the new driver is not mature enough to compile, therefore we need to support this in the old driver.

Also, this change is about bitcode device libs, which I believe is common between the old and new drivers.

@yxsamliu
Copy link
Collaborator Author

yxsamliu commented Dec 7, 2023

Would it be feasible to consider switching to the new offloading driver mode and really link with the library instead? It may be a conveniently isolated use case with little/no existing users that would disrupt.

I've thought a reasonable amount about a compiler-rt for GPUs. Right now it's a little difficult because of the issue of compatibility. We could do the traditional "Build the library N times for N architectures", but I'd like to think of something more intelligent in the future. The use of -mlink-builtin-bitcode handles this by more-or-less forcing correct attributes.

What this patch does is a little interesting though, having the clang driver pick apart archives has always seemed a little weird. We did it in the past for AMD's old handling of static libraries. There's still a lot of that code leftover I want to delete. I really need to sit down and allow HIP to work with the new driver.

I've been messing around with generic IR a bit, and I think what might work is LLVM-IR that intentionally leaves off target specific attributes and then introduce a pass that adds them in if missing before other optimizations are run. Then we may be able to investigate the use of i-funcs to resolve target specific branches once the architecture is known (once being linked in). I think @JonChesterfield was thinking about something to that effect as well.

This patch compiles compiler-rt as a generic bitcode library which is independent of a specific GPU arch. This is like the ROCm device library and is aligned with the trend for gpu libc. Currently I have to use -mlink-bitcode-file to link it since currently the -fno-gpu-rdc mode does not use lld. I believe in the future it should be linked the same way as the gpu libc for the new driver.

Currently nvcc supports 128 bit int division in device code.
This patch adds support of 128 bit int division to HIP.

It builds lib functions for 128 bit division in compiler-rt
for amdgcn target.

Then links compiler-rt with -mlink-bitcode-file.

It adds support of archive of bitcode to -mlink-bitcode-file.

Fixes: llvm#71223

Fixes: SWDEV-426193
Copy link
Contributor

@arsenm arsenm left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we land the infrastructure to allow linking of compiler-rt binaries without the specifics for divide 128?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
backend:AMDGPU clang:codegen clang:driver 'clang' and 'clang++' user-facing binaries. Not 'clang-cl' clang Clang issues not falling into any other category compiler-rt
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[CUDA][HIP] fails to compile __int128 division
5 participants