Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[AMDGPU] Enable OpenCL hostcall printf (WIP) #72556

Open
wants to merge 16 commits into
base: main
Choose a base branch
from

Conversation

vikramRH
Copy link
Contributor

@vikramRH vikramRH commented Nov 16, 2023

Kindly review top commit here, The builtin specific changes are up for review in a seperate patch (#72554)

Few implementation details,

  1. Hostcall printf is now default for both HIP and OpenCL.
  2. The implementation adds vector processing support both for hostcall and buffered cases. The vector elements are extracted and pushed onto the buffer individually (each alingned to 8 byte boundary)
  3. for OpenCL hostcall case, The format string pointer is addrspace casted to generic address space to be compatible with hostcall device lib functions.

Copy link

github-actions bot commented Nov 16, 2023

✅ With the latest revision this PR passed the C/C++ code formatter.

@@ -170,20 +173,46 @@ static Value *appendString(IRBuilder<> &Builder, Value *Desc, Value *Arg,
return callAppendStringN(Builder, Desc, Arg, Length, IsLast);
}

static Value *appendVectorArg(IRBuilder<> &Builder, Value *Desc, Value *Arg,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

All of the codegen changes here should be a separate commit

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These are all still in this PR?

Comment on lines 315 to 316
auto VecArg = dyn_cast<FixedVectorType>(Args[i]->getType());
assert(VecArg && "invalid vector specifier");
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

cast<> instead of dyn_cast + assert

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

Comment on lines 317 to 323
AllocSize = VecArg->getNumElements() * 8;
} else
AllocSize = M->getDataLayout().getTypeAllocSize(Args[i]->getType());
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Don't understand this split vector handling. Just always use getTypeAllocSize?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I extract individual vector elements, expand them to 8 bytes and store them onto the buffer. The "getTypeAllocSize" would not give me the actual occupied size in the buffer in this case.

// have printed a warning. We just rely on undefined behaviour and send the
// argument anyway.
return appendArg(Builder, Desc, Arg, IsLast);
} else if (IsVector) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No else after return

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

Zero, Zero, Zero, Zero, Zero, false);
}

auto Val =
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Value *

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

@@ -406,5 +410,9 @@ TARGET_BUILTIN(__builtin_amdgcn_cvt_pk_fp8_f32, "iffiIb", "nc", "fp8-insts")
TARGET_BUILTIN(__builtin_amdgcn_cvt_sr_bf8_f32, "ifiiIi", "nc", "fp8-insts")
TARGET_BUILTIN(__builtin_amdgcn_cvt_sr_fp8_f32, "ifiiIi", "nc", "fp8-insts")

// OpenCL
LANGBUILTIN(printf, "icC*4.", "fp:0:", ALL_OCL_LANGUAGES)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't understand why this is necessary for the current task. What does it fix in the parsing OpenCL builtins?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Although we talked about this offline, the explanation needs to be added here. In fact, the motivation for having this builtin should be added as a comment to the source itself for future reference.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This does not belong here. This has nothing to do with AMDGPU

if (Ty->getTypeID() == Type::DoubleTyID) {
if (Ty->isFloatingPointTy()) {
if (DL.getTypeAllocSize(Ty) < 8)
Arg = Builder.CreateFPExt(Arg, Builder.getDoubleTy());
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This typecast should not be necessary. Default argument promotions in C++ for variadic functions ensure that all floating point arguments are promoted to double. If that is not happening, can you demonstrate with a test?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The type cast is necessary for types such as _Float16, which is not handled at argument promotion. I have added a test case to show the same

return Builder.CreateBitCast(Arg, Int64Ty);
}

if (isa<PointerType>(Ty)) {
if (!IsBuffered && isa<PointerType>(Ty)) {
return Builder.CreatePtrToInt(Arg, Int64Ty);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How are pointers handled in the buffered case?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The pointer is just pushed onto the buffer. The cast is necessary for the hostcall case to be compatible with device lib functions

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Information like this should be written in comments in the source code itself.

assert(Arg->getType()->isVectorTy() && "incorrent append* function");
auto VectorTy = dyn_cast<FixedVectorType>(Arg->getType());
auto Zero = Builder.getInt64(0);
if (VectorTy) {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So the argument is ignored if it is not a FixedVectorType?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have changed this code a little now so that only FixedVectorTypes are handled. This should be okay since the OCL specs specifically say only vectors of length 2,3,4,8 and 16 are supported for printf.

Comment on lines 209 to 230
// If the format specifies a string but the argument is not, the frontend
// will have printed a warning. We just rely on undefined behaviour and send
// the argument anyway.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a pure whitespace change. Keeping the original formatting of the comment helps simplify the diff.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Bump. Please restore the original comment if there is no change in the actual words.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Bump!

@@ -194,6 +226,8 @@ static void locateCStrings(SparseBitVector<8> &BV, StringRef Str) {
SpecPos += 2;
continue;
}
if (Str.find_first_of("v", SpecPos) != StringRef::npos)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think this will work as expected. It can clearly match a "v" that occurs after the data type. For example, it is supposed to match "%v2d", but it will also match "%d v". The match should be performed inside the "Spec" substring created below.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed

"-mprintf-kind=" +
Args.getLastArgValue(options::OPT_mprintf_kind_EQ)));
// Force compiler error on invalid conversion specifiers
CmdArgs.push_back(Args.MakeArgString("-Werror=format-invalid-specifier"));
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is this necessary here?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Bump. I am not convinced that we should force errors on invalid specifiers. What is the rationale for that?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This was just so that we could error out instead of undefined behaviors due to wrong specifiers. I have no preference here and would be okay to change it if you feel so.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually this is important. Converting a warning into an error can cause surprises when users build existing applications. It is upto the user to decide if they want this to be an error, which is why it is a command-line option. Even if the spec says undefined behaviour, we should just do something reasonable and accept whatever the user wrote without forcing an error. There is a lot of freedom in deciding what is "something reasonable" ... we could choose to print nothing, or a default value, or a placeholder instead of the actual printf format string, etc.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Bump! Remove this command-line flag.

}

// Scan the format string to locate all specifiers, and mark the ones that
// specify a string, i.e, the "%s" specifier with optional '*' characters.
static void locateCStrings(SparseBitVector<8> &BV, StringRef Str) {
static void locateCStringsAndVectors(SparseBitVector<8> &BV,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

update the comment above the function

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

@@ -1,12 +1,68 @@
// NOTE: Assertions have been autogenerated by utils/update_cc_test_checks.py
// RUN: %clang_cc1 -cl-std=CL1.2 -triple amdgcn-amd-amdhsa -disable-llvm-passes -emit-llvm -o - %s | FileCheck %s
// RUN: %clang_cc1 -cl-std=CL1.2 -triple amdgcn-amd-amdhsa -mprintf-kind=buffered -disable-llvm-passes -emit-llvm -o - %s | FileCheck --check-prefix=CHECK_BUFFERED %s
// RUN: %clang_cc1 -cl-std=CL1.2 -triple amdgcn-amd-amdhsa -mprintf-kind=hostcall -disable-llvm-passes -emit-llvm -o - %s | FileCheck --check-prefix=CHECK_HOSTCALL %s

int printf(__constant const char* st, ...) __attribute__((format(printf, 1, 2)));
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There should be tests that use the "v" modifier inside a format specifier such as "%v2d" and also tests that use outside it, such as "%dv".

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I only see tests with the "v" correctly used as a vector specifier. What about tests like "%dv", and other cases where either the "v" is wrong, or it's just part of the text being printed? Given that the first attempt at detecting "v" had errors in it, I think it will be good to cover all corner cases where a "v" is actually a vector specifier and and where it is not.

@vikramRH vikramRH marked this pull request as ready for review November 28, 2023 06:11
@llvmbot llvmbot added clang Clang issues not falling into any other category backend:AMDGPU clang:driver 'clang' and 'clang++' user-facing binaries. Not 'clang-cl' clang:frontend Language frontend issues, e.g. anything involving "Sema" clang:codegen llvm:transforms labels Nov 28, 2023
@llvmbot
Copy link
Collaborator

llvmbot commented Nov 28, 2023

@llvm/pr-subscribers-clang-driver
@llvm/pr-subscribers-llvm-transforms
@llvm/pr-subscribers-backend-amdgpu
@llvm/pr-subscribers-clang-codegen

@llvm/pr-subscribers-clang

Author: Vikram Hegde (vikramRH)

Changes

Kindly review top commit here, The builtin specific changes are up for review in a seperate patch (#72554)

Few implementation details,

  1. Hostcall printf is now default for both HIP and OpenCL.
  2. The implementation adds vector processing support both for hostcall and buffered cases. The vector elements are extracted and pushed onto the buffer individually (each alingned to 8 byte boundary)
  3. for OpenCL hostcall case, The format string pointer is addrspace casted to generic address space to be compatible with hostcall device lib functions.

Patch is 89.24 KiB, truncated to 20.00 KiB below, full version: https://github.com/llvm/llvm-project/pull/72556.diff

9 Files Affected:

  • (modified) clang/include/clang/Basic/BuiltinsAMDGPU.def (+8)
  • (modified) clang/lib/AST/Decl.cpp (+7)
  • (modified) clang/lib/Basic/Targets/AMDGPU.cpp (+2)
  • (modified) clang/lib/CodeGen/CGBuiltin.cpp (+7-1)
  • (modified) clang/lib/CodeGen/CGGPUBuiltin.cpp (+37-6)
  • (modified) clang/lib/Driver/ToolChains/Clang.cpp (+10)
  • (modified) clang/test/CodeGenOpenCL/amdgpu-printf.cl (+756-1)
  • (modified) llvm/include/llvm/Transforms/Utils/AMDGPUEmitPrintf.h (+1-1)
  • (modified) llvm/lib/Transforms/Utils/AMDGPUEmitPrintf.cpp (+90-57)
diff --git a/clang/include/clang/Basic/BuiltinsAMDGPU.def b/clang/include/clang/Basic/BuiltinsAMDGPU.def
index a19c8bd5f219ec6..1799c72806bfdd4 100644
--- a/clang/include/clang/Basic/BuiltinsAMDGPU.def
+++ b/clang/include/clang/Basic/BuiltinsAMDGPU.def
@@ -21,6 +21,10 @@
 #if defined(BUILTIN) && !defined(TARGET_BUILTIN)
 #   define TARGET_BUILTIN(ID, TYPE, ATTRS, FEATURE) BUILTIN(ID, TYPE, ATTRS)
 #endif
+
+#if defined(BUILTIN) && !defined(LANGBUILTIN)
+#define LANGBUILTIN(ID, TYPE, ATTRS, BUILTIN_LANG) BUILTIN(ID, TYPE, ATTRS)
+#endif
 //===----------------------------------------------------------------------===//
 // SI+ only builtins.
 //===----------------------------------------------------------------------===//
@@ -406,5 +410,9 @@ TARGET_BUILTIN(__builtin_amdgcn_cvt_pk_fp8_f32, "iffiIb", "nc", "fp8-insts")
 TARGET_BUILTIN(__builtin_amdgcn_cvt_sr_bf8_f32, "ifiiIi", "nc", "fp8-insts")
 TARGET_BUILTIN(__builtin_amdgcn_cvt_sr_fp8_f32, "ifiiIi", "nc", "fp8-insts")
 
+// OpenCL
+LANGBUILTIN(printf, "icC*4.", "fp:0:", ALL_OCL_LANGUAGES)
+
 #undef BUILTIN
 #undef TARGET_BUILTIN
+#undef LANGBUILTIN
diff --git a/clang/lib/AST/Decl.cpp b/clang/lib/AST/Decl.cpp
index c5c2edf1bfe3aba..2597422bdd521a0 100644
--- a/clang/lib/AST/Decl.cpp
+++ b/clang/lib/AST/Decl.cpp
@@ -49,6 +49,7 @@
 #include "clang/Basic/SourceLocation.h"
 #include "clang/Basic/SourceManager.h"
 #include "clang/Basic/Specifiers.h"
+#include "clang/Basic/TargetBuiltins.h"
 #include "clang/Basic/TargetCXXABI.h"
 #include "clang/Basic/TargetInfo.h"
 #include "clang/Basic/Visibility.h"
@@ -3598,6 +3599,12 @@ unsigned FunctionDecl::getBuiltinID(bool ConsiderWrapperFunctions) const {
   if (!ConsiderWrapperFunctions && getStorageClass() == SC_Static)
     return 0;
 
+  // AMDGCN implementation supports printf as a builtin
+  // for OpenCL
+  if (Context.getTargetInfo().getTriple().isAMDGCN() &&
+      Context.getLangOpts().OpenCL && BuiltinID == AMDGPU::BIprintf)
+    return BuiltinID;
+
   // OpenCL v1.2 s6.9.f - The library functions defined in
   // the C99 standard headers are not available.
   if (Context.getLangOpts().OpenCL &&
diff --git a/clang/lib/Basic/Targets/AMDGPU.cpp b/clang/lib/Basic/Targets/AMDGPU.cpp
index 409ae32ab424215..307cfa49f54e926 100644
--- a/clang/lib/Basic/Targets/AMDGPU.cpp
+++ b/clang/lib/Basic/Targets/AMDGPU.cpp
@@ -91,6 +91,8 @@ static constexpr Builtin::Info BuiltinInfo[] = {
   {#ID, TYPE, ATTRS, nullptr, HeaderDesc::NO_HEADER, ALL_LANGUAGES},
 #define TARGET_BUILTIN(ID, TYPE, ATTRS, FEATURE)                               \
   {#ID, TYPE, ATTRS, FEATURE, HeaderDesc::NO_HEADER, ALL_LANGUAGES},
+#define LANGBUILTIN(ID, TYPE, ATTRS, LANG)                                     \
+  {#ID, TYPE, ATTRS, nullptr, HeaderDesc::NO_HEADER, LANG},
 #include "clang/Basic/BuiltinsAMDGPU.def"
 };
 
diff --git a/clang/lib/CodeGen/CGBuiltin.cpp b/clang/lib/CodeGen/CGBuiltin.cpp
index 09309a3937fb613..8d51df24c7872b7 100644
--- a/clang/lib/CodeGen/CGBuiltin.cpp
+++ b/clang/lib/CodeGen/CGBuiltin.cpp
@@ -2458,6 +2458,11 @@ RValue CodeGenFunction::EmitBuiltinExpr(const GlobalDecl GD, unsigned BuiltinID,
       &getTarget().getLongDoubleFormat() == &llvm::APFloat::IEEEquad())
     BuiltinID = mutateLongDoubleBuiltin(BuiltinID);
 
+   // Mutate the printf builtin ID so that we use the same CodeGen path for
+   // HIP and OpenCL with AMDGPU targets.
+   if (getTarget().getTriple().isAMDGCN() && BuiltinID == AMDGPU::BIprintf)
+     BuiltinID = Builtin::BIprintf;
+
   // If the builtin has been declared explicitly with an assembler label,
   // disable the specialized emitting below. Ideally we should communicate the
   // rename in IR, or at least avoid generating the intrinsic calls that are
@@ -5617,7 +5622,8 @@ RValue CodeGenFunction::EmitBuiltinExpr(const GlobalDecl GD, unsigned BuiltinID,
         return EmitOpenMPDevicePrintfCallExpr(E);
       if (getTarget().getTriple().isNVPTX())
         return EmitNVPTXDevicePrintfCallExpr(E);
-      if (getTarget().getTriple().isAMDGCN() && getLangOpts().HIP)
+      if (getTarget().getTriple().isAMDGCN() &&
+         (getLangOpts().HIP || getLangOpts().OpenCL))
         return EmitAMDGPUDevicePrintfCallExpr(E);
     }
 
diff --git a/clang/lib/CodeGen/CGGPUBuiltin.cpp b/clang/lib/CodeGen/CGGPUBuiltin.cpp
index de4ee68c0da1e79..4eb2cf826e700fb 100644
--- a/clang/lib/CodeGen/CGGPUBuiltin.cpp
+++ b/clang/lib/CodeGen/CGGPUBuiltin.cpp
@@ -13,7 +13,10 @@
 
 #include "CodeGenFunction.h"
 #include "clang/Basic/Builtins.h"
+#include "clang/Basic/TargetBuiltins.h"
+#include "llvm/Analysis/ValueTracking.h"
 #include "llvm/IR/DataLayout.h"
+#include "llvm/IR/DiagnosticInfo.h"
 #include "llvm/IR/Instruction.h"
 #include "llvm/Support/MathExtras.h"
 #include "llvm/Transforms/Utils/AMDGPUEmitPrintf.h"
@@ -177,10 +180,20 @@ RValue CodeGenFunction::EmitNVPTXDevicePrintfCallExpr(const CallExpr *E) {
       E, this, GetVprintfDeclaration(CGM.getModule()), false);
 }
 
+// Deterimines if an argument is a string
+static bool isString(const clang::Type *argXTy) {
+  if ((argXTy->isPointerType() || argXTy->isConstantArrayType()) &&
+      argXTy->getPointeeOrArrayElementType()->isCharType())
+    return true;
+  else
+    return false;
+}
+
 RValue CodeGenFunction::EmitAMDGPUDevicePrintfCallExpr(const CallExpr *E) {
   assert(getTarget().getTriple().getArch() == llvm::Triple::amdgcn);
   assert(E->getBuiltinCallee() == Builtin::BIprintf ||
-         E->getBuiltinCallee() == Builtin::BI__builtin_printf);
+         E->getBuiltinCallee() == Builtin::BI__builtin_printf ||
+         E->getBuiltinCallee() == AMDGPU::BIprintf);
   assert(E->getNumArgs() >= 1); // printf always has at least one arg.
 
   CallArgList CallArgs;
@@ -188,6 +201,8 @@ RValue CodeGenFunction::EmitAMDGPUDevicePrintfCallExpr(const CallExpr *E) {
                E->getDirectCallee()->getType()->getAs<FunctionProtoType>(),
                E->arguments(), E->getDirectCallee(),
                /* ParamsToSkip = */ 0);
+  llvm::IRBuilder<> IRB(Builder.GetInsertBlock(), Builder.GetInsertPoint());
+  IRB.SetCurrentDebugLocation(Builder.getCurrentDebugLocation());
 
   SmallVector<llvm::Value *, 8> Args;
   for (const auto &A : CallArgs) {
@@ -198,15 +213,31 @@ RValue CodeGenFunction::EmitAMDGPUDevicePrintfCallExpr(const CallExpr *E) {
     }
 
     llvm::Value *Arg = A.getRValue(*this).getScalarVal();
+    if (isString(A.getType().getTypePtr()) && CGM.getLangOpts().OpenCL)
+      Arg = Builder.CreateAddrSpaceCast(Arg, CGM.Int8PtrTy);
     Args.push_back(Arg);
   }
 
-  llvm::IRBuilder<> IRB(Builder.GetInsertBlock(), Builder.GetInsertPoint());
-  IRB.SetCurrentDebugLocation(Builder.getCurrentDebugLocation());
+  auto PFK = CGM.getTarget().getTargetOpts().AMDGPUPrintfKindVal;
+  bool isBuffered =
+       (PFK == clang::TargetOptions::AMDGPUPrintfKind::Buffered);
+
+  StringRef FmtStr;
+  if (llvm::getConstantStringInfo(Args[0], FmtStr)) {
+    if (FmtStr.empty())
+      FmtStr = StringRef("", 1);
+  } else {
+    if (CGM.getLangOpts().OpenCL) {
+      llvm::DiagnosticInfoUnsupported UnsupportedFormatStr(
+          *IRB.GetInsertBlock()->getParent(),
+          "printf format string must be a trivially resolved constant string "
+          "global variable",
+          IRB.getCurrentDebugLocation());
+      IRB.getContext().diagnose(UnsupportedFormatStr);
+    }
+  }
 
-  bool isBuffered = (CGM.getTarget().getTargetOpts().AMDGPUPrintfKindVal ==
-                     clang::TargetOptions::AMDGPUPrintfKind::Buffered);
-  auto Printf = llvm::emitAMDGPUPrintfCall(IRB, Args, isBuffered);
+  auto Printf = llvm::emitAMDGPUPrintfCall(IRB, Args, FmtStr, isBuffered);
   Builder.SetInsertPoint(IRB.GetInsertBlock(), IRB.GetInsertPoint());
   return RValue::get(Printf);
 }
diff --git a/clang/lib/Driver/ToolChains/Clang.cpp b/clang/lib/Driver/ToolChains/Clang.cpp
index b462f5a44057d94..b63c777fd1f158c 100644
--- a/clang/lib/Driver/ToolChains/Clang.cpp
+++ b/clang/lib/Driver/ToolChains/Clang.cpp
@@ -4742,6 +4742,16 @@ void Clang::ConstructJob(Compilation &C, const JobAction &JA,
     Args.ClaimAllArgs(options::OPT_gen_cdb_fragment_path);
   }
 
+  if (TC.getTriple().isAMDGPU() && types::isOpenCL(Input.getType())) {
+    if (Args.getLastArg(options::OPT_mprintf_kind_EQ)) {
+      CmdArgs.push_back(Args.MakeArgString(
+          "-mprintf-kind=" +
+          Args.getLastArgValue(options::OPT_mprintf_kind_EQ)));
+      // Force compiler error on invalid conversion specifiers
+      CmdArgs.push_back(Args.MakeArgString("-Werror=format-invalid-specifier"));
+    }
+  }
+
   if (IsCuda || IsHIP) {
     // We have to pass the triple of the host if compiling for a CUDA/HIP device
     // and vice-versa.
diff --git a/clang/test/CodeGenOpenCL/amdgpu-printf.cl b/clang/test/CodeGenOpenCL/amdgpu-printf.cl
index edf6dbf8657cbe5..9b411f23ceba56a 100644
--- a/clang/test/CodeGenOpenCL/amdgpu-printf.cl
+++ b/clang/test/CodeGenOpenCL/amdgpu-printf.cl
@@ -1,5 +1,8 @@
 // NOTE: Assertions have been autogenerated by utils/update_cc_test_checks.py
-// RUN: %clang_cc1 -cl-std=CL1.2 -triple amdgcn-amd-amdhsa -disable-llvm-passes -emit-llvm -o - %s | FileCheck %s
+// RUN: %clang_cc1 -cl-std=CL1.2 -triple amdgcn-amd-amdhsa -mprintf-kind=buffered -disable-llvm-passes -emit-llvm -o - %s | FileCheck --check-prefix=CHECK_BUFFERED %s
+// RUN: %clang_cc1 -cl-std=CL1.2 -triple amdgcn-amd-amdhsa -mprintf-kind=hostcall -disable-llvm-passes -emit-llvm -o - %s | FileCheck --check-prefix=CHECK_HOSTCALL %s
+
+#pragma OPENCL EXTENSION cl_khr_fp16 : enable
 
 int printf(__constant const char* st, ...) __attribute__((format(printf, 1, 2)));
 
@@ -7,6 +10,42 @@ int printf(__constant const char* st, ...) __attribute__((format(printf, 1, 2)))
 // CHECK-NEXT:  entry:
 // CHECK-NEXT:    [[CALL:%.*]] = call i32 (ptr addrspace(4), ...) @printf(ptr addrspace(4) noundef @.str) #[[ATTR4:[0-9]+]]
 // CHECK-NEXT:    ret void
+// CHECK_BUFFERED-LABEL: @test_printf_noargs(
+// CHECK_BUFFERED-NEXT:  entry:
+// CHECK_BUFFERED-NEXT:    [[PRINTF_ALLOC_FN:%.*]] = call ptr addrspace(1) @__printf_alloc(i32 12)
+// CHECK_BUFFERED-NEXT:    [[TMP0:%.*]] = icmp ne ptr addrspace(1) [[PRINTF_ALLOC_FN]], null
+// CHECK_BUFFERED-NEXT:    br i1 [[TMP0]], label [[ARGPUSH_BLOCK:%.*]], label [[END_BLOCK:%.*]]
+// CHECK_BUFFERED:       end.block:
+// CHECK_BUFFERED-NEXT:    [[TMP1:%.*]] = xor i1 [[TMP0]], true
+// CHECK_BUFFERED-NEXT:    [[PRINTF_RESULT:%.*]] = sext i1 [[TMP1]] to i32
+// CHECK_BUFFERED-NEXT:    ret void
+// CHECK_BUFFERED:       argpush.block:
+// CHECK_BUFFERED-NEXT:    store i32 50, ptr addrspace(1) [[PRINTF_ALLOC_FN]], align 4
+// CHECK_BUFFERED-NEXT:    [[TMP2:%.*]] = getelementptr inbounds i8, ptr addrspace(1) [[PRINTF_ALLOC_FN]], i32 4
+// CHECK_BUFFERED-NEXT:    store i64 -8529802306755643245, ptr addrspace(1) [[TMP2]], align 8
+// CHECK_BUFFERED-NEXT:    [[TMP3:%.*]] = getelementptr inbounds i8, ptr addrspace(1) [[TMP2]], i32 8
+// CHECK_BUFFERED-NEXT:    br label [[END_BLOCK]]
+//
+// CHECK_HOSTCALL-LABEL: @test_printf_noargs(
+// CHECK_HOSTCALL-NEXT:  entry:
+// CHECK_HOSTCALL-NEXT:    [[TMP0:%.*]] = call i64 @__ockl_printf_begin(i64 0)
+// CHECK_HOSTCALL-NEXT:    br i1 icmp eq (ptr addrspacecast (ptr addrspace(4) @.str to ptr), ptr null), label [[STRLEN_JOIN:%.*]], label [[STRLEN_WHILE:%.*]]
+// CHECK_HOSTCALL:       strlen.while:
+// CHECK_HOSTCALL-NEXT:    [[TMP1:%.*]] = phi ptr [ addrspacecast (ptr addrspace(4) @.str to ptr), [[ENTRY:%.*]] ], [ [[TMP2:%.*]], [[STRLEN_WHILE]] ]
+// CHECK_HOSTCALL-NEXT:    [[TMP2]] = getelementptr i8, ptr [[TMP1]], i64 1
+// CHECK_HOSTCALL-NEXT:    [[TMP3:%.*]] = load i8, ptr [[TMP1]], align 1
+// CHECK_HOSTCALL-NEXT:    [[TMP4:%.*]] = icmp eq i8 [[TMP3]], 0
+// CHECK_HOSTCALL-NEXT:    br i1 [[TMP4]], label [[STRLEN_WHILE_DONE:%.*]], label [[STRLEN_WHILE]]
+// CHECK_HOSTCALL:       strlen.while.done:
+// CHECK_HOSTCALL-NEXT:    [[TMP5:%.*]] = ptrtoint ptr [[TMP1]] to i64
+// CHECK_HOSTCALL-NEXT:    [[TMP6:%.*]] = sub i64 [[TMP5]], ptrtoint (ptr addrspacecast (ptr addrspace(4) @.str to ptr) to i64)
+// CHECK_HOSTCALL-NEXT:    [[TMP7:%.*]] = add i64 [[TMP6]], 1
+// CHECK_HOSTCALL-NEXT:    br label [[STRLEN_JOIN]]
+// CHECK_HOSTCALL:       strlen.join:
+// CHECK_HOSTCALL-NEXT:    [[TMP8:%.*]] = phi i64 [ [[TMP7]], [[STRLEN_WHILE_DONE]] ], [ 0, [[ENTRY]] ]
+// CHECK_HOSTCALL-NEXT:    [[TMP9:%.*]] = call i64 @__ockl_printf_append_string_n(i64 [[TMP0]], ptr addrspacecast (ptr addrspace(4) @.str to ptr), i64 [[TMP8]], i32 1)
+// CHECK_HOSTCALL-NEXT:    [[TMP10:%.*]] = trunc i64 [[TMP9]] to i32
+// CHECK_HOSTCALL-NEXT:    ret void
 //
 __kernel void test_printf_noargs() {
     printf("");
@@ -19,6 +58,53 @@ __kernel void test_printf_noargs() {
 // CHECK-NEXT:    [[TMP0:%.*]] = load i32, ptr addrspace(5) [[I_ADDR]], align 4, !tbaa [[TBAA8]]
 // CHECK-NEXT:    [[CALL:%.*]] = call i32 (ptr addrspace(4), ...) @printf(ptr addrspace(4) noundef @.str.1, i32 noundef [[TMP0]]) #[[ATTR4]]
 // CHECK-NEXT:    ret void
+// CHECK_BUFFERED-LABEL: @test_printf_int(
+// CHECK_BUFFERED-NEXT:  entry:
+// CHECK_BUFFERED-NEXT:    [[I_ADDR:%.*]] = alloca i32, align 4, addrspace(5)
+// CHECK_BUFFERED-NEXT:    store i32 [[I:%.*]], ptr addrspace(5) [[I_ADDR]], align 4, !tbaa [[TBAA16:![0-9]+]]
+// CHECK_BUFFERED-NEXT:    [[TMP0:%.*]] = load i32, ptr addrspace(5) [[I_ADDR]], align 4, !tbaa [[TBAA16]]
+// CHECK_BUFFERED-NEXT:    [[PRINTF_ALLOC_FN:%.*]] = call ptr addrspace(1) @__printf_alloc(i32 20)
+// CHECK_BUFFERED-NEXT:    [[TMP1:%.*]] = icmp ne ptr addrspace(1) [[PRINTF_ALLOC_FN]], null
+// CHECK_BUFFERED-NEXT:    br i1 [[TMP1]], label [[ARGPUSH_BLOCK:%.*]], label [[END_BLOCK:%.*]]
+// CHECK_BUFFERED:       end.block:
+// CHECK_BUFFERED-NEXT:    [[TMP2:%.*]] = xor i1 [[TMP1]], true
+// CHECK_BUFFERED-NEXT:    [[PRINTF_RESULT:%.*]] = sext i1 [[TMP2]] to i32
+// CHECK_BUFFERED-NEXT:    ret void
+// CHECK_BUFFERED:       argpush.block:
+// CHECK_BUFFERED-NEXT:    store i32 82, ptr addrspace(1) [[PRINTF_ALLOC_FN]], align 4
+// CHECK_BUFFERED-NEXT:    [[TMP3:%.*]] = getelementptr inbounds i8, ptr addrspace(1) [[PRINTF_ALLOC_FN]], i32 4
+// CHECK_BUFFERED-NEXT:    store i64 -2582314622382785113, ptr addrspace(1) [[TMP3]], align 8
+// CHECK_BUFFERED-NEXT:    [[TMP4:%.*]] = getelementptr inbounds i8, ptr addrspace(1) [[TMP3]], i32 8
+// CHECK_BUFFERED-NEXT:    [[TMP5:%.*]] = zext i32 [[TMP0]] to i64
+// CHECK_BUFFERED-NEXT:    store i64 [[TMP5]], ptr addrspace(1) [[TMP4]], align 8
+// CHECK_BUFFERED-NEXT:    [[PRINTBUFFNEXTPTR:%.*]] = getelementptr inbounds i8, ptr addrspace(1) [[TMP4]], i32 8
+// CHECK_BUFFERED-NEXT:    br label [[END_BLOCK]]
+//
+// CHECK_HOSTCALL-LABEL: @test_printf_int(
+// CHECK_HOSTCALL-NEXT:  entry:
+// CHECK_HOSTCALL-NEXT:    [[I_ADDR:%.*]] = alloca i32, align 4, addrspace(5)
+// CHECK_HOSTCALL-NEXT:    store i32 [[I:%.*]], ptr addrspace(5) [[I_ADDR]], align 4, !tbaa [[TBAA9:![0-9]+]]
+// CHECK_HOSTCALL-NEXT:    [[TMP0:%.*]] = load i32, ptr addrspace(5) [[I_ADDR]], align 4, !tbaa [[TBAA9]]
+// CHECK_HOSTCALL-NEXT:    [[TMP1:%.*]] = call i64 @__ockl_printf_begin(i64 0)
+// CHECK_HOSTCALL-NEXT:    br i1 icmp eq (ptr addrspacecast (ptr addrspace(4) @.str.1 to ptr), ptr null), label [[STRLEN_JOIN:%.*]], label [[STRLEN_WHILE:%.*]]
+// CHECK_HOSTCALL:       strlen.while:
+// CHECK_HOSTCALL-NEXT:    [[TMP2:%.*]] = phi ptr [ addrspacecast (ptr addrspace(4) @.str.1 to ptr), [[ENTRY:%.*]] ], [ [[TMP3:%.*]], [[STRLEN_WHILE]] ]
+// CHECK_HOSTCALL-NEXT:    [[TMP3]] = getelementptr i8, ptr [[TMP2]], i64 1
+// CHECK_HOSTCALL-NEXT:    [[TMP4:%.*]] = load i8, ptr [[TMP2]], align 1
+// CHECK_HOSTCALL-NEXT:    [[TMP5:%.*]] = icmp eq i8 [[TMP4]], 0
+// CHECK_HOSTCALL-NEXT:    br i1 [[TMP5]], label [[STRLEN_WHILE_DONE:%.*]], label [[STRLEN_WHILE]]
+// CHECK_HOSTCALL:       strlen.while.done:
+// CHECK_HOSTCALL-NEXT:    [[TMP6:%.*]] = ptrtoint ptr [[TMP2]] to i64
+// CHECK_HOSTCALL-NEXT:    [[TMP7:%.*]] = sub i64 [[TMP6]], ptrtoint (ptr addrspacecast (ptr addrspace(4) @.str.1 to ptr) to i64)
+// CHECK_HOSTCALL-NEXT:    [[TMP8:%.*]] = add i64 [[TMP7]], 1
+// CHECK_HOSTCALL-NEXT:    br label [[STRLEN_JOIN]]
+// CHECK_HOSTCALL:       strlen.join:
+// CHECK_HOSTCALL-NEXT:    [[TMP9:%.*]] = phi i64 [ [[TMP8]], [[STRLEN_WHILE_DONE]] ], [ 0, [[ENTRY]] ]
+// CHECK_HOSTCALL-NEXT:    [[TMP10:%.*]] = call i64 @__ockl_printf_append_string_n(i64 [[TMP1]], ptr addrspacecast (ptr addrspace(4) @.str.1 to ptr), i64 [[TMP9]], i32 0)
+// CHECK_HOSTCALL-NEXT:    [[TMP11:%.*]] = zext i32 [[TMP0]] to i64
+// CHECK_HOSTCALL-NEXT:    [[TMP12:%.*]] = call i64 @__ockl_printf_append_args(i64 [[TMP10]], i32 1, i64 [[TMP11]], i64 0, i64 0, i64 0, i64 0, i64 0, i64 0, i32 1)
+// CHECK_HOSTCALL-NEXT:    [[TMP13:%.*]] = trunc i64 [[TMP12]] to i32
+// CHECK_HOSTCALL-NEXT:    ret void
 //
 __kernel void test_printf_int(int i) {
     printf("%d", i);
@@ -36,8 +122,677 @@ __kernel void test_printf_int(int i) {
 // CHECK-NEXT:    [[CALL:%.*]] = call i32 (ptr addrspace(4), ...) @printf(ptr addrspace(4) noundef @.str.2, ptr addrspace(5) noundef [[ARRAYDECAY]], i32 noundef [[TMP2]]) #[[ATTR4]]
 // CHECK-NEXT:    call void @llvm.lifetime.end.p5(i64 4, ptr addrspace(5) [[S]]) #[[ATTR5]]
 // CHECK-NEXT:    ret void
+// CHECK_BUFFERED-LABEL: @test_printf_str_int(
+// CHECK_BUFFERED-NEXT:  entry:
+// CHECK_BUFFERED-NEXT:    [[I_ADDR:%.*]] = alloca i32, align 4, addrspace(5)
+// CHECK_BUFFERED-NEXT:    [[S:%.*]] = alloca [4 x i8], align 1, addrspace(5)
+// CHECK_BUFFERED-NEXT:    store i32 [[I:%.*]], ptr addrspace(5) [[I_ADDR]], align 4, !tbaa [[TBAA16]]
+// CHECK_BUFFERED-NEXT:    call void @llvm.lifetime.start.p5(i64 4, ptr addrspace(5) [[S]]) #[[ATTR1:[0-9]+]]
+// CHECK_BUFFERED-NEXT:    call void @llvm.memcpy.p5.p4.i64(ptr addrspace(5) align 1 [[S]], ptr addrspace(4) align 1 @__const.test_printf_str_int.s, i64 4, i1 false)
+// CHECK_BUFFERED-NEXT:    [[ARRAYDECAY:%.*]] = getelementptr inbounds [4 x i8], ptr addrspace(5) [[S]], i64 0, i64 0
+// CHECK_BUFFERED-NEXT:    [[TMP0:%.*]] = load i32, ptr addrspace(5) [[I_ADDR]], align 4, !tbaa [[TBAA16]]
+// CHECK_BUFFERED-NEXT:    [[TMP1:%.*]] = addrspacecast ptr addrspace(5) [[ARRAYDECAY]] to ptr
+// CHECK_BUFFERED-NEXT:    [[TMP2:%.*]] = icmp eq ptr [[TMP1]], null
+// CHECK_BUFFERED-NEXT:    br i1 [[TMP2]], label [[STRLEN_JOIN:%.*]], label [[STRLEN_WHILE:%.*]]
+// CHECK_BUFFERED:       strlen.while:
+// CHECK_BUFFERED-NEXT:    [[TMP3:%.*]] = phi ptr [ [[TMP1]], [[ENTRY:%.*]] ], [ [[TMP4:%.*]], [[STRLEN_WHILE]] ]
+// CHECK_BUFFERED-NEXT:    [[TMP4]] = getelementptr i8, ptr [[TMP3]], i64 1
+// CHECK_BUFFERED-NEXT:    [[TMP5:%.*]] = load i8, ptr [[TMP3]], align 1
+// CHECK_BUFFERED-NEXT:    [[TMP6:%.*]] = icmp eq i8 [[TMP5]], 0
+// CHECK_BUFFERED-NEXT:    br i1 [[TMP6]], label [[STRLEN_WHILE_DONE:%.*]], label [[STRLEN_WHILE]]
+// CHECK_BUFFERED:       strlen.while.done:
+// CHECK_BUFFERED-NEXT:    [[TMP7:%.*]] = ptrtoint ptr [[TMP1]] to i64
+// CHECK_BUFFERED-NEXT:    [[TMP8:%.*]] = ptrtoint ptr [[TMP3]] to i64
+// CHECK_BUFFERED-NEXT:    [[TMP9:%.*]] = sub i64 [[TMP8]], [[TMP7]]
+// CHECK_BUFFERED-NEXT:    [[TMP10:%.*]] = add i64 [[TMP9]], 1
+// CHECK_BUFFERED-NEXT:    br label [[STRLEN_JOIN]]
+// CHECK_BUFFERED:       strlen.join:
+// CHECK_BUFFERED-NEXT:    [[TMP11:%.*]] = phi i64 [ [[TMP10]], [[STRLEN_WHILE_DONE]] ], [ 0, [[ENTRY]] ]
+// CHECK_BUFFERED-NEXT:    [[TMP12:%.*]] = add i64 [[TMP11]], 7
+// CHECK_BUFFERED-NEXT:    [[TMP13:%.*]] = and i64 [[TMP12]], 4294967288
+// CHECK_BUFFERED-NEXT:    [[TMP14:%.*]] = add i64 [[TMP13]], 20
+// CHECK_BUFFERED-NEXT:    [[TMP15:%.*]] = trunc i64 [[TMP14]] to i32
+// CHECK_BUFFERED-NEXT:    [[PRINTF_ALLOC_FN:%.*]] = call ptr addrspace(1) @__printf_alloc(i32 [[TMP15]])
+// CHECK_BUFFERED-NEXT:    [[TMP16:%.*]] = icmp ne ptr addrspace(1) [[PRINTF_ALLOC_FN]], null
+// CHECK_BUFFERED-NEXT:    br i1 [[TMP16]], label [[ARGPUSH_BLOCK:%.*]], label [[END_BLOCK:%.*]]
+// CHECK_BUFFERED:       end.block:
+// CHECK_BUFFERED-NEXT:    [[TMP17:%.*]] = xor i1 [[TMP16]], true
+// CHECK_BUFFERED-NEXT:    [[PRINTF_RESULT:%.*]] = sext i1 [[TMP17]] to i32
+// CHECK_BUFFERED-NEXT:    call void @llvm.lifetime.end.p5(i64 4, ptr addrspace(5) [[S]]) #[[ATTR1]]
+// CHECK_BUFFERED-NEXT:    ret void
+// CHECK_BUFFERED:       argpush.block:
+// CHECK_BUFFERED-NEXT:    [[TMP18:%.*]] = shl i32 [[TMP15]], 2
+// CHECK_BUFFERED-NEXT:    [[TMP19:%.*]] = or i32 [[TMP18]], 2
+// CHECK_BUFFERED-NEXT:    store i32 [[TMP19]], ptr addrspace(1) [[PRINTF_ALLOC_FN]], align 4
+// CHECK_BUFFERED-NEXT:   ...
[truncated]

@vikramRH vikramRH changed the title Enable OpenCL hostcall printf (WIP) [AMDGPU] Enable OpenCL hostcall printf (WIP) Nov 28, 2023
@vikramRH
Copy link
Contributor Author

ping

@arsenm
Copy link
Contributor

arsenm commented Nov 28, 2023

ping

The split up parts are still part of this one PR. Currently you're supposed to create a separate PR for each separate change. The set behavior is to squash all of these together on submit

@vikramRH
Copy link
Contributor Author

@arsenm , apologies for the trouble here. I should have based this out of my earlier commit. currently I do not see a way to base this patch off of my earlier commit and it might get too confusing for other reviewers if I close this and raise another review. would the individual commit details suffice ?

@ssahasra
Copy link
Collaborator

ping

Some comments still need to be addressed.

return Builder.CreateBitCast(Arg, Int64Ty);
}

if (isa<PointerType>(Ty)) {
if (!IsBuffered && isa<PointerType>(Ty)) {
return Builder.CreatePtrToInt(Arg, Int64Ty);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Information like this should be written in comments in the source code itself.

@@ -1,12 +1,68 @@
// NOTE: Assertions have been autogenerated by utils/update_cc_test_checks.py
// RUN: %clang_cc1 -cl-std=CL1.2 -triple amdgcn-amd-amdhsa -disable-llvm-passes -emit-llvm -o - %s | FileCheck %s
// RUN: %clang_cc1 -cl-std=CL1.2 -triple amdgcn-amd-amdhsa -mprintf-kind=buffered -disable-llvm-passes -emit-llvm -o - %s | FileCheck --check-prefix=CHECK_BUFFERED %s
// RUN: %clang_cc1 -cl-std=CL1.2 -triple amdgcn-amd-amdhsa -mprintf-kind=hostcall -disable-llvm-passes -emit-llvm -o - %s | FileCheck --check-prefix=CHECK_HOSTCALL %s

int printf(__constant const char* st, ...) __attribute__((format(printf, 1, 2)));
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I only see tests with the "v" correctly used as a vector specifier. What about tests like "%dv", and other cases where either the "v" is wrong, or it's just part of the text being printed? Given that the first attempt at detecting "v" had errors in it, I think it will be good to cover all corner cases where a "v" is actually a vector specifier and and where it is not.

@@ -406,5 +410,9 @@ TARGET_BUILTIN(__builtin_amdgcn_cvt_pk_fp8_f32, "iffiIb", "nc", "fp8-insts")
TARGET_BUILTIN(__builtin_amdgcn_cvt_sr_bf8_f32, "ifiiIi", "nc", "fp8-insts")
TARGET_BUILTIN(__builtin_amdgcn_cvt_sr_fp8_f32, "ifiiIi", "nc", "fp8-insts")

// OpenCL
LANGBUILTIN(printf, "icC*4.", "fp:0:", ALL_OCL_LANGUAGES)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Although we talked about this offline, the explanation needs to be added here. In fact, the motivation for having this builtin should be added as a comment to the source itself for future reference.

@@ -198,6 +229,10 @@ static void locateCStrings(SparseBitVector<8> &BV, StringRef Str) {
if (SpecEnd == StringRef::npos)
return;
auto Spec = Str.slice(SpecPos, SpecEnd + 1);

if ((Spec.find_first_of("v")) != StringRef::npos)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just ".find()" should be sufficient?

Copy link
Contributor

@arsenm arsenm left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a separate PR open for "Add vector processing support to AMDGPU printf"? I think it's easiest to move this part forward first

@vikramRH
Copy link
Contributor Author

vikramRH commented Dec 4, 2023

Is there a separate PR open for "Add vector processing support to AMDGPU printf"? I think it's easiest to move this part forward first

@arsenm , you are right. I just want to make sure we are good on runtime changes too now since there seems to be a blocker. The changes here are not necessary unless we are okay with runtime changes.

// Scan the format string to locate all specifiers, and mark the ones that
// specify a string, i.e, the "%s" specifier with optional '*' characters.
static void locateCStrings(SparseBitVector<8> &BV, StringRef Str) {
// Scan the format string to locate all specifiers and OCL vectors,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"all specifiers" is enough ... there is no need to say "OCL vectors". The rest of the sentence is the one which correctly says "string or vector".

Comment on lines 209 to 230
// If the format specifies a string but the argument is not, the frontend
// will have printed a warning. We just rely on undefined behaviour and send
// the argument anyway.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Bump!

return Builder.CreatePtrToInt(Arg, Int64Ty);
}

llvm_unreachable("unexpected type");
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This llvm_unreachable is preferred. It's clear documentation that all supported types have been handled by this point. Each if-block for integer, floating and pointer types should have its own default "return Arg".

if (FmtStr.empty())
FmtStr = StringRef("", 1);
} else {
if (CGM.getLangOpts().OpenCL) {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks like the wrong place for a diagnostic. For an OpenCL program, shouldn't Sema have already verified that the arguments match the required types, such as "constant address space" for the format string?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The diagnostic should be replaced with an assert() or an llvm_unreachable(). The OpenCL spec says that the format string should be resolvable at compile time, but this is not the right place to check that. By now, the frontend or sema should have rejected the program as ill-formed.

"-mprintf-kind=" +
Args.getLastArgValue(options::OPT_mprintf_kind_EQ)));
// Force compiler error on invalid conversion specifiers
CmdArgs.push_back(Args.MakeArgString("-Werror=format-invalid-specifier"));
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Bump! Remove this command-line flag.

@vikramRH
Copy link
Contributor Author

The new set of changes adds following changes,

  1. The iteration over vector elements now happens using vector size from the format specifier as reference, this is inline with runtime implementation and helps handling undefined behavior when we have a mismatch.
  2. The error flag "-Werror=format-invalid-specifier" has been removed.

@@ -178,17 +181,29 @@ RValue CodeGenFunction::EmitNVPTXDevicePrintfCallExpr(const CallExpr *E) {
E, this, GetVprintfDeclaration(CGM.getModule()), false);
}

// Deterimines if an argument is a string
static bool isString(const clang::Type *argXTy) {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This could be called "MayBeString()" at best. It's not necessary that a char* argument type is a C-style string.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have removed this, addrspace cast is done during arg processing.

clang/lib/CodeGen/CGGPUBuiltin.cpp Outdated Show resolved Hide resolved
if (FmtStr.empty())
FmtStr = StringRef("", 1);
} else {
if (CGM.getLangOpts().OpenCL) {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The diagnostic should be replaced with an assert() or an llvm_unreachable(). The OpenCL spec says that the format string should be resolvable at compile time, but this is not the right place to check that. By now, the frontend or sema should have rejected the program as ill-formed.

@@ -406,5 +410,9 @@ TARGET_BUILTIN(__builtin_amdgcn_cvt_pk_fp8_f32, "iffiIb", "nc", "fp8-insts")
TARGET_BUILTIN(__builtin_amdgcn_cvt_sr_bf8_f32, "ifiiIi", "nc", "fp8-insts")
TARGET_BUILTIN(__builtin_amdgcn_cvt_sr_fp8_f32, "ifiiIi", "nc", "fp8-insts")

// OpenCL
LANGBUILTIN(printf, "icC*4.", "fp:0:", ALL_OCL_LANGUAGES)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This does not belong here. This has nothing to do with AMDGPU

Comment on lines +3620 to +3624
// AMDGCN implementation supports printf as a builtin
// for OpenCL
if (Context.getTargetInfo().getTriple().isAMDGCN() &&
Context.getLangOpts().OpenCL && BuiltinID == AMDGPU::BIprintf)
return BuiltinID;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This does not belong here and has nothing to do with AMDGPU

Copy link
Contributor Author

@vikramRH vikramRH Mar 1, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The signatures of C-printf and OCL printf differ and I dont think generic builtin handling provides a way to register overloaded builtins with "shared" builtin ID's. This needs a new ID. do you have any alternate suggestions here ?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I thought this had been clarified earlier too. It's quite imprecise to just say that "signatures differ". Perhaps the following detailed explanation might move the conversatino forward. The problem is that the OpenCL printf expects a format string in the constant address space, which has no representation in Clang builtin. What we do have is the ability to specify an address-space number in the builtin declaration. But this number is target-specific, which makes the whole builtin target-specific. Is there a way around that magic number 4?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Only other alternative I see currently is to modify Sema (probably ActOnFunctionDeclarator) so that we map the ocl printf declaration to C printf builtin ID. This would be a really hacky solution and I would prefer this implementation.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ping @arsenm

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The builtin specifications are also in terms of the lang address space, not the target address space (this was an ugly compromise to make builtins work at all in OpenCL)

Copy link
Contributor Author

@vikramRH vikramRH Mar 18, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@arsenm, thanks for the info. CustomTypeChecking is a valid option. I'm not sure why OpenCL community did not consider this change despite OpenCL specs specifying the details. I could create a separate patch for this (probably folks from OCL community would provide further background). Meanwhile, can this go ahead as an AMDGPU specific workaround for now so that we have the intended feature in place ? (The frontend changes here can be reverted with that follow up patch )

PS :Ah, I see another issue . OpenCL v1.2 s6.9.f states none of the functions defined in C99 headers are available. This would mean std printf is supposed to be treated differently than OpenCL builtins and consequently the builtin IDs assigned to them "need" to be different. If this understanding is correct, moving ahead with using same builtin ID as std printf is not the right way. (probably this is why such an implementation was never considered)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

. Meanwhile, can this go ahead as an AMDGPU specific workaround for now so that we have the intended feature in place

No. That cleanup will never happen.

PS :Ah, I see another issue . OpenCL v1.2 s6.9.f states none of the functions defined in C99 headers are available. This would mean std printf is supposed to be treated differently than OpenCL builtins and consequently the builtin IDs assigned to them "need" to be different

That's not what that means

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was referring to

. This essentially means that even if frontend attaches the printf builtin ID to the decl (even after custom type checks), this would revert.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that check is buggy. The printf declaration doesn't come from a header, but the printf function does exist in 1.2+. It probably needs to special case printf

Comment on lines +2553 to +2556
// Mutate the printf builtin ID so that we use the same CodeGen path for
// HIP and OpenCL with AMDGPU targets.
if (getTarget().getTriple().isAMDGCN() && BuiltinID == AMDGPU::BIprintf)
BuiltinID = Builtin::BIprintf;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You should not need to remap builtins

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This can be removed if you feel so, probably we would need a new case in Expr CodeGen

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think all you need is to add the CustomTypeChecking attribute to the printf definition, and then add language specific type checking on the string argument

StringRef FmtStr;
if (llvm::getConstantStringInfo(Args[0], FmtStr)) {
if (FmtStr.empty())
FmtStr = StringRef("", 1);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is producing an invalid StringRef?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

not really. This is just to say the format string is not really empty (i.e size = 0) when the user input is an empty format string (a weird corner case)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
backend:AMDGPU clang:codegen clang:driver 'clang' and 'clang++' user-facing binaries. Not 'clang-cl' clang:frontend Language frontend issues, e.g. anything involving "Sema" clang Clang issues not falling into any other category llvm:transforms
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants