[AMDGPU] Implement variadic functions by IR lowering #93362

JonChesterfield · 2024-05-25T01:40:41Z

This is a mostly-target-independent variadic function optimisation and lowering pass. It is only enabled for AMDGPU in this initial commit.

The purpose is to make C style variadic functions a zero cost abstraction. They are lowered to equivalent IR which is then amenable to other optimisations. This is inherently slightly target specific but much less so than one might expect - the C varargs interface heavily constrains the ABI design divergence.

The pass is primarily tested from webassembly. This is because wasm has a straightforward variadic lowering strategy which coincides exactly with what this pass transforms code into and a struct passing convention with few cases to check. Adding further targets conventions is straightforward and elided from this patch primarily to simplify the review. Implemented in other branches are Linux X86, AMD64, AArch64 and NVPTX.

Testing for targets that have existing lowering for va_arg from clang is most efficiently done by checking that clang | opt completely elides the variadic syntax from test cases. The lowering produces a struct for each call site which can be inspected to check the various alignment and indirections are correct.

AMDGPU presently has no variadic support other than some ad hoc printf handling. Combined with the pass being inactive on all other targets landing this represents strict increase in capability with zero risk. Testing and refining will continue post commit.

In addition to the compiler tests included here, a self contained x64 clang/musl toolchain was constructed using the "lowering" instead of the systemv ABI and used to build various C programs like lua and libxml2.

llvmbot · 2024-05-25T01:41:11Z

@llvm/pr-subscribers-llvm-ir
@llvm/pr-subscribers-libc
@llvm/pr-subscribers-backend-webassembly

@llvm/pr-subscribers-clang

Author: Jon Chesterfield (JonChesterfield)

Changes

This is a mostly-target-independent variadic function optimisation and lowering pass. It is only enabled for AMDGPU in this initial commit.

The purpose is to make C style variadic functions a zero cost abstraction. They are lowered to equivalent IR which is then amenable to other optimisations. This is inherently slightly target specific but much less so than one might expect - the C varargs interface heavily constrains the ABI design divergence.

The pass is primarily tested from webassembly. This is because wasm has a straightforward variadic lowering strategy which coincides exactly with what this pass transforms code into and a struct passing convention with few cases to check. Adding further targets conventions is straightforward and elided from this patch primarily to simplify the review. Implemented in other branches are Linux X86, AMD64, AArch64 and NVPTX.

Testing for targets that have existing lowering for va_arg from clang is most efficiently done by checking that clang | opt completely elides the variadic syntax from test cases. The lowering produces a struct for each call site which can be inspected to check the various alignment and indirections are correct.

AMDGPU presently has no variadic support other than some ad hoc printf handling. Combined with the pass being inactive on all other targets landing this represents strict increase in capability with zero risk. Testing and refining will continue post commit.

In addition to the compiler tests included here, a self contained x64 clang/musl toolchain was constructed using the "lowering" instead of the systemv ABI and used to build various C programs like lua and libxml2.

Patch is 206.77 KiB, truncated to 20.00 KiB below, full version: https://github.com/llvm/llvm-project/pull/93362.diff

26 Files Affected:

(modified) clang/lib/CodeGen/Targets/AMDGPU.cpp (+21-4)
(added) clang/test/CodeGen/voidptr-vaarg.c (+478)
(added) clang/test/CodeGenCXX/inline-then-fold-variadics.cpp (+180)
(modified) llvm/cmake/modules/HandleLLVMOptions.cmake (+1-1)
(modified) llvm/include/llvm/InitializePasses.h (+1)
(added) llvm/include/llvm/Transforms/IPO/ExpandVariadics.h (+43)
(modified) llvm/lib/Passes/PassBuilder.cpp (+1)
(modified) llvm/lib/Passes/PassRegistry.def (+1)
(modified) llvm/lib/Target/AMDGPU/AMDGPUPassRegistry.def (+4)
(modified) llvm/lib/Target/AMDGPU/AMDGPUTargetMachine.cpp (+3)
(modified) llvm/lib/Transforms/IPO/CMakeLists.txt (+1)
(added) llvm/lib/Transforms/IPO/ExpandVariadics.cpp (+1031)
(added) llvm/test/CodeGen/AMDGPU/expand-variadic-call.ll (+524)
(modified) llvm/test/CodeGen/AMDGPU/llc-pipeline.ll (+5)
(modified) llvm/test/CodeGen/AMDGPU/unsupported-calls.ll (-19)
(added) llvm/test/CodeGen/WebAssembly/expand-variadic-call.ll (+483)
(added) llvm/test/CodeGen/WebAssembly/vararg-frame.ll (+525)
(added) llvm/test/Transforms/ExpandVariadics/expand-va-intrinsic-split-linkage.ll (+230)
(added) llvm/test/Transforms/ExpandVariadics/expand-va-intrinsic-split-simple.ll (+212)
(added) llvm/test/Transforms/ExpandVariadics/indirect-calls.ll (+58)
(added) llvm/test/Transforms/ExpandVariadics/intrinsics.ll (+117)
(added) llvm/test/Transforms/ExpandVariadics/invoke.ll (+88)
(added) llvm/test/Transforms/ExpandVariadics/pass-byval-byref.ll (+148)
(added) llvm/test/Transforms/ExpandVariadics/pass-indirect.ll (+58)
(added) llvm/test/Transforms/ExpandVariadics/pass-integers.ll (+344)
(modified) llvm/utils/gn/secondary/llvm/lib/Transforms/IPO/BUILD.gn (+1)

diff --git a/clang/lib/CodeGen/Targets/AMDGPU.cpp b/clang/lib/CodeGen/Targets/AMDGPU.cpp
index 44e86c0b40f68..47e18535f8fe0 100644
--- a/clang/lib/CodeGen/Targets/AMDGPU.cpp
+++ b/clang/lib/CodeGen/Targets/AMDGPU.cpp
@@ -45,7 +45,7 @@ class AMDGPUABIInfo final : public DefaultABIInfo {
 
   ABIArgInfo classifyReturnType(QualType RetTy) const;
   ABIArgInfo classifyKernelArgumentType(QualType Ty) const;
-  ABIArgInfo classifyArgumentType(QualType Ty, unsigned &NumRegsLeft) const;
+  ABIArgInfo classifyArgumentType(QualType Ty, bool Variadic, unsigned &NumRegsLeft) const;
 
   void computeInfo(CGFunctionInfo &FI) const override;
   Address EmitVAArg(CodeGenFunction &CGF, Address VAListAddr,
@@ -103,19 +103,27 @@ void AMDGPUABIInfo::computeInfo(CGFunctionInfo &FI) const {
   if (!getCXXABI().classifyReturnType(FI))
     FI.getReturnInfo() = classifyReturnType(FI.getReturnType());
 
+  unsigned ArgumentIndex = 0;
+  const unsigned numFixedArguments = FI.getNumRequiredArgs();
+
   unsigned NumRegsLeft = MaxNumRegsForArgsRet;
   for (auto &Arg : FI.arguments()) {
     if (CC == llvm::CallingConv::AMDGPU_KERNEL) {
       Arg.info = classifyKernelArgumentType(Arg.type);
     } else {
-      Arg.info = classifyArgumentType(Arg.type, NumRegsLeft);
+      bool FixedArgument = ArgumentIndex++ < numFixedArguments;
+      Arg.info = classifyArgumentType(Arg.type, !FixedArgument, NumRegsLeft);
     }
   }
 }
 
 Address AMDGPUABIInfo::EmitVAArg(CodeGenFunction &CGF, Address VAListAddr,
-                                 QualType Ty) const {
-  llvm_unreachable("AMDGPU does not support varargs");
+                                 QualType Ty) const { 
+  const bool IsIndirect = false;
+  const bool AllowHigherAlign = false;
+  return emitVoidPtrVAArg(CGF, VAListAddr, Ty, IsIndirect,
+                          getContext().getTypeInfoInChars(Ty),
+                          CharUnits::fromQuantity(4), AllowHigherAlign);
 }
 
 ABIArgInfo AMDGPUABIInfo::classifyReturnType(QualType RetTy) const {
@@ -198,11 +206,20 @@ ABIArgInfo AMDGPUABIInfo::classifyKernelArgumentType(QualType Ty) const {
 }
 
 ABIArgInfo AMDGPUABIInfo::classifyArgumentType(QualType Ty,
+                                               bool Variadic,
                                                unsigned &NumRegsLeft) const {
   assert(NumRegsLeft <= MaxNumRegsForArgsRet && "register estimate underflow");
 
   Ty = useFirstFieldIfTransparentUnion(Ty);
 
+  if (Variadic) {
+      return ABIArgInfo::getDirect(/*T=*/nullptr,
+                                   /*Offset=*/0,
+                                   /*Padding=*/nullptr,
+                                   /*CanBeFlattened=*/false,
+                                   /*Align=*/0);
+    }
+  
   if (isAggregateTypeForABI(Ty)) {
     // Records with non-trivial destructors/copy-constructors should not be
     // passed by value.
diff --git a/clang/test/CodeGen/voidptr-vaarg.c b/clang/test/CodeGen/voidptr-vaarg.c
new file mode 100644
index 0000000000000..d023ddf0fb5d2
--- /dev/null
+++ b/clang/test/CodeGen/voidptr-vaarg.c
@@ -0,0 +1,478 @@
+// NOTE: Assertions have been autogenerated by utils/update_cc_test_checks.py
+// REQUIRES: webassembly-registered-target
+// RUN: %clang_cc1 -triple wasm32-unknown-unknown -emit-llvm -o - %s | FileCheck %s
+
+// Multiple targets use emitVoidPtrVAArg to lower va_arg instructions in clang
+// PPC is complicated, excluding from this case analysis
+// ForceRightAdjust is false for all non-PPC targets
+// AllowHigherAlign is only false for two Microsoft targets, both of which
+// pass most things by reference.
+//
+// Address emitVoidPtrVAArg(CodeGenFunction &CGF, Address VAListAddr,
+//                          QualType ValueTy, bool IsIndirect,
+//                          TypeInfoChars ValueInfo, CharUnits SlotSizeAndAlign,
+//                          bool AllowHigherAlign, bool ForceRightAdjust =
+//                          false);
+//
+// Target       IsIndirect    SlotSize  AllowHigher ForceRightAdjust
+// ARC          false             four  true        false
+// ARM          varies            four  true        false
+// Mips         false           4 or 8  true        false
+// RISCV        varies        register  true        false
+// PPC elided
+// LoongArch    varies        register  true        false
+// NVPTX WIP
+// AMDGPU WIP
+// X86_32       false             four  true        false
+// X86_64 MS    varies           eight  false       false
+// CSKY         false             four  true        false
+// Webassembly  varies            four  true        false
+// AArch64      false            eight  true        false
+// AArch64 MS   false            eight  false       false
+//
+// Webassembly passes indirectly iff it's an aggregate of multiple values
+// Choosing this as a representative architecture to check IR generation
+// partly because it has a relatively simple variadic calling convention.
+
+// Int, by itself and packed in structs
+// CHECK-LABEL: @raw_int(
+// CHECK-NEXT:  entry:
+// CHECK-NEXT:    [[LIST_ADDR:%.*]] = alloca ptr, align 4
+// CHECK-NEXT:    store ptr [[LIST:%.*]], ptr [[LIST_ADDR]], align 4
+// CHECK-NEXT:    [[ARGP_CUR:%.*]] = load ptr, ptr [[LIST_ADDR]], align 4
+// CHECK-NEXT:    [[ARGP_NEXT:%.*]] = getelementptr inbounds i8, ptr [[ARGP_CUR]], i32 4
+// CHECK-NEXT:    store ptr [[ARGP_NEXT]], ptr [[LIST_ADDR]], align 4
+// CHECK-NEXT:    [[TMP0:%.*]] = load i32, ptr [[ARGP_CUR]], align 4
+// CHECK-NEXT:    ret i32 [[TMP0]]
+//
+int raw_int(__builtin_va_list list) { return __builtin_va_arg(list, int); }
+
+typedef struct {
+  int x;
+} one_int_t;
+
+// CHECK-LABEL: @one_int(
+// CHECK-NEXT:  entry:
+// CHECK-NEXT:    [[RETVAL:%.*]] = alloca [[STRUCT_ONE_INT_T:%.*]], align 4
+// CHECK-NEXT:    [[LIST_ADDR:%.*]] = alloca ptr, align 4
+// CHECK-NEXT:    store ptr [[LIST:%.*]], ptr [[LIST_ADDR]], align 4
+// CHECK-NEXT:    [[ARGP_CUR:%.*]] = load ptr, ptr [[LIST_ADDR]], align 4
+// CHECK-NEXT:    [[ARGP_NEXT:%.*]] = getelementptr inbounds i8, ptr [[ARGP_CUR]], i32 4
+// CHECK-NEXT:    store ptr [[ARGP_NEXT]], ptr [[LIST_ADDR]], align 4
+// CHECK-NEXT:    call void @llvm.memcpy.p0.p0.i32(ptr align 4 [[RETVAL]], ptr align 4 [[ARGP_CUR]], i32 4, i1 false)
+// CHECK-NEXT:    [[COERCE_DIVE:%.*]] = getelementptr inbounds [[STRUCT_ONE_INT_T]], ptr [[RETVAL]], i32 0, i32 0
+// CHECK-NEXT:    [[TMP0:%.*]] = load i32, ptr [[COERCE_DIVE]], align 4
+// CHECK-NEXT:    ret i32 [[TMP0]]
+//
+one_int_t one_int(__builtin_va_list list) {
+  return __builtin_va_arg(list, one_int_t);
+}
+
+typedef struct {
+  int x;
+  int y;
+} two_int_t;
+
+// CHECK-LABEL: @two_int(
+// CHECK-NEXT:  entry:
+// CHECK-NEXT:    [[LIST_ADDR:%.*]] = alloca ptr, align 4
+// CHECK-NEXT:    store ptr [[LIST:%.*]], ptr [[LIST_ADDR]], align 4
+// CHECK-NEXT:    [[ARGP_CUR:%.*]] = load ptr, ptr [[LIST_ADDR]], align 4
+// CHECK-NEXT:    [[ARGP_NEXT:%.*]] = getelementptr inbounds i8, ptr [[ARGP_CUR]], i32 4
+// CHECK-NEXT:    store ptr [[ARGP_NEXT]], ptr [[LIST_ADDR]], align 4
+// CHECK-NEXT:    [[TMP0:%.*]] = load ptr, ptr [[ARGP_CUR]], align 4
+// CHECK-NEXT:    call void @llvm.memcpy.p0.p0.i32(ptr align 4 [[AGG_RESULT:%.*]], ptr align 4 [[TMP0]], i32 8, i1 false)
+// CHECK-NEXT:    ret void
+//
+two_int_t two_int(__builtin_va_list list) {
+  return __builtin_va_arg(list, two_int_t);
+}
+
+// Double, by itself and packed in structs
+// CHECK-LABEL: @raw_double(
+// CHECK-NEXT:  entry:
+// CHECK-NEXT:    [[LIST_ADDR:%.*]] = alloca ptr, align 4
+// CHECK-NEXT:    store ptr [[LIST:%.*]], ptr [[LIST_ADDR]], align 4
+// CHECK-NEXT:    [[ARGP_CUR:%.*]] = load ptr, ptr [[LIST_ADDR]], align 4
+// CHECK-NEXT:    [[TMP0:%.*]] = getelementptr inbounds i8, ptr [[ARGP_CUR]], i32 7
+// CHECK-NEXT:    [[ARGP_CUR_ALIGNED:%.*]] = call ptr @llvm.ptrmask.p0.i32(ptr [[TMP0]], i32 -8)
+// CHECK-NEXT:    [[ARGP_NEXT:%.*]] = getelementptr inbounds i8, ptr [[ARGP_CUR_ALIGNED]], i32 8
+// CHECK-NEXT:    store ptr [[ARGP_NEXT]], ptr [[LIST_ADDR]], align 4
+// CHECK-NEXT:    [[TMP1:%.*]] = load double, ptr [[ARGP_CUR_ALIGNED]], align 8
+// CHECK-NEXT:    ret double [[TMP1]]
+//
+double raw_double(__builtin_va_list list) {
+  return __builtin_va_arg(list, double);
+}
+
+typedef struct {
+  double x;
+} one_double_t;
+
+// CHECK-LABEL: @one_double(
+// CHECK-NEXT:  entry:
+// CHECK-NEXT:    [[RETVAL:%.*]] = alloca [[STRUCT_ONE_DOUBLE_T:%.*]], align 8
+// CHECK-NEXT:    [[LIST_ADDR:%.*]] = alloca ptr, align 4
+// CHECK-NEXT:    store ptr [[LIST:%.*]], ptr [[LIST_ADDR]], align 4
+// CHECK-NEXT:    [[ARGP_CUR:%.*]] = load ptr, ptr [[LIST_ADDR]], align 4
+// CHECK-NEXT:    [[TMP0:%.*]] = getelementptr inbounds i8, ptr [[ARGP_CUR]], i32 7
+// CHECK-NEXT:    [[ARGP_CUR_ALIGNED:%.*]] = call ptr @llvm.ptrmask.p0.i32(ptr [[TMP0]], i32 -8)
+// CHECK-NEXT:    [[ARGP_NEXT:%.*]] = getelementptr inbounds i8, ptr [[ARGP_CUR_ALIGNED]], i32 8
+// CHECK-NEXT:    store ptr [[ARGP_NEXT]], ptr [[LIST_ADDR]], align 4
+// CHECK-NEXT:    call void @llvm.memcpy.p0.p0.i32(ptr align 8 [[RETVAL]], ptr align 8 [[ARGP_CUR_ALIGNED]], i32 8, i1 false)
+// CHECK-NEXT:    [[COERCE_DIVE:%.*]] = getelementptr inbounds [[STRUCT_ONE_DOUBLE_T]], ptr [[RETVAL]], i32 0, i32 0
+// CHECK-NEXT:    [[TMP1:%.*]] = load double, ptr [[COERCE_DIVE]], align 8
+// CHECK-NEXT:    ret double [[TMP1]]
+//
+one_double_t one_double(__builtin_va_list list) {
+  return __builtin_va_arg(list, one_double_t);
+}
+
+typedef struct {
+  double x;
+  double y;
+} two_double_t;
+
+// CHECK-LABEL: @two_double(
+// CHECK-NEXT:  entry:
+// CHECK-NEXT:    [[LIST_ADDR:%.*]] = alloca ptr, align 4
+// CHECK-NEXT:    store ptr [[LIST:%.*]], ptr [[LIST_ADDR]], align 4
+// CHECK-NEXT:    [[ARGP_CUR:%.*]] = load ptr, ptr [[LIST_ADDR]], align 4
+// CHECK-NEXT:    [[ARGP_NEXT:%.*]] = getelementptr inbounds i8, ptr [[ARGP_CUR]], i32 4
+// CHECK-NEXT:    store ptr [[ARGP_NEXT]], ptr [[LIST_ADDR]], align 4
+// CHECK-NEXT:    [[TMP0:%.*]] = load ptr, ptr [[ARGP_CUR]], align 4
+// CHECK-NEXT:    call void @llvm.memcpy.p0.p0.i32(ptr align 8 [[AGG_RESULT:%.*]], ptr align 8 [[TMP0]], i32 16, i1 false)
+// CHECK-NEXT:    ret void
+//
+two_double_t two_double(__builtin_va_list list) {
+  return __builtin_va_arg(list, two_double_t);
+}
+
+// Scalar smaller than the slot size (C would promote a short to int)
+typedef struct {
+  char x;
+} one_char_t;
+
+// CHECK-LABEL: @one_char(
+// CHECK-NEXT:  entry:
+// CHECK-NEXT:    [[RETVAL:%.*]] = alloca [[STRUCT_ONE_CHAR_T:%.*]], align 1
+// CHECK-NEXT:    [[LIST_ADDR:%.*]] = alloca ptr, align 4
+// CHECK-NEXT:    store ptr [[LIST:%.*]], ptr [[LIST_ADDR]], align 4
+// CHECK-NEXT:    [[ARGP_CUR:%.*]] = load ptr, ptr [[LIST_ADDR]], align 4
+// CHECK-NEXT:    [[ARGP_NEXT:%.*]] = getelementptr inbounds i8, ptr [[ARGP_CUR]], i32 4
+// CHECK-NEXT:    store ptr [[ARGP_NEXT]], ptr [[LIST_ADDR]], align 4
+// CHECK-NEXT:    call void @llvm.memcpy.p0.p0.i32(ptr align 1 [[RETVAL]], ptr align 4 [[ARGP_CUR]], i32 1, i1 false)
+// CHECK-NEXT:    [[COERCE_DIVE:%.*]] = getelementptr inbounds [[STRUCT_ONE_CHAR_T]], ptr [[RETVAL]], i32 0, i32 0
+// CHECK-NEXT:    [[TMP0:%.*]] = load i8, ptr [[COERCE_DIVE]], align 1
+// CHECK-NEXT:    ret i8 [[TMP0]]
+//
+one_char_t one_char(__builtin_va_list list) {
+  return __builtin_va_arg(list, one_char_t);
+}
+
+typedef struct {
+  short x;
+} one_short_t;
+
+// CHECK-LABEL: @one_short(
+// CHECK-NEXT:  entry:
+// CHECK-NEXT:    [[RETVAL:%.*]] = alloca [[STRUCT_ONE_SHORT_T:%.*]], align 2
+// CHECK-NEXT:    [[LIST_ADDR:%.*]] = alloca ptr, align 4
+// CHECK-NEXT:    store ptr [[LIST:%.*]], ptr [[LIST_ADDR]], align 4
+// CHECK-NEXT:    [[ARGP_CUR:%.*]] = load ptr, ptr [[LIST_ADDR]], align 4
+// CHECK-NEXT:    [[ARGP_NEXT:%.*]] = getelementptr inbounds i8, ptr [[ARGP_CUR]], i32 4
+// CHECK-NEXT:    store ptr [[ARGP_NEXT]], ptr [[LIST_ADDR]], align 4
+// CHECK-NEXT:    call void @llvm.memcpy.p0.p0.i32(ptr align 2 [[RETVAL]], ptr align 4 [[ARGP_CUR]], i32 2, i1 false)
+// CHECK-NEXT:    [[COERCE_DIVE:%.*]] = getelementptr inbounds [[STRUCT_ONE_SHORT_T]], ptr [[RETVAL]], i32 0, i32 0
+// CHECK-NEXT:    [[TMP0:%.*]] = load i16, ptr [[COERCE_DIVE]], align 2
+// CHECK-NEXT:    ret i16 [[TMP0]]
+//
+one_short_t one_short(__builtin_va_list list) {
+  return __builtin_va_arg(list, one_short_t);
+}
+
+// Composite smaller than the slot size
+typedef struct {
+  _Alignas(2) char x;
+  char y;
+} char_pair_t;
+
+// CHECK-LABEL: @char_pair(
+// CHECK-NEXT:  entry:
+// CHECK-NEXT:    [[LIST_ADDR:%.*]] = alloca ptr, align 4
+// CHECK-NEXT:    store ptr [[LIST:%.*]], ptr [[LIST_ADDR]], align 4
+// CHECK-NEXT:    [[ARGP_CUR:%.*]] = load ptr, ptr [[LIST_ADDR]], align 4
+// CHECK-NEXT:    [[ARGP_NEXT:%.*]] = getelementptr inbounds i8, ptr [[ARGP_CUR]], i32 4
+// CHECK-NEXT:    store ptr [[ARGP_NEXT]], ptr [[LIST_ADDR]], align 4
+// CHECK-NEXT:    [[TMP0:%.*]] = load ptr, ptr [[ARGP_CUR]], align 4
+// CHECK-NEXT:    call void @llvm.memcpy.p0.p0.i32(ptr align 2 [[AGG_RESULT:%.*]], ptr align 2 [[TMP0]], i32 2, i1 false)
+// CHECK-NEXT:    ret void
+//
+char_pair_t char_pair(__builtin_va_list list) {
+  return __builtin_va_arg(list, char_pair_t);
+}
+
+// Empty struct
+typedef struct {
+} empty_t;
+
+// CHECK-LABEL: @empty(
+// CHECK-NEXT:  entry:
+// CHECK-NEXT:    [[RETVAL:%.*]] = alloca [[STRUCT_EMPTY_T:%.*]], align 1
+// CHECK-NEXT:    [[LIST_ADDR:%.*]] = alloca ptr, align 4
+// CHECK-NEXT:    store ptr [[LIST:%.*]], ptr [[LIST_ADDR]], align 4
+// CHECK-NEXT:    [[ARGP_CUR:%.*]] = load ptr, ptr [[LIST_ADDR]], align 4
+// CHECK-NEXT:    [[ARGP_NEXT:%.*]] = getelementptr inbounds i8, ptr [[ARGP_CUR]], i32 0
+// CHECK-NEXT:    store ptr [[ARGP_NEXT]], ptr [[LIST_ADDR]], align 4
+// CHECK-NEXT:    call void @llvm.memcpy.p0.p0.i32(ptr align 1 [[RETVAL]], ptr align 4 [[ARGP_CUR]], i32 0, i1 false)
+// CHECK-NEXT:    ret void
+//
+empty_t empty(__builtin_va_list list) {
+  return __builtin_va_arg(list, empty_t);
+}
+
+typedef struct {
+  empty_t x;
+  int y;
+} empty_int_t;
+
+// CHECK-LABEL: @empty_int(
+// CHECK-NEXT:  entry:
+// CHECK-NEXT:    [[RETVAL:%.*]] = alloca [[STRUCT_EMPTY_INT_T:%.*]], align 4
+// CHECK-NEXT:    [[LIST_ADDR:%.*]] = alloca ptr, align 4
+// CHECK-NEXT:    store ptr [[LIST:%.*]], ptr [[LIST_ADDR]], align 4
+// CHECK-NEXT:    [[ARGP_CUR:%.*]] = load ptr, ptr [[LIST_ADDR]], align 4
+// CHECK-NEXT:    [[ARGP_NEXT:%.*]] = getelementptr inbounds i8, ptr [[ARGP_CUR]], i32 4
+// CHECK-NEXT:    store ptr [[ARGP_NEXT]], ptr [[LIST_ADDR]], align 4
+// CHECK-NEXT:    call void @llvm.memcpy.p0.p0.i32(ptr align 4 [[RETVAL]], ptr align 4 [[ARGP_CUR]], i32 4, i1 false)
+// CHECK-NEXT:    [[TMP0:%.*]] = load i32, ptr [[RETVAL]], align 4
+// CHECK-NEXT:    ret i32 [[TMP0]]
+//
+empty_int_t empty_int(__builtin_va_list list) {
+  return __builtin_va_arg(list, empty_int_t);
+}
+
+typedef struct {
+  int x;
+  empty_t y;
+} int_empty_t;
+
+// CHECK-LABEL: @int_empty(
+// CHECK-NEXT:  entry:
+// CHECK-NEXT:    [[RETVAL:%.*]] = alloca [[STRUCT_INT_EMPTY_T:%.*]], align 4
+// CHECK-NEXT:    [[LIST_ADDR:%.*]] = alloca ptr, align 4
+// CHECK-NEXT:    store ptr [[LIST:%.*]], ptr [[LIST_ADDR]], align 4
+// CHECK-NEXT:    [[ARGP_CUR:%.*]] = load ptr, ptr [[LIST_ADDR]], align 4
+// CHECK-NEXT:    [[ARGP_NEXT:%.*]] = getelementptr inbounds i8, ptr [[ARGP_CUR]], i32 4
+// CHECK-NEXT:    store ptr [[ARGP_NEXT]], ptr [[LIST_ADDR]], align 4
+// CHECK-NEXT:    call void @llvm.memcpy.p0.p0.i32(ptr align 4 [[RETVAL]], ptr align 4 [[ARGP_CUR]], i32 4, i1 false)
+// CHECK-NEXT:    [[COERCE_DIVE:%.*]] = getelementptr inbounds [[STRUCT_INT_EMPTY_T]], ptr [[RETVAL]], i32 0, i32 0
+// CHECK-NEXT:    [[TMP0:%.*]] = load i32, ptr [[COERCE_DIVE]], align 4
+// CHECK-NEXT:    ret i32 [[TMP0]]
+//
+int_empty_t int_empty(__builtin_va_list list) {
+  return __builtin_va_arg(list, int_empty_t);
+}
+
+// Need multiple va_arg instructions to check the postincrement
+// Using types that are passed directly as the indirect handling
+// is independent of the alignment handling in emitVoidPtrDirectVAArg.
+
+// CHECK-LABEL: @multiple_int(
+// CHECK-NEXT:  entry:
+// CHECK-NEXT:    [[LIST_ADDR:%.*]] = alloca ptr, align 4
+// CHECK-NEXT:    [[OUT0_ADDR:%.*]] = alloca ptr, align 4
+// CHECK-NEXT:    [[OUT1_ADDR:%.*]] = alloca ptr, align 4
+// CHECK-NEXT:    [[OUT2_ADDR:%.*]] = alloca ptr, align 4
+// CHECK-NEXT:    store ptr [[LIST:%.*]], ptr [[LIST_ADDR]], align 4
+// CHECK-NEXT:    store ptr [[OUT0:%.*]], ptr [[OUT0_ADDR]], align 4
+// CHECK-NEXT:    store ptr [[OUT1:%.*]], ptr [[OUT1_ADDR]], align 4
+// CHECK-NEXT:    store ptr [[OUT2:%.*]], ptr [[OUT2_ADDR]], align 4
+// CHECK-NEXT:    [[ARGP_CUR:%.*]] = load ptr, ptr [[LIST_ADDR]], align 4
+// CHECK-NEXT:    [[ARGP_NEXT:%.*]] = getelementptr inbounds i8, ptr [[ARGP_CUR]], i32 4
+// CHECK-NEXT:    store ptr [[ARGP_NEXT]], ptr [[LIST_ADDR]], align 4
+// CHECK-NEXT:    [[TMP0:%.*]] = load i32, ptr [[ARGP_CUR]], align 4
+// CHECK-NEXT:    [[TMP1:%.*]] = load ptr, ptr [[OUT0_ADDR]], align 4
+// CHECK-NEXT:    store i32 [[TMP0]], ptr [[TMP1]], align 4
+// CHECK-NEXT:    [[ARGP_CUR1:%.*]] = load ptr, ptr [[LIST_ADDR]], align 4
+// CHECK-NEXT:    [[ARGP_NEXT2:%.*]] = getelementptr inbounds i8, ptr [[ARGP_CUR1]], i32 4
+// CHECK-NEXT:    store ptr [[ARGP_NEXT2]], ptr [[LIST_ADDR]], align 4
+// CHECK-NEXT:    [[TMP2:%.*]] = load i32, ptr [[ARGP_CUR1]], align 4
+// CHECK-NEXT:    [[TMP3:%.*]] = load ptr, ptr [[OUT1_ADDR]], align 4
+// CHECK-NEXT:    store i32 [[TMP2]], ptr [[TMP3]], align 4
+// CHECK-NEXT:    [[ARGP_CUR3:%.*]] = load ptr, ptr [[LIST_ADDR]], align 4
+// CHECK-NEXT:    [[ARGP_NEXT4:%.*]] = getelementptr inbounds i8, ptr [[ARGP_CUR3]], i32 4
+// CHECK-NEXT:    store ptr [[ARGP_NEXT4]], ptr [[LIST_ADDR]], align 4
+// CHECK-NEXT:    [[TMP4:%.*]] = load i32, ptr [[ARGP_CUR3]], align 4
+// CHECK-NEXT:    [[TMP5:%.*]] = load ptr, ptr [[OUT2_ADDR]], align 4
+// CHECK-NEXT:    store i32 [[TMP4]], ptr [[TMP5]], align 4
+// CHECK-NEXT:    ret void
+//
+void multiple_int(__builtin_va_list list, int *out0, int *out1, int *out2) {
+  *out0 = __builtin_va_arg(list, int);
+  *out1 = __builtin_va_arg(list, int);
+  *out2 = __builtin_va_arg(list, int);
+}
+
+// Scalars in structs are an easy way of specifying alignment from C
+// CHECK-LABEL: @increasing_alignment(
+// CHECK-NEXT:  entry:
+// CHECK-NEXT:    [[LIST_ADDR:%.*]] = alloca ptr, align 4
+// CHECK-NEXT:    [[OUT0_ADDR:%.*]] = alloca ptr, align 4
+// CHECK-NEXT:    [[OUT1_ADDR:%.*]] = alloca ptr, align 4
+// CHECK-NEXT:    [[OUT2_ADDR:%.*]] = alloca ptr, align 4
+// CHECK-NEXT:    [[OUT3_ADDR:%.*]] = alloca ptr, align 4
+// CHECK-NEXT:    store ptr [[LIST:%.*]], ptr [[LIST_ADDR]], align 4
+// CHECK-NEXT:    store ptr [[OUT0:%.*]], ptr [[OUT0_ADDR]], align 4
+// CHECK-NEXT:    store ptr [[OUT1:%.*]], ptr [[OUT1_ADDR]], align 4
+// CHECK-NEXT:    store ptr [[OUT2:%.*]], ptr [[OUT2_ADDR]], align 4
+// CHECK-NEXT:    store ptr [[OUT3:%.*]], ptr [[OUT3_ADDR]], align 4
+// CHECK-NEXT:    [[TMP0:%.*]] = load ptr, ptr [[OUT0_ADDR]], align 4
+// CHECK-NEXT:    [[ARGP_CUR:%.*]] = load ptr, ptr [[LIST_ADDR]], align 4
+// CHECK-NEXT:    [[ARGP_NEXT:%.*]] = getelementptr inbounds i8, ptr [[ARGP_CUR]], i32 4
+// CHECK-NEXT:    store ptr [[ARGP_NEXT]], ptr [[LIST_ADDR]], align 4
+// CHECK-NEXT:    call void @llvm.memcpy.p0.p0.i32(ptr align 1 [[TMP0]], ptr align 4 [[ARGP_CUR]], i32 1, i1 false)
+// CHECK-NEXT:    [[TMP1:%.*]] = load ptr, ptr [[OUT1_ADDR]], align 4
+// CHECK-NEXT:    [[ARGP_CUR1:%.*]] = load ptr, ptr [[LIST_ADDR]], align 4
+// CHECK-NEXT:    [[ARGP_NEXT2:%.*]] = getelementptr inbounds i8, ptr [[ARGP_CUR1]], i32 4
+// CHECK-NEXT:    store ptr [[ARGP_NEXT2]], ptr [[LIST_ADDR]], align 4
+// CHECK-NEXT:    call void @llvm.memcpy.p0.p0.i32(ptr align 2 [[TMP1]], ptr align 4 [[ARGP_CUR1]], i32 2, i1 false)
+// CHECK-NEXT:    [[ARGP_CUR3:%.*]] = load ptr, ptr [[LIST_ADDR]], align 4
+// CHECK-NEXT:    [[ARGP_NEXT4:%.*]] = getelementptr inbounds i8, ptr [[ARGP_CUR3]], i32 4
+// CHECK-NEXT:    store ptr [[ARGP_NEXT4]], ptr [[LIST_ADDR]], align 4
+// CHECK-NEXT:    [[TMP2:%.*]] = load i32, ptr [[ARGP_CUR3]], align 4
+// CHECK-NEXT:    [[TMP3:%.*]] = load ptr, ptr [[OUT2_ADDR]], align 4
+// CHECK-NEXT:    store i32 [[TMP2]], ptr [[TMP3]], align 4
+// CHECK-NEXT:    [[ARGP_CUR5:%.*]] = load ptr, ptr [[LIST_ADDR]], align 4
+// CHECK-NEXT:    [[TMP4:%.*]] = getelementptr inbounds i8, ptr [[ARGP_CUR5]], i32 7
+// CHECK-NEXT:    [[ARGP_CUR5_ALIGNED:%.*]] = call ptr @llvm.ptrmask.p0.i32(ptr [[TMP4...
[truncated]

llvmbot · 2024-05-25T01:41:11Z

@llvm/pr-subscribers-llvm-transforms

Author: Jon Chesterfield (JonChesterfield)

Changes

This is a mostly-target-independent variadic function optimisation and lowering pass. It is only enabled for AMDGPU in this initial commit.

The purpose is to make C style variadic functions a zero cost abstraction. They are lowered to equivalent IR which is then amenable to other optimisations. This is inherently slightly target specific but much less so than one might expect - the C varargs interface heavily constrains the ABI design divergence.

The pass is primarily tested from webassembly. This is because wasm has a straightforward variadic lowering strategy which coincides exactly with what this pass transforms code into and a struct passing convention with few cases to check. Adding further targets conventions is straightforward and elided from this patch primarily to simplify the review. Implemented in other branches are Linux X86, AMD64, AArch64 and NVPTX.

Testing for targets that have existing lowering for va_arg from clang is most efficiently done by checking that clang | opt completely elides the variadic syntax from test cases. The lowering produces a struct for each call site which can be inspected to check the various alignment and indirections are correct.

AMDGPU presently has no variadic support other than some ad hoc printf handling. Combined with the pass being inactive on all other targets landing this represents strict increase in capability with zero risk. Testing and refining will continue post commit.

In addition to the compiler tests included here, a self contained x64 clang/musl toolchain was constructed using the "lowering" instead of the systemv ABI and used to build various C programs like lua and libxml2.

Patch is 206.77 KiB, truncated to 20.00 KiB below, full version: https://github.com/llvm/llvm-project/pull/93362.diff

26 Files Affected:

(modified) clang/lib/CodeGen/Targets/AMDGPU.cpp (+21-4)
(added) clang/test/CodeGen/voidptr-vaarg.c (+478)
(added) clang/test/CodeGenCXX/inline-then-fold-variadics.cpp (+180)
(modified) llvm/cmake/modules/HandleLLVMOptions.cmake (+1-1)
(modified) llvm/include/llvm/InitializePasses.h (+1)
(added) llvm/include/llvm/Transforms/IPO/ExpandVariadics.h (+43)
(modified) llvm/lib/Passes/PassBuilder.cpp (+1)
(modified) llvm/lib/Passes/PassRegistry.def (+1)
(modified) llvm/lib/Target/AMDGPU/AMDGPUPassRegistry.def (+4)
(modified) llvm/lib/Target/AMDGPU/AMDGPUTargetMachine.cpp (+3)
(modified) llvm/lib/Transforms/IPO/CMakeLists.txt (+1)
(added) llvm/lib/Transforms/IPO/ExpandVariadics.cpp (+1031)
(added) llvm/test/CodeGen/AMDGPU/expand-variadic-call.ll (+524)
(modified) llvm/test/CodeGen/AMDGPU/llc-pipeline.ll (+5)
(modified) llvm/test/CodeGen/AMDGPU/unsupported-calls.ll (-19)
(added) llvm/test/CodeGen/WebAssembly/expand-variadic-call.ll (+483)
(added) llvm/test/CodeGen/WebAssembly/vararg-frame.ll (+525)
(added) llvm/test/Transforms/ExpandVariadics/expand-va-intrinsic-split-linkage.ll (+230)
(added) llvm/test/Transforms/ExpandVariadics/expand-va-intrinsic-split-simple.ll (+212)
(added) llvm/test/Transforms/ExpandVariadics/indirect-calls.ll (+58)
(added) llvm/test/Transforms/ExpandVariadics/intrinsics.ll (+117)
(added) llvm/test/Transforms/ExpandVariadics/invoke.ll (+88)
(added) llvm/test/Transforms/ExpandVariadics/pass-byval-byref.ll (+148)
(added) llvm/test/Transforms/ExpandVariadics/pass-indirect.ll (+58)
(added) llvm/test/Transforms/ExpandVariadics/pass-integers.ll (+344)
(modified) llvm/utils/gn/secondary/llvm/lib/Transforms/IPO/BUILD.gn (+1)

diff --git a/clang/lib/CodeGen/Targets/AMDGPU.cpp b/clang/lib/CodeGen/Targets/AMDGPU.cpp
index 44e86c0b40f68..47e18535f8fe0 100644
--- a/clang/lib/CodeGen/Targets/AMDGPU.cpp
+++ b/clang/lib/CodeGen/Targets/AMDGPU.cpp
@@ -45,7 +45,7 @@ class AMDGPUABIInfo final : public DefaultABIInfo {
 
   ABIArgInfo classifyReturnType(QualType RetTy) const;
   ABIArgInfo classifyKernelArgumentType(QualType Ty) const;
-  ABIArgInfo classifyArgumentType(QualType Ty, unsigned &NumRegsLeft) const;
+  ABIArgInfo classifyArgumentType(QualType Ty, bool Variadic, unsigned &NumRegsLeft) const;
 
   void computeInfo(CGFunctionInfo &FI) const override;
   Address EmitVAArg(CodeGenFunction &CGF, Address VAListAddr,
@@ -103,19 +103,27 @@ void AMDGPUABIInfo::computeInfo(CGFunctionInfo &FI) const {
   if (!getCXXABI().classifyReturnType(FI))
     FI.getReturnInfo() = classifyReturnType(FI.getReturnType());
 
+  unsigned ArgumentIndex = 0;
+  const unsigned numFixedArguments = FI.getNumRequiredArgs();
+
   unsigned NumRegsLeft = MaxNumRegsForArgsRet;
   for (auto &Arg : FI.arguments()) {
     if (CC == llvm::CallingConv::AMDGPU_KERNEL) {
       Arg.info = classifyKernelArgumentType(Arg.type);
     } else {
-      Arg.info = classifyArgumentType(Arg.type, NumRegsLeft);
+      bool FixedArgument = ArgumentIndex++ < numFixedArguments;
+      Arg.info = classifyArgumentType(Arg.type, !FixedArgument, NumRegsLeft);
     }
   }
 }
 
 Address AMDGPUABIInfo::EmitVAArg(CodeGenFunction &CGF, Address VAListAddr,
-                                 QualType Ty) const {
-  llvm_unreachable("AMDGPU does not support varargs");
+                                 QualType Ty) const { 
+  const bool IsIndirect = false;
+  const bool AllowHigherAlign = false;
+  return emitVoidPtrVAArg(CGF, VAListAddr, Ty, IsIndirect,
+                          getContext().getTypeInfoInChars(Ty),
+                          CharUnits::fromQuantity(4), AllowHigherAlign);
 }
 
 ABIArgInfo AMDGPUABIInfo::classifyReturnType(QualType RetTy) const {
@@ -198,11 +206,20 @@ ABIArgInfo AMDGPUABIInfo::classifyKernelArgumentType(QualType Ty) const {
 }
 
 ABIArgInfo AMDGPUABIInfo::classifyArgumentType(QualType Ty,
+                                               bool Variadic,
                                                unsigned &NumRegsLeft) const {
   assert(NumRegsLeft <= MaxNumRegsForArgsRet && "register estimate underflow");
 
   Ty = useFirstFieldIfTransparentUnion(Ty);
 
+  if (Variadic) {
+      return ABIArgInfo::getDirect(/*T=*/nullptr,
+                                   /*Offset=*/0,
+                                   /*Padding=*/nullptr,
+                                   /*CanBeFlattened=*/false,
+                                   /*Align=*/0);
+    }
+  
   if (isAggregateTypeForABI(Ty)) {
     // Records with non-trivial destructors/copy-constructors should not be
     // passed by value.
diff --git a/clang/test/CodeGen/voidptr-vaarg.c b/clang/test/CodeGen/voidptr-vaarg.c
new file mode 100644
index 0000000000000..d023ddf0fb5d2
--- /dev/null
+++ b/clang/test/CodeGen/voidptr-vaarg.c
@@ -0,0 +1,478 @@
+// NOTE: Assertions have been autogenerated by utils/update_cc_test_checks.py
+// REQUIRES: webassembly-registered-target
+// RUN: %clang_cc1 -triple wasm32-unknown-unknown -emit-llvm -o - %s | FileCheck %s
+
+// Multiple targets use emitVoidPtrVAArg to lower va_arg instructions in clang
+// PPC is complicated, excluding from this case analysis
+// ForceRightAdjust is false for all non-PPC targets
+// AllowHigherAlign is only false for two Microsoft targets, both of which
+// pass most things by reference.
+//
+// Address emitVoidPtrVAArg(CodeGenFunction &CGF, Address VAListAddr,
+//                          QualType ValueTy, bool IsIndirect,
+//                          TypeInfoChars ValueInfo, CharUnits SlotSizeAndAlign,
+//                          bool AllowHigherAlign, bool ForceRightAdjust =
+//                          false);
+//
+// Target       IsIndirect    SlotSize  AllowHigher ForceRightAdjust
+// ARC          false             four  true        false
+// ARM          varies            four  true        false
+// Mips         false           4 or 8  true        false
+// RISCV        varies        register  true        false
+// PPC elided
+// LoongArch    varies        register  true        false
+// NVPTX WIP
+// AMDGPU WIP
+// X86_32       false             four  true        false
+// X86_64 MS    varies           eight  false       false
+// CSKY         false             four  true        false
+// Webassembly  varies            four  true        false
+// AArch64      false            eight  true        false
+// AArch64 MS   false            eight  false       false
+//
+// Webassembly passes indirectly iff it's an aggregate of multiple values
+// Choosing this as a representative architecture to check IR generation
+// partly because it has a relatively simple variadic calling convention.
+
+// Int, by itself and packed in structs
+// CHECK-LABEL: @raw_int(
+// CHECK-NEXT:  entry:
+// CHECK-NEXT:    [[LIST_ADDR:%.*]] = alloca ptr, align 4
+// CHECK-NEXT:    store ptr [[LIST:%.*]], ptr [[LIST_ADDR]], align 4
+// CHECK-NEXT:    [[ARGP_CUR:%.*]] = load ptr, ptr [[LIST_ADDR]], align 4
+// CHECK-NEXT:    [[ARGP_NEXT:%.*]] = getelementptr inbounds i8, ptr [[ARGP_CUR]], i32 4
+// CHECK-NEXT:    store ptr [[ARGP_NEXT]], ptr [[LIST_ADDR]], align 4
+// CHECK-NEXT:    [[TMP0:%.*]] = load i32, ptr [[ARGP_CUR]], align 4
+// CHECK-NEXT:    ret i32 [[TMP0]]
+//
+int raw_int(__builtin_va_list list) { return __builtin_va_arg(list, int); }
+
+typedef struct {
+  int x;
+} one_int_t;
+
+// CHECK-LABEL: @one_int(
+// CHECK-NEXT:  entry:
+// CHECK-NEXT:    [[RETVAL:%.*]] = alloca [[STRUCT_ONE_INT_T:%.*]], align 4
+// CHECK-NEXT:    [[LIST_ADDR:%.*]] = alloca ptr, align 4
+// CHECK-NEXT:    store ptr [[LIST:%.*]], ptr [[LIST_ADDR]], align 4
+// CHECK-NEXT:    [[ARGP_CUR:%.*]] = load ptr, ptr [[LIST_ADDR]], align 4
+// CHECK-NEXT:    [[ARGP_NEXT:%.*]] = getelementptr inbounds i8, ptr [[ARGP_CUR]], i32 4
+// CHECK-NEXT:    store ptr [[ARGP_NEXT]], ptr [[LIST_ADDR]], align 4
+// CHECK-NEXT:    call void @llvm.memcpy.p0.p0.i32(ptr align 4 [[RETVAL]], ptr align 4 [[ARGP_CUR]], i32 4, i1 false)
+// CHECK-NEXT:    [[COERCE_DIVE:%.*]] = getelementptr inbounds [[STRUCT_ONE_INT_T]], ptr [[RETVAL]], i32 0, i32 0
+// CHECK-NEXT:    [[TMP0:%.*]] = load i32, ptr [[COERCE_DIVE]], align 4
+// CHECK-NEXT:    ret i32 [[TMP0]]
+//
+one_int_t one_int(__builtin_va_list list) {
+  return __builtin_va_arg(list, one_int_t);
+}
+
+typedef struct {
+  int x;
+  int y;
+} two_int_t;
+
+// CHECK-LABEL: @two_int(
+// CHECK-NEXT:  entry:
+// CHECK-NEXT:    [[LIST_ADDR:%.*]] = alloca ptr, align 4
+// CHECK-NEXT:    store ptr [[LIST:%.*]], ptr [[LIST_ADDR]], align 4
+// CHECK-NEXT:    [[ARGP_CUR:%.*]] = load ptr, ptr [[LIST_ADDR]], align 4
+// CHECK-NEXT:    [[ARGP_NEXT:%.*]] = getelementptr inbounds i8, ptr [[ARGP_CUR]], i32 4
+// CHECK-NEXT:    store ptr [[ARGP_NEXT]], ptr [[LIST_ADDR]], align 4
+// CHECK-NEXT:    [[TMP0:%.*]] = load ptr, ptr [[ARGP_CUR]], align 4
+// CHECK-NEXT:    call void @llvm.memcpy.p0.p0.i32(ptr align 4 [[AGG_RESULT:%.*]], ptr align 4 [[TMP0]], i32 8, i1 false)
+// CHECK-NEXT:    ret void
+//
+two_int_t two_int(__builtin_va_list list) {
+  return __builtin_va_arg(list, two_int_t);
+}
+
+// Double, by itself and packed in structs
+// CHECK-LABEL: @raw_double(
+// CHECK-NEXT:  entry:
+// CHECK-NEXT:    [[LIST_ADDR:%.*]] = alloca ptr, align 4
+// CHECK-NEXT:    store ptr [[LIST:%.*]], ptr [[LIST_ADDR]], align 4
+// CHECK-NEXT:    [[ARGP_CUR:%.*]] = load ptr, ptr [[LIST_ADDR]], align 4
+// CHECK-NEXT:    [[TMP0:%.*]] = getelementptr inbounds i8, ptr [[ARGP_CUR]], i32 7
+// CHECK-NEXT:    [[ARGP_CUR_ALIGNED:%.*]] = call ptr @llvm.ptrmask.p0.i32(ptr [[TMP0]], i32 -8)
+// CHECK-NEXT:    [[ARGP_NEXT:%.*]] = getelementptr inbounds i8, ptr [[ARGP_CUR_ALIGNED]], i32 8
+// CHECK-NEXT:    store ptr [[ARGP_NEXT]], ptr [[LIST_ADDR]], align 4
+// CHECK-NEXT:    [[TMP1:%.*]] = load double, ptr [[ARGP_CUR_ALIGNED]], align 8
+// CHECK-NEXT:    ret double [[TMP1]]
+//
+double raw_double(__builtin_va_list list) {
+  return __builtin_va_arg(list, double);
+}
+
+typedef struct {
+  double x;
+} one_double_t;
+
+// CHECK-LABEL: @one_double(
+// CHECK-NEXT:  entry:
+// CHECK-NEXT:    [[RETVAL:%.*]] = alloca [[STRUCT_ONE_DOUBLE_T:%.*]], align 8
+// CHECK-NEXT:    [[LIST_ADDR:%.*]] = alloca ptr, align 4
+// CHECK-NEXT:    store ptr [[LIST:%.*]], ptr [[LIST_ADDR]], align 4
+// CHECK-NEXT:    [[ARGP_CUR:%.*]] = load ptr, ptr [[LIST_ADDR]], align 4
+// CHECK-NEXT:    [[TMP0:%.*]] = getelementptr inbounds i8, ptr [[ARGP_CUR]], i32 7
+// CHECK-NEXT:    [[ARGP_CUR_ALIGNED:%.*]] = call ptr @llvm.ptrmask.p0.i32(ptr [[TMP0]], i32 -8)
+// CHECK-NEXT:    [[ARGP_NEXT:%.*]] = getelementptr inbounds i8, ptr [[ARGP_CUR_ALIGNED]], i32 8
+// CHECK-NEXT:    store ptr [[ARGP_NEXT]], ptr [[LIST_ADDR]], align 4
+// CHECK-NEXT:    call void @llvm.memcpy.p0.p0.i32(ptr align 8 [[RETVAL]], ptr align 8 [[ARGP_CUR_ALIGNED]], i32 8, i1 false)
+// CHECK-NEXT:    [[COERCE_DIVE:%.*]] = getelementptr inbounds [[STRUCT_ONE_DOUBLE_T]], ptr [[RETVAL]], i32 0, i32 0
+// CHECK-NEXT:    [[TMP1:%.*]] = load double, ptr [[COERCE_DIVE]], align 8
+// CHECK-NEXT:    ret double [[TMP1]]
+//
+one_double_t one_double(__builtin_va_list list) {
+  return __builtin_va_arg(list, one_double_t);
+}
+
+typedef struct {
+  double x;
+  double y;
+} two_double_t;
+
+// CHECK-LABEL: @two_double(
+// CHECK-NEXT:  entry:
+// CHECK-NEXT:    [[LIST_ADDR:%.*]] = alloca ptr, align 4
+// CHECK-NEXT:    store ptr [[LIST:%.*]], ptr [[LIST_ADDR]], align 4
+// CHECK-NEXT:    [[ARGP_CUR:%.*]] = load ptr, ptr [[LIST_ADDR]], align 4
+// CHECK-NEXT:    [[ARGP_NEXT:%.*]] = getelementptr inbounds i8, ptr [[ARGP_CUR]], i32 4
+// CHECK-NEXT:    store ptr [[ARGP_NEXT]], ptr [[LIST_ADDR]], align 4
+// CHECK-NEXT:    [[TMP0:%.*]] = load ptr, ptr [[ARGP_CUR]], align 4
+// CHECK-NEXT:    call void @llvm.memcpy.p0.p0.i32(ptr align 8 [[AGG_RESULT:%.*]], ptr align 8 [[TMP0]], i32 16, i1 false)
+// CHECK-NEXT:    ret void
+//
+two_double_t two_double(__builtin_va_list list) {
+  return __builtin_va_arg(list, two_double_t);
+}
+
+// Scalar smaller than the slot size (C would promote a short to int)
+typedef struct {
+  char x;
+} one_char_t;
+
+// CHECK-LABEL: @one_char(
+// CHECK-NEXT:  entry:
+// CHECK-NEXT:    [[RETVAL:%.*]] = alloca [[STRUCT_ONE_CHAR_T:%.*]], align 1
+// CHECK-NEXT:    [[LIST_ADDR:%.*]] = alloca ptr, align 4
+// CHECK-NEXT:    store ptr [[LIST:%.*]], ptr [[LIST_ADDR]], align 4
+// CHECK-NEXT:    [[ARGP_CUR:%.*]] = load ptr, ptr [[LIST_ADDR]], align 4
+// CHECK-NEXT:    [[ARGP_NEXT:%.*]] = getelementptr inbounds i8, ptr [[ARGP_CUR]], i32 4
+// CHECK-NEXT:    store ptr [[ARGP_NEXT]], ptr [[LIST_ADDR]], align 4
+// CHECK-NEXT:    call void @llvm.memcpy.p0.p0.i32(ptr align 1 [[RETVAL]], ptr align 4 [[ARGP_CUR]], i32 1, i1 false)
+// CHECK-NEXT:    [[COERCE_DIVE:%.*]] = getelementptr inbounds [[STRUCT_ONE_CHAR_T]], ptr [[RETVAL]], i32 0, i32 0
+// CHECK-NEXT:    [[TMP0:%.*]] = load i8, ptr [[COERCE_DIVE]], align 1
+// CHECK-NEXT:    ret i8 [[TMP0]]
+//
+one_char_t one_char(__builtin_va_list list) {
+  return __builtin_va_arg(list, one_char_t);
+}
+
+typedef struct {
+  short x;
+} one_short_t;
+
+// CHECK-LABEL: @one_short(
+// CHECK-NEXT:  entry:
+// CHECK-NEXT:    [[RETVAL:%.*]] = alloca [[STRUCT_ONE_SHORT_T:%.*]], align 2
+// CHECK-NEXT:    [[LIST_ADDR:%.*]] = alloca ptr, align 4
+// CHECK-NEXT:    store ptr [[LIST:%.*]], ptr [[LIST_ADDR]], align 4
+// CHECK-NEXT:    [[ARGP_CUR:%.*]] = load ptr, ptr [[LIST_ADDR]], align 4
+// CHECK-NEXT:    [[ARGP_NEXT:%.*]] = getelementptr inbounds i8, ptr [[ARGP_CUR]], i32 4
+// CHECK-NEXT:    store ptr [[ARGP_NEXT]], ptr [[LIST_ADDR]], align 4
+// CHECK-NEXT:    call void @llvm.memcpy.p0.p0.i32(ptr align 2 [[RETVAL]], ptr align 4 [[ARGP_CUR]], i32 2, i1 false)
+// CHECK-NEXT:    [[COERCE_DIVE:%.*]] = getelementptr inbounds [[STRUCT_ONE_SHORT_T]], ptr [[RETVAL]], i32 0, i32 0
+// CHECK-NEXT:    [[TMP0:%.*]] = load i16, ptr [[COERCE_DIVE]], align 2
+// CHECK-NEXT:    ret i16 [[TMP0]]
+//
+one_short_t one_short(__builtin_va_list list) {
+  return __builtin_va_arg(list, one_short_t);
+}
+
+// Composite smaller than the slot size
+typedef struct {
+  _Alignas(2) char x;
+  char y;
+} char_pair_t;
+
+// CHECK-LABEL: @char_pair(
+// CHECK-NEXT:  entry:
+// CHECK-NEXT:    [[LIST_ADDR:%.*]] = alloca ptr, align 4
+// CHECK-NEXT:    store ptr [[LIST:%.*]], ptr [[LIST_ADDR]], align 4
+// CHECK-NEXT:    [[ARGP_CUR:%.*]] = load ptr, ptr [[LIST_ADDR]], align 4
+// CHECK-NEXT:    [[ARGP_NEXT:%.*]] = getelementptr inbounds i8, ptr [[ARGP_CUR]], i32 4
+// CHECK-NEXT:    store ptr [[ARGP_NEXT]], ptr [[LIST_ADDR]], align 4
+// CHECK-NEXT:    [[TMP0:%.*]] = load ptr, ptr [[ARGP_CUR]], align 4
+// CHECK-NEXT:    call void @llvm.memcpy.p0.p0.i32(ptr align 2 [[AGG_RESULT:%.*]], ptr align 2 [[TMP0]], i32 2, i1 false)
+// CHECK-NEXT:    ret void
+//
+char_pair_t char_pair(__builtin_va_list list) {
+  return __builtin_va_arg(list, char_pair_t);
+}
+
+// Empty struct
+typedef struct {
+} empty_t;
+
+// CHECK-LABEL: @empty(
+// CHECK-NEXT:  entry:
+// CHECK-NEXT:    [[RETVAL:%.*]] = alloca [[STRUCT_EMPTY_T:%.*]], align 1
+// CHECK-NEXT:    [[LIST_ADDR:%.*]] = alloca ptr, align 4
+// CHECK-NEXT:    store ptr [[LIST:%.*]], ptr [[LIST_ADDR]], align 4
+// CHECK-NEXT:    [[ARGP_CUR:%.*]] = load ptr, ptr [[LIST_ADDR]], align 4
+// CHECK-NEXT:    [[ARGP_NEXT:%.*]] = getelementptr inbounds i8, ptr [[ARGP_CUR]], i32 0
+// CHECK-NEXT:    store ptr [[ARGP_NEXT]], ptr [[LIST_ADDR]], align 4
+// CHECK-NEXT:    call void @llvm.memcpy.p0.p0.i32(ptr align 1 [[RETVAL]], ptr align 4 [[ARGP_CUR]], i32 0, i1 false)
+// CHECK-NEXT:    ret void
+//
+empty_t empty(__builtin_va_list list) {
+  return __builtin_va_arg(list, empty_t);
+}
+
+typedef struct {
+  empty_t x;
+  int y;
+} empty_int_t;
+
+// CHECK-LABEL: @empty_int(
+// CHECK-NEXT:  entry:
+// CHECK-NEXT:    [[RETVAL:%.*]] = alloca [[STRUCT_EMPTY_INT_T:%.*]], align 4
+// CHECK-NEXT:    [[LIST_ADDR:%.*]] = alloca ptr, align 4
+// CHECK-NEXT:    store ptr [[LIST:%.*]], ptr [[LIST_ADDR]], align 4
+// CHECK-NEXT:    [[ARGP_CUR:%.*]] = load ptr, ptr [[LIST_ADDR]], align 4
+// CHECK-NEXT:    [[ARGP_NEXT:%.*]] = getelementptr inbounds i8, ptr [[ARGP_CUR]], i32 4
+// CHECK-NEXT:    store ptr [[ARGP_NEXT]], ptr [[LIST_ADDR]], align 4
+// CHECK-NEXT:    call void @llvm.memcpy.p0.p0.i32(ptr align 4 [[RETVAL]], ptr align 4 [[ARGP_CUR]], i32 4, i1 false)
+// CHECK-NEXT:    [[TMP0:%.*]] = load i32, ptr [[RETVAL]], align 4
+// CHECK-NEXT:    ret i32 [[TMP0]]
+//
+empty_int_t empty_int(__builtin_va_list list) {
+  return __builtin_va_arg(list, empty_int_t);
+}
+
+typedef struct {
+  int x;
+  empty_t y;
+} int_empty_t;
+
+// CHECK-LABEL: @int_empty(
+// CHECK-NEXT:  entry:
+// CHECK-NEXT:    [[RETVAL:%.*]] = alloca [[STRUCT_INT_EMPTY_T:%.*]], align 4
+// CHECK-NEXT:    [[LIST_ADDR:%.*]] = alloca ptr, align 4
+// CHECK-NEXT:    store ptr [[LIST:%.*]], ptr [[LIST_ADDR]], align 4
+// CHECK-NEXT:    [[ARGP_CUR:%.*]] = load ptr, ptr [[LIST_ADDR]], align 4
+// CHECK-NEXT:    [[ARGP_NEXT:%.*]] = getelementptr inbounds i8, ptr [[ARGP_CUR]], i32 4
+// CHECK-NEXT:    store ptr [[ARGP_NEXT]], ptr [[LIST_ADDR]], align 4
+// CHECK-NEXT:    call void @llvm.memcpy.p0.p0.i32(ptr align 4 [[RETVAL]], ptr align 4 [[ARGP_CUR]], i32 4, i1 false)
+// CHECK-NEXT:    [[COERCE_DIVE:%.*]] = getelementptr inbounds [[STRUCT_INT_EMPTY_T]], ptr [[RETVAL]], i32 0, i32 0
+// CHECK-NEXT:    [[TMP0:%.*]] = load i32, ptr [[COERCE_DIVE]], align 4
+// CHECK-NEXT:    ret i32 [[TMP0]]
+//
+int_empty_t int_empty(__builtin_va_list list) {
+  return __builtin_va_arg(list, int_empty_t);
+}
+
+// Need multiple va_arg instructions to check the postincrement
+// Using types that are passed directly as the indirect handling
+// is independent of the alignment handling in emitVoidPtrDirectVAArg.
+
+// CHECK-LABEL: @multiple_int(
+// CHECK-NEXT:  entry:
+// CHECK-NEXT:    [[LIST_ADDR:%.*]] = alloca ptr, align 4
+// CHECK-NEXT:    [[OUT0_ADDR:%.*]] = alloca ptr, align 4
+// CHECK-NEXT:    [[OUT1_ADDR:%.*]] = alloca ptr, align 4
+// CHECK-NEXT:    [[OUT2_ADDR:%.*]] = alloca ptr, align 4
+// CHECK-NEXT:    store ptr [[LIST:%.*]], ptr [[LIST_ADDR]], align 4
+// CHECK-NEXT:    store ptr [[OUT0:%.*]], ptr [[OUT0_ADDR]], align 4
+// CHECK-NEXT:    store ptr [[OUT1:%.*]], ptr [[OUT1_ADDR]], align 4
+// CHECK-NEXT:    store ptr [[OUT2:%.*]], ptr [[OUT2_ADDR]], align 4
+// CHECK-NEXT:    [[ARGP_CUR:%.*]] = load ptr, ptr [[LIST_ADDR]], align 4
+// CHECK-NEXT:    [[ARGP_NEXT:%.*]] = getelementptr inbounds i8, ptr [[ARGP_CUR]], i32 4
+// CHECK-NEXT:    store ptr [[ARGP_NEXT]], ptr [[LIST_ADDR]], align 4
+// CHECK-NEXT:    [[TMP0:%.*]] = load i32, ptr [[ARGP_CUR]], align 4
+// CHECK-NEXT:    [[TMP1:%.*]] = load ptr, ptr [[OUT0_ADDR]], align 4
+// CHECK-NEXT:    store i32 [[TMP0]], ptr [[TMP1]], align 4
+// CHECK-NEXT:    [[ARGP_CUR1:%.*]] = load ptr, ptr [[LIST_ADDR]], align 4
+// CHECK-NEXT:    [[ARGP_NEXT2:%.*]] = getelementptr inbounds i8, ptr [[ARGP_CUR1]], i32 4
+// CHECK-NEXT:    store ptr [[ARGP_NEXT2]], ptr [[LIST_ADDR]], align 4
+// CHECK-NEXT:    [[TMP2:%.*]] = load i32, ptr [[ARGP_CUR1]], align 4
+// CHECK-NEXT:    [[TMP3:%.*]] = load ptr, ptr [[OUT1_ADDR]], align 4
+// CHECK-NEXT:    store i32 [[TMP2]], ptr [[TMP3]], align 4
+// CHECK-NEXT:    [[ARGP_CUR3:%.*]] = load ptr, ptr [[LIST_ADDR]], align 4
+// CHECK-NEXT:    [[ARGP_NEXT4:%.*]] = getelementptr inbounds i8, ptr [[ARGP_CUR3]], i32 4
+// CHECK-NEXT:    store ptr [[ARGP_NEXT4]], ptr [[LIST_ADDR]], align 4
+// CHECK-NEXT:    [[TMP4:%.*]] = load i32, ptr [[ARGP_CUR3]], align 4
+// CHECK-NEXT:    [[TMP5:%.*]] = load ptr, ptr [[OUT2_ADDR]], align 4
+// CHECK-NEXT:    store i32 [[TMP4]], ptr [[TMP5]], align 4
+// CHECK-NEXT:    ret void
+//
+void multiple_int(__builtin_va_list list, int *out0, int *out1, int *out2) {
+  *out0 = __builtin_va_arg(list, int);
+  *out1 = __builtin_va_arg(list, int);
+  *out2 = __builtin_va_arg(list, int);
+}
+
+// Scalars in structs are an easy way of specifying alignment from C
+// CHECK-LABEL: @increasing_alignment(
+// CHECK-NEXT:  entry:
+// CHECK-NEXT:    [[LIST_ADDR:%.*]] = alloca ptr, align 4
+// CHECK-NEXT:    [[OUT0_ADDR:%.*]] = alloca ptr, align 4
+// CHECK-NEXT:    [[OUT1_ADDR:%.*]] = alloca ptr, align 4
+// CHECK-NEXT:    [[OUT2_ADDR:%.*]] = alloca ptr, align 4
+// CHECK-NEXT:    [[OUT3_ADDR:%.*]] = alloca ptr, align 4
+// CHECK-NEXT:    store ptr [[LIST:%.*]], ptr [[LIST_ADDR]], align 4
+// CHECK-NEXT:    store ptr [[OUT0:%.*]], ptr [[OUT0_ADDR]], align 4
+// CHECK-NEXT:    store ptr [[OUT1:%.*]], ptr [[OUT1_ADDR]], align 4
+// CHECK-NEXT:    store ptr [[OUT2:%.*]], ptr [[OUT2_ADDR]], align 4
+// CHECK-NEXT:    store ptr [[OUT3:%.*]], ptr [[OUT3_ADDR]], align 4
+// CHECK-NEXT:    [[TMP0:%.*]] = load ptr, ptr [[OUT0_ADDR]], align 4
+// CHECK-NEXT:    [[ARGP_CUR:%.*]] = load ptr, ptr [[LIST_ADDR]], align 4
+// CHECK-NEXT:    [[ARGP_NEXT:%.*]] = getelementptr inbounds i8, ptr [[ARGP_CUR]], i32 4
+// CHECK-NEXT:    store ptr [[ARGP_NEXT]], ptr [[LIST_ADDR]], align 4
+// CHECK-NEXT:    call void @llvm.memcpy.p0.p0.i32(ptr align 1 [[TMP0]], ptr align 4 [[ARGP_CUR]], i32 1, i1 false)
+// CHECK-NEXT:    [[TMP1:%.*]] = load ptr, ptr [[OUT1_ADDR]], align 4
+// CHECK-NEXT:    [[ARGP_CUR1:%.*]] = load ptr, ptr [[LIST_ADDR]], align 4
+// CHECK-NEXT:    [[ARGP_NEXT2:%.*]] = getelementptr inbounds i8, ptr [[ARGP_CUR1]], i32 4
+// CHECK-NEXT:    store ptr [[ARGP_NEXT2]], ptr [[LIST_ADDR]], align 4
+// CHECK-NEXT:    call void @llvm.memcpy.p0.p0.i32(ptr align 2 [[TMP1]], ptr align 4 [[ARGP_CUR1]], i32 2, i1 false)
+// CHECK-NEXT:    [[ARGP_CUR3:%.*]] = load ptr, ptr [[LIST_ADDR]], align 4
+// CHECK-NEXT:    [[ARGP_NEXT4:%.*]] = getelementptr inbounds i8, ptr [[ARGP_CUR3]], i32 4
+// CHECK-NEXT:    store ptr [[ARGP_NEXT4]], ptr [[LIST_ADDR]], align 4
+// CHECK-NEXT:    [[TMP2:%.*]] = load i32, ptr [[ARGP_CUR3]], align 4
+// CHECK-NEXT:    [[TMP3:%.*]] = load ptr, ptr [[OUT2_ADDR]], align 4
+// CHECK-NEXT:    store i32 [[TMP2]], ptr [[TMP3]], align 4
+// CHECK-NEXT:    [[ARGP_CUR5:%.*]] = load ptr, ptr [[LIST_ADDR]], align 4
+// CHECK-NEXT:    [[TMP4:%.*]] = getelementptr inbounds i8, ptr [[ARGP_CUR5]], i32 7
+// CHECK-NEXT:    [[ARGP_CUR5_ALIGNED:%.*]] = call ptr @llvm.ptrmask.p0.i32(ptr [[TMP4...
[truncated]

llvmbot · 2024-05-25T01:41:11Z

@llvm/pr-subscribers-backend-amdgpu

Author: Jon Chesterfield (JonChesterfield)

Changes

This is a mostly-target-independent variadic function optimisation and lowering pass. It is only enabled for AMDGPU in this initial commit.

The purpose is to make C style variadic functions a zero cost abstraction. They are lowered to equivalent IR which is then amenable to other optimisations. This is inherently slightly target specific but much less so than one might expect - the C varargs interface heavily constrains the ABI design divergence.

The pass is primarily tested from webassembly. This is because wasm has a straightforward variadic lowering strategy which coincides exactly with what this pass transforms code into and a struct passing convention with few cases to check. Adding further targets conventions is straightforward and elided from this patch primarily to simplify the review. Implemented in other branches are Linux X86, AMD64, AArch64 and NVPTX.

Testing for targets that have existing lowering for va_arg from clang is most efficiently done by checking that clang | opt completely elides the variadic syntax from test cases. The lowering produces a struct for each call site which can be inspected to check the various alignment and indirections are correct.

AMDGPU presently has no variadic support other than some ad hoc printf handling. Combined with the pass being inactive on all other targets landing this represents strict increase in capability with zero risk. Testing and refining will continue post commit.

In addition to the compiler tests included here, a self contained x64 clang/musl toolchain was constructed using the "lowering" instead of the systemv ABI and used to build various C programs like lua and libxml2.

Patch is 206.77 KiB, truncated to 20.00 KiB below, full version: https://github.com/llvm/llvm-project/pull/93362.diff

26 Files Affected:

(modified) clang/lib/CodeGen/Targets/AMDGPU.cpp (+21-4)
(added) clang/test/CodeGen/voidptr-vaarg.c (+478)
(added) clang/test/CodeGenCXX/inline-then-fold-variadics.cpp (+180)
(modified) llvm/cmake/modules/HandleLLVMOptions.cmake (+1-1)
(modified) llvm/include/llvm/InitializePasses.h (+1)
(added) llvm/include/llvm/Transforms/IPO/ExpandVariadics.h (+43)
(modified) llvm/lib/Passes/PassBuilder.cpp (+1)
(modified) llvm/lib/Passes/PassRegistry.def (+1)
(modified) llvm/lib/Target/AMDGPU/AMDGPUPassRegistry.def (+4)
(modified) llvm/lib/Target/AMDGPU/AMDGPUTargetMachine.cpp (+3)
(modified) llvm/lib/Transforms/IPO/CMakeLists.txt (+1)
(added) llvm/lib/Transforms/IPO/ExpandVariadics.cpp (+1031)
(added) llvm/test/CodeGen/AMDGPU/expand-variadic-call.ll (+524)
(modified) llvm/test/CodeGen/AMDGPU/llc-pipeline.ll (+5)
(modified) llvm/test/CodeGen/AMDGPU/unsupported-calls.ll (-19)
(added) llvm/test/CodeGen/WebAssembly/expand-variadic-call.ll (+483)
(added) llvm/test/CodeGen/WebAssembly/vararg-frame.ll (+525)
(added) llvm/test/Transforms/ExpandVariadics/expand-va-intrinsic-split-linkage.ll (+230)
(added) llvm/test/Transforms/ExpandVariadics/expand-va-intrinsic-split-simple.ll (+212)
(added) llvm/test/Transforms/ExpandVariadics/indirect-calls.ll (+58)
(added) llvm/test/Transforms/ExpandVariadics/intrinsics.ll (+117)
(added) llvm/test/Transforms/ExpandVariadics/invoke.ll (+88)
(added) llvm/test/Transforms/ExpandVariadics/pass-byval-byref.ll (+148)
(added) llvm/test/Transforms/ExpandVariadics/pass-indirect.ll (+58)
(added) llvm/test/Transforms/ExpandVariadics/pass-integers.ll (+344)
(modified) llvm/utils/gn/secondary/llvm/lib/Transforms/IPO/BUILD.gn (+1)

diff --git a/clang/lib/CodeGen/Targets/AMDGPU.cpp b/clang/lib/CodeGen/Targets/AMDGPU.cpp
index 44e86c0b40f68..47e18535f8fe0 100644
--- a/clang/lib/CodeGen/Targets/AMDGPU.cpp
+++ b/clang/lib/CodeGen/Targets/AMDGPU.cpp
@@ -45,7 +45,7 @@ class AMDGPUABIInfo final : public DefaultABIInfo {
 
   ABIArgInfo classifyReturnType(QualType RetTy) const;
   ABIArgInfo classifyKernelArgumentType(QualType Ty) const;
-  ABIArgInfo classifyArgumentType(QualType Ty, unsigned &NumRegsLeft) const;
+  ABIArgInfo classifyArgumentType(QualType Ty, bool Variadic, unsigned &NumRegsLeft) const;
 
   void computeInfo(CGFunctionInfo &FI) const override;
   Address EmitVAArg(CodeGenFunction &CGF, Address VAListAddr,
@@ -103,19 +103,27 @@ void AMDGPUABIInfo::computeInfo(CGFunctionInfo &FI) const {
   if (!getCXXABI().classifyReturnType(FI))
     FI.getReturnInfo() = classifyReturnType(FI.getReturnType());
 
+  unsigned ArgumentIndex = 0;
+  const unsigned numFixedArguments = FI.getNumRequiredArgs();
+
   unsigned NumRegsLeft = MaxNumRegsForArgsRet;
   for (auto &Arg : FI.arguments()) {
     if (CC == llvm::CallingConv::AMDGPU_KERNEL) {
       Arg.info = classifyKernelArgumentType(Arg.type);
     } else {
-      Arg.info = classifyArgumentType(Arg.type, NumRegsLeft);
+      bool FixedArgument = ArgumentIndex++ < numFixedArguments;
+      Arg.info = classifyArgumentType(Arg.type, !FixedArgument, NumRegsLeft);
     }
   }
 }
 
 Address AMDGPUABIInfo::EmitVAArg(CodeGenFunction &CGF, Address VAListAddr,
-                                 QualType Ty) const {
-  llvm_unreachable("AMDGPU does not support varargs");
+                                 QualType Ty) const { 
+  const bool IsIndirect = false;
+  const bool AllowHigherAlign = false;
+  return emitVoidPtrVAArg(CGF, VAListAddr, Ty, IsIndirect,
+                          getContext().getTypeInfoInChars(Ty),
+                          CharUnits::fromQuantity(4), AllowHigherAlign);
 }
 
 ABIArgInfo AMDGPUABIInfo::classifyReturnType(QualType RetTy) const {
@@ -198,11 +206,20 @@ ABIArgInfo AMDGPUABIInfo::classifyKernelArgumentType(QualType Ty) const {
 }
 
 ABIArgInfo AMDGPUABIInfo::classifyArgumentType(QualType Ty,
+                                               bool Variadic,
                                                unsigned &NumRegsLeft) const {
   assert(NumRegsLeft <= MaxNumRegsForArgsRet && "register estimate underflow");
 
   Ty = useFirstFieldIfTransparentUnion(Ty);
 
+  if (Variadic) {
+      return ABIArgInfo::getDirect(/*T=*/nullptr,
+                                   /*Offset=*/0,
+                                   /*Padding=*/nullptr,
+                                   /*CanBeFlattened=*/false,
+                                   /*Align=*/0);
+    }
+  
   if (isAggregateTypeForABI(Ty)) {
     // Records with non-trivial destructors/copy-constructors should not be
     // passed by value.
diff --git a/clang/test/CodeGen/voidptr-vaarg.c b/clang/test/CodeGen/voidptr-vaarg.c
new file mode 100644
index 0000000000000..d023ddf0fb5d2
--- /dev/null
+++ b/clang/test/CodeGen/voidptr-vaarg.c
@@ -0,0 +1,478 @@
+// NOTE: Assertions have been autogenerated by utils/update_cc_test_checks.py
+// REQUIRES: webassembly-registered-target
+// RUN: %clang_cc1 -triple wasm32-unknown-unknown -emit-llvm -o - %s | FileCheck %s
+
+// Multiple targets use emitVoidPtrVAArg to lower va_arg instructions in clang
+// PPC is complicated, excluding from this case analysis
+// ForceRightAdjust is false for all non-PPC targets
+// AllowHigherAlign is only false for two Microsoft targets, both of which
+// pass most things by reference.
+//
+// Address emitVoidPtrVAArg(CodeGenFunction &CGF, Address VAListAddr,
+//                          QualType ValueTy, bool IsIndirect,
+//                          TypeInfoChars ValueInfo, CharUnits SlotSizeAndAlign,
+//                          bool AllowHigherAlign, bool ForceRightAdjust =
+//                          false);
+//
+// Target       IsIndirect    SlotSize  AllowHigher ForceRightAdjust
+// ARC          false             four  true        false
+// ARM          varies            four  true        false
+// Mips         false           4 or 8  true        false
+// RISCV        varies        register  true        false
+// PPC elided
+// LoongArch    varies        register  true        false
+// NVPTX WIP
+// AMDGPU WIP
+// X86_32       false             four  true        false
+// X86_64 MS    varies           eight  false       false
+// CSKY         false             four  true        false
+// Webassembly  varies            four  true        false
+// AArch64      false            eight  true        false
+// AArch64 MS   false            eight  false       false
+//
+// Webassembly passes indirectly iff it's an aggregate of multiple values
+// Choosing this as a representative architecture to check IR generation
+// partly because it has a relatively simple variadic calling convention.
+
+// Int, by itself and packed in structs
+// CHECK-LABEL: @raw_int(
+// CHECK-NEXT:  entry:
+// CHECK-NEXT:    [[LIST_ADDR:%.*]] = alloca ptr, align 4
+// CHECK-NEXT:    store ptr [[LIST:%.*]], ptr [[LIST_ADDR]], align 4
+// CHECK-NEXT:    [[ARGP_CUR:%.*]] = load ptr, ptr [[LIST_ADDR]], align 4
+// CHECK-NEXT:    [[ARGP_NEXT:%.*]] = getelementptr inbounds i8, ptr [[ARGP_CUR]], i32 4
+// CHECK-NEXT:    store ptr [[ARGP_NEXT]], ptr [[LIST_ADDR]], align 4
+// CHECK-NEXT:    [[TMP0:%.*]] = load i32, ptr [[ARGP_CUR]], align 4
+// CHECK-NEXT:    ret i32 [[TMP0]]
+//
+int raw_int(__builtin_va_list list) { return __builtin_va_arg(list, int); }
+
+typedef struct {
+  int x;
+} one_int_t;
+
+// CHECK-LABEL: @one_int(
+// CHECK-NEXT:  entry:
+// CHECK-NEXT:    [[RETVAL:%.*]] = alloca [[STRUCT_ONE_INT_T:%.*]], align 4
+// CHECK-NEXT:    [[LIST_ADDR:%.*]] = alloca ptr, align 4
+// CHECK-NEXT:    store ptr [[LIST:%.*]], ptr [[LIST_ADDR]], align 4
+// CHECK-NEXT:    [[ARGP_CUR:%.*]] = load ptr, ptr [[LIST_ADDR]], align 4
+// CHECK-NEXT:    [[ARGP_NEXT:%.*]] = getelementptr inbounds i8, ptr [[ARGP_CUR]], i32 4
+// CHECK-NEXT:    store ptr [[ARGP_NEXT]], ptr [[LIST_ADDR]], align 4
+// CHECK-NEXT:    call void @llvm.memcpy.p0.p0.i32(ptr align 4 [[RETVAL]], ptr align 4 [[ARGP_CUR]], i32 4, i1 false)
+// CHECK-NEXT:    [[COERCE_DIVE:%.*]] = getelementptr inbounds [[STRUCT_ONE_INT_T]], ptr [[RETVAL]], i32 0, i32 0
+// CHECK-NEXT:    [[TMP0:%.*]] = load i32, ptr [[COERCE_DIVE]], align 4
+// CHECK-NEXT:    ret i32 [[TMP0]]
+//
+one_int_t one_int(__builtin_va_list list) {
+  return __builtin_va_arg(list, one_int_t);
+}
+
+typedef struct {
+  int x;
+  int y;
+} two_int_t;
+
+// CHECK-LABEL: @two_int(
+// CHECK-NEXT:  entry:
+// CHECK-NEXT:    [[LIST_ADDR:%.*]] = alloca ptr, align 4
+// CHECK-NEXT:    store ptr [[LIST:%.*]], ptr [[LIST_ADDR]], align 4
+// CHECK-NEXT:    [[ARGP_CUR:%.*]] = load ptr, ptr [[LIST_ADDR]], align 4
+// CHECK-NEXT:    [[ARGP_NEXT:%.*]] = getelementptr inbounds i8, ptr [[ARGP_CUR]], i32 4
+// CHECK-NEXT:    store ptr [[ARGP_NEXT]], ptr [[LIST_ADDR]], align 4
+// CHECK-NEXT:    [[TMP0:%.*]] = load ptr, ptr [[ARGP_CUR]], align 4
+// CHECK-NEXT:    call void @llvm.memcpy.p0.p0.i32(ptr align 4 [[AGG_RESULT:%.*]], ptr align 4 [[TMP0]], i32 8, i1 false)
+// CHECK-NEXT:    ret void
+//
+two_int_t two_int(__builtin_va_list list) {
+  return __builtin_va_arg(list, two_int_t);
+}
+
+// Double, by itself and packed in structs
+// CHECK-LABEL: @raw_double(
+// CHECK-NEXT:  entry:
+// CHECK-NEXT:    [[LIST_ADDR:%.*]] = alloca ptr, align 4
+// CHECK-NEXT:    store ptr [[LIST:%.*]], ptr [[LIST_ADDR]], align 4
+// CHECK-NEXT:    [[ARGP_CUR:%.*]] = load ptr, ptr [[LIST_ADDR]], align 4
+// CHECK-NEXT:    [[TMP0:%.*]] = getelementptr inbounds i8, ptr [[ARGP_CUR]], i32 7
+// CHECK-NEXT:    [[ARGP_CUR_ALIGNED:%.*]] = call ptr @llvm.ptrmask.p0.i32(ptr [[TMP0]], i32 -8)
+// CHECK-NEXT:    [[ARGP_NEXT:%.*]] = getelementptr inbounds i8, ptr [[ARGP_CUR_ALIGNED]], i32 8
+// CHECK-NEXT:    store ptr [[ARGP_NEXT]], ptr [[LIST_ADDR]], align 4
+// CHECK-NEXT:    [[TMP1:%.*]] = load double, ptr [[ARGP_CUR_ALIGNED]], align 8
+// CHECK-NEXT:    ret double [[TMP1]]
+//
+double raw_double(__builtin_va_list list) {
+  return __builtin_va_arg(list, double);
+}
+
+typedef struct {
+  double x;
+} one_double_t;
+
+// CHECK-LABEL: @one_double(
+// CHECK-NEXT:  entry:
+// CHECK-NEXT:    [[RETVAL:%.*]] = alloca [[STRUCT_ONE_DOUBLE_T:%.*]], align 8
+// CHECK-NEXT:    [[LIST_ADDR:%.*]] = alloca ptr, align 4
+// CHECK-NEXT:    store ptr [[LIST:%.*]], ptr [[LIST_ADDR]], align 4
+// CHECK-NEXT:    [[ARGP_CUR:%.*]] = load ptr, ptr [[LIST_ADDR]], align 4
+// CHECK-NEXT:    [[TMP0:%.*]] = getelementptr inbounds i8, ptr [[ARGP_CUR]], i32 7
+// CHECK-NEXT:    [[ARGP_CUR_ALIGNED:%.*]] = call ptr @llvm.ptrmask.p0.i32(ptr [[TMP0]], i32 -8)
+// CHECK-NEXT:    [[ARGP_NEXT:%.*]] = getelementptr inbounds i8, ptr [[ARGP_CUR_ALIGNED]], i32 8
+// CHECK-NEXT:    store ptr [[ARGP_NEXT]], ptr [[LIST_ADDR]], align 4
+// CHECK-NEXT:    call void @llvm.memcpy.p0.p0.i32(ptr align 8 [[RETVAL]], ptr align 8 [[ARGP_CUR_ALIGNED]], i32 8, i1 false)
+// CHECK-NEXT:    [[COERCE_DIVE:%.*]] = getelementptr inbounds [[STRUCT_ONE_DOUBLE_T]], ptr [[RETVAL]], i32 0, i32 0
+// CHECK-NEXT:    [[TMP1:%.*]] = load double, ptr [[COERCE_DIVE]], align 8
+// CHECK-NEXT:    ret double [[TMP1]]
+//
+one_double_t one_double(__builtin_va_list list) {
+  return __builtin_va_arg(list, one_double_t);
+}
+
+typedef struct {
+  double x;
+  double y;
+} two_double_t;
+
+// CHECK-LABEL: @two_double(
+// CHECK-NEXT:  entry:
+// CHECK-NEXT:    [[LIST_ADDR:%.*]] = alloca ptr, align 4
+// CHECK-NEXT:    store ptr [[LIST:%.*]], ptr [[LIST_ADDR]], align 4
+// CHECK-NEXT:    [[ARGP_CUR:%.*]] = load ptr, ptr [[LIST_ADDR]], align 4
+// CHECK-NEXT:    [[ARGP_NEXT:%.*]] = getelementptr inbounds i8, ptr [[ARGP_CUR]], i32 4
+// CHECK-NEXT:    store ptr [[ARGP_NEXT]], ptr [[LIST_ADDR]], align 4
+// CHECK-NEXT:    [[TMP0:%.*]] = load ptr, ptr [[ARGP_CUR]], align 4
+// CHECK-NEXT:    call void @llvm.memcpy.p0.p0.i32(ptr align 8 [[AGG_RESULT:%.*]], ptr align 8 [[TMP0]], i32 16, i1 false)
+// CHECK-NEXT:    ret void
+//
+two_double_t two_double(__builtin_va_list list) {
+  return __builtin_va_arg(list, two_double_t);
+}
+
+// Scalar smaller than the slot size (C would promote a short to int)
+typedef struct {
+  char x;
+} one_char_t;
+
+// CHECK-LABEL: @one_char(
+// CHECK-NEXT:  entry:
+// CHECK-NEXT:    [[RETVAL:%.*]] = alloca [[STRUCT_ONE_CHAR_T:%.*]], align 1
+// CHECK-NEXT:    [[LIST_ADDR:%.*]] = alloca ptr, align 4
+// CHECK-NEXT:    store ptr [[LIST:%.*]], ptr [[LIST_ADDR]], align 4
+// CHECK-NEXT:    [[ARGP_CUR:%.*]] = load ptr, ptr [[LIST_ADDR]], align 4
+// CHECK-NEXT:    [[ARGP_NEXT:%.*]] = getelementptr inbounds i8, ptr [[ARGP_CUR]], i32 4
+// CHECK-NEXT:    store ptr [[ARGP_NEXT]], ptr [[LIST_ADDR]], align 4
+// CHECK-NEXT:    call void @llvm.memcpy.p0.p0.i32(ptr align 1 [[RETVAL]], ptr align 4 [[ARGP_CUR]], i32 1, i1 false)
+// CHECK-NEXT:    [[COERCE_DIVE:%.*]] = getelementptr inbounds [[STRUCT_ONE_CHAR_T]], ptr [[RETVAL]], i32 0, i32 0
+// CHECK-NEXT:    [[TMP0:%.*]] = load i8, ptr [[COERCE_DIVE]], align 1
+// CHECK-NEXT:    ret i8 [[TMP0]]
+//
+one_char_t one_char(__builtin_va_list list) {
+  return __builtin_va_arg(list, one_char_t);
+}
+
+typedef struct {
+  short x;
+} one_short_t;
+
+// CHECK-LABEL: @one_short(
+// CHECK-NEXT:  entry:
+// CHECK-NEXT:    [[RETVAL:%.*]] = alloca [[STRUCT_ONE_SHORT_T:%.*]], align 2
+// CHECK-NEXT:    [[LIST_ADDR:%.*]] = alloca ptr, align 4
+// CHECK-NEXT:    store ptr [[LIST:%.*]], ptr [[LIST_ADDR]], align 4
+// CHECK-NEXT:    [[ARGP_CUR:%.*]] = load ptr, ptr [[LIST_ADDR]], align 4
+// CHECK-NEXT:    [[ARGP_NEXT:%.*]] = getelementptr inbounds i8, ptr [[ARGP_CUR]], i32 4
+// CHECK-NEXT:    store ptr [[ARGP_NEXT]], ptr [[LIST_ADDR]], align 4
+// CHECK-NEXT:    call void @llvm.memcpy.p0.p0.i32(ptr align 2 [[RETVAL]], ptr align 4 [[ARGP_CUR]], i32 2, i1 false)
+// CHECK-NEXT:    [[COERCE_DIVE:%.*]] = getelementptr inbounds [[STRUCT_ONE_SHORT_T]], ptr [[RETVAL]], i32 0, i32 0
+// CHECK-NEXT:    [[TMP0:%.*]] = load i16, ptr [[COERCE_DIVE]], align 2
+// CHECK-NEXT:    ret i16 [[TMP0]]
+//
+one_short_t one_short(__builtin_va_list list) {
+  return __builtin_va_arg(list, one_short_t);
+}
+
+// Composite smaller than the slot size
+typedef struct {
+  _Alignas(2) char x;
+  char y;
+} char_pair_t;
+
+// CHECK-LABEL: @char_pair(
+// CHECK-NEXT:  entry:
+// CHECK-NEXT:    [[LIST_ADDR:%.*]] = alloca ptr, align 4
+// CHECK-NEXT:    store ptr [[LIST:%.*]], ptr [[LIST_ADDR]], align 4
+// CHECK-NEXT:    [[ARGP_CUR:%.*]] = load ptr, ptr [[LIST_ADDR]], align 4
+// CHECK-NEXT:    [[ARGP_NEXT:%.*]] = getelementptr inbounds i8, ptr [[ARGP_CUR]], i32 4
+// CHECK-NEXT:    store ptr [[ARGP_NEXT]], ptr [[LIST_ADDR]], align 4
+// CHECK-NEXT:    [[TMP0:%.*]] = load ptr, ptr [[ARGP_CUR]], align 4
+// CHECK-NEXT:    call void @llvm.memcpy.p0.p0.i32(ptr align 2 [[AGG_RESULT:%.*]], ptr align 2 [[TMP0]], i32 2, i1 false)
+// CHECK-NEXT:    ret void
+//
+char_pair_t char_pair(__builtin_va_list list) {
+  return __builtin_va_arg(list, char_pair_t);
+}
+
+// Empty struct
+typedef struct {
+} empty_t;
+
+// CHECK-LABEL: @empty(
+// CHECK-NEXT:  entry:
+// CHECK-NEXT:    [[RETVAL:%.*]] = alloca [[STRUCT_EMPTY_T:%.*]], align 1
+// CHECK-NEXT:    [[LIST_ADDR:%.*]] = alloca ptr, align 4
+// CHECK-NEXT:    store ptr [[LIST:%.*]], ptr [[LIST_ADDR]], align 4
+// CHECK-NEXT:    [[ARGP_CUR:%.*]] = load ptr, ptr [[LIST_ADDR]], align 4
+// CHECK-NEXT:    [[ARGP_NEXT:%.*]] = getelementptr inbounds i8, ptr [[ARGP_CUR]], i32 0
+// CHECK-NEXT:    store ptr [[ARGP_NEXT]], ptr [[LIST_ADDR]], align 4
+// CHECK-NEXT:    call void @llvm.memcpy.p0.p0.i32(ptr align 1 [[RETVAL]], ptr align 4 [[ARGP_CUR]], i32 0, i1 false)
+// CHECK-NEXT:    ret void
+//
+empty_t empty(__builtin_va_list list) {
+  return __builtin_va_arg(list, empty_t);
+}
+
+typedef struct {
+  empty_t x;
+  int y;
+} empty_int_t;
+
+// CHECK-LABEL: @empty_int(
+// CHECK-NEXT:  entry:
+// CHECK-NEXT:    [[RETVAL:%.*]] = alloca [[STRUCT_EMPTY_INT_T:%.*]], align 4
+// CHECK-NEXT:    [[LIST_ADDR:%.*]] = alloca ptr, align 4
+// CHECK-NEXT:    store ptr [[LIST:%.*]], ptr [[LIST_ADDR]], align 4
+// CHECK-NEXT:    [[ARGP_CUR:%.*]] = load ptr, ptr [[LIST_ADDR]], align 4
+// CHECK-NEXT:    [[ARGP_NEXT:%.*]] = getelementptr inbounds i8, ptr [[ARGP_CUR]], i32 4
+// CHECK-NEXT:    store ptr [[ARGP_NEXT]], ptr [[LIST_ADDR]], align 4
+// CHECK-NEXT:    call void @llvm.memcpy.p0.p0.i32(ptr align 4 [[RETVAL]], ptr align 4 [[ARGP_CUR]], i32 4, i1 false)
+// CHECK-NEXT:    [[TMP0:%.*]] = load i32, ptr [[RETVAL]], align 4
+// CHECK-NEXT:    ret i32 [[TMP0]]
+//
+empty_int_t empty_int(__builtin_va_list list) {
+  return __builtin_va_arg(list, empty_int_t);
+}
+
+typedef struct {
+  int x;
+  empty_t y;
+} int_empty_t;
+
+// CHECK-LABEL: @int_empty(
+// CHECK-NEXT:  entry:
+// CHECK-NEXT:    [[RETVAL:%.*]] = alloca [[STRUCT_INT_EMPTY_T:%.*]], align 4
+// CHECK-NEXT:    [[LIST_ADDR:%.*]] = alloca ptr, align 4
+// CHECK-NEXT:    store ptr [[LIST:%.*]], ptr [[LIST_ADDR]], align 4
+// CHECK-NEXT:    [[ARGP_CUR:%.*]] = load ptr, ptr [[LIST_ADDR]], align 4
+// CHECK-NEXT:    [[ARGP_NEXT:%.*]] = getelementptr inbounds i8, ptr [[ARGP_CUR]], i32 4
+// CHECK-NEXT:    store ptr [[ARGP_NEXT]], ptr [[LIST_ADDR]], align 4
+// CHECK-NEXT:    call void @llvm.memcpy.p0.p0.i32(ptr align 4 [[RETVAL]], ptr align 4 [[ARGP_CUR]], i32 4, i1 false)
+// CHECK-NEXT:    [[COERCE_DIVE:%.*]] = getelementptr inbounds [[STRUCT_INT_EMPTY_T]], ptr [[RETVAL]], i32 0, i32 0
+// CHECK-NEXT:    [[TMP0:%.*]] = load i32, ptr [[COERCE_DIVE]], align 4
+// CHECK-NEXT:    ret i32 [[TMP0]]
+//
+int_empty_t int_empty(__builtin_va_list list) {
+  return __builtin_va_arg(list, int_empty_t);
+}
+
+// Need multiple va_arg instructions to check the postincrement
+// Using types that are passed directly as the indirect handling
+// is independent of the alignment handling in emitVoidPtrDirectVAArg.
+
+// CHECK-LABEL: @multiple_int(
+// CHECK-NEXT:  entry:
+// CHECK-NEXT:    [[LIST_ADDR:%.*]] = alloca ptr, align 4
+// CHECK-NEXT:    [[OUT0_ADDR:%.*]] = alloca ptr, align 4
+// CHECK-NEXT:    [[OUT1_ADDR:%.*]] = alloca ptr, align 4
+// CHECK-NEXT:    [[OUT2_ADDR:%.*]] = alloca ptr, align 4
+// CHECK-NEXT:    store ptr [[LIST:%.*]], ptr [[LIST_ADDR]], align 4
+// CHECK-NEXT:    store ptr [[OUT0:%.*]], ptr [[OUT0_ADDR]], align 4
+// CHECK-NEXT:    store ptr [[OUT1:%.*]], ptr [[OUT1_ADDR]], align 4
+// CHECK-NEXT:    store ptr [[OUT2:%.*]], ptr [[OUT2_ADDR]], align 4
+// CHECK-NEXT:    [[ARGP_CUR:%.*]] = load ptr, ptr [[LIST_ADDR]], align 4
+// CHECK-NEXT:    [[ARGP_NEXT:%.*]] = getelementptr inbounds i8, ptr [[ARGP_CUR]], i32 4
+// CHECK-NEXT:    store ptr [[ARGP_NEXT]], ptr [[LIST_ADDR]], align 4
+// CHECK-NEXT:    [[TMP0:%.*]] = load i32, ptr [[ARGP_CUR]], align 4
+// CHECK-NEXT:    [[TMP1:%.*]] = load ptr, ptr [[OUT0_ADDR]], align 4
+// CHECK-NEXT:    store i32 [[TMP0]], ptr [[TMP1]], align 4
+// CHECK-NEXT:    [[ARGP_CUR1:%.*]] = load ptr, ptr [[LIST_ADDR]], align 4
+// CHECK-NEXT:    [[ARGP_NEXT2:%.*]] = getelementptr inbounds i8, ptr [[ARGP_CUR1]], i32 4
+// CHECK-NEXT:    store ptr [[ARGP_NEXT2]], ptr [[LIST_ADDR]], align 4
+// CHECK-NEXT:    [[TMP2:%.*]] = load i32, ptr [[ARGP_CUR1]], align 4
+// CHECK-NEXT:    [[TMP3:%.*]] = load ptr, ptr [[OUT1_ADDR]], align 4
+// CHECK-NEXT:    store i32 [[TMP2]], ptr [[TMP3]], align 4
+// CHECK-NEXT:    [[ARGP_CUR3:%.*]] = load ptr, ptr [[LIST_ADDR]], align 4
+// CHECK-NEXT:    [[ARGP_NEXT4:%.*]] = getelementptr inbounds i8, ptr [[ARGP_CUR3]], i32 4
+// CHECK-NEXT:    store ptr [[ARGP_NEXT4]], ptr [[LIST_ADDR]], align 4
+// CHECK-NEXT:    [[TMP4:%.*]] = load i32, ptr [[ARGP_CUR3]], align 4
+// CHECK-NEXT:    [[TMP5:%.*]] = load ptr, ptr [[OUT2_ADDR]], align 4
+// CHECK-NEXT:    store i32 [[TMP4]], ptr [[TMP5]], align 4
+// CHECK-NEXT:    ret void
+//
+void multiple_int(__builtin_va_list list, int *out0, int *out1, int *out2) {
+  *out0 = __builtin_va_arg(list, int);
+  *out1 = __builtin_va_arg(list, int);
+  *out2 = __builtin_va_arg(list, int);
+}
+
+// Scalars in structs are an easy way of specifying alignment from C
+// CHECK-LABEL: @increasing_alignment(
+// CHECK-NEXT:  entry:
+// CHECK-NEXT:    [[LIST_ADDR:%.*]] = alloca ptr, align 4
+// CHECK-NEXT:    [[OUT0_ADDR:%.*]] = alloca ptr, align 4
+// CHECK-NEXT:    [[OUT1_ADDR:%.*]] = alloca ptr, align 4
+// CHECK-NEXT:    [[OUT2_ADDR:%.*]] = alloca ptr, align 4
+// CHECK-NEXT:    [[OUT3_ADDR:%.*]] = alloca ptr, align 4
+// CHECK-NEXT:    store ptr [[LIST:%.*]], ptr [[LIST_ADDR]], align 4
+// CHECK-NEXT:    store ptr [[OUT0:%.*]], ptr [[OUT0_ADDR]], align 4
+// CHECK-NEXT:    store ptr [[OUT1:%.*]], ptr [[OUT1_ADDR]], align 4
+// CHECK-NEXT:    store ptr [[OUT2:%.*]], ptr [[OUT2_ADDR]], align 4
+// CHECK-NEXT:    store ptr [[OUT3:%.*]], ptr [[OUT3_ADDR]], align 4
+// CHECK-NEXT:    [[TMP0:%.*]] = load ptr, ptr [[OUT0_ADDR]], align 4
+// CHECK-NEXT:    [[ARGP_CUR:%.*]] = load ptr, ptr [[LIST_ADDR]], align 4
+// CHECK-NEXT:    [[ARGP_NEXT:%.*]] = getelementptr inbounds i8, ptr [[ARGP_CUR]], i32 4
+// CHECK-NEXT:    store ptr [[ARGP_NEXT]], ptr [[LIST_ADDR]], align 4
+// CHECK-NEXT:    call void @llvm.memcpy.p0.p0.i32(ptr align 1 [[TMP0]], ptr align 4 [[ARGP_CUR]], i32 1, i1 false)
+// CHECK-NEXT:    [[TMP1:%.*]] = load ptr, ptr [[OUT1_ADDR]], align 4
+// CHECK-NEXT:    [[ARGP_CUR1:%.*]] = load ptr, ptr [[LIST_ADDR]], align 4
+// CHECK-NEXT:    [[ARGP_NEXT2:%.*]] = getelementptr inbounds i8, ptr [[ARGP_CUR1]], i32 4
+// CHECK-NEXT:    store ptr [[ARGP_NEXT2]], ptr [[LIST_ADDR]], align 4
+// CHECK-NEXT:    call void @llvm.memcpy.p0.p0.i32(ptr align 2 [[TMP1]], ptr align 4 [[ARGP_CUR1]], i32 2, i1 false)
+// CHECK-NEXT:    [[ARGP_CUR3:%.*]] = load ptr, ptr [[LIST_ADDR]], align 4
+// CHECK-NEXT:    [[ARGP_NEXT4:%.*]] = getelementptr inbounds i8, ptr [[ARGP_CUR3]], i32 4
+// CHECK-NEXT:    store ptr [[ARGP_NEXT4]], ptr [[LIST_ADDR]], align 4
+// CHECK-NEXT:    [[TMP2:%.*]] = load i32, ptr [[ARGP_CUR3]], align 4
+// CHECK-NEXT:    [[TMP3:%.*]] = load ptr, ptr [[OUT2_ADDR]], align 4
+// CHECK-NEXT:    store i32 [[TMP2]], ptr [[TMP3]], align 4
+// CHECK-NEXT:    [[ARGP_CUR5:%.*]] = load ptr, ptr [[LIST_ADDR]], align 4
+// CHECK-NEXT:    [[TMP4:%.*]] = getelementptr inbounds i8, ptr [[ARGP_CUR5]], i32 7
+// CHECK-NEXT:    [[ARGP_CUR5_ALIGNED:%.*]] = call ptr @llvm.ptrmask.p0.i32(ptr [[TMP4...
[truncated]

github-actions · 2024-05-25T01:54:39Z

✅ With the latest revision this PR passed the C/C++ code formatter.

JonChesterfield · 2024-05-25T01:59:59Z

clang/lib/CodeGen/Targets/AMDGPU.cpp

@@ -197,12 +206,20 @@ ABIArgInfo AMDGPUABIInfo::classifyKernelArgumentType(QualType Ty) const {
  return ABIArgInfo::getDirect(LTy, 0, nullptr, false);
 }

-ABIArgInfo AMDGPUABIInfo::classifyArgumentType(QualType Ty,
+ABIArgInfo AMDGPUABIInfo::classifyArgumentType(QualType Ty, bool Variadic,
                                               unsigned &NumRegsLeft) const {


This was subtle. Structs that aren't packed into integers and passed in registers fall through to default handling which sets CanBeFlattened, saying that it's OK to spread the struct across multiple arguments. This is then very difficult to reassemble robustly using the va_arg(x, type) interface - one needs to compute how type is likely to have been spread out across part of the call frame.

Noting that these values aren't being usefully passed in registers anyway, the if (Variadic) {} sets up call instructions that pass values by value (not byval) and declares that every value shall be exactly four byte aligned (including doubles, as that's something Matt suggested for amdgpu some time ago). This means the frame setup implementation and the case analysis for testing are very straightforward.

JonChesterfield · 2024-05-28T14:03:01Z

Joseph reports "memory error" from a libc test when running with this patch. This is unfortunate. I haven't reproduced that yet (I don't mean libc passes, I mean libc fails with or without this patch). The blast radius for "memory error" on amdgpu is wide but there is very little amdgpu specific code in this patch so it's either something handling addrspacecast incorrectly or an unlucky interaction with something outside of this patch.

My plan is to spin up a separate patch which is the non-amdgpu part of this and hope someone signs off on it - the development overhead of juggling lots of branches is significantly compromising time to solution here. Bringing up x64 / aarch64 / nvptx or similar will, if I'm lucky, uncover a bug in this pass which is causing the libc test failure.

For debugging amdgpu, I'll add more tests around addrspace cast and hope to see a bug in the IR, try to get libc to pass and, in extremis, try to build rocm from source in case the debugger helps.

arsenm · 2024-05-29T11:08:19Z

clang/test/CodeGenCXX/inline-then-fold-variadics.cpp

+// suffice here -Wno-varargs avoids warning second argument to 'va_start' is not
+// the last named parameter
+
+// RUN: %clang_cc1 %s -triple wasm32-unknown-unknown -Wno-varargs -O1 -emit-llvm -o - | opt - -S --passes='module(expand-variadics,default<O1>)' --expand-variadics-override=optimize -o - | FileCheck %s


Does this need REQUIRES: wasm-registered-target

arsenm · 2024-05-29T11:08:42Z

clang/lib/CodeGen/Targets/AMDGPU.cpp

+  unsigned ArgumentIndex = 0;
+  const unsigned numFixedArguments = FI.getNumRequiredArgs();


Can you split the clang AMDGPU ABI changes into a separate PR? The tests for this are also missing

Yes, I think I can do - the ABI change only affects variadic functions, which currently fatal_error anyway - but I think the C to IR tests will succeed as long as nothing calls va_arg and it stops before codegen.

Added the test to this PR and also split out #94083. Can land that subpatch first and rebase this for a reduction in complexity.

This is a mostly-target-independent variadic function optimisation and lowering pass. It is only enabled for AMDGPU in this initial commit. The purpose is to make C style variadic functions a zero cost abstraction. They are lowered to equivalent IR which is then amenable to other optimisations. This is inherently slightly target specific but much less so than one might expect - the C varargs interface heavily constrains the ABI design divergence. The pass is primarily tested from webassembly. This is because wasm has a straightforward variadic lowering strategy which coincides exactly with what this pass transforms code into and a struct passing convention with few cases to check. Adding further targets conventions is straightforward and elided from this patch primarily to simplify the review. Implemented in other branches are Linux X86, AMD64, AArch64 and NVPTX. Testing for targets that have existing lowering for va_arg from clang is most efficiently done by checking that clang | opt completely elides the variadic syntax from test cases. The lowering produces a struct for each call site which can be inspected to check the various alignment and indirections are correct. AMDGPU presently has no variadic support other than some ad hoc printf handling. Combined with the pass being inactive on all other targets landing this represents strict increase in capability with zero risk. Testing and refining will continue post commit. In addition to the compiler tests included here, a self contained x64 clang/musl toolchain was constructed using the "lowering" instead of the systemv ABI and used to build various C programs like lua and libxml2.

Pass variadic arguments without changing their type, unlike the fixed ones. Fixed arguments are modified to better fit into registers. This patch leaves those unchanged. Splitting struct types into individual fields and packing small structs into integers works well for passing via registers. Variadic arguments are currently unimplemented in the backend. They're likely to be implemented as a pointer to stack memory in which case register-themed optimisations are inapplicable. Splitting the struct into fields makes it difficult to implement va_arg robustly. The rules around padding and alignment to inverse the struct splitting could be constructed, but at high complexity and no particular advantage. Passing types as-is means there is a 1:1 correspondence with the type information va_arg has to work with and the parameter type at the call site. This is an ABI change, but as the only functions affected are variadic ones which are presently a compilation error, not a functional break. Factored out of the larger llvm#93362 and can land independently.