[WIP] Expand variadic functions in IR #89007

JonChesterfield · 2024-04-17T00:30:43Z

Rewrite calls to variadic functions into calls to an equivalent non-variadic function.

This makes calls to known variadic functions a zero cost abstraction. The GPUs use it as a backend implementation.

The pass runs for implemented targets on known functions at O1 or above. For targets using it as codegen it runs from the backend on known and unknown functions. Interacts well with inlining so is scheduled shortly before the inliner.

Relative to the abandoned pull/81058, this is the whole end to end transform as opposed to a subset for easier review. The feedback on that is applied, in particular this makes no attempt to detect existing va_list functions, so in some cases it'll make the IR worse in a fashion that the inliner completely reverts. The ABI abstraction is reworked to be more orthogonal which hopefully makes it clear that the switch isn't worth scattering across the target classes.

Marked WIP because vector types in structs on X64 might be misbehaving. Still testing this - it's solid enough to build a working clang and libxml2 and should meet the use cases in gpu libc and I'll put a patch up for the rocm CI - but would like more test coverage and probably another architecture implemented before landing.

llvmbot · 2024-04-17T00:31:15Z

@llvm/pr-subscribers-llvm-ir
@llvm/pr-subscribers-clang
@llvm/pr-subscribers-backend-x86
@llvm/pr-subscribers-backend-amdgpu

@llvm/pr-subscribers-llvm-transforms

Author: Jon Chesterfield (JonChesterfield)

Changes

Rewrite calls to variadic functions into calls to an equivalent non-variadic function.

This makes calls to known variadic functions a zero cost abstraction. The GPUs use it as a backend implementation.

The pass runs for implemented targets on known functions at O1 or above. For targets using it as codegen it runs from the backend on known and unknown functions. Interacts well with inlining so is scheduled shortly before the inliner.

Relative to the abandoned pull/81058, this is the whole end to end transform as opposed to a subset for easier review. The feedback on that is applied, in particular this makes no attempt to detect existing va_list functions, so in some cases it'll make the IR worse in a fashion that the inliner completely reverts. The ABI abstraction is reworked to be more orthogonal which hopefully makes it clear that the switch isn't worth scattering across the target classes.

Marked WIP because vector types in structs on X64 might be misbehaving. Still testing this - it's solid enough to build a working clang and libxml2 and should meet the use cases in gpu libc and I'll put a patch up for the rocm CI - but would like more test coverage and probably another architecture implemented before landing.

Patch is 293.19 KiB, truncated to 20.00 KiB below, full version: https://github.com/llvm/llvm-project/pull/89007.diff

36 Files Affected:

(modified) clang/lib/CodeGen/ABIInfoImpl.cpp (+11-2)
(modified) clang/lib/CodeGen/Targets/AMDGPU.cpp (+7-1)
(modified) clang/lib/CodeGen/Targets/NVPTX.cpp (+7-1)
(added) clang/test/CodeGen/expand-variadic-call.c (+314)
(added) clang/test/CodeGen/variadic-wrapper-removal.c (+85)
(added) clang/test/CodeGenCXX/inline-then-fold-variadics.cpp (+247)
(modified) llvm/include/llvm/InitializePasses.h (+1)
(added) llvm/include/llvm/Transforms/IPO/ExpandVariadics.h (+43)
(modified) llvm/lib/Passes/PassBuilder.cpp (+1)
(modified) llvm/lib/Passes/PassBuilderPipelines.cpp (+4)
(modified) llvm/lib/Passes/PassRegistry.def (+1)
(modified) llvm/lib/Target/AMDGPU/AMDGPUPassRegistry.def (+1)
(modified) llvm/lib/Target/AMDGPU/AMDGPUTargetMachine.cpp (+3)
(modified) llvm/lib/Target/NVPTX/NVPTXTargetMachine.cpp (+4)
(modified) llvm/lib/Transforms/IPO/CMakeLists.txt (+1)
(added) llvm/lib/Transforms/IPO/ExpandVariadics.cpp (+1056)
(added) llvm/test/CodeGen/AMDGPU/expand-variadic-call.ll (+499)
(modified) llvm/test/CodeGen/AMDGPU/llc-pipeline.ll (+5)
(modified) llvm/test/CodeGen/AMDGPU/unsupported-calls.ll (-19)
(added) llvm/test/CodeGen/NVPTX/expand-variadic-call.ll (+468)
(added) llvm/test/CodeGen/X86/expand-variadic-call-i386-darwin.ll (+449)
(added) llvm/test/CodeGen/X86/expand-variadic-call-i386-linux.ll (+449)
(added) llvm/test/CodeGen/X86/expand-variadic-call-i686-msvc.ll (+467)
(added) llvm/test/CodeGen/X86/expand-variadic-call-x64-darwin.ll (+688)
(added) llvm/test/CodeGen/X86/expand-variadic-call-x64-linux.ll (+688)
(modified) llvm/test/Other/new-pm-defaults.ll (+1)
(modified) llvm/test/Other/new-pm-thinlto-postlink-defaults.ll (+1)
(modified) llvm/test/Other/new-pm-thinlto-postlink-pgo-defaults.ll (+1)
(modified) llvm/test/Other/new-pm-thinlto-postlink-samplepgo-defaults.ll (+1-1)
(modified) llvm/test/Other/new-pm-thinlto-prelink-defaults.ll (+1)
(modified) llvm/test/Other/new-pm-thinlto-prelink-pgo-defaults.ll (+1)
(modified) llvm/test/Other/new-pm-thinlto-prelink-samplepgo-defaults.ll (+1)
(added) llvm/test/Transforms/ExpandVariadics/expand-va-intrinsic-outliner.ll (+86)
(added) llvm/test/Transforms/ExpandVariadics/expand-va-intrinsic-split-linkage.ll (+225)
(added) llvm/test/Transforms/ExpandVariadics/expand-va-intrinsic-split-simple.ll (+121)
(modified) llvm/utils/gn/secondary/llvm/lib/Transforms/IPO/BUILD.gn (+1)

diff --git a/clang/lib/CodeGen/ABIInfoImpl.cpp b/clang/lib/CodeGen/ABIInfoImpl.cpp
index 3e34d82cb399ba..c4a1829ab8863f 100644
--- a/clang/lib/CodeGen/ABIInfoImpl.cpp
+++ b/clang/lib/CodeGen/ABIInfoImpl.cpp
@@ -154,11 +154,20 @@ llvm::Value *CodeGen::emitRoundPointerUpToAlignment(CodeGenFunction &CGF,
                                                     llvm::Value *Ptr,
                                                     CharUnits Align) {
   // OverflowArgArea = (OverflowArgArea + Align - 1) & -Align;
+  Ptr = CGF.Builder.CreateAddrSpaceCast(Ptr, CGF.AllocaInt8PtrTy,
+                                        Ptr->getName() + ".addrcast");
   llvm::Value *RoundUp = CGF.Builder.CreateConstInBoundsGEP1_32(
       CGF.Builder.getInt8Ty(), Ptr, Align.getQuantity() - 1);
+
+  // ptrmask is sensitive to the bitwidth of the mask
+  unsigned IndexTypeSize =
+      CGF.CGM.getDataLayout().getIndexTypeSizeInBits(RoundUp->getType());
+  llvm::IntegerType *MaskType =
+      llvm::IntegerType::get(CGF.getLLVMContext(), IndexTypeSize);
+
   return CGF.Builder.CreateIntrinsic(
-      llvm::Intrinsic::ptrmask, {CGF.AllocaInt8PtrTy, CGF.IntPtrTy},
-      {RoundUp, llvm::ConstantInt::get(CGF.IntPtrTy, -Align.getQuantity())},
+      llvm::Intrinsic::ptrmask, {CGF.AllocaInt8PtrTy, MaskType},
+      {RoundUp, llvm::ConstantInt::get(MaskType, -Align.getQuantity())},
       nullptr, Ptr->getName() + ".aligned");
 }
 
diff --git a/clang/lib/CodeGen/Targets/AMDGPU.cpp b/clang/lib/CodeGen/Targets/AMDGPU.cpp
index 44e86c0b40f686..e274741a6ee652 100644
--- a/clang/lib/CodeGen/Targets/AMDGPU.cpp
+++ b/clang/lib/CodeGen/Targets/AMDGPU.cpp
@@ -115,7 +115,13 @@ void AMDGPUABIInfo::computeInfo(CGFunctionInfo &FI) const {
 
 Address AMDGPUABIInfo::EmitVAArg(CodeGenFunction &CGF, Address VAListAddr,
                                  QualType Ty) const {
-  llvm_unreachable("AMDGPU does not support varargs");
+  const bool IsIndirect = false;
+  const bool AllowHigherAlign = true;
+  // Would rather not naturally align values
+  // Splitting {char, short} into two separate arguments makes that difficult.
+  return emitVoidPtrVAArg(CGF, VAListAddr, Ty, IsIndirect,
+                          getContext().getTypeInfoInChars(Ty),
+                          CharUnits::fromQuantity(1), AllowHigherAlign);
 }
 
 ABIArgInfo AMDGPUABIInfo::classifyReturnType(QualType RetTy) const {
diff --git a/clang/lib/CodeGen/Targets/NVPTX.cpp b/clang/lib/CodeGen/Targets/NVPTX.cpp
index 7dce5042c3dc20..d6c256e5e6f7b9 100644
--- a/clang/lib/CodeGen/Targets/NVPTX.cpp
+++ b/clang/lib/CodeGen/Targets/NVPTX.cpp
@@ -215,7 +215,13 @@ void NVPTXABIInfo::computeInfo(CGFunctionInfo &FI) const {
 
 Address NVPTXABIInfo::EmitVAArg(CodeGenFunction &CGF, Address VAListAddr,
                                 QualType Ty) const {
-  llvm_unreachable("NVPTX does not support varargs");
+  // TODO: Work out to what extent this correlates with ptx
+  // doubles get passed with 8 byte alignment and C promotes smaller integer
+  // types to int. Printf doesn't really do structs so hard to guess what
+  // the right thing is for that.
+  return emitVoidPtrVAArg(CGF, VAListAddr, Ty, false,
+                          getContext().getTypeInfoInChars(Ty),
+                          CharUnits::fromQuantity(4), true);
 }
 
 void NVPTXTargetCodeGenInfo::setTargetAttributes(
diff --git a/clang/test/CodeGen/expand-variadic-call.c b/clang/test/CodeGen/expand-variadic-call.c
new file mode 100644
index 00000000000000..baff54544207ea
--- /dev/null
+++ b/clang/test/CodeGen/expand-variadic-call.c
@@ -0,0 +1,314 @@
+// NOTE: Assertions have been autogenerated by utils/update_cc_test_checks.py
+
+// REQUIRES: x86-registered-target
+// RUN: %clang_cc1 -triple x86_64-unknown-linux-gnu -target-cpu x86-64-v4 -std=c23 -O1 -ffreestanding -mllvm --expand-variadics-override=disable -emit-llvm -o - %s | FileCheck %s
+
+// This test sanity checks calling a variadic function with the expansion transform disabled.
+// The IR test cases {arch}/expand-variadic-call-*.ll correspond to IR generated from this test case.
+
+typedef __builtin_va_list va_list;
+#define va_copy(dest, src) __builtin_va_copy(dest, src)
+#define va_start(ap, ...) __builtin_va_start(ap, 0)
+#define va_end(ap) __builtin_va_end(ap)
+#define va_arg(ap, type) __builtin_va_arg(ap, type)
+
+// 32 bit x86 alignment uses getTypeStackAlign for special cases
+// Whitebox testing.
+// Needs a type >= 16 which is either a simd or a struct containing a simd
+// darwinvectorabi should force 4 bytes
+// linux vectors with align 16/32/64 return that alignment
+
+
+// Might want various copy/end style constructs in a separate test
+
+void vararg(...);
+void valist(va_list);
+
+// CHECK-LABEL: @copy(
+// CHECK-NEXT:  entry:
+// CHECK-NEXT:    [[CP:%.*]] = alloca [1 x %struct.__va_list_tag], align 16
+// CHECK-NEXT:    call void @llvm.lifetime.start.p0(i64 24, ptr nonnull [[CP]]) #[[ATTR7:[0-9]+]]
+// CHECK-NEXT:    call void @llvm.va_copy.p0(ptr nonnull [[CP]], ptr [[VA:%.*]])
+// CHECK-NEXT:    call void @valist(ptr noundef nonnull [[CP]]) #[[ATTR8:[0-9]+]]
+// CHECK-NEXT:    call void @llvm.lifetime.end.p0(i64 24, ptr nonnull [[CP]]) #[[ATTR7]]
+// CHECK-NEXT:    ret void
+//
+void copy(va_list va)
+{
+  va_list cp;
+  va_copy(cp, va);
+  valist(cp);
+}
+
+// CHECK-LABEL: @start_once(
+// CHECK-NEXT:  entry:
+// CHECK-NEXT:    [[S:%.*]] = alloca [1 x %struct.__va_list_tag], align 16
+// CHECK-NEXT:    call void @llvm.lifetime.start.p0(i64 24, ptr nonnull [[S]]) #[[ATTR7]]
+// CHECK-NEXT:    call void @llvm.va_start.p0(ptr nonnull [[S]])
+// CHECK-NEXT:    call void @valist(ptr noundef nonnull [[S]]) #[[ATTR8]]
+// CHECK-NEXT:    call void @llvm.va_end.p0(ptr [[S]])
+// CHECK-NEXT:    call void @llvm.lifetime.end.p0(i64 24, ptr nonnull [[S]]) #[[ATTR7]]
+// CHECK-NEXT:    ret void
+//
+void start_once(...)
+{
+  va_list s;
+  va_start(s);
+  valist(s);
+  va_end(s);
+}
+
+// CHECK-LABEL: @start_twice(
+// CHECK-NEXT:  entry:
+// CHECK-NEXT:    [[S0:%.*]] = alloca [1 x %struct.__va_list_tag], align 16
+// CHECK-NEXT:    [[S1:%.*]] = alloca [1 x %struct.__va_list_tag], align 16
+// CHECK-NEXT:    call void @llvm.lifetime.start.p0(i64 24, ptr nonnull [[S0]]) #[[ATTR7]]
+// CHECK-NEXT:    call void @llvm.lifetime.start.p0(i64 24, ptr nonnull [[S1]]) #[[ATTR7]]
+// CHECK-NEXT:    call void @llvm.va_start.p0(ptr nonnull [[S0]])
+// CHECK-NEXT:    call void @valist(ptr noundef nonnull [[S0]]) #[[ATTR8]]
+// CHECK-NEXT:    call void @llvm.va_end.p0(ptr [[S0]])
+// CHECK-NEXT:    call void @llvm.va_start.p0(ptr nonnull [[S1]])
+// CHECK-NEXT:    call void @valist(ptr noundef nonnull [[S1]]) #[[ATTR8]]
+// CHECK-NEXT:    call void @llvm.va_end.p0(ptr [[S1]])
+// CHECK-NEXT:    call void @llvm.lifetime.end.p0(i64 24, ptr nonnull [[S1]]) #[[ATTR7]]
+// CHECK-NEXT:    call void @llvm.lifetime.end.p0(i64 24, ptr nonnull [[S0]]) #[[ATTR7]]
+// CHECK-NEXT:    ret void
+//
+void start_twice(...)
+{
+  va_list s0,s1;
+  va_start(s0);
+  valist(s0);
+  va_end(s0);
+  va_start(s1);
+  valist(s1);
+  va_end(s1);
+}
+
+// vectors with alignment 16/32/64 are natively aligned on linux x86
+// v32f32 would be a m1024 type, larger than x64 defines at time of writing
+typedef int i32;
+typedef float v4f32 __attribute__((__vector_size__(16), __aligned__(16)));
+typedef float v8f32 __attribute__((__vector_size__(32), __aligned__(32)));
+typedef float v16f32 __attribute__((__vector_size__(64), __aligned__(64)));
+typedef float v32f32 __attribute__((__vector_size__(128), __aligned__(128)));
+
+
+// Pass a single value to wrapped() via vararg(...)
+// CHECK-LABEL: @single_i32(
+// CHECK-NEXT:  entry:
+// CHECK-NEXT:    tail call void (...) @vararg(i32 noundef [[X:%.*]]) #[[ATTR8]]
+// CHECK-NEXT:    ret void
+//
+void single_i32(i32 x)
+{
+  vararg(x);
+}
+
+
+// CHECK-LABEL: @single_double(
+// CHECK-NEXT:  entry:
+// CHECK-NEXT:    tail call void (...) @vararg(double noundef [[X:%.*]]) #[[ATTR8]]
+// CHECK-NEXT:    ret void
+//
+void single_double(double x)
+{
+  vararg(x);
+}
+
+// CHECK-LABEL: @single_v4f32(
+// CHECK-NEXT:  entry:
+// CHECK-NEXT:    tail call void (...) @vararg(<4 x float> noundef [[X:%.*]]) #[[ATTR8]]
+// CHECK-NEXT:    ret void
+//
+void single_v4f32(v4f32 x)
+{
+  vararg(x);
+}
+
+// CHECK-LABEL: @single_v8f32(
+// CHECK-NEXT:  entry:
+// CHECK-NEXT:    tail call void (...) @vararg(<8 x float> noundef [[X:%.*]]) #[[ATTR8]]
+// CHECK-NEXT:    ret void
+//
+void single_v8f32(v8f32 x)
+{
+  vararg(x);
+}
+
+// CHECK-LABEL: @single_v16f32(
+// CHECK-NEXT:  entry:
+// CHECK-NEXT:    tail call void (...) @vararg(<16 x float> noundef [[X:%.*]]) #[[ATTR8]]
+// CHECK-NEXT:    ret void
+//
+void single_v16f32(v16f32 x)
+{
+  vararg(x);
+}
+
+// CHECK-LABEL: @single_v32f32(
+// CHECK-NEXT:  entry:
+// CHECK-NEXT:    [[INDIRECT_ARG_TEMP:%.*]] = alloca <32 x float>, align 128
+// CHECK-NEXT:    [[X:%.*]] = load <32 x float>, ptr [[TMP0:%.*]], align 128, !tbaa [[TBAA2:![0-9]+]]
+// CHECK-NEXT:    store <32 x float> [[X]], ptr [[INDIRECT_ARG_TEMP]], align 128, !tbaa [[TBAA2]]
+// CHECK-NEXT:    tail call void (...) @vararg(ptr noundef nonnull byval(<32 x float>) align 128 [[INDIRECT_ARG_TEMP]]) #[[ATTR8]]
+// CHECK-NEXT:    ret void
+//
+void single_v32f32(v32f32 x)
+{
+  vararg(x);
+}
+
+
+
+// CHECK-LABEL: @i32_double(
+// CHECK-NEXT:  entry:
+// CHECK-NEXT:    tail call void (...) @vararg(i32 noundef [[X:%.*]], double noundef [[Y:%.*]]) #[[ATTR8]]
+// CHECK-NEXT:    ret void
+//
+void i32_double(i32 x, double y)
+{
+  vararg(x, y);
+}
+
+// CHECK-LABEL: @double_i32(
+// CHECK-NEXT:  entry:
+// CHECK-NEXT:    tail call void (...) @vararg(double noundef [[X:%.*]], i32 noundef [[Y:%.*]]) #[[ATTR8]]
+// CHECK-NEXT:    ret void
+//
+void double_i32(double x, i32 y)
+{
+  vararg(x, y);
+}
+
+
+// A struct used by libc variadic tests
+
+typedef struct {
+  char c;
+  short s;
+  int i;
+  long l;
+  float f;
+  double d;
+}  libcS;
+
+// CHECK-LABEL: @i32_libcS(
+// CHECK-NEXT:  entry:
+// CHECK-NEXT:    tail call void (...) @vararg(i32 noundef [[X:%.*]], ptr noundef nonnull byval([[STRUCT_LIBCS:%.*]]) align 8 [[Y:%.*]]) #[[ATTR8]]
+// CHECK-NEXT:    ret void
+//
+void i32_libcS(i32 x, libcS y)
+{
+  vararg(x, y);
+}
+
+// CHECK-LABEL: @libcS_i32(
+// CHECK-NEXT:  entry:
+// CHECK-NEXT:    tail call void (...) @vararg(ptr noundef nonnull byval([[STRUCT_LIBCS:%.*]]) align 8 [[X:%.*]], i32 noundef [[Y:%.*]]) #[[ATTR8]]
+// CHECK-NEXT:    ret void
+//
+void libcS_i32(libcS x, i32 y)
+{
+  vararg(x, y);
+}
+
+
+// CHECK-LABEL: @i32_v4f32(
+// CHECK-NEXT:  entry:
+// CHECK-NEXT:    tail call void (...) @vararg(i32 noundef [[X:%.*]], <4 x float> noundef [[Y:%.*]]) #[[ATTR8]]
+// CHECK-NEXT:    ret void
+//
+void i32_v4f32(i32 x, v4f32 y)
+{
+  vararg(x, y);
+}
+
+// CHECK-LABEL: @v4f32_i32(
+// CHECK-NEXT:  entry:
+// CHECK-NEXT:    tail call void (...) @vararg(<4 x float> noundef [[X:%.*]], i32 noundef [[Y:%.*]]) #[[ATTR8]]
+// CHECK-NEXT:    ret void
+//
+void v4f32_i32(v4f32 x, i32 y)
+{
+  vararg(x, y);
+}
+
+// CHECK-LABEL: @i32_v8f32(
+// CHECK-NEXT:  entry:
+// CHECK-NEXT:    tail call void (...) @vararg(i32 noundef [[X:%.*]], <8 x float> noundef [[Y:%.*]]) #[[ATTR8]]
+// CHECK-NEXT:    ret void
+//
+void i32_v8f32(i32 x, v8f32 y)
+{
+  vararg(x, y);
+}
+
+// CHECK-LABEL: @v8f32_i32(
+// CHECK-NEXT:  entry:
+// CHECK-NEXT:    tail call void (...) @vararg(<8 x float> noundef [[X:%.*]], i32 noundef [[Y:%.*]]) #[[ATTR8]]
+// CHECK-NEXT:    ret void
+//
+void v8f32_i32(v8f32 x, i32 y)
+{
+  vararg(x, y);
+}
+
+// CHECK-LABEL: @i32_v16f32(
+// CHECK-NEXT:  entry:
+// CHECK-NEXT:    tail call void (...) @vararg(i32 noundef [[X:%.*]], <16 x float> noundef [[Y:%.*]]) #[[ATTR8]]
+// CHECK-NEXT:    ret void
+//
+void i32_v16f32(i32 x, v16f32 y)
+{
+  vararg(x, y);
+}
+
+// CHECK-LABEL: @v16f32_i32(
+// CHECK-NEXT:  entry:
+// CHECK-NEXT:    tail call void (...) @vararg(<16 x float> noundef [[X:%.*]], i32 noundef [[Y:%.*]]) #[[ATTR8]]
+// CHECK-NEXT:    ret void
+//
+void v16f32_i32(v16f32 x, i32 y)
+{
+  vararg(x, y);
+}
+
+// CHECK-LABEL: @i32_v32f32(
+// CHECK-NEXT:  entry:
+// CHECK-NEXT:    [[INDIRECT_ARG_TEMP:%.*]] = alloca <32 x float>, align 128
+// CHECK-NEXT:    [[Y:%.*]] = load <32 x float>, ptr [[TMP0:%.*]], align 128, !tbaa [[TBAA2]]
+// CHECK-NEXT:    store <32 x float> [[Y]], ptr [[INDIRECT_ARG_TEMP]], align 128, !tbaa [[TBAA2]]
+// CHECK-NEXT:    tail call void (...) @vararg(i32 noundef [[X:%.*]], ptr noundef nonnull byval(<32 x float>) align 128 [[INDIRECT_ARG_TEMP]]) #[[ATTR8]]
+// CHECK-NEXT:    ret void
+//
+void i32_v32f32(i32 x, v32f32 y)
+{
+  vararg(x, y);
+}
+
+// CHECK-LABEL: @v32f32_i32(
+// CHECK-NEXT:  entry:
+// CHECK-NEXT:    [[INDIRECT_ARG_TEMP:%.*]] = alloca <32 x float>, align 128
+// CHECK-NEXT:    [[X:%.*]] = load <32 x float>, ptr [[TMP0:%.*]], align 128, !tbaa [[TBAA2]]
+// CHECK-NEXT:    store <32 x float> [[X]], ptr [[INDIRECT_ARG_TEMP]], align 128, !tbaa [[TBAA2]]
+// CHECK-NEXT:    tail call void (...) @vararg(ptr noundef nonnull byval(<32 x float>) align 128 [[INDIRECT_ARG_TEMP]], i32 noundef [[Y:%.*]]) #[[ATTR8]]
+// CHECK-NEXT:    ret void
+//
+void v32f32_i32(v32f32 x, i32 y)
+{
+  vararg(x, y);
+}
+
+#if __AMDGPU__ || __NVPTX__
+
+
+void (*volatile vararg_ptr)(...) = &vararg;
+
+void indirect_single_i32(i32 x)
+{
+  vararg_ptr(x);
+}
+
+
+#endif
diff --git a/clang/test/CodeGen/variadic-wrapper-removal.c b/clang/test/CodeGen/variadic-wrapper-removal.c
new file mode 100644
index 00000000000000..13499612f0757f
--- /dev/null
+++ b/clang/test/CodeGen/variadic-wrapper-removal.c
@@ -0,0 +1,85 @@
+// REQUIRES: x86-registered-target
+// RUN: %clang_cc1 -triple i386-unknown-linux-gnu -O1 -emit-llvm -o - %s | opt --passes='module(expand-variadics,inline)' -S | FileCheck %s
+// RUN: %clang_cc1 -triple=x86_64-linux-gnu -O1 -emit-llvm -o - %s | opt --passes='module(expand-variadics,inline)' -S | FileCheck %s
+
+// neither arm arch is implemented yet, leaving it here as a reminder
+// armv6 is a ptr as far as the struct is concerned, but possibly also a [1 x i32] passed by value
+// that seems insistent, maybe leave 32 bit arm alone for now
+// aarch64 is a struct of five things passed using byval memcpy
+
+// R-N: %clang_cc1 -triple=armv6-none--eabi -O1 -emit-llvm -o - %s | opt --passes=expand-variadics -S | FileCheck %s
+// R-N: %clang_cc1 -triple=aarch64-none-linux-gnu -O1 -emit-llvm -o - %s | opt --passes=expand-variadics -S | FileCheck %s
+
+
+
+// expand-variadics rewrites calls to variadic functions into calls to
+// equivalent functions that take a va_list argument. A property of the
+// implementation is that said "equivalent function" may be a pre-existing one.
+// This is equivalent to inlining a sufficiently simple variadic wrapper.
+
+#include <stdarg.h>
+
+typedef int FILE; // close enough for this test
+
+// fprintf is sometimes implemented as a call to vfprintf. That fits the
+// pattern the transform pass recognises - given an implementation of fprintf
+// in the IR module, calls to it can be rewritten into calls into vfprintf.
+
+// CHECK-LABEL: define{{.*}} i32 @fprintf(
+// CHECK-LABEL: define{{.*}} i32 @call_fprintf(
+// CHECK-NOT:   @fprintf
+// CHECK:       @vfprintf
+int vfprintf(FILE *restrict f, const char *restrict fmt, va_list ap);
+int fprintf(FILE *restrict f, const char *restrict fmt, ...)
+{
+  int ret;
+  va_list ap;
+  va_start(ap, fmt);
+  ret = vfprintf(f, fmt, ap);
+  va_end(ap);
+  return ret;
+}
+int call_fprintf(FILE *f)
+{
+  int x = 42;
+  double y = 3.14;
+  return fprintf(f, "int %d dbl %g\n", x, y);
+}
+
+// Void return type is also OK
+
+// CHECK-LABEL: define{{.*}} void @no_result(
+// CHECK-LABEL: define{{.*}} void @call_no_result(
+// CHECK-NOT:   @no_result
+// CHECK:       @vno_result
+void vno_result(const char * fmt, va_list);
+void no_result(const char * fmt, ...)
+{
+  va_list ap;
+  va_start(ap, fmt);
+  vno_result(fmt, ap);
+  va_end(ap);
+}
+void call_no_result(FILE *f)
+{
+  int x = 101;
+  no_result("", x);
+}
+
+// The vaend in the forwarding implementation is optional where it's a no-op
+
+// CHECK-LABEL: define{{.*}} i32 @no_vaend(
+// CHECK-LABEL: define{{.*}} i32 @call_no_vaend(
+// CHECK-NOT:   @no_vaend
+// CHECK:       @vno_vaend
+int vno_vaend(int x, va_list);
+int no_vaend(int x, ...)
+{
+  va_list ap;
+  va_start(ap, x);
+  return vno_vaend(x, ap);
+}
+int call_no_vaend(int x)
+{
+  return no_vaend(x, 10, 2.5f);
+}
diff --git a/clang/test/CodeGenCXX/inline-then-fold-variadics.cpp b/clang/test/CodeGenCXX/inline-then-fold-variadics.cpp
new file mode 100644
index 00000000000000..c7b5863b2632a6
--- /dev/null
+++ b/clang/test/CodeGenCXX/inline-then-fold-variadics.cpp
@@ -0,0 +1,247 @@
+// RUN: %clang_cc1 -triple i386-unknown-linux-gnu -Wno-varargs -O1 -emit-llvm -o - %s | opt --passes=expand-variadics | opt -S -O1 | FileCheck %s --check-prefixes=CHECK,X86Linux
+
+
+// RUN: %clang_cc1 -triple x86_64-linux-gnu -Wno-varargs -O1 -emit-llvm -o - %s | opt --passes=expand-variadics | opt -S -O1 | FileCheck %s --check-prefixes=CHECK,X64SystemV
+
+
+// RUN: %clang_cc1 -triple i386-apple-darwin -Wno-varargs -O1 -emit-llvm -o - %s | opt --passes=expand-variadics | opt -S -O1 | FileCheck %s --check-prefixes=CHECK,X86Darwin
+
+// RUN: %clang_cc1 -triple x86_64-apple-darwin -Wno-varargs -O1 -emit-llvm -o - %s | opt --passes=expand-variadics | opt -S -O1 | FileCheck %s --check-prefixes=CHECK,X64SystemV
+
+
+// The clang test suite has _lots_ of windows related triples in it
+// 'x86_64-pc-windows-msvc|i686-windows-msvc|thumbv7-windows|aarch64-windows|i686-windows|x86_64-windows|x86_64-unknown-windows-msvc|i386-windows-pc|x86_64--windows-msvc|i686--windows-msvc|x86_64-unknown-windows-gnu|i686-unknown-windows-msvc|i686-unknown-windows-gnu|arm64ec-pc-windows-msvc19.20.0|i686-pc-windows-msvc19.14.0|i686-pc-windows|x86_64--windows-gnu|i686--windows-gnu|thumbv7--windows|i386-windows|x86_64-unknown-windows-pc|i686--windows|x86_64--windows|i686-w64-windows-gnu'
+
+// Might be detecting an inconsistency - maybe different alignment
+// Presently failing on an unusual calling convention
+
+// i686 windows emits suboptimal codegen. sroa removes a field from a struct which misaligns a field which blocks store/load forwarding
+// RUN: %clang_cc1 -triple i686-windows-msvc -Wno-varargs -O1 -emit-llvm -o - %s | opt --passes=expand-variadics | opt -S -O1 | FileCheck %s --check-prefixes=CHECK,X86Windows
+
+
+// 64 bit windows va_arg passes most type indirectly but the call instruction uses the types by value
+// ___: %clang_cc1 -triple x86_64-pc-windows-msvc -Wno-varargs -O1 -emit-llvm -o - %s | opt --passes=expand-variadics | opt -S -O1 | FileCheck %s --check-prefixes=CHECK
+
+
+
+// amdgpu emits a sequence of addrspace casts that aren't folded yet
+// todo: match it anyway
+// R-N: %clang_cc1 -triple amdgcn-amd-amdhsa -Wno-varargs -O1 -emit-llvm -o - %s | opt --passes=expand-variadics | opt -S -O1 | FileCheck %s
+
+// Requires the instcombine patch that hasn't landed yet
+// RUN: %clang_cc1 -triple nvptx64-nvidia-cuda -Wno-varargs -O1 -emit-llvm -o - %s | opt --passes=expand-variadics | opt -S -O1 | FileCheck %s
+
+
+
+
+
+// Not yet implemented on arm
+// Also there are various x86 variants that should be in the triple
+
+// Checks for consistency between clang and expand-va-intrinics
+// 1. Use clang to lower va_arg
+// 2. Use expand-variadics to lower the rest of the variadic operations
+// 3. Use opt -O1 to simplify the functions to ret %arg
+// The simplification to ret %arg will fail when the two are not consistent, modulo bugs elsewhere.
+
+#include <stdarg.h>
+
+template <typename X, typename Y>
+static X first(...) {
+  va_list va;
+  __builtin_va_start(va, 0);
+  X r = va_arg(va, X);
+  va_end(va);
+  return r;
+}
+
+template <typename X, typename Y>
+static Y second(...) {
+  va_list va;
+  __builtin_va_start(va, 0);
+  va_arg(va, X);
+  Y r = va_arg(va, Y);
+  va_end(va);
+  return r;
+}
+
+typedef float float4 __attribute__((__vector_size__(16), __aligned__(16)));
+typedef float float8 __attribute__((__vector_size__(32), __aligned__(32)));
+typedef float float16 __attribute__((__vector_size__(64), __aligned__(64)));
+typedef float float32 __attribute__((__vector_size__(128), __aligned__(128)));
+
+
+extern "C"
+{
+// CHECK-LABEL: define{{.*}} i32 @first_i32_i32(i32{{.*}} %x, i32{{.*}} %y)
+// CHECK:       entry:
+// CHECK:       ret i32 %x
+int first_i32_i32(int x, int y)
+{
+  return first<int,in...
[truncated]

llvmbot · 2024-04-17T00:31:15Z

@llvm/pr-subscribers-clang-codegen

Author: Jon Chesterfield (JonChesterfield)

Changes

Rewrite calls to variadic functions into calls to an equivalent non-variadic function.

This makes calls to known variadic functions a zero cost abstraction. The GPUs use it as a backend implementation.

The pass runs for implemented targets on known functions at O1 or above. For targets using it as codegen it runs from the backend on known and unknown functions. Interacts well with inlining so is scheduled shortly before the inliner.

Relative to the abandoned pull/81058, this is the whole end to end transform as opposed to a subset for easier review. The feedback on that is applied, in particular this makes no attempt to detect existing va_list functions, so in some cases it'll make the IR worse in a fashion that the inliner completely reverts. The ABI abstraction is reworked to be more orthogonal which hopefully makes it clear that the switch isn't worth scattering across the target classes.

Marked WIP because vector types in structs on X64 might be misbehaving. Still testing this - it's solid enough to build a working clang and libxml2 and should meet the use cases in gpu libc and I'll put a patch up for the rocm CI - but would like more test coverage and probably another architecture implemented before landing.

Patch is 293.19 KiB, truncated to 20.00 KiB below, full version: https://github.com/llvm/llvm-project/pull/89007.diff

36 Files Affected:

(modified) clang/lib/CodeGen/ABIInfoImpl.cpp (+11-2)
(modified) clang/lib/CodeGen/Targets/AMDGPU.cpp (+7-1)
(modified) clang/lib/CodeGen/Targets/NVPTX.cpp (+7-1)
(added) clang/test/CodeGen/expand-variadic-call.c (+314)
(added) clang/test/CodeGen/variadic-wrapper-removal.c (+85)
(added) clang/test/CodeGenCXX/inline-then-fold-variadics.cpp (+247)
(modified) llvm/include/llvm/InitializePasses.h (+1)
(added) llvm/include/llvm/Transforms/IPO/ExpandVariadics.h (+43)
(modified) llvm/lib/Passes/PassBuilder.cpp (+1)
(modified) llvm/lib/Passes/PassBuilderPipelines.cpp (+4)
(modified) llvm/lib/Passes/PassRegistry.def (+1)
(modified) llvm/lib/Target/AMDGPU/AMDGPUPassRegistry.def (+1)
(modified) llvm/lib/Target/AMDGPU/AMDGPUTargetMachine.cpp (+3)
(modified) llvm/lib/Target/NVPTX/NVPTXTargetMachine.cpp (+4)
(modified) llvm/lib/Transforms/IPO/CMakeLists.txt (+1)
(added) llvm/lib/Transforms/IPO/ExpandVariadics.cpp (+1056)
(added) llvm/test/CodeGen/AMDGPU/expand-variadic-call.ll (+499)
(modified) llvm/test/CodeGen/AMDGPU/llc-pipeline.ll (+5)
(modified) llvm/test/CodeGen/AMDGPU/unsupported-calls.ll (-19)
(added) llvm/test/CodeGen/NVPTX/expand-variadic-call.ll (+468)
(added) llvm/test/CodeGen/X86/expand-variadic-call-i386-darwin.ll (+449)
(added) llvm/test/CodeGen/X86/expand-variadic-call-i386-linux.ll (+449)
(added) llvm/test/CodeGen/X86/expand-variadic-call-i686-msvc.ll (+467)
(added) llvm/test/CodeGen/X86/expand-variadic-call-x64-darwin.ll (+688)
(added) llvm/test/CodeGen/X86/expand-variadic-call-x64-linux.ll (+688)
(modified) llvm/test/Other/new-pm-defaults.ll (+1)
(modified) llvm/test/Other/new-pm-thinlto-postlink-defaults.ll (+1)
(modified) llvm/test/Other/new-pm-thinlto-postlink-pgo-defaults.ll (+1)
(modified) llvm/test/Other/new-pm-thinlto-postlink-samplepgo-defaults.ll (+1-1)
(modified) llvm/test/Other/new-pm-thinlto-prelink-defaults.ll (+1)
(modified) llvm/test/Other/new-pm-thinlto-prelink-pgo-defaults.ll (+1)
(modified) llvm/test/Other/new-pm-thinlto-prelink-samplepgo-defaults.ll (+1)
(added) llvm/test/Transforms/ExpandVariadics/expand-va-intrinsic-outliner.ll (+86)
(added) llvm/test/Transforms/ExpandVariadics/expand-va-intrinsic-split-linkage.ll (+225)
(added) llvm/test/Transforms/ExpandVariadics/expand-va-intrinsic-split-simple.ll (+121)
(modified) llvm/utils/gn/secondary/llvm/lib/Transforms/IPO/BUILD.gn (+1)

diff --git a/clang/lib/CodeGen/ABIInfoImpl.cpp b/clang/lib/CodeGen/ABIInfoImpl.cpp
index 3e34d82cb399ba..c4a1829ab8863f 100644
--- a/clang/lib/CodeGen/ABIInfoImpl.cpp
+++ b/clang/lib/CodeGen/ABIInfoImpl.cpp
@@ -154,11 +154,20 @@ llvm::Value *CodeGen::emitRoundPointerUpToAlignment(CodeGenFunction &CGF,
                                                     llvm::Value *Ptr,
                                                     CharUnits Align) {
   // OverflowArgArea = (OverflowArgArea + Align - 1) & -Align;
+  Ptr = CGF.Builder.CreateAddrSpaceCast(Ptr, CGF.AllocaInt8PtrTy,
+                                        Ptr->getName() + ".addrcast");
   llvm::Value *RoundUp = CGF.Builder.CreateConstInBoundsGEP1_32(
       CGF.Builder.getInt8Ty(), Ptr, Align.getQuantity() - 1);
+
+  // ptrmask is sensitive to the bitwidth of the mask
+  unsigned IndexTypeSize =
+      CGF.CGM.getDataLayout().getIndexTypeSizeInBits(RoundUp->getType());
+  llvm::IntegerType *MaskType =
+      llvm::IntegerType::get(CGF.getLLVMContext(), IndexTypeSize);
+
   return CGF.Builder.CreateIntrinsic(
-      llvm::Intrinsic::ptrmask, {CGF.AllocaInt8PtrTy, CGF.IntPtrTy},
-      {RoundUp, llvm::ConstantInt::get(CGF.IntPtrTy, -Align.getQuantity())},
+      llvm::Intrinsic::ptrmask, {CGF.AllocaInt8PtrTy, MaskType},
+      {RoundUp, llvm::ConstantInt::get(MaskType, -Align.getQuantity())},
       nullptr, Ptr->getName() + ".aligned");
 }
 
diff --git a/clang/lib/CodeGen/Targets/AMDGPU.cpp b/clang/lib/CodeGen/Targets/AMDGPU.cpp
index 44e86c0b40f686..e274741a6ee652 100644
--- a/clang/lib/CodeGen/Targets/AMDGPU.cpp
+++ b/clang/lib/CodeGen/Targets/AMDGPU.cpp
@@ -115,7 +115,13 @@ void AMDGPUABIInfo::computeInfo(CGFunctionInfo &FI) const {
 
 Address AMDGPUABIInfo::EmitVAArg(CodeGenFunction &CGF, Address VAListAddr,
                                  QualType Ty) const {
-  llvm_unreachable("AMDGPU does not support varargs");
+  const bool IsIndirect = false;
+  const bool AllowHigherAlign = true;
+  // Would rather not naturally align values
+  // Splitting {char, short} into two separate arguments makes that difficult.
+  return emitVoidPtrVAArg(CGF, VAListAddr, Ty, IsIndirect,
+                          getContext().getTypeInfoInChars(Ty),
+                          CharUnits::fromQuantity(1), AllowHigherAlign);
 }
 
 ABIArgInfo AMDGPUABIInfo::classifyReturnType(QualType RetTy) const {
diff --git a/clang/lib/CodeGen/Targets/NVPTX.cpp b/clang/lib/CodeGen/Targets/NVPTX.cpp
index 7dce5042c3dc20..d6c256e5e6f7b9 100644
--- a/clang/lib/CodeGen/Targets/NVPTX.cpp
+++ b/clang/lib/CodeGen/Targets/NVPTX.cpp
@@ -215,7 +215,13 @@ void NVPTXABIInfo::computeInfo(CGFunctionInfo &FI) const {
 
 Address NVPTXABIInfo::EmitVAArg(CodeGenFunction &CGF, Address VAListAddr,
                                 QualType Ty) const {
-  llvm_unreachable("NVPTX does not support varargs");
+  // TODO: Work out to what extent this correlates with ptx
+  // doubles get passed with 8 byte alignment and C promotes smaller integer
+  // types to int. Printf doesn't really do structs so hard to guess what
+  // the right thing is for that.
+  return emitVoidPtrVAArg(CGF, VAListAddr, Ty, false,
+                          getContext().getTypeInfoInChars(Ty),
+                          CharUnits::fromQuantity(4), true);
 }
 
 void NVPTXTargetCodeGenInfo::setTargetAttributes(
diff --git a/clang/test/CodeGen/expand-variadic-call.c b/clang/test/CodeGen/expand-variadic-call.c
new file mode 100644
index 00000000000000..baff54544207ea
--- /dev/null
+++ b/clang/test/CodeGen/expand-variadic-call.c
@@ -0,0 +1,314 @@
+// NOTE: Assertions have been autogenerated by utils/update_cc_test_checks.py
+
+// REQUIRES: x86-registered-target
+// RUN: %clang_cc1 -triple x86_64-unknown-linux-gnu -target-cpu x86-64-v4 -std=c23 -O1 -ffreestanding -mllvm --expand-variadics-override=disable -emit-llvm -o - %s | FileCheck %s
+
+// This test sanity checks calling a variadic function with the expansion transform disabled.
+// The IR test cases {arch}/expand-variadic-call-*.ll correspond to IR generated from this test case.
+
+typedef __builtin_va_list va_list;
+#define va_copy(dest, src) __builtin_va_copy(dest, src)
+#define va_start(ap, ...) __builtin_va_start(ap, 0)
+#define va_end(ap) __builtin_va_end(ap)
+#define va_arg(ap, type) __builtin_va_arg(ap, type)
+
+// 32 bit x86 alignment uses getTypeStackAlign for special cases
+// Whitebox testing.
+// Needs a type >= 16 which is either a simd or a struct containing a simd
+// darwinvectorabi should force 4 bytes
+// linux vectors with align 16/32/64 return that alignment
+
+
+// Might want various copy/end style constructs in a separate test
+
+void vararg(...);
+void valist(va_list);
+
+// CHECK-LABEL: @copy(
+// CHECK-NEXT:  entry:
+// CHECK-NEXT:    [[CP:%.*]] = alloca [1 x %struct.__va_list_tag], align 16
+// CHECK-NEXT:    call void @llvm.lifetime.start.p0(i64 24, ptr nonnull [[CP]]) #[[ATTR7:[0-9]+]]
+// CHECK-NEXT:    call void @llvm.va_copy.p0(ptr nonnull [[CP]], ptr [[VA:%.*]])
+// CHECK-NEXT:    call void @valist(ptr noundef nonnull [[CP]]) #[[ATTR8:[0-9]+]]
+// CHECK-NEXT:    call void @llvm.lifetime.end.p0(i64 24, ptr nonnull [[CP]]) #[[ATTR7]]
+// CHECK-NEXT:    ret void
+//
+void copy(va_list va)
+{
+  va_list cp;
+  va_copy(cp, va);
+  valist(cp);
+}
+
+// CHECK-LABEL: @start_once(
+// CHECK-NEXT:  entry:
+// CHECK-NEXT:    [[S:%.*]] = alloca [1 x %struct.__va_list_tag], align 16
+// CHECK-NEXT:    call void @llvm.lifetime.start.p0(i64 24, ptr nonnull [[S]]) #[[ATTR7]]
+// CHECK-NEXT:    call void @llvm.va_start.p0(ptr nonnull [[S]])
+// CHECK-NEXT:    call void @valist(ptr noundef nonnull [[S]]) #[[ATTR8]]
+// CHECK-NEXT:    call void @llvm.va_end.p0(ptr [[S]])
+// CHECK-NEXT:    call void @llvm.lifetime.end.p0(i64 24, ptr nonnull [[S]]) #[[ATTR7]]
+// CHECK-NEXT:    ret void
+//
+void start_once(...)
+{
+  va_list s;
+  va_start(s);
+  valist(s);
+  va_end(s);
+}
+
+// CHECK-LABEL: @start_twice(
+// CHECK-NEXT:  entry:
+// CHECK-NEXT:    [[S0:%.*]] = alloca [1 x %struct.__va_list_tag], align 16
+// CHECK-NEXT:    [[S1:%.*]] = alloca [1 x %struct.__va_list_tag], align 16
+// CHECK-NEXT:    call void @llvm.lifetime.start.p0(i64 24, ptr nonnull [[S0]]) #[[ATTR7]]
+// CHECK-NEXT:    call void @llvm.lifetime.start.p0(i64 24, ptr nonnull [[S1]]) #[[ATTR7]]
+// CHECK-NEXT:    call void @llvm.va_start.p0(ptr nonnull [[S0]])
+// CHECK-NEXT:    call void @valist(ptr noundef nonnull [[S0]]) #[[ATTR8]]
+// CHECK-NEXT:    call void @llvm.va_end.p0(ptr [[S0]])
+// CHECK-NEXT:    call void @llvm.va_start.p0(ptr nonnull [[S1]])
+// CHECK-NEXT:    call void @valist(ptr noundef nonnull [[S1]]) #[[ATTR8]]
+// CHECK-NEXT:    call void @llvm.va_end.p0(ptr [[S1]])
+// CHECK-NEXT:    call void @llvm.lifetime.end.p0(i64 24, ptr nonnull [[S1]]) #[[ATTR7]]
+// CHECK-NEXT:    call void @llvm.lifetime.end.p0(i64 24, ptr nonnull [[S0]]) #[[ATTR7]]
+// CHECK-NEXT:    ret void
+//
+void start_twice(...)
+{
+  va_list s0,s1;
+  va_start(s0);
+  valist(s0);
+  va_end(s0);
+  va_start(s1);
+  valist(s1);
+  va_end(s1);
+}
+
+// vectors with alignment 16/32/64 are natively aligned on linux x86
+// v32f32 would be a m1024 type, larger than x64 defines at time of writing
+typedef int i32;
+typedef float v4f32 __attribute__((__vector_size__(16), __aligned__(16)));
+typedef float v8f32 __attribute__((__vector_size__(32), __aligned__(32)));
+typedef float v16f32 __attribute__((__vector_size__(64), __aligned__(64)));
+typedef float v32f32 __attribute__((__vector_size__(128), __aligned__(128)));
+
+
+// Pass a single value to wrapped() via vararg(...)
+// CHECK-LABEL: @single_i32(
+// CHECK-NEXT:  entry:
+// CHECK-NEXT:    tail call void (...) @vararg(i32 noundef [[X:%.*]]) #[[ATTR8]]
+// CHECK-NEXT:    ret void
+//
+void single_i32(i32 x)
+{
+  vararg(x);
+}
+
+
+// CHECK-LABEL: @single_double(
+// CHECK-NEXT:  entry:
+// CHECK-NEXT:    tail call void (...) @vararg(double noundef [[X:%.*]]) #[[ATTR8]]
+// CHECK-NEXT:    ret void
+//
+void single_double(double x)
+{
+  vararg(x);
+}
+
+// CHECK-LABEL: @single_v4f32(
+// CHECK-NEXT:  entry:
+// CHECK-NEXT:    tail call void (...) @vararg(<4 x float> noundef [[X:%.*]]) #[[ATTR8]]
+// CHECK-NEXT:    ret void
+//
+void single_v4f32(v4f32 x)
+{
+  vararg(x);
+}
+
+// CHECK-LABEL: @single_v8f32(
+// CHECK-NEXT:  entry:
+// CHECK-NEXT:    tail call void (...) @vararg(<8 x float> noundef [[X:%.*]]) #[[ATTR8]]
+// CHECK-NEXT:    ret void
+//
+void single_v8f32(v8f32 x)
+{
+  vararg(x);
+}
+
+// CHECK-LABEL: @single_v16f32(
+// CHECK-NEXT:  entry:
+// CHECK-NEXT:    tail call void (...) @vararg(<16 x float> noundef [[X:%.*]]) #[[ATTR8]]
+// CHECK-NEXT:    ret void
+//
+void single_v16f32(v16f32 x)
+{
+  vararg(x);
+}
+
+// CHECK-LABEL: @single_v32f32(
+// CHECK-NEXT:  entry:
+// CHECK-NEXT:    [[INDIRECT_ARG_TEMP:%.*]] = alloca <32 x float>, align 128
+// CHECK-NEXT:    [[X:%.*]] = load <32 x float>, ptr [[TMP0:%.*]], align 128, !tbaa [[TBAA2:![0-9]+]]
+// CHECK-NEXT:    store <32 x float> [[X]], ptr [[INDIRECT_ARG_TEMP]], align 128, !tbaa [[TBAA2]]
+// CHECK-NEXT:    tail call void (...) @vararg(ptr noundef nonnull byval(<32 x float>) align 128 [[INDIRECT_ARG_TEMP]]) #[[ATTR8]]
+// CHECK-NEXT:    ret void
+//
+void single_v32f32(v32f32 x)
+{
+  vararg(x);
+}
+
+
+
+// CHECK-LABEL: @i32_double(
+// CHECK-NEXT:  entry:
+// CHECK-NEXT:    tail call void (...) @vararg(i32 noundef [[X:%.*]], double noundef [[Y:%.*]]) #[[ATTR8]]
+// CHECK-NEXT:    ret void
+//
+void i32_double(i32 x, double y)
+{
+  vararg(x, y);
+}
+
+// CHECK-LABEL: @double_i32(
+// CHECK-NEXT:  entry:
+// CHECK-NEXT:    tail call void (...) @vararg(double noundef [[X:%.*]], i32 noundef [[Y:%.*]]) #[[ATTR8]]
+// CHECK-NEXT:    ret void
+//
+void double_i32(double x, i32 y)
+{
+  vararg(x, y);
+}
+
+
+// A struct used by libc variadic tests
+
+typedef struct {
+  char c;
+  short s;
+  int i;
+  long l;
+  float f;
+  double d;
+}  libcS;
+
+// CHECK-LABEL: @i32_libcS(
+// CHECK-NEXT:  entry:
+// CHECK-NEXT:    tail call void (...) @vararg(i32 noundef [[X:%.*]], ptr noundef nonnull byval([[STRUCT_LIBCS:%.*]]) align 8 [[Y:%.*]]) #[[ATTR8]]
+// CHECK-NEXT:    ret void
+//
+void i32_libcS(i32 x, libcS y)
+{
+  vararg(x, y);
+}
+
+// CHECK-LABEL: @libcS_i32(
+// CHECK-NEXT:  entry:
+// CHECK-NEXT:    tail call void (...) @vararg(ptr noundef nonnull byval([[STRUCT_LIBCS:%.*]]) align 8 [[X:%.*]], i32 noundef [[Y:%.*]]) #[[ATTR8]]
+// CHECK-NEXT:    ret void
+//
+void libcS_i32(libcS x, i32 y)
+{
+  vararg(x, y);
+}
+
+
+// CHECK-LABEL: @i32_v4f32(
+// CHECK-NEXT:  entry:
+// CHECK-NEXT:    tail call void (...) @vararg(i32 noundef [[X:%.*]], <4 x float> noundef [[Y:%.*]]) #[[ATTR8]]
+// CHECK-NEXT:    ret void
+//
+void i32_v4f32(i32 x, v4f32 y)
+{
+  vararg(x, y);
+}
+
+// CHECK-LABEL: @v4f32_i32(
+// CHECK-NEXT:  entry:
+// CHECK-NEXT:    tail call void (...) @vararg(<4 x float> noundef [[X:%.*]], i32 noundef [[Y:%.*]]) #[[ATTR8]]
+// CHECK-NEXT:    ret void
+//
+void v4f32_i32(v4f32 x, i32 y)
+{
+  vararg(x, y);
+}
+
+// CHECK-LABEL: @i32_v8f32(
+// CHECK-NEXT:  entry:
+// CHECK-NEXT:    tail call void (...) @vararg(i32 noundef [[X:%.*]], <8 x float> noundef [[Y:%.*]]) #[[ATTR8]]
+// CHECK-NEXT:    ret void
+//
+void i32_v8f32(i32 x, v8f32 y)
+{
+  vararg(x, y);
+}
+
+// CHECK-LABEL: @v8f32_i32(
+// CHECK-NEXT:  entry:
+// CHECK-NEXT:    tail call void (...) @vararg(<8 x float> noundef [[X:%.*]], i32 noundef [[Y:%.*]]) #[[ATTR8]]
+// CHECK-NEXT:    ret void
+//
+void v8f32_i32(v8f32 x, i32 y)
+{
+  vararg(x, y);
+}
+
+// CHECK-LABEL: @i32_v16f32(
+// CHECK-NEXT:  entry:
+// CHECK-NEXT:    tail call void (...) @vararg(i32 noundef [[X:%.*]], <16 x float> noundef [[Y:%.*]]) #[[ATTR8]]
+// CHECK-NEXT:    ret void
+//
+void i32_v16f32(i32 x, v16f32 y)
+{
+  vararg(x, y);
+}
+
+// CHECK-LABEL: @v16f32_i32(
+// CHECK-NEXT:  entry:
+// CHECK-NEXT:    tail call void (...) @vararg(<16 x float> noundef [[X:%.*]], i32 noundef [[Y:%.*]]) #[[ATTR8]]
+// CHECK-NEXT:    ret void
+//
+void v16f32_i32(v16f32 x, i32 y)
+{
+  vararg(x, y);
+}
+
+// CHECK-LABEL: @i32_v32f32(
+// CHECK-NEXT:  entry:
+// CHECK-NEXT:    [[INDIRECT_ARG_TEMP:%.*]] = alloca <32 x float>, align 128
+// CHECK-NEXT:    [[Y:%.*]] = load <32 x float>, ptr [[TMP0:%.*]], align 128, !tbaa [[TBAA2]]
+// CHECK-NEXT:    store <32 x float> [[Y]], ptr [[INDIRECT_ARG_TEMP]], align 128, !tbaa [[TBAA2]]
+// CHECK-NEXT:    tail call void (...) @vararg(i32 noundef [[X:%.*]], ptr noundef nonnull byval(<32 x float>) align 128 [[INDIRECT_ARG_TEMP]]) #[[ATTR8]]
+// CHECK-NEXT:    ret void
+//
+void i32_v32f32(i32 x, v32f32 y)
+{
+  vararg(x, y);
+}
+
+// CHECK-LABEL: @v32f32_i32(
+// CHECK-NEXT:  entry:
+// CHECK-NEXT:    [[INDIRECT_ARG_TEMP:%.*]] = alloca <32 x float>, align 128
+// CHECK-NEXT:    [[X:%.*]] = load <32 x float>, ptr [[TMP0:%.*]], align 128, !tbaa [[TBAA2]]
+// CHECK-NEXT:    store <32 x float> [[X]], ptr [[INDIRECT_ARG_TEMP]], align 128, !tbaa [[TBAA2]]
+// CHECK-NEXT:    tail call void (...) @vararg(ptr noundef nonnull byval(<32 x float>) align 128 [[INDIRECT_ARG_TEMP]], i32 noundef [[Y:%.*]]) #[[ATTR8]]
+// CHECK-NEXT:    ret void
+//
+void v32f32_i32(v32f32 x, i32 y)
+{
+  vararg(x, y);
+}
+
+#if __AMDGPU__ || __NVPTX__
+
+
+void (*volatile vararg_ptr)(...) = &vararg;
+
+void indirect_single_i32(i32 x)
+{
+  vararg_ptr(x);
+}
+
+
+#endif
diff --git a/clang/test/CodeGen/variadic-wrapper-removal.c b/clang/test/CodeGen/variadic-wrapper-removal.c
new file mode 100644
index 00000000000000..13499612f0757f
--- /dev/null
+++ b/clang/test/CodeGen/variadic-wrapper-removal.c
@@ -0,0 +1,85 @@
+// REQUIRES: x86-registered-target
+// RUN: %clang_cc1 -triple i386-unknown-linux-gnu -O1 -emit-llvm -o - %s | opt --passes='module(expand-variadics,inline)' -S | FileCheck %s
+// RUN: %clang_cc1 -triple=x86_64-linux-gnu -O1 -emit-llvm -o - %s | opt --passes='module(expand-variadics,inline)' -S | FileCheck %s
+
+// neither arm arch is implemented yet, leaving it here as a reminder
+// armv6 is a ptr as far as the struct is concerned, but possibly also a [1 x i32] passed by value
+// that seems insistent, maybe leave 32 bit arm alone for now
+// aarch64 is a struct of five things passed using byval memcpy
+
+// R-N: %clang_cc1 -triple=armv6-none--eabi -O1 -emit-llvm -o - %s | opt --passes=expand-variadics -S | FileCheck %s
+// R-N: %clang_cc1 -triple=aarch64-none-linux-gnu -O1 -emit-llvm -o - %s | opt --passes=expand-variadics -S | FileCheck %s
+
+
+
+// expand-variadics rewrites calls to variadic functions into calls to
+// equivalent functions that take a va_list argument. A property of the
+// implementation is that said "equivalent function" may be a pre-existing one.
+// This is equivalent to inlining a sufficiently simple variadic wrapper.
+
+#include <stdarg.h>
+
+typedef int FILE; // close enough for this test
+
+// fprintf is sometimes implemented as a call to vfprintf. That fits the
+// pattern the transform pass recognises - given an implementation of fprintf
+// in the IR module, calls to it can be rewritten into calls into vfprintf.
+
+// CHECK-LABEL: define{{.*}} i32 @fprintf(
+// CHECK-LABEL: define{{.*}} i32 @call_fprintf(
+// CHECK-NOT:   @fprintf
+// CHECK:       @vfprintf
+int vfprintf(FILE *restrict f, const char *restrict fmt, va_list ap);
+int fprintf(FILE *restrict f, const char *restrict fmt, ...)
+{
+  int ret;
+  va_list ap;
+  va_start(ap, fmt);
+  ret = vfprintf(f, fmt, ap);
+  va_end(ap);
+  return ret;
+}
+int call_fprintf(FILE *f)
+{
+  int x = 42;
+  double y = 3.14;
+  return fprintf(f, "int %d dbl %g\n", x, y);
+}
+
+// Void return type is also OK
+
+// CHECK-LABEL: define{{.*}} void @no_result(
+// CHECK-LABEL: define{{.*}} void @call_no_result(
+// CHECK-NOT:   @no_result
+// CHECK:       @vno_result
+void vno_result(const char * fmt, va_list);
+void no_result(const char * fmt, ...)
+{
+  va_list ap;
+  va_start(ap, fmt);
+  vno_result(fmt, ap);
+  va_end(ap);
+}
+void call_no_result(FILE *f)
+{
+  int x = 101;
+  no_result("", x);
+}
+
+// The vaend in the forwarding implementation is optional where it's a no-op
+
+// CHECK-LABEL: define{{.*}} i32 @no_vaend(
+// CHECK-LABEL: define{{.*}} i32 @call_no_vaend(
+// CHECK-NOT:   @no_vaend
+// CHECK:       @vno_vaend
+int vno_vaend(int x, va_list);
+int no_vaend(int x, ...)
+{
+  va_list ap;
+  va_start(ap, x);
+  return vno_vaend(x, ap);
+}
+int call_no_vaend(int x)
+{
+  return no_vaend(x, 10, 2.5f);
+}
diff --git a/clang/test/CodeGenCXX/inline-then-fold-variadics.cpp b/clang/test/CodeGenCXX/inline-then-fold-variadics.cpp
new file mode 100644
index 00000000000000..c7b5863b2632a6
--- /dev/null
+++ b/clang/test/CodeGenCXX/inline-then-fold-variadics.cpp
@@ -0,0 +1,247 @@
+// RUN: %clang_cc1 -triple i386-unknown-linux-gnu -Wno-varargs -O1 -emit-llvm -o - %s | opt --passes=expand-variadics | opt -S -O1 | FileCheck %s --check-prefixes=CHECK,X86Linux
+
+
+// RUN: %clang_cc1 -triple x86_64-linux-gnu -Wno-varargs -O1 -emit-llvm -o - %s | opt --passes=expand-variadics | opt -S -O1 | FileCheck %s --check-prefixes=CHECK,X64SystemV
+
+
+// RUN: %clang_cc1 -triple i386-apple-darwin -Wno-varargs -O1 -emit-llvm -o - %s | opt --passes=expand-variadics | opt -S -O1 | FileCheck %s --check-prefixes=CHECK,X86Darwin
+
+// RUN: %clang_cc1 -triple x86_64-apple-darwin -Wno-varargs -O1 -emit-llvm -o - %s | opt --passes=expand-variadics | opt -S -O1 | FileCheck %s --check-prefixes=CHECK,X64SystemV
+
+
+// The clang test suite has _lots_ of windows related triples in it
+// 'x86_64-pc-windows-msvc|i686-windows-msvc|thumbv7-windows|aarch64-windows|i686-windows|x86_64-windows|x86_64-unknown-windows-msvc|i386-windows-pc|x86_64--windows-msvc|i686--windows-msvc|x86_64-unknown-windows-gnu|i686-unknown-windows-msvc|i686-unknown-windows-gnu|arm64ec-pc-windows-msvc19.20.0|i686-pc-windows-msvc19.14.0|i686-pc-windows|x86_64--windows-gnu|i686--windows-gnu|thumbv7--windows|i386-windows|x86_64-unknown-windows-pc|i686--windows|x86_64--windows|i686-w64-windows-gnu'
+
+// Might be detecting an inconsistency - maybe different alignment
+// Presently failing on an unusual calling convention
+
+// i686 windows emits suboptimal codegen. sroa removes a field from a struct which misaligns a field which blocks store/load forwarding
+// RUN: %clang_cc1 -triple i686-windows-msvc -Wno-varargs -O1 -emit-llvm -o - %s | opt --passes=expand-variadics | opt -S -O1 | FileCheck %s --check-prefixes=CHECK,X86Windows
+
+
+// 64 bit windows va_arg passes most type indirectly but the call instruction uses the types by value
+// ___: %clang_cc1 -triple x86_64-pc-windows-msvc -Wno-varargs -O1 -emit-llvm -o - %s | opt --passes=expand-variadics | opt -S -O1 | FileCheck %s --check-prefixes=CHECK
+
+
+
+// amdgpu emits a sequence of addrspace casts that aren't folded yet
+// todo: match it anyway
+// R-N: %clang_cc1 -triple amdgcn-amd-amdhsa -Wno-varargs -O1 -emit-llvm -o - %s | opt --passes=expand-variadics | opt -S -O1 | FileCheck %s
+
+// Requires the instcombine patch that hasn't landed yet
+// RUN: %clang_cc1 -triple nvptx64-nvidia-cuda -Wno-varargs -O1 -emit-llvm -o - %s | opt --passes=expand-variadics | opt -S -O1 | FileCheck %s
+
+
+
+
+
+// Not yet implemented on arm
+// Also there are various x86 variants that should be in the triple
+
+// Checks for consistency between clang and expand-va-intrinics
+// 1. Use clang to lower va_arg
+// 2. Use expand-variadics to lower the rest of the variadic operations
+// 3. Use opt -O1 to simplify the functions to ret %arg
+// The simplification to ret %arg will fail when the two are not consistent, modulo bugs elsewhere.
+
+#include <stdarg.h>
+
+template <typename X, typename Y>
+static X first(...) {
+  va_list va;
+  __builtin_va_start(va, 0);
+  X r = va_arg(va, X);
+  va_end(va);
+  return r;
+}
+
+template <typename X, typename Y>
+static Y second(...) {
+  va_list va;
+  __builtin_va_start(va, 0);
+  va_arg(va, X);
+  Y r = va_arg(va, Y);
+  va_end(va);
+  return r;
+}
+
+typedef float float4 __attribute__((__vector_size__(16), __aligned__(16)));
+typedef float float8 __attribute__((__vector_size__(32), __aligned__(32)));
+typedef float float16 __attribute__((__vector_size__(64), __aligned__(64)));
+typedef float float32 __attribute__((__vector_size__(128), __aligned__(128)));
+
+
+extern "C"
+{
+// CHECK-LABEL: define{{.*}} i32 @first_i32_i32(i32{{.*}} %x, i32{{.*}} %y)
+// CHECK:       entry:
+// CHECK:       ret i32 %x
+int first_i32_i32(int x, int y)
+{
+  return first<int,in...
[truncated]

efriedma-quic · 2024-04-17T00:40:20Z

clang/lib/CodeGen/ABIInfoImpl.cpp

@@ -154,11 +154,20 @@ llvm::Value *CodeGen::emitRoundPointerUpToAlignment(CodeGenFunction &CGF,
                                                    llvm::Value *Ptr,
                                                    CharUnits Align) {
  // OverflowArgArea = (OverflowArgArea + Align - 1) & -Align;
+  Ptr = CGF.Builder.CreateAddrSpaceCast(Ptr, CGF.AllocaInt8PtrTy,
+                                        Ptr->getName() + ".addrcast");


I can sort of guess why you're doing this, but probably you want to rename the method so it's clear why it's specialized this way.

This might not be the right place for it - amdgpu passes a pointer in addrspace(5) which then hits an assert with the current implementation. The va_* intrinsics specify an address space now which should permit a cleaner answer, though I'm seeing p0 used from clang for targets where it shouldn't be. That's on the WIP part.

This isn't the right place for it, I would expect this function to preserve whatever address space was passed in

efriedma-quic · 2024-04-17T00:54:16Z

llvm/lib/Transforms/IPO/ExpandVariadics.cpp

+    // varargs can't be tail called anyway
+    // Not totally convinced this is necessary but dead store elimination
+    // will discard the stores to the Alloca and pass uninitialised memory along
+    // instead when the function is marked tailcall


I'll take another shot at explaining this...

From LangRef: "[tail and musttail] imply that the callee does not access allocas from the caller." This applies whether or not it's actually possible to do sibling-call/tail-call optimization on the call. And in fact, "tail" has basically no other semantic implications. So LLVM IR optimizations infer "tail" pretty aggressively.

Because of this definition, alias analysis assumes calls marked "tail" don't read/write allocas. So the DSE behavior is expected.

Tail on varargs functions is sort of valid in that the callee doesn't access an alloca. The callee does tend to dig around in the stack of the caller though, so tail calling varargs functions only works out nicely if the callee rearranges the stack frame on behalf of the caller. Which it can do, but usually doesn't.

I think your reading is right that DSE making this inference is reasonable. I think it's also right that a varargs function could be tail called on some implementations, though isn't on the ones considered so far in this patch.

jhuber6 · 2024-04-17T01:50:17Z

llvm/lib/Transforms/IPO/ExpandVariadics.cpp

+}
+
+ExpandVariadicsPass::ExpandVariadicsPass(OptimizationLevel Level)
+    : ExpandVariadicsPass(Level == OptimizationLevel::O0


I don't think this works, there's a cyclic dependency here on LLVMPasses which defines this. The standard way to handle this should be when we register the pass in the PassBuilderPipeline.

I'll fix this, thanks. Curious that the build succeeds for me and fails for you. I'm not confident our build graph is acyclic but certainly do not want to introduce cycles.

arsenm · 2024-04-17T10:09:18Z

clang/lib/CodeGen/ABIInfoImpl.cpp

  return CGF.Builder.CreateIntrinsic(
-      llvm::Intrinsic::ptrmask, {CGF.AllocaInt8PtrTy, CGF.IntPtrTy},
-      {RoundUp, llvm::ConstantInt::get(CGF.IntPtrTy, -Align.getQuantity())},
+      llvm::Intrinsic::ptrmask, {CGF.AllocaInt8PtrTy, MaskType},


This seems like it was bad to begin with, I think this should have just been using Ptr->getType() instead of the CFG types

Yeah, swapping it out for ptr->gettype is a much better choice. That works exactly as one would hope

arsenm · 2024-04-17T10:19:13Z

clang/lib/CodeGen/Targets/AMDGPU.cpp

+  // Would rather not naturally align values
+  // Splitting {char, short} into two separate arguments makes that difficult.


Not sure what this comment is saying. 4 is the only correct alignment for anything stack related

Currently amdgpu takes a struct {char x; short y;} example; and splats it across two registers. Doesn't seem totally good to me but I'm trying to avoid changing the non-variadic calling convention at the same time.

This means foo(example) and bar (char, short) look the same in IR, thus they need to have consistent representations as far as va_arg is concerned. The specific case that caused me trouble here was trying to pass a double with four byte alignment as opposed to eight (that's what the the naturally align values comment relates to).

I'll see if I can capture this in an amdgpu specific test case and fix up the comment

I thought all structs under 8 bytes were packed into i32/i64

Nope. At least, not what I got under testing. That sounds like a good idea though.

^structs < 8 bytes are packed into integers, but structs larger than that are spread across multiple parameters in a discarding-padding sense. Better testing revealed that one byte aligning things does not work out robustly, have had to take a different strategy to make all the cases work. However that does mean everything is 4 byte aligned now.

arsenm · 2024-04-17T10:19:42Z

llvm/include/llvm/Transforms/IPO/ExpandVariadics.h

+class OptimizationLevel;
+
+enum class ExpandVariadicsMode {
+  unspecified,


Capitalize and document the fields

arsenm · 2024-04-17T10:21:32Z

llvm/lib/Transforms/IPO/ExpandVariadics.cpp

+// changed to be consistent with Module()'s naming. Implementing as a local
+// function here in the meantime to decouple from that process.
+Function *getPreexistingDeclaration(Module *M, Intrinsic::ID id,
+                                    ArrayRef<Type *> Tys = std::nullopt) {


Suggested change

ArrayRef<Type *> Tys = std::nullopt) {

ArrayRef<Type *> Tys = {}) {

Written to be consistent with Module::getFunction which uses nullopt. Need to open an issue about this, can't change getDeclaration to return null on missing without causing a lot of breakage. I think the right thing is to rename getDeclaration to getOrCreateDeclaration, introduce another function called getPreexistingDeclaration, wait for at least some of the downstream projects to do the rename, then rename getPreexistingDeclaration into getDeclaration and thus be consistent with Module...

arsenm · 2024-04-17T10:22:46Z

llvm/lib/Transforms/IPO/ExpandVariadics.cpp

+
+    Idxs[2] = ConstantInt::get(I32, 1);
+    Builder.CreateStore(
+        ConstantInt::get(I32, 6 * 8 + 8 * 16),


Comment on where these magic expressions were derived?

Oh right. This needs a whole comment block on why this collection of stores is known to create a va_list which is correct on x64 since that's not very obvious. I'll think about how to express that and probably write down the aarch64 one as well, it's the same sort of construct.

arsenm · 2024-04-17T10:23:18Z

llvm/lib/Transforms/IPO/ExpandVariadics.cpp

+
+    Idxs[2] = ConstantInt::get(I32, 3);
+    Builder.CreateStore(
+        ConstantPointerNull::get(PointerType::getUnqual(Ctx)),


Don't use getUnqual, use the explicit address space. Best to query the alloca addrspace?

arsenm · 2024-04-17T10:31:28Z

llvm/lib/Transforms/IPO/ExpandVariadics.cpp

+        Value *calledOperand = CB->getCalledOperand();
+        if (F == calledOperand) {
+          report_fatal_error(
+              "ExpandVA abi requires eliminating call uses first\n");


Don't think these are supposed to have newlines

arsenm · 2024-04-17T10:31:59Z

llvm/lib/Transforms/IPO/ExpandVariadics.cpp

+    Equivalent->setLinkage(F->getLinkage());
+    Equivalent->setVisibility(F->getVisibility());
+    Equivalent->takeName(F);


Can you use copyAttributesFrom?

arsenm · 2024-04-17T10:33:16Z

llvm/lib/Transforms/IPO/ExpandVariadics.cpp

+  NF->copyAttributesFrom(&F);
+  NF->setComdat(F.getComdat()); // beware weak
+  F.getParent()->getFunctionList().insert(F.getIterator(), NF);
+  NF->setName(F.getName() + ".valist");


If it's ABI the new function should retain the original name?

It does, see further down. Trying to keep the control flow linear, could change that

arsenm · 2024-04-17T10:34:08Z

llvm/lib/Transforms/IPO/ExpandVariadics.cpp

+  // the fields to be correct at runtime. Use the native stack alignment instead
+  // if that's greater as that tends to give better codegen.
+  Align AllocaAlign = MaxFieldAlign;
+  if (DL.exceedsNaturalStackAlignment(Align(1024))) {


Where did 1024 come from?

This is really fun. The datalayout class asserts if you ask for the stack alignment when the string didn't specify one (which nvptx often doesn't), and has no means of testing directly whether the stack alignment is specified. Hence the TODO below this line. 1024 is a credible guess at probably higher alignment than the stack has naturally. I need a separate patch changing datalayout then rebase this on that one

I think adding the following to DataLayout is reasonable, reflects the existing call to exceeds

/// Returns true if the given alignment is below the natural stack alignment. bool belowNaturalStackAlignment(Align Alignment) const { return StackNaturalAlign && (Alignment < *StackNaturalAlign); }

arsenm · 2024-04-17T10:34:26Z

llvm/lib/Transforms/IPO/ExpandVariadics.cpp

+  unsigned NumArgs = FuncType->getNumParams();
+
+  SmallVector<Value *> Args;
+  Args.assign(CB->arg_begin(), CB->arg_begin() + NumArgs);


Move to constructor?

JonChesterfield · 2024-04-17T11:19:31Z

Drive by fixes to some of Matt's comments. Caught a missing line in a .def file for NVPTX through luck due to the enum rename which means the Other/new-pm-thinlto-postlink-samplepgo-defaults.ll style tests need to be patched again - leaving that for now as I want to check whether the passes run in the order of the PassRegistry.def (in which case expand-variadics should probably happen before always-inline) and the failing test is a reminder to myself of that

edit: The mysteriously missing patches were made to the amd-staging branch because I didn't pay enough attention to where git was pointing, resolving that....

github-actions · 2024-04-18T21:11:49Z

⚠️ C/C++ code formatter, clang-format found issues in your code. ⚠️

You can test this locally with the following command:

git-clang-format --diff 78eb2c2cba5ad0f9a36ec1959371527313a76bbe c6c01c5ac8e974d94607cdea37c74194513c7a89 -- clang/test/CodeGen/expand-variadic-call.c clang/test/CodeGen/variadic-wrapper-removal.c clang/test/CodeGenCXX/inline-then-fold-variadics.cpp llvm/include/llvm/Transforms/IPO/ExpandVariadics.h llvm/lib/Transforms/IPO/ExpandVariadics.cpp clang/lib/CodeGen/ABIInfoImpl.cpp clang/lib/CodeGen/Targets/AMDGPU.cpp clang/lib/CodeGen/Targets/NVPTX.cpp llvm/include/llvm/IR/DataLayout.h llvm/include/llvm/InitializePasses.h llvm/lib/Passes/PassBuilder.cpp llvm/lib/Passes/PassBuilderPipelines.cpp llvm/lib/Target/AMDGPU/AMDGPUTargetMachine.cpp llvm/lib/Target/NVPTX/NVPTXTargetMachine.cpp

View the diff from clang-format here.

diff --git a/clang/lib/CodeGen/ABIInfoImpl.cpp b/clang/lib/CodeGen/ABIInfoImpl.cpp
index 8e9a239eec..e4412febbe 100644
--- a/clang/lib/CodeGen/ABIInfoImpl.cpp
+++ b/clang/lib/CodeGen/ABIInfoImpl.cpp
@@ -157,7 +157,7 @@ llvm::Value *CodeGen::emitRoundPointerUpToAlignment(CodeGenFunction &CGF,
   llvm::Value *RoundUp = CGF.Builder.CreateConstInBoundsGEP1_32(
       CGF.Builder.getInt8Ty(), Ptr, Align.getQuantity() - 1);
   return CGF.Builder.CreateIntrinsic(
-        llvm::Intrinsic::ptrmask, {Ptr->getType(), CGF.IntPtrTy},
+      llvm::Intrinsic::ptrmask, {Ptr->getType(), CGF.IntPtrTy},
       {RoundUp, llvm::ConstantInt::get(CGF.IntPtrTy, -Align.getQuantity())},
       nullptr, Ptr->getName() + ".aligned");
 }
diff --git a/llvm/include/llvm/Transforms/IPO/ExpandVariadics.h b/llvm/include/llvm/Transforms/IPO/ExpandVariadics.h
index 92df9647d8..80fdb02294 100644
--- a/llvm/include/llvm/Transforms/IPO/ExpandVariadics.h
+++ b/llvm/include/llvm/Transforms/IPO/ExpandVariadics.h
@@ -18,9 +18,9 @@ class OptimizationLevel;
 
 enum class ExpandVariadicsMode {
   Unspecified, // Use the implementation defaults
-  Disable, // Disable the pass entirely
-  Optimize, // Optimise without changing ABI
-  Lowering, // Change variadic calling convention
+  Disable,     // Disable the pass entirely
+  Optimize,    // Optimise without changing ABI
+  Lowering,    // Change variadic calling convention
 };
 
 class ExpandVariadicsPass : public PassInfoMixin<ExpandVariadicsPass> {
diff --git a/llvm/lib/Passes/PassBuilderPipelines.cpp b/llvm/lib/Passes/PassBuilderPipelines.cpp
index 2f3bbc3e56..32dcdc300d 100644
--- a/llvm/lib/Passes/PassBuilderPipelines.cpp
+++ b/llvm/lib/Passes/PassBuilderPipelines.cpp
@@ -1175,10 +1175,9 @@ PassBuilder::buildModuleSimplificationPipeline(OptimizationLevel Level,
 
   // ExpandVariadics interacts well with the function inliner.
   MPM.addPass(ExpandVariadicsPass(Level == OptimizationLevel::O0
-                                  ? ExpandVariadicsMode::Disable
-                                  : ExpandVariadicsMode::Optimize));
+                                      ? ExpandVariadicsMode::Disable
+                                      : ExpandVariadicsMode::Optimize));
 
-  
   MPM.addPass(AlwaysInlinerPass(/*InsertLifetimeIntrinsics=*/true));
 
   if (EnableModuleInliner)

jdoerfert

I just read over this and left some comments.

jdoerfert · 2024-04-29T23:02:00Z

llvm/lib/Transforms/IPO/ExpandVariadics.cpp

+// implements getDeclaration which creates one when missing. This should be
+// changed to be consistent with Module()'s naming. Implementing as a local
+// function here in the meantime to decouple from that process.
+Function *getPreexistingDeclaration(Module *M, Intrinsic::ID id,


Suggested change

Function *getPreexistingDeclaration(Module *M, Intrinsic::ID id,

Function *getPreexistingDeclaration(Module *M, Intrinsic::ID Id,

jdoerfert · 2024-04-29T23:11:21Z

llvm/lib/Transforms/IPO/ExpandVariadics.cpp

+  // The type of a va_list iterator object
+  virtual Type *vaListType(LLVMContext &Ctx) = 0;
+
+  // The type of a va_list as a function argument as lowered by C


jdoerfert · 2024-04-29T23:11:48Z

llvm/lib/Transforms/IPO/ExpandVariadics.cpp

+
+  // Whether a valist instance is passed by value or by address
+  // I.e. does it need to be alloca'ed and stored into, or can
+  // it be passed directly in a SSA register


Comments should be sentences (with full stop), here and elsewhere.

jdoerfert · 2024-04-29T23:13:28Z

llvm/lib/Transforms/IPO/ExpandVariadics.cpp

+  // initialised contiguous memory region.
+  // Return the value to pass as the va_list argument
+  virtual Value *initializeVAList(LLVMContext &Ctx, IRBuilder<> &Builder,
+                                  AllocaInst *, Value * /*buffer*/) = 0;


No need for the comment. Name all arguments.

jdoerfert · 2024-04-29T23:14:29Z

llvm/lib/Transforms/IPO/ExpandVariadics.cpp

+  }
+
+  Value *initializeVAList(LLVMContext &Ctx, IRBuilder<> &Builder,
+                          AllocaInst * /*va_list*/, Value *buffer) override {


Again style. Names, and capitalization.

jdoerfert · 2024-04-29T23:29:26Z

llvm/lib/Transforms/IPO/ExpandVariadics.cpp

+    for (User *U : llvm::make_early_inc_range(F->users()))
+      if (CallBase *CB = dyn_cast<CallBase>(U)) {
+        Value *calledOperand = CB->getCalledOperand();
+        if (F == calledOperand) {
+          report_fatal_error(
+              "ExpandVA abi requires eliminating call uses first");
+        }


Is this a glorified assert?

Style: check variable names here and elsewhere, lots of lower case first letters.

arsenm · 2024-05-08T16:11:02Z

clang/lib/CodeGen/ABIInfoImpl.cpp

@@ -157,7 +157,7 @@ llvm::Value *CodeGen::emitRoundPointerUpToAlignment(CodeGenFunction &CGF,
  llvm::Value *RoundUp = CGF.Builder.CreateConstInBoundsGEP1_32(
      CGF.Builder.getInt8Ty(), Ptr, Align.getQuantity() - 1);
  return CGF.Builder.CreateIntrinsic(
-      llvm::Intrinsic::ptrmask, {Ptr->getType(), CGF.IntPtrTy},
+        llvm::Intrinsic::ptrmask, {Ptr->getType(), CGF.IntPtrTy},


Spurious whitespace change?

arsenm · 2024-05-08T16:11:13Z

clang/lib/CodeGen/ABIInfoImpl.cpp

@@ -247,7 +247,7 @@ Address CodeGen::emitMergePHI(CodeGenFunction &CGF, Address Addr1,

 bool CodeGen::isEmptyField(ASTContext &Context, const FieldDecl *FD,
                           bool AllowArrays, bool AsIfNoUniqueAddr) {
-  if (FD->isUnnamedBitField())
+  if (FD->isUnnamedBitfield())


Unrelated change pulled in?

arsenm · 2024-05-08T16:12:05Z

llvm/lib/Target/AMDGPU/AMDGPUPassRegistry.def

@@ -24,6 +24,7 @@ MODULE_PASS("amdgpu-lower-ctor-dtor", AMDGPUCtorDtorLoweringPass())
 MODULE_PASS("amdgpu-lower-module-lds", AMDGPULowerModuleLDSPass(*this))
 MODULE_PASS("amdgpu-printf-runtime-binding", AMDGPUPrintfRuntimeBindingPass())
 MODULE_PASS("amdgpu-unify-metadata", AMDGPUUnifyMetadataPass())
+MODULE_PASS("expand-variadics", ExpandVariadicsPass(ExpandVariadicsMode::Lowering))


Shouldn't need to list this in every target's PassRegistry, the generic one should be fine

JonChesterfield · 2024-05-25T01:54:32Z

I think the comments here are fed into #93362 successfully, will go through the list again to check.

arsenm · 2024-05-31T10:17:26Z

I think the comments here are fed into #93362 successfully, will go through the list again to check.

So #93362 is the replacement, and not the sequential next piece? Can we close this one then?

JonChesterfield requested review from arsenm, Artem-B, TNorthover, shiltian, ronlieb, AlexVlx, Pierre-vh and efriedma-quic April 17, 2024 00:30

llvmbot added clang Clang issues not falling into any other category backend:AMDGPU backend:X86 clang:codegen llvm:transforms labels Apr 17, 2024

JonChesterfield mentioned this pull request Apr 17, 2024

[transforms] Inline simple variadic functions #81058

Closed

efriedma-quic reviewed Apr 17, 2024

View reviewed changes

jhuber6 reviewed Apr 17, 2024

View reviewed changes

arsenm reviewed Apr 17, 2024

View reviewed changes

JonChesterfield added 3 commits April 18, 2024 22:46

[WIP] Expand variadic functions in IR

bd11e64

Fix some of Matt's comments

cca9003

apply jhuber6's symbol/dependency patch

4ee6881

JonChesterfield force-pushed the jc_varargs_pr branch from 3414a5a to 4ee6881 Compare April 18, 2024 21:47

simplify

c6c01c5

llvmbot added the llvm:ir label Apr 18, 2024

jdoerfert reviewed Apr 29, 2024

View reviewed changes

arsenm reviewed May 8, 2024

View reviewed changes

JonChesterfield closed this Jun 6, 2024

JonChesterfield deleted the jc_varargs_pr branch June 6, 2024 10:56

		// Would rather not naturally align values
		// Splitting {char, short} into two separate arguments makes that difficult.

	ArrayRef<Type *> Tys = std::nullopt) {
	ArrayRef<Type *> Tys = {}) {

	Function getPreexistingDeclaration(Module M, Intrinsic::ID id,
	Function getPreexistingDeclaration(Module M, Intrinsic::ID Id,

[WIP] Expand variadic functions in IR #89007

[WIP] Expand variadic functions in IR #89007

Conversation

JonChesterfield commented Apr 17, 2024

llvmbot commented Apr 17, 2024 • edited Loading

llvmbot commented Apr 17, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

JonChesterfield commented Apr 17, 2024 • edited Loading

github-actions bot commented Apr 18, 2024 • edited Loading

jdoerfert left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

JonChesterfield commented May 25, 2024

arsenm commented May 31, 2024

llvmbot commented Apr 17, 2024 •

edited

Loading

JonChesterfield commented Apr 17, 2024 •

edited

Loading

github-actions bot commented Apr 18, 2024 •

edited

Loading