Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SROA] Allow rewriting memcpy depending on tbaa.struct #77597

Open
wants to merge 2 commits into
base: main
Choose a base branch
from

Conversation

bjope
Copy link
Collaborator

@bjope bjope commented Jan 10, 2024

The bugfix in commit 54067c5, related to
#64081
limited the ability of SROA to handle non byte-sized types
when used in aggregates that are memcpy'd.

Main problem was that the LLVM types used in an alloca doesn't
always reflect if the stack slot can be used to for multiple
types (typically unions). So even if we for example have
%p = alloca i6
that stack slot may be used for other types than i6. And it
would be legal to for example store an i8 to that stack slot.

Thus, if %p was dereferenced in a memcpy we needed to consider
that also padding bits (seen from the i6 perspective) could be
of importance.

The limitation added to SROA in commit 54067c5 resulted
in huge regressions for a downstream target. Since the frontend
typically emit memcpy for aggregate copy it seems quite normal
that one end up with a memcpy that is copying padding bits even
when there are no unions or type punning. So that limitation
seem unfortunate in general.

In this patch we try to lift the restrictions a bit. If the
memcpy is decorated with tbaa.struct metadata we look at that
metadata. If we find out if the slice used by our new alloca is
touching memory described by a single scalar type according to
the tbaa.struct, then we assume that the type derived for the
new alloca is correct for all accesses made to the stack slot.
And then we can allow replacing the memcpy by regular load/store
instructions operating on that type (disregarding any padding
bits).

… types

The bugfix in commit 54067c5, related to
  llvm#64081
limited the ability of SROA to handle non byte-sized types
when used in aggregates that are memcpy'd.

Adding some test cases involving tbaa.struct annotations
that might allow us to eliminate more memcpy/alloca.
@llvmbot
Copy link
Collaborator

llvmbot commented Jan 10, 2024

@llvm/pr-subscribers-llvm-transforms

Author: Björn Pettersson (bjope)

Changes

The bugfix in commit 54067c5, related to
#64081
limited the ability of SROA to handle non byte-sized types
when used in aggregates that are memcpy'd.

Main problem was that the LLVM types used in an alloca doesn't
always reflect if the stack slot can be used to for multiple
types (typically unions). So even if we for example have
%p = alloca i6
that stack slot may be used for other types than i6. And it
would be legal to for example store an i8 to that stack slot.

Thus, if %p was dereferenced in a memcpy we needed to consider
that also padding bits (seen from the i6 perspective) could be
of importance.

The limitation added to SROA in commit 54067c5 resulted
in huge regressions for a downstream target. Since the frontend
typically emit memcpy for aggregate copy it seems quite normal
that one end up with a memcpy that is copying padding bits even
when there are no unions or type punning. So that limitation
seem unfortunate in general.

In this patch we try to lift the restrictions a bit. If the
memcpy is decorated with tbaa.struct metadata we look at that
metadata. If we find out if the slice used by our new alloca is
touching memory described by a single scalar type according to
the tbaa.struct, then we assume that the type derived for the
new alloca is correct for all accesses made to the stack slot.
And then we can allow replacing the memcpy by regular load/store
instructions operating on that type (disregarding any padding
bits).


Full diff: https://github.com/llvm/llvm-project/pull/77597.diff

2 Files Affected:

  • (modified) llvm/lib/Transforms/Scalar/SROA.cpp (+36-2)
  • (modified) llvm/test/Transforms/SROA/pr64081.ll (+74)
diff --git a/llvm/lib/Transforms/Scalar/SROA.cpp b/llvm/lib/Transforms/Scalar/SROA.cpp
index 75cddfa16d6db5..4439765f68db0b 100644
--- a/llvm/lib/Transforms/Scalar/SROA.cpp
+++ b/llvm/lib/Transforms/Scalar/SROA.cpp
@@ -3276,6 +3276,39 @@ class AllocaSliceRewriter : public InstVisitor<AllocaSliceRewriter, bool> {
     // memmove with memcpy, and we don't need to worry about all manner of
     // downsides to splitting and transforming the operations.
 
+    // The tbaa.struct is only being explicit about byte padding. Here we assume
+    // that if the derived type used for the NewAI maps to a single scalar type,
+    // as given by the tbaa.struct, then it is safe to assume that we can use
+    // that type when doing the copying even if it include bit padding. If there
+    // for example would be a union of "_BitInt(3)" and "char" types the
+    // tbaa.struct would have multiple entries indicating the different types
+    // (or there wouldn't be any tbaa.struct)..
+    auto IsSingleTypeAccordingToTBAA = [&]() -> bool {
+      // Only consider the case when we have a tbaa.struct.
+      if (!(AATags && AATags.TBAAStruct))
+        return false;
+      MDNode *MD = AATags.TBAAStruct;
+      uint64_t Offset = NewBeginOffset - BeginOffset;
+      unsigned Count = 0;
+      for (size_t i = 0, size = MD->getNumOperands(); i < size; i += 3) {
+        uint64_t InnerOffset =
+          mdconst::extract<ConstantInt>(MD->getOperand(i))->getZExtValue();
+        uint64_t InnerSize =
+          mdconst::extract<ConstantInt>(MD->getOperand(i + 1))->getZExtValue();
+        // Ignore entries that aren't overlapping with our slice.
+        if (InnerOffset + InnerSize <= Offset ||
+            InnerOffset >= Offset + SliceSize)
+          continue;
+        // Only allow a single match (no unions).
+        if (++Count > 1)
+          return false;
+        // Size/offset must match up.
+        if (InnerSize != SliceSize || Offset != InnerOffset)
+          return false;
+      }
+      return Count == 1;
+    };
+
     // If this doesn't map cleanly onto the alloca type, and that type isn't
     // a single value type, just emit a memcpy.
     bool EmitMemCpy =
@@ -3283,8 +3316,9 @@ class AllocaSliceRewriter : public InstVisitor<AllocaSliceRewriter, bool> {
         (BeginOffset > NewAllocaBeginOffset || EndOffset < NewAllocaEndOffset ||
          SliceSize !=
              DL.getTypeStoreSize(NewAI.getAllocatedType()).getFixedValue() ||
-         !DL.typeSizeEqualsStoreSize(NewAI.getAllocatedType()) ||
-         !NewAI.getAllocatedType()->isSingleValueType());
+         !NewAI.getAllocatedType()->isSingleValueType() ||
+         (!DL.typeSizeEqualsStoreSize(NewAI.getAllocatedType()) &&
+          !IsSingleTypeAccordingToTBAA()));
 
     // If we're just going to emit a memcpy, the alloca hasn't changed, and the
     // size hasn't been shrunk based on analysis of the viable range, this is
diff --git a/llvm/test/Transforms/SROA/pr64081.ll b/llvm/test/Transforms/SROA/pr64081.ll
index 4b893842138263..ba83e495f56c27 100644
--- a/llvm/test/Transforms/SROA/pr64081.ll
+++ b/llvm/test/Transforms/SROA/pr64081.ll
@@ -30,3 +30,77 @@ bb:
 declare void @use(ptr)
 
 declare void @llvm.memcpy.p0.p0.i64(ptr noalias nocapture writeonly, ptr noalias nocapture readonly, i64, i1 immarg)
+
+
+; No unions or overlaps in the tbaa.struct. So we can rely on the types
+define void @test2(i3 %x) {
+; CHECK-LABEL: define void @test2(
+; CHECK-SAME: i3 [[X:%.*]]) {
+; CHECK-NEXT:  bb:
+; CHECK-NEXT:    [[RES:%.*]] = alloca [[B:%.*]], align 8
+; CHECK-NEXT:    store i1 true, ptr [[RES]], align 1, !tbaa.struct [[TBAA_STRUCT0:![0-9]+]]
+; CHECK-NEXT:    [[TMP_SROA_2_0_RES_SROA_IDX:%.*]] = getelementptr inbounds i8, ptr [[RES]], i64 1
+; CHECK-NEXT:    store i3 [[X]], ptr [[TMP_SROA_2_0_RES_SROA_IDX]], align 1, !tbaa.struct [[TBAA_STRUCT7:![0-9]+]]
+; CHECK-NEXT:    [[TMP0:%.*]] = call i8 @use(ptr [[RES]])
+; CHECK-NEXT:    ret void
+;
+bb:
+  %res = alloca %B
+  %tmp = alloca %B
+  %tmp.1 = getelementptr i8, ptr %tmp, i64 1
+  store i1 1, ptr %tmp
+  store i3 %x, ptr %tmp.1
+  call void @llvm.memcpy.p0.p0.i64(ptr %res, ptr %tmp, i64 2, i1 false), !tbaa.struct !6
+  call i8 @use(ptr %res)
+  ret void
+}
+
+; Union preventing SROA from removing the memcpy for the first byte.
+define void @test3(i3 %x) {
+; CHECK-LABEL: define void @test3(
+; CHECK-SAME: i3 [[X:%.*]]) {
+; CHECK-NEXT:  bb:
+; CHECK-NEXT:    [[RES:%.*]] = alloca [[B:%.*]], align 8
+; CHECK-NEXT:    [[TMP_SROA_0:%.*]] = alloca i1, align 8
+; CHECK-NEXT:    store i1 true, ptr [[TMP_SROA_0]], align 8
+; CHECK-NEXT:    call void @llvm.memcpy.p0.p0.i64(ptr align 1 [[RES]], ptr align 8 [[TMP_SROA_0]], i64 1, i1 false), !tbaa.struct [[TBAA_STRUCT8:![0-9]+]]
+; CHECK-NEXT:    [[TMP_SROA_2_0_RES_SROA_IDX:%.*]] = getelementptr inbounds i8, ptr [[RES]], i64 1
+; CHECK-NEXT:    store i3 [[X]], ptr [[TMP_SROA_2_0_RES_SROA_IDX]], align 1, !tbaa.struct [[TBAA_STRUCT7]]
+; CHECK-NEXT:    [[TMP0:%.*]] = call i8 @use(ptr [[RES]])
+; CHECK-NEXT:    ret void
+;
+bb:
+  %res = alloca %B
+  %tmp = alloca %B
+  %tmp.1 = getelementptr i8, ptr %tmp, i64 1
+  store i1 1, ptr %tmp
+  store i3 %x, ptr %tmp.1
+  call void @llvm.memcpy.p0.p0.i64(ptr %res, ptr %tmp, i64 2, i1 false), !tbaa.struct !9
+  call i8 @use(ptr %res)
+  ret void
+}
+
+!1 = !{!"_BitInt(7)", !4, i64 0}
+!2 = !{!"_BitInt(1)", !4, i64 0}
+!3 = !{!"_BitInt(3)", !4, i64 0}
+!4 = !{!"omnipotent char", !5, i64 0}
+!5 = !{!"Simple C++ TBAA"}
+!6 = !{i64 0, i64 1, !7, i64 1, i64 1, !8}
+!7 = !{!2, !2, i64 0}
+!8 = !{!3, !3, i64 0}
+!9 = !{i64 0, i64 1, !10, i64 0, i64 1, !7, i64 1, i64 1, !8}
+!10 = !{!1, !1, i64 0}
+
+;.
+; CHECK: [[TBAA_STRUCT0]] = !{i64 0, i64 1, [[META1:![0-9]+]], i64 1, i64 1, [[META5:![0-9]+]]}
+; CHECK: [[META1]] = !{[[META2:![0-9]+]], [[META2]], i64 0}
+; CHECK: [[META2]] = !{!"_BitInt(1)", [[META3:![0-9]+]], i64 0}
+; CHECK: [[META3]] = !{!"omnipotent char", [[META4:![0-9]+]], i64 0}
+; CHECK: [[META4]] = !{!"Simple C++ TBAA"}
+; CHECK: [[META5]] = !{[[META6:![0-9]+]], [[META6]], i64 0}
+; CHECK: [[META6]] = !{!"_BitInt(3)", [[META3]], i64 0}
+; CHECK: [[TBAA_STRUCT7]] = !{i64 0, i64 1, [[META5]]}
+; CHECK: [[TBAA_STRUCT8]] = !{i64 0, i64 1, [[META9:![0-9]+]], i64 0, i64 1, [[META1]], i64 1, i64 1, [[META5]]}
+; CHECK: [[META9]] = !{[[META10:![0-9]+]], [[META10]], i64 0}
+; CHECK: [[META10]] = !{!"_BitInt(7)", [[META3]], i64 0}
+;.

Copy link

github-actions bot commented Jan 10, 2024

✅ With the latest revision this PR passed the C/C++ code formatter.

@bjope
Copy link
Collaborator Author

bjope commented Jan 10, 2024

I guess tbaa.stuct mainly is intended to show where padding bytes are inserted. But I don't have any better idea for how to identify the bit padding here unless doing it like this.

@nikic
Copy link
Contributor

nikic commented Jan 10, 2024

I'm somewhat skeptical about making SROA use TBAA metadata.

I think there may be an alternative fix here: Require that the new alloca type is always "correct" for all uses. This means reverting my change, and instead changing how the SliceTy is determined in: https://github.com/llvm/llvm-project/blob/db74d290b229efb64698996e8647e1cca264d84f/llvm/lib/Transforms/Scalar/SROA.cpp#L4695C1-L4695C1

In particular, we would drop the "If not, can we find an appropriate subtype in the original allocated type?" part of the logic and adjust findCommonType() to work with non-byte-sized integers. I think that would both address the original issue while still producing the desired code in your case?

Though at a high-level, we should really just not ever emit loads and stored with non-byte-sized types in the first place. Clang's use of these for _BitInt violates the SysV ABI, so needs to be changed anyway. We probably can't outright forbid these in LLVM, but frontends really shouldn't be generating them.

@bjope
Copy link
Collaborator Author

bjope commented Jan 10, 2024

I'm somewhat skeptical about making SROA use TBAA metadata.

I'm not surprised and I'm also a bit skeptical. Yet, the comments in CGExprAgg that adds tbaa.struct on aggregate copies is indicating that tbaa.struct is added to aid expansion of memcpy into scalar memory ops. So it's weird if the we first add tbaa.struct metadata to aid SROA in doing this kind of optimizations and then aren't allowed to use it.

I think there may be an alternative fix here: Require that the new alloca type is always "correct" for all uses. This means reverting my change, and instead changing how the SliceTy is determined in: https://github.com/llvm/llvm-project/blob/db74d290b229efb64698996e8647e1cca264d84f/llvm/lib/Transforms/Scalar/SROA.cpp#L4695C1-L4695C1

I did look a lot at rewritePartition/findCommonType/getTypePartition. But if the frontend has emitted a memcpy in the first place then it is a bit hard to find a common type that isn't including the padding bits. As there could be a chain of memcpy calls one would need to ensure that the memory being copied is initialized by a scalar store, and that no pointers into the memory escape (that all the final users are loads using the scalar type).

In particular, we would drop the "If not, can we find an appropriate subtype in the original allocated type?" part of the logic and adjust findCommonType() to work with non-byte-sized integers. I think that would both address the original issue while still producing the desired code in your case?

Though at a high-level, we should really just not ever emit loads and stored with non-byte-sized types in the first place. Clang's use of these for _BitInt violates the SysV ABI, so needs to be changed anyway. We probably can't outright forbid these in LLVM, but frontends really shouldn't be generating them.

Right now our target has builtins operating on { i40, i40 }. So basically i40 is a builtin 40-bit type, with a store size of 48 bits (which is three bytes given an address unit size of 16 bits).
We could perhaps describe a load as "load i48 followed by trunc to i40" and hopefully ISel can lower it to a single i40 load to a 40-bit register. But it is difficult to describe a store as "anyextend i40 to i48 followed by store i48", to let ISel figure out that it can use a 40-bit store operation to write the 48 bits.

This has (amazingly) worked well in the past. But now, after 54067c5, when SROA failed to eliminate the aggregate copy via memcpy it ended up as a complete mess as I got a feeling that for example the LoopVectorizer wasn't happy with memcpy:s remaining inside the loop after SROA. At least one of the larger regressions involved lack of vectorization.

One option that I haven't fully explored yet would be to change our builtin {i40, i40} type in some way. Such as making sure that the CGExprAgg isn't using a memcpy in the first place. But then it would still not work in general for user defined structs etc. And maybe I would need to change the ABI for those builtin types.

…a.struct

The bugfix in commit 54067c5, related to
  llvm#64081
limited the ability of SROA to handle non byte-sized types
when used in aggregates that are memcpy'd.

Main problem was that the LLVM types used in an alloca doesn't
always reflect if the stack slot can be used to for multiple
types (typically unions). So even if we for example have
  %p = alloca i6
that stack slot may be used for other types than i6. And it
would be legal to for example store an i8 to that stack slot.

Thus, if %p was dereferenced in a memcpy we needed to consider
that also padding bits (seen from the i6 perspective) could be
of importance.

The limitation added to SROA in commit 54067c5 resulted
in huge regressions for a downstream target. Since the frontend
typically emit memcpy for aggregate copy it seems quite normal
that one end up with a memcpy that is copying padding bits even
when there are no unions or type punning. So that limitation
seem unfortunate in general.

In this patch we try to lift the restrictions a bit. If the
memcpy is decorated with tbaa.struct metadata we look at that
metadata. If we find out if the slice used by our new alloca is
touching memory described by a single scalar type according to
the tbaa.struct, then we assume that the type derived for the
new alloca is correct for all accesses made to the stack slot.
And then we can allow replacing the memcpy by regular load/store
instructions operating on that type (disregarding any padding
bits).
@bjope
Copy link
Collaborator Author

bjope commented Jan 11, 2024

I'll have a look into adjusting getCommonType instead. I might have overcomplicated things when I thought about the impact of memcpy:s being "users" of the alloca slice.
I'll at least make another attempt to see if we can solve the problem that way instead. (Not sure how to mark this as "planned changes" while investigating that option further.)

@bjope
Copy link
Collaborator Author

bjope commented Jan 12, 2024

Haven't had much luck in finding a different solution.

Looking at findCommonType it currently ignore all intrinsics. But given that we want to detect if there is a memcpy writing to the alloca (or at least a slice) we would need to also consider MemTransferIntrinsics.

My use case kind of looks like this:

  %struct_i40x2 = type { i40, i40 }
  ...
  %sum = alloca %struct_i40x2
  %tmp = alloca %struct_i40x2
  %hi = getelementptr inbounds %struct_i40x2, ptr %sum, i32 0, i32 0
  store i40 0, ptr %hi, align 1, !tbaa !17
  %lo = getelementptr inbounds %struct_i40x2, ptr %sum, i32 0, i32 1
  store i40 0, ptr %lo, align 1, !tbaa !22
  ...
for.body:
  ...
  call void @llvm.memcpy.p0.p0.i32(ptr align 1 %tmp, ptr align 1 %sum, i32 12, i1 false), !tbaa.struct !37
  %7 = load %struct_i40x2, ptr %tmp
  %8 = extractvalue %struct_i40x2 %7, 0
  %9 = extractvalue %struct_i40x2 %7, 1
  %10 = call %struct_i40x2 @llvm.accumulate(i40 %8, i40 %9)
  store %struct_i40x2 %10, ptr %sum

So I guess that when splitting %tmp one would need to know that data copied from %sum is a pure i40 and not an i40 type that might alias with something with different bit padding. Thus one would need to analyze %sum first to find a common type for the slices. Since %sum is being written to using the result from the intrinsic call, one need to be able to rely on that the intrinsic is operating on the exact type (seeing the struct as two i40). But we can't do that in general, if the frontend just has picked one "random" type in case the struct contains unions?!?!

One could perhaps do something similar to getTypePartition. To look at the IR type to find a subtype. But that would require that we make sure both clang/LLVM (and other frontends) adhere to the rule of picking a type that is covering all bits when having unions.

Not quite sure what to do if the proposal using tbaa.struct is rejected.
(I'll probably use that solution downstream as a quick workaround unless there is some obvious fault making it invalid.)

@bjope bjope requested a review from fhahn February 10, 2024 00:47
@brunodf-snps
Copy link
Contributor

Hi Björn and Nikita,

I recently learned about commit 54067c5 in LLVM-18. Some time ago, we also discovered issue #64081 downstream in the context of an architecture with 20 bit pointers, stored as 4 bytes in memory. The libcxx std::string uses small string optimization that overlays a pointer with string data, and this would in some scenario trigger SROA to copy small string data as a pointer load/store, which does not copy the complete 4 bytes.

We tried the same fix as commit 54067c5, but like Björn we found severe quality regressions. SROA would fail to cleanup the stack access for C++ structs/classes containing pointer values, even if the code was benign. I realized then that the replacement of memcpy by a load/store pair (together with splitting) is critical for SROA mem2reg promotion: when an alloca is only used for direct load/stores the alloca can be removed and the values connected through. In the worst case, the memcpy copies between two alloca regions, and if you don't replace it by a load/store pair, both the allocas remain. This is hard to solve in SROA, because you are treating one alloca slice and its uses, but it may depend on the access in the other region that you are copying from/to whether it is safe to replace it by a narrower load/store.

At that point, I had the idea that in a situation where type size does not equal store size, it would be better to let SROA proceed with replacing the memcpy with a load/store pair, but convert the alloca slice to a promoted type that represents the entire storage region, and use (non-memory) operations to access the narrow type inside it. I did not continue working on it at that point, and it is not clear what form of promoted type to use (a wider integer? some special struct type with the other narrow type as its member?). To try to illustrate anyway, I created an example where i28 is promoted to i32, and the conversions are done using zext and trunc operations: https://godbolt.org/z/E3xh5cerE It's not clear if this is viable.

In any case, if a solution for the regressions (based on TBAA metadata or otherwise) would materialize, I would be interested to know.

Regards,
Bruno

@bjope
Copy link
Collaborator Author

bjope commented Mar 20, 2024

Hi @brunodf-snps. Did you try this patch to see if it helped out avoid regressions?

We have been using this solution downstream. So it is interesting to know if it has value to others as well.

@brunodf-snps
Copy link
Contributor

Hi @brunodf-snps. Did you try this patch to see if it helped out avoid regressions?

We have been using this solution downstream. So it is interesting to know if it has value to others as well.

I retrieved the application where we saw the most regression, and I retried it with an LLVM-18 based build. I use the number of allocas in the optimized LLVM IR as a measure. Applying/reverting commit 54067c5 has a dramatic effect: 169 versus 2369 allocas. Then I tried this patch, but unfortunately it does not have any effect, it stays at 2369 allocas. I did not investigate why (I probably need a much smaller test case for that). I know that we do use -Xclang -new-struct-path-tbaa, perhaps the TBAA metadata format is different.

@brunodf-snps
Copy link
Contributor

I did not investigate why (I probably need a much smaller test case for that). I know that we do use -Xclang -new-struct-path-tbaa, perhaps the TBAA metadata format is different.

I looked into this a bit, and I see no tbaa.struct metadata on the relevant llvm.memcpy calls in my case. I think this may be because this is a struct with inheritance and base classes.

There seems to be no tbaa.struct metadata generated for a struct with a base: https://godbolt.org/z/dc1K3b4Pv ?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants