Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[LV] Change loops' interleave count computation #70141

Closed
wants to merge 6,003 commits into from

Conversation

nilanjana87
Copy link
Contributor

[LV] Change loops' interleave count computation

A set of microbenchmarks in llvm-test-suite (llvm/llvm-test-suite#26), when tested on a AArch64 platform, demonstrates that loop interleaving is beneficial in two cases:
1) when TC > 2 * VW * IC, such that the interleaved vectorized portion of the loop runs at least twice
2) when TC is an exact multiple of VW * IC, such that there is no epilogue loop to run
where, TC = trip count, VW = vectorization width, IC = interleaving count

We change the interleave count computation based on this information but we leave it the same when the flag InterleaveSmallLoopScalarReductionTrue is set to true, since it handles a special case (https://reviews.llvm.org/D81416).

@llvmbot
Copy link
Collaborator

llvmbot commented Oct 24, 2023

@llvm/pr-subscribers-llvm-transforms

Author: Nilanjana Basu (nilanjana87)

Changes

[LV] Change loops' interleave count computation

A set of microbenchmarks in llvm-test-suite (llvm/llvm-test-suite#26), when tested on a AArch64 platform, demonstrates that loop interleaving is beneficial in two cases:
1) when TC > 2 * VW * IC, such that the interleaved vectorized portion of the loop runs at least twice
2) when TC is an exact multiple of VW * IC, such that there is no epilogue loop to run
where, TC = trip count, VW = vectorization width, IC = interleaving count

We change the interleave count computation based on this information but we leave it the same when the flag InterleaveSmallLoopScalarReductionTrue is set to true, since it handles a special case (https://reviews.llvm.org/D81416).


Full diff: https://github.com/llvm/llvm-project/pull/70141.diff

2 Files Affected:

  • (modified) llvm/lib/Transforms/Vectorize/LoopVectorize.cpp (+15-6)
  • (added) llvm/test/Transforms/LoopVectorize/AArch64/interleave_count.ll (+106)
diff --git a/llvm/lib/Transforms/Vectorize/LoopVectorize.cpp b/llvm/lib/Transforms/Vectorize/LoopVectorize.cpp
index cc17d91d4f43727..5786e8e38cb4974 100644
--- a/llvm/lib/Transforms/Vectorize/LoopVectorize.cpp
+++ b/llvm/lib/Transforms/Vectorize/LoopVectorize.cpp
@@ -5745,8 +5745,12 @@ LoopVectorizationCostModel::selectInterleaveCount(ElementCount VF,
   }
 
   // If trip count is known or estimated compile time constant, limit the
-  // interleave count to be less than the trip count divided by VF, provided it
-  // is at least 1.
+  // interleave count to be less than the trip count divided by VF * 2,
+  // provided VF is at least 1 and the trip count is not an exact multiple of
+  // VF, such that the vector loop runs at least twice to make interleaving seem
+  // profitable when there is an epilogue loop present. When
+  // InterleaveSmallLoopScalarReduction is true or trip count is an exact
+  // multiple of VF, we allow interleaving even when the vector loop runs once.
   //
   // For scalable vectors we can't know if interleaving is beneficial. It may
   // not be beneficial for small loops if none of the lanes in the second vector
@@ -5755,10 +5759,15 @@ LoopVectorizationCostModel::selectInterleaveCount(ElementCount VF,
   // the InterleaveCount as if vscale is '1', although if some information about
   // the vector is known (e.g. min vector size), we can make a better decision.
   if (BestKnownTC) {
-    MaxInterleaveCount =
-        std::min(*BestKnownTC / VF.getKnownMinValue(), MaxInterleaveCount);
-    // Make sure MaxInterleaveCount is greater than 0.
-    MaxInterleaveCount = std::max(1u, MaxInterleaveCount);
+    if (InterleaveSmallLoopScalarReduction ||
+        (*BestKnownTC % VF.getKnownMinValue() == 0))
+      MaxInterleaveCount =
+          std::min(*BestKnownTC / VF.getKnownMinValue(), MaxInterleaveCount);
+    else
+      MaxInterleaveCount = std::min(*BestKnownTC / (VF.getKnownMinValue() * 2),
+                                    MaxInterleaveCount);
+    // Make sure MaxInterleaveCount is greater than 0 & a power of 2.
+    MaxInterleaveCount = llvm::bit_floor(std::max(1u, MaxInterleaveCount));
   }
 
   assert(MaxInterleaveCount > 0 &&
diff --git a/llvm/test/Transforms/LoopVectorize/AArch64/interleave_count.ll b/llvm/test/Transforms/LoopVectorize/AArch64/interleave_count.ll
new file mode 100644
index 000000000000000..501fc334bc8bf48
--- /dev/null
+++ b/llvm/test/Transforms/LoopVectorize/AArch64/interleave_count.ll
@@ -0,0 +1,106 @@
+; RUN: opt < %s -force-vector-width=64 -O3 -S -pass-remarks=loop-vectorize 2>&1 | FileCheck %s
+
+target triple = "aarch64-linux-gnu"
+
+%pair = type { i8, i8 }
+
+; For a loop with known trip count of 128, when we force VF 64, it should use
+; IC 2, since there is no remainder loop needed when the vector loop runs.
+; CHECK: remark: <unknown>:0:0: vectorized loop (vectorization width: 64, interleaved count: 2)
+define void @loop_with_tc_128(ptr %p, ptr %q) {
+entry:
+  br label %for.body
+
+for.body:
+  %i = phi i64 [ 0, %entry ], [ %i.next, %for.body ]
+  %tmp0 = getelementptr %pair, ptr %p, i64 %i, i32 0
+  %tmp1 = load i8, ptr %tmp0, align 1
+  %tmp2 = getelementptr %pair, ptr %p, i64 %i, i32 1
+  %tmp3 = load i8, ptr %tmp2, align 1
+  %add = add i8 %tmp1, %tmp3
+  %qi = getelementptr i8, ptr %q, i64 %i
+  store i8 %add, ptr %qi, align 1
+  %i.next = add nuw nsw i64 %i, 1
+  %cond = icmp eq i64 %i.next, 128
+  br i1 %cond, label %for.end, label %for.body
+
+for.end:
+  ret void
+}
+
+; For a loop with known trip count of 129, when we force VF 64, it should use
+; IC 1, since there may be a remainder loop that needs to run after the vector loop.
+; CHECK: remark: <unknown>:0:0: vectorized loop (vectorization width: 64, interleaved count: 1)
+define void @loop_with_tc_129(ptr %p, ptr %q) {
+entry:
+  br label %for.body
+
+for.body:
+  %i = phi i64 [ 0, %entry ], [ %i.next, %for.body ]
+  %tmp0 = getelementptr %pair, ptr %p, i64 %i, i32 0
+  %tmp1 = load i8, ptr %tmp0, align 1
+  %tmp2 = getelementptr %pair, ptr %p, i64 %i, i32 1
+  %tmp3 = load i8, ptr %tmp2, align 1
+  %add = add i8 %tmp1, %tmp3
+  %qi = getelementptr i8, ptr %q, i64 %i
+  store i8 %add, ptr %qi, align 1
+  %i.next = add nuw nsw i64 %i, 1
+  %cond = icmp eq i64 %i.next, 129
+  br i1 %cond, label %for.end, label %for.body
+
+for.end:
+  ret void
+}
+
+; For a loop with unknown trip count but a profile showing an approx TC estimate of 128, 
+; when we force VF 64, it should use IC 2, since chances are high that the remainder loop
+; won't need to run
+; CHECK: remark: <unknown>:0:0: vectorized loop (vectorization width: 64, interleaved count: 2)
+define void @loop_with_profile_tc_128(ptr %p, ptr %q, i64 %n) {
+entry:
+  br label %for.body
+
+for.body:
+  %i = phi i64 [ 0, %entry ], [ %i.next, %for.body ]
+  %tmp0 = getelementptr %pair, ptr %p, i64 %i, i32 0
+  %tmp1 = load i8, ptr %tmp0, align 1
+  %tmp2 = getelementptr %pair, ptr %p, i64 %i, i32 1
+  %tmp3 = load i8, ptr %tmp2, align 1
+  %add = add i8 %tmp1, %tmp3
+  %qi = getelementptr i8, ptr %q, i64 %i
+  store i8 %add, ptr %qi, align 1
+  %i.next = add nuw nsw i64 %i, 1
+  %cond = icmp eq i64 %i.next, %n
+  br i1 %cond, label %for.end, label %for.body, !prof !0
+
+for.end:
+  ret void
+}
+
+; For a loop with unknown trip count but a profile showing an approx TC estimate of 129, 
+; when we force VF 64, it should use IC 1, since chances are high that the remainder loop
+; will need to run
+; CHECK: remark: <unknown>:0:0: vectorized loop (vectorization width: 64, interleaved count: 1)
+define void @loop_with_profile_tc_129(ptr %p, ptr %q, i64 %n) {
+entry:
+  br label %for.body
+
+for.body:
+  %i = phi i64 [ 0, %entry ], [ %i.next, %for.body ]
+  %tmp0 = getelementptr %pair, ptr %p, i64 %i, i32 0
+  %tmp1 = load i8, ptr %tmp0, align 1
+  %tmp2 = getelementptr %pair, ptr %p, i64 %i, i32 1
+  %tmp3 = load i8, ptr %tmp2, align 1
+  %add = add i8 %tmp1, %tmp3
+  %qi = getelementptr i8, ptr %q, i64 %i
+  store i8 %add, ptr %qi, align 1
+  %i.next = add nuw nsw i64 %i, 1
+  %cond = icmp eq i64 %i.next, %n
+  br i1 %cond, label %for.end, label %for.body, !prof !1
+
+for.end:
+  ret void
+}
+
+!0 = !{!"branch_weights", i32 1, i32 127}
+!1 = !{!"branch_weights", i32 1, i32 128}

@nilanjana87
Copy link
Contributor Author

nilanjana87 commented Oct 26, 2023

The initial tests are added in a separate PR in #70272.

Copy link

github-actions bot commented Nov 11, 2023

⚠️ C/C++ code formatter, clang-format found issues in your code. ⚠️

You can test this locally with the following command:
git-clang-format --diff 2f4328e6979004fbf531d69a40c2e06d43d3128c 3104a425e31aad4defdeda4e0bb8a8c2d0549703 -- llvm/lib/Transforms/Vectorize/LoopVectorize.cpp
View the diff from clang-format here.
diff --git a/llvm/lib/Transforms/Vectorize/LoopVectorize.cpp b/llvm/lib/Transforms/Vectorize/LoopVectorize.cpp
index 5786e8e38c..45f2c442cd 100644
--- a/llvm/lib/Transforms/Vectorize/LoopVectorize.cpp
+++ b/llvm/lib/Transforms/Vectorize/LoopVectorize.cpp
@@ -3841,8 +3841,7 @@ void InnerLoopVectorizer::fixReduction(VPReductionPHIRecipe *PhiR,
           TTI->preferPredicatedReductionSelect(
               RdxDesc.getOpcode(), PhiTy,
               TargetTransformInfo::ReductionFlags())) {
-        auto *VecRdxPhi =
-            cast<PHINode>(State.get(PhiR, Part));
+        auto *VecRdxPhi = cast<PHINode>(State.get(PhiR, Part));
         VecRdxPhi->setIncomingValueForBlock(VectorLoopLatch, Sel);
       }
     }
@@ -8309,18 +8308,17 @@ VPWidenCallRecipe *VPRecipeBuilder::tryToWidenCall(CallInst *CI,
 
   // Is it beneficial to perform intrinsic call compared to lib call?
   bool ShouldUseVectorIntrinsic =
-      ID && LoopVectorizationPlanner::getDecisionAndClampRange(
-                [&](ElementCount VF) -> bool {
-                  Function *Variant;
-                  // Is it beneficial to perform intrinsic call compared to lib
-                  // call?
-                  InstructionCost CallCost =
-                      CM.getVectorCallCost(CI, VF, &Variant);
-                  InstructionCost IntrinsicCost =
-                      CM.getVectorIntrinsicCost(CI, VF);
-                  return IntrinsicCost <= CallCost;
-                },
-                Range);
+      ID &&
+      LoopVectorizationPlanner::getDecisionAndClampRange(
+          [&](ElementCount VF) -> bool {
+            Function *Variant;
+            // Is it beneficial to perform intrinsic call compared to lib
+            // call?
+            InstructionCost CallCost = CM.getVectorCallCost(CI, VF, &Variant);
+            InstructionCost IntrinsicCost = CM.getVectorIntrinsicCost(CI, VF);
+            return IntrinsicCost <= CallCost;
+          },
+          Range);
   if (ShouldUseVectorIntrinsic)
     return new VPWidenCallRecipe(*CI, make_range(Ops.begin(), Ops.end()), ID);
 

@fhahn
Copy link
Contributor

fhahn commented Nov 27, 2023

With #70272 landed, could you rebase this one again?

kadircet and others added 12 commits November 27, 2023 17:18
TargetLowering.h shouldn't include any passes and thus pull in
the entire pass infrastructure. Replace the include with forward
declarations.
…vm#72444)

This patch removes the val field from the `MapInfoOp`.

Previously when lowering `TargetOp`, the bounds information for the
`BoxValues` were also being mapped. Instead these ops are now cloned
inside the target region to prevent mapping of non reference typed
values.
The following pattern fails on recent GCC versions with -std=c++20 flag
passed and succeeds with -std=c++17. Such behavior is not observed on
Clang 16.0.

```
template <typename T>
struct Foo {
    Foo<T>(int a) {}
};
```

This patch removes template parameter from constructor in two occurences
to make the following command complete successfully:
bazel build -c fastbuild --cxxopt=-std=c++20 --host_cxxopt=-std=c++20
@llvm-project//mlir/...

This patch is similar to https://reviews.llvm.org/D154782

Co-authored-by: Alexander Batashev <a.batashev@partner.samsung.com>
Struct Module::Header is not a POD type. As such, qsort() and
llvm::array_pod_sort() must not be used to sort it. This became an issue
with the new implementation of qsort() in glibc 2.39 that is not
guaranteed to be a stable sort, causing Headers to be re-ordered and
corrupted.

Replace the usage of llvm::array_pod_sort() with std::stable_sort() in
order to fix this issue. The signature of compareModuleHeaders() has to
be modified.

Fixes llvm#73145.
Update regex to _explicitly_ show which exp versions are added.
The previous regex used `exp[^e]` to avoid matching calls like:
`@llvm.experimental.stepvector`.

Note: ArmPL Mappings for scalable types are not yet utilized
(eg, `llvm.exp10.nxv2f64`, `llvm.exp10.nxv4f32`), as `replace-with-veclib`
pass needs improvements.
…lvm#72526)

The code in the CloneInstructionsIntoPredec... function modified by this
patch has a long history that dates back to 2011, see d715ec8.
There, when folding branches, all dbg.value intrinsics seen when folding
would be saved and then re-inserted at the end of whatever was folded. Over
the last 12 years this behaviour has been preserved.

However, IMO it's bad behaviour. If we have:

  inst1
  dbg.value1
  inst2
  dbg.value2

And we fold that sequence into a different block, then we would want the
instructions and variable assignments to appear in the same order. However
because of this old behaviour, the dbg.values are sunk, and we get:

  inst1
  inst2
  dbg.value1
  dbg.value2

This clustering of dbg.values can make assignments to the same variable
invisible, as well as reducing the coverage of other assignments.

This patch relaxes the CloneInstructions... function and allows it to clone
and update dbg.values in-place, causing them to appear in the original
order in the destination block. I've added some extra dbg.values to the
updated test: without the changes to the pass, the dbg.values sink into a
blob ahead of the select. The RemoveDIs code can't cope with this right now
so I've removed the "--try..." flag, restored in a commit to land in a
couple of hours.

(Metadata changes to make the LLVM-IR parser not drop the debug-info for it
being out of date. The RemoveDIs related RUN line has been removed because
it was spuriously passing due to the debug-info being dropped).
…n type of function. (llvm#69724)

Import of a function with `auto` return type that is expanded to a
`SubstTemplateTypeParmType` could fail if the function itself is the
template specialization where the parameter was replaced.
jasonmolenda and others added 21 commits November 27, 2023 17:18
)

This patch is rearranging code a bit to add WatchpointResources to
Process. A WatchpointResource is meant to represent a hardware
watchpoint register in the inferior process. It has an address, a size,
a type, and a list of Watchpoints that are using this
WatchpointResource.

This current patch doesn't add any of the features of
WatchpointResources that make them interesting -- a user asking to watch
a 24 byte object could watch this with three 8 byte WatchpointResources.
Or a Watchpoint on 1 byte at 0x1002 and a second watchpoint on 1 byte at
0x1003, these must both be served by a single WatchpointResource on that
doubleword at 0x1000 on a 64-bit target, if two hardware watchpoint
registers were used to track these separately, one of them may not be
hit. Or if you have one Watchpoint on a variable with a condition set,
and another Watchpoint on that same variable with a command defined or
different condition, or ignorecount, both of those Watchpoints need to
evaluate their criteria/commands when their WatchpointResource has been
hit.

There's a bit of code movement to rearrange things in the direction I'll
need for implementing this feature, so I want to start with reviewing &
landing this mostly NFC patch and we can focus on the algorithmic
choices about how WatchpointResources are shared and handled as they're
triggeed, separately.

This patch also stops printing "Watchpoint <n> hit: old value: <x>, new
vlaue: <y>" for Read watchpoints. I could make an argument for print
"Watchpoint <n> hit: current value <x>" but the current output doesn't
make any sense, and the user can print the value if they are
particularly interested. Read watchpoints are used primarily to
understand what code is reading a variable.

This patch adds more fallbacks for how to print the objects being
watched if we have types, instead of assuming they are all integral
values, so a struct will print its elements. As large watchpoints are
added, we'll be doing a lot more of those.

To track the WatchpointSP in the WatchpointResources, I changed the
internal API which took a WatchpointSP and devolved it to a Watchpoint*,
which meant touching several different Process files. I removed the
watchpoint code in ProcessKDP which only reported that watchpoints
aren't supported, the base class does that already.

I haven't yet changed how we receive a watchpoint to identify the
WatchpointResource responsible for the trigger, and identify all
Watchpoints that are using this Resource to evaluate their conditions
etc. This is the same work that a BreakpointSite needs to do when it has
been tiggered, where multiple Breakpoints may be at the same address.

There is not yet any printing of the Resources that a Watchpoint is
implemented in terms of ("watchpoint list", or
SBWatchpoint::GetDescription).

"watchpoint set var" and "watchpoint set expression" take a size
argument which was previously 1, 2, 4, or 8 (an enum). I've changed this
to an unsigned int. Most hardware implementations can only watch 1, 2,
4, 8 byte ranges, but with Resources we'll allow a user to ask for
different sized watchpoints and set them in hardware-expressble terms
soon.

I've annotated areas where I know there is work still needed with
LWP_TODO that I'll be working on once this is landed.

I've tested this on aarch64 macOS, aarch64 Linux, and Intel macOS.

https://discourse.llvm.org/t/rfc-large-watchpoint-support-in-lldb/72116
…lvm#68926)

This PR adds support for the PPA2 fields.

---------

Co-authored-by: Yusra Syeda <yusra.syeda@ibm.com>
The previous optional class would call the destructor on a non-trivially
destructible object regardless of if it had already been reset. This
patch fixes this by moving tracking for if the object exists into the
internal storage class for optional.
On macOS <vector> was already included via another
header file, but this failed on the CI bots.  I'd
tested on linux earlier in this patch's life when
the headers were differently arranged.
…3439)

In the FreeBSD base system, re-executing the main binary when ASLR is
detected was implemented in the following commits:

* freebsd/freebsd-src@7cafe89f9ce33
* freebsd/freebsd-src@96fe7c8ab0f65
* freebsd/freebsd-src@930a7c2ac67e1
* freebsd/freebsd-src@0a736f0a6aeb0
* freebsd/freebsd-src@4c9a0adad1826

Squash all these to bring them into upstream compiler-rt.

When ASLR is detected to be enabled, this first force-disables ASLR for
the current process, then calls ReExec(). The ReExec() function gets a
FreeBSD specific implementation for finding the path of the executed
program, via the ELF auxiliary vector. This is done without calling into
the regular elf_aux_info(3) function, as that makes use of several
already-intercepted functions.
Summary:
These tests are currently failing, disable them so we can keep the bots
green until we find a better solution. The x64 tests are not the core
target so this is low priority.
This adds functionality to tbd files similar to what `lipo -extract`
and `lipo -remove` does for binaries.
The "Dim" prefix is a legacy left-over that no longer makes sense, since
we have a very strict "Dimension" vs. "Level" definition for sparse
tensor types and their storage.
…lvm#72666)

This is the first in a planned patch series to teach our vector lowering
how to exploit register boundaries in LMUL>1 types when VLEN is known to
be an exact constant. This corresponds to code compiled by clang with
the -mrvv-vector-bits=zvl option.

For extract_vector_elt, if we have a constant index and a known vlen,
then we can identify which register out of a register group is being
accessed. Given this, we can do a sub-register extract for that
register, and then shift any remaining index.

This results in all constant index extracts becoming m1 operations, and
thus eliminates the complexity concern for explode-vector idioms at high
lmul.
This sanitizes the string for printing.
std::hash and ADT/Hashing::hash_value are non-deterministic functions
whose
results might vary across implementation/process/execution. Using xxh3
instead
for computing hashes of BinaryFunctions and BinaryBasicBlock for stale
profile
matching.
(A possible alternative is to use ADT/StableHashing.h based on FNV
hashing but
xxh3 seems to be more popular in LLVM)

This is to address llvm#65241.
I recently moved from using vim to lunarvim. I noticed that `gcc` (lol)
wasn't able to comment out lines, because commentstring is not set.

Thanks to Chase Colman for the suggestion in
LunarVim/LunarVim#4394.
This pass isn't used anywhere upstream and thus has no test coverage.
For these reasons, remove it.
Since this patch is the first part of a two-part patch for changing loop interleaving count computation, followed by removing loop interleaving threshold, these tests remain unchanged for the current (first) patch but will be modified for the second patch. The original PR for these tests is in llvm#70272.
@david-arm
Copy link
Contributor

With #70272 landed, could you rebase this one again?

I think something has gone wrong with the rebase - your patch now has 250 commits.

nicolasvasilache and others added 2 commits November 28, 2023 15:48
…vm#70907)

Unless specified as a "raw" string, Python will try to interpret
backslashes, which means they don't get passed on correctly to the
underlying regex-engine, unless escaped manually (`\\`).

Use raw strings in `llvm/test/lit.cfg.py` to avoid a `SyntaxWarning`:
```
llvm/test/lit.cfg.py:275: SyntaxWarning: invalid escape sequence '\d'
  match = re.search("release (\d+)\.(\d+)", ptxas_out)
```

Other such warning present in 17.0.x were already fixed in
7ed0f5b.
@nilanjana87
Copy link
Contributor Author

I think something has gone wrong with the rebase - your patch now has 250 commits.

Not sure if this was because I had multiple remotes configured, the rebasing got completely messed up! I have tried several ways to fix it, but it's not going back to the previous stage. I'll close this PR & create a new one with the final version.

@nilanjana87
Copy link
Contributor Author

Apologies for messing up the rebase process to a place that I could not easily recover from. I added a clean patch in #73766 which is rebased on latest code that includes the tests in #70272.

@Endilll Endilll removed their request for review April 17, 2024 04:31
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet