folds for icmp-of-sum-of-extended-i1 aren't happening in more complex code #73417

scottmcm · 2023-11-26T01:06:30Z

I'm trying to take advantage of the folds from dd31a3b (cc @bcl5980 ) in Rust's standard library. They're working great for single fields, but in more complex cases they don't seem to be triggering, leaving poor IR.

For example, if I change Rust'd Ord::cmp for primitives to use the sext + zext implementation, then I get the following for < on a rust tuple of (i16, u16):

define noundef zeroext i1 @check_lt_direct(i16 noundef %0, i16 noundef %1, i16 noundef %2, i16 noundef %3) unnamed_addr #0 {
start:
  %lhs.i.i = icmp sgt i16 %0, %2
  %rhs.i.i = icmp slt i16 %0, %2
  %self1.i.i = zext i1 %lhs.i.i to i8
  %rhs2.neg.i.i = sext i1 %rhs.i.i to i8
  %diff.i.i = add nsw i8 %rhs2.neg.i.i, %self1.i.i
  %4 = icmp eq i8 %diff.i.i, 0
  %_0.i.i = icmp ult i16 %1, %3
  %5 = icmp slt i8 %diff.i.i, 0
  %_0.0.i = select i1 %4, i1 %_0.i.i, i1 %5
  ret i1 %_0.0.i
}

But it could just be (proof: https://alive2.llvm.org/ce/z/4dD_qc)

define noundef zeroext i1 @tgt(i16 noundef %0, i16 noundef %1, i16 noundef %2, i16 noundef %3) unnamed_addr #0 {
start:
  ; No longer needed %lhs.i.i = icmp sgt i16 %0, %2
  ; No longer needed %rhs.i.i = icmp slt i16 %0, %2
  ; No longer needed %self1.i.i = zext i1 %lhs.i.i to i8
  ; No longer needed %rhs2.neg.i.i = sext i1 %rhs.i.i to i8
  ; No longer needed %diff.i.i = add nsw i8 %rhs2.neg.i.i, %self1.i.i
  %4 = icmp eq i16 %0, %2
  %_0.i.i = icmp ult i16 %1, %3
  %5 = icmp slt i16 %0, %2
  %_0.0.i = select i1 %4, i1 %_0.i.i, i1 %5
  ret i1 %_0.0.i
}

By replacing those checks against %diff.i.i with the simplified forms.

A simpler change to just edit the icmp eq i8 %diff.i.i, 0 also works https://alive2.llvm.org/ce/z/gUqUi7

define noundef zeroext i1 @tgt(i16 noundef %0, i16 noundef %1, i16 noundef %2, i16 noundef %3) unnamed_addr #0 {
start:
  %lhs.i.i = icmp sgt i16 %0, %2
  %rhs.i.i = icmp slt i16 %0, %2
  %self1.i.i = zext i1 %lhs.i.i to i8
  %rhs2.neg.i.i = sext i1 %rhs.i.i to i8
  %diff.i.i = add nsw i8 %rhs2.neg.i.i, %self1.i.i
  %4 = icmp eq i1 %lhs.i.i, %rhs.i.i ; <--
  %_0.i.i = icmp ult i16 %1, %3
  %5 = icmp slt i8 %diff.i.i, 0
  %_0.0.i = select i1 %4, i1 %_0.i.i, i1 %5
  ret i1 %_0.0.i
}

with the the other existing folds then able to do their magic to get it down to

define noundef zeroext i1 @tgt(i16 noundef %0, i16 noundef %1, i16 noundef %2, i16 noundef %3) unnamed_addr #0 {
  %rhs.i.i = icmp slt i16 %0, %2
  %.not = icmp eq i16 %0, %2
  %_0.i.i = icmp ult i16 %1, %3
  %_0.0.i = select i1 %.not, i1 %_0.i.i, i1 %rhs.i.i
  ret i1 %_0.0.i
}

The text was updated successfully, but these errors were encountered:

Add `Ord::cmp` for primitives as a `BinOp` in MIR There are dozens of reasonable ways to implement `Ord::cmp` for integers using comparison, bit-ops, and branches. Those differences are irrelevant at the rust level, however, so we can make things better by adding `BinOp::Cmp` at the MIR level: 1. Exactly how to implement it is left up to the backends, so LLVM can use whatever pattern its optimizer best recognizes and cranelift can use whichever pattern codegens the fastest. 2. By not inlining those details for every use of `cmp`, we drastically reduce the amount of MIR generated for `derive`d `PartialOrd`, while also making it more amenable to MIR-level optimizations. Having extremely careful `if` ordering to μoptimize resource usage on broadwell (rust-lang#63767) is great, but it really feels to me like libcore is the wrong place to put that logic. Similarly, using subtraction [tricks](https://graphics.stanford.edu/~seander/bithacks.html#CopyIntegerSign) (rust-lang#105840) is arguably even nicer, but depends on the optimizer understanding it (llvm/llvm-project#73417) to be practical. Or maybe [bitor is better than add](https://discourse.llvm.org/t/representing-in-ir/67369/2?u=scottmcm)? But maybe only on a future version that [has `or disjoint` support](https://discourse.llvm.org/t/rfc-add-or-disjoint-flag/75036?u=scottmcm)? And just because one of those forms happens to be good for LLVM, there's no guarantee that it'd be the same form that GCC or Cranelift would rather see -- especially given their very different optimizers. Not to mention that if LLVM gets a spaceship intrinsic -- [which it should](https://rust-lang.zulipchat.com/#narrow/stream/131828-t-compiler/topic/Suboptimal.20inlining.20in.20std.20function.20.60binary_search.60/near/404250586) -- we'll need at least a rustc intrinsic to be able to call it. As for simplifying it in Rust, we now regularly inline `{integer}::partial_cmp`, but it's quite a large amount of IR. The best way to see that is with rust-lang@8811efa#diff-d134c32d028fbe2bf835fef2df9aca9d13332dd82284ff21ee7ebf717bfa4765R113 -- I added a new pre-codegen MIR test for a simple 3-tuple struct, and this PR change it from 36 locals and 26 basic blocks down to 24 locals and 8 basic blocks. Even better, as soon as the construct-`Some`-then-match-it-in-same-BB noise is cleaned up, this'll expose the `Cmp == 0` branches clearly in MIR, so that an InstCombine (rust-lang#105808) can simplify that to just a `BinOp::Eq` and thus fix some of our generated code perf issues. (Tracking that through today's `if a < b { Less } else if a == b { Equal } else { Greater }` would be *much* harder.) --- r? `@ghost` But first I should check that perf is ok with this ~~...and my true nemesis, tidy.~~

Add `Ord::cmp` for primitives as a `BinOp` in MIR Update: most of this OP was written months ago. See rust-lang#118310 (comment) below for where we got to recently that made it ready for review. --- There are dozens of reasonable ways to implement `Ord::cmp` for integers using comparison, bit-ops, and branches. Those differences are irrelevant at the rust level, however, so we can make things better by adding `BinOp::Cmp` at the MIR level: 1. Exactly how to implement it is left up to the backends, so LLVM can use whatever pattern its optimizer best recognizes and cranelift can use whichever pattern codegens the fastest. 2. By not inlining those details for every use of `cmp`, we drastically reduce the amount of MIR generated for `derive`d `PartialOrd`, while also making it more amenable to MIR-level optimizations. Having extremely careful `if` ordering to μoptimize resource usage on broadwell (rust-lang#63767) is great, but it really feels to me like libcore is the wrong place to put that logic. Similarly, using subtraction [tricks](https://graphics.stanford.edu/~seander/bithacks.html#CopyIntegerSign) (rust-lang#105840) is arguably even nicer, but depends on the optimizer understanding it (llvm/llvm-project#73417) to be practical. Or maybe [bitor is better than add](https://discourse.llvm.org/t/representing-in-ir/67369/2?u=scottmcm)? But maybe only on a future version that [has `or disjoint` support](https://discourse.llvm.org/t/rfc-add-or-disjoint-flag/75036?u=scottmcm)? And just because one of those forms happens to be good for LLVM, there's no guarantee that it'd be the same form that GCC or Cranelift would rather see -- especially given their very different optimizers. Not to mention that if LLVM gets a spaceship intrinsic -- [which it should](https://rust-lang.zulipchat.com/#narrow/stream/131828-t-compiler/topic/Suboptimal.20inlining.20in.20std.20function.20.60binary_search.60/near/404250586) -- we'll need at least a rustc intrinsic to be able to call it. As for simplifying it in Rust, we now regularly inline `{integer}::partial_cmp`, but it's quite a large amount of IR. The best way to see that is with rust-lang@8811efa#diff-d134c32d028fbe2bf835fef2df9aca9d13332dd82284ff21ee7ebf717bfa4765R113 -- I added a new pre-codegen MIR test for a simple 3-tuple struct, and this PR change it from 36 locals and 26 basic blocks down to 24 locals and 8 basic blocks. Even better, as soon as the construct-`Some`-then-match-it-in-same-BB noise is cleaned up, this'll expose the `Cmp == 0` branches clearly in MIR, so that an InstCombine (rust-lang#105808) can simplify that to just a `BinOp::Eq` and thus fix some of our generated code perf issues. (Tracking that through today's `if a < b { Less } else if a == b { Equal } else { Greater }` would be *much* harder.) --- r? `@ghost` But first I should check that perf is ok with this ~~...and my true nemesis, tidy.~~

Add `Ord::cmp` for primitives as a `BinOp` in MIR Update: most of this OP was written months ago. See rust-lang/rust#118310 (comment) below for where we got to recently that made it ready for review. --- There are dozens of reasonable ways to implement `Ord::cmp` for integers using comparison, bit-ops, and branches. Those differences are irrelevant at the rust level, however, so we can make things better by adding `BinOp::Cmp` at the MIR level: 1. Exactly how to implement it is left up to the backends, so LLVM can use whatever pattern its optimizer best recognizes and cranelift can use whichever pattern codegens the fastest. 2. By not inlining those details for every use of `cmp`, we drastically reduce the amount of MIR generated for `derive`d `PartialOrd`, while also making it more amenable to MIR-level optimizations. Having extremely careful `if` ordering to μoptimize resource usage on broadwell (#63767) is great, but it really feels to me like libcore is the wrong place to put that logic. Similarly, using subtraction [tricks](https://graphics.stanford.edu/~seander/bithacks.html#CopyIntegerSign) (#105840) is arguably even nicer, but depends on the optimizer understanding it (llvm/llvm-project#73417) to be practical. Or maybe [bitor is better than add](https://discourse.llvm.org/t/representing-in-ir/67369/2?u=scottmcm)? But maybe only on a future version that [has `or disjoint` support](https://discourse.llvm.org/t/rfc-add-or-disjoint-flag/75036?u=scottmcm)? And just because one of those forms happens to be good for LLVM, there's no guarantee that it'd be the same form that GCC or Cranelift would rather see -- especially given their very different optimizers. Not to mention that if LLVM gets a spaceship intrinsic -- [which it should](https://rust-lang.zulipchat.com/#narrow/stream/131828-t-compiler/topic/Suboptimal.20inlining.20in.20std.20function.20.60binary_search.60/near/404250586) -- we'll need at least a rustc intrinsic to be able to call it. As for simplifying it in Rust, we now regularly inline `{integer}::partial_cmp`, but it's quite a large amount of IR. The best way to see that is with rust-lang/rust@8811efa#diff-d134c32d028fbe2bf835fef2df9aca9d13332dd82284ff21ee7ebf717bfa4765R113 -- I added a new pre-codegen MIR test for a simple 3-tuple struct, and this PR change it from 36 locals and 26 basic blocks down to 24 locals and 8 basic blocks. Even better, as soon as the construct-`Some`-then-match-it-in-same-BB noise is cleaned up, this'll expose the `Cmp == 0` branches clearly in MIR, so that an InstCombine (#105808) can simplify that to just a `BinOp::Eq` and thus fix some of our generated code perf issues. (Tracking that through today's `if a < b { Less } else if a == b { Equal } else { Greater }` would be *much* harder.) --- r? `@ghost` But first I should check that perf is ok with this ~~...and my true nemesis, tidy.~~

github-actions bot added the new issue label Nov 26, 2023

scottmcm mentioned this issue Nov 26, 2023

Add Ord::cmp for primitives as a BinOp in MIR rust-lang/rust#118310

Merged

dtcxzyw added missed-optimization and removed new issue labels Nov 27, 2023

scottmcm mentioned this issue Feb 14, 2024

Micro-optimize Ord::cmp for primitives rust-lang/rust#105840

Draft

dtcxzyw added the llvm:instcombine label Jun 23, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

folds for icmp-of-sum-of-extended-i1 aren't happening in more complex code #73417

folds for icmp-of-sum-of-extended-i1 aren't happening in more complex code #73417

scottmcm commented Nov 26, 2023

folds for icmp-of-sum-of-extended-i1 aren't happening in more complex code #73417

folds for icmp-of-sum-of-extended-i1 aren't happening in more complex code #73417

Comments

scottmcm commented Nov 26, 2023