Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Regression in code quality for mul=>shift conversions in multiple backends #53829

Closed
asb opened this issue Feb 14, 2022 · 7 comments
Closed

Regression in code quality for mul=>shift conversions in multiple backends #53829

asb opened this issue Feb 14, 2022 · 7 comments

Comments

@asb
Copy link
Contributor

asb commented Feb 14, 2022

I spotted that rG995d400f3a3c: [InstCombine] reduce mul operands based on undemanded high bits caused some regressions in terms of code that previously resulted in shifts being selected generating a multiply instead. I got as far as bisecting the commit, but not as far as looking in any detail at mitigations - so creating this bug so others don't duplicate effort.

I don't think there's any suggestion the modification is wrong - this kind of issue is quite common with these demanded bits optimisations.

Take this example from fn2_4 in 20040629-1.c from the GCC torture suite.

clang-generated .ll:

%struct.anon = type { i32 }
@b = dso_local global %struct.anon zeroinitializer, align 4

; Function Attrs: nounwind
define dso_local void @fn2_4(i32 noundef %x) #0 {
entry:
  %x.addr = alloca i32, align 4
  store i32 %x, i32* %x.addr, align 4
  %0 = load i32, i32* %x.addr, align 4
  %bf.load = load i32, i32* getelementptr inbounds (%struct.anon, %struct.anon* @b, i32 0, i32 0), align 4
  %bf.lshr = lshr i32 %bf.load, 6
  %bf.clear = and i32 %bf.lshr, 2047
  %sub = sub i32 %bf.clear, %0
  %bf.load1 = load i32, i32* getelementptr inbounds (%struct.anon, %struct.anon* @b, i32 0, i32 0), align 4
  %bf.value = and i32 %sub, 2047
  %bf.shl = shl i32 %bf.value, 6
  %bf.clear2 = and i32 %bf.load1, -131009
  %bf.set = or i32 %bf.clear2, %bf.shl
  store i32 %bf.set, i32* getelementptr inbounds (%struct.anon, %struct.anon* @b, i32 0, i32 0), align 4
  ret void
}

Result of opt -O2 after 995d400:

; ModuleID = '<stdin>'
source_filename = "<stdin>"

%struct.anon = type { i32 }

@b = dso_local local_unnamed_addr global %struct.anon zeroinitializer, align 4

; Function Attrs: mustprogress nofree norecurse nosync nounwind willreturn
define dso_local void @fn2_4(i32 noundef %x) local_unnamed_addr #0 {
entry:
  %bf.load = load i32, i32* getelementptr inbounds (%struct.anon, %struct.anon* @b, i64 0, i32 0), align 4
  %sub1.neg = mul i32 %x, 131008
  %bf.lshr2 = add i32 %bf.load, %sub1.neg
  %bf.shl = and i32 %bf.lshr2, 131008
  %bf.clear2 = and i32 %bf.load, -131009
  %bf.set = or i32 %bf.shl, %bf.clear2
  store i32 %bf.set, i32* getelementptr inbounds (%struct.anon, %struct.anon* @b, i64 0, i32 0), align 4
  ret void
}

attributes #0 = { mustprogress nofree norecurse nosync nounwind willreturn }

Result of opt -O2 before that commit:

; ModuleID = '<stdin>'
source_filename = "<stdin>"

%struct.anon = type { i32 }

@b = dso_local local_unnamed_addr global %struct.anon zeroinitializer, align 4

; Function Attrs: mustprogress nofree norecurse nosync nounwind willreturn
define dso_local void @fn2_4(i32 noundef %x) local_unnamed_addr #0 {
entry:
  %bf.load = load i32, i32* getelementptr inbounds (%struct.anon, %struct.anon* @b, i64 0, i32 0), align 4
  %sub1.neg = mul i32 %x, -64
  %bf.lshr2 = add i32 %bf.load, %sub1.neg
  %bf.shl = and i32 %bf.lshr2, 131008
  %bf.clear2 = and i32 %bf.load, -131009
  %bf.set = or i32 %bf.shl, %bf.clear2
  store i32 %bf.set, i32* getelementptr inbounds (%struct.anon, %struct.anon* @b, i64 0, i32 0), align 4
  ret void
}

attributes #0 = { mustprogress nofree norecurse nosync nounwind willreturn }

From a quick look, this results in mul instructions for at least i686/x86_64, RISC-V, and AArch64 when they weren't present before.

CC @spatel-gh @topperc

@topperc
Copy link
Collaborator

topperc commented Feb 14, 2022

I assume this is interfering with the negative power of 2 special case lowering in the backend.

@rotateright
Copy link
Contributor

Adding @rotateright - I'm also @spatel-gh, but the other ID is mapped to my LLVM commits.

@rotateright
Copy link
Contributor

Removing a few more bits from the example, we have these equivalent representations:

define i32 @src(i32 %x, i32 %y) {
  %m = mul i32 %x, -64
  %a = add i32 %y, %m
  %r = and i32 %a, 131008
  ret i32 %r
}

define i32 @tgt(i32 %x, i32 %y) {
  %m = mul i32 %x, 131008
  %a = add i32 %y, %m
  %r = and i32 %a, 131008
  ret i32 %r
}

https://alive2.llvm.org/ce/z/LD6uds

@rotateright
Copy link
Contributor

x86-64:

movl	%esi, %eax
shll	$6, %edi
subl	%edi, %eax
andl	$131008, %eax                   ## imm = 0x1FFC0

vs.

imull	$131008, %edi, %eax             ## imm = 0x1FFC0
addl	%esi, %eax
andl	$131008, %eax                   ## imm = 0x1FFC0

AArch64:

sub	w8, w1, w0, lsl #6
and	w0, w8, #0x1ffc0

vs.

mov	w8, #131008
madd	w8, w0, w8, w1
and	w0, w8, #0x1ffc0

RISCV with "m" (should we test with any other attributes?):

slli	a0, a0, 6
sub	a0, a1, a0
lui	a1, 32
addi	a1, a1, -64
and	a0, a0, a1

vs.

lui	a2, 32
addi	a2, a2, -64
mul	a0, a0, a2
add	a0, a1, a0
and	a0, a0, a2

@rotateright
Copy link
Contributor

More generally, we're inverting a transform that is done by InstCombine's Negator.

This is the fold in DAGCombiner - no target hooks or other constraints:

  // fold (mul x, -(1 << c)) -> -(x << c) or (-x) << c
  if (N1IsConst && !N1IsOpaqueConst && ConstValue1.isNegatedPowerOf2()) {
    unsigned Log2Val = (-ConstValue1).logBase2();
    SDLoc DL(N);
    // FIXME: If the input is something that is easily negated (e.g. a
    // single-use add), we should put the negate there.
    return DAG.getNode(ISD::SUB, DL, VT,
                       DAG.getConstant(0, DL, VT),
                       DAG.getNode(ISD::SHL, DL, VT, N0,
                            DAG.getConstant(Log2Val, DL,
                                      getShiftAmountTy(N0.getValueType()))));
  }


@rotateright rotateright self-assigned this Feb 18, 2022
@rotateright
Copy link
Contributor

I think we can reverse this in SDAG's demanded bits, but there's at least one missing fold exposed.

The existing fold in the previous comment seems too aggressive (it may not always be good to replace a single multiply with shift + negate), but we can probably always profitably fold mul+add/sub --> shl+sub.

rotateright added a commit that referenced this issue Feb 18, 2022
This fold is done in IR:
https://alive2.llvm.org/ce/z/jWyFrP

There is an x86 test that shows an improvement
from the added flexibility of using add (commutative).

The other diffs are presumed neutral.

Note that this could also be folded to an 'xor',
but I'm not sure if that would be universally better
(eg, x86 can convert adds more easily into LEA).

This helps prevent regressions from a potential fold for
issue #53829.
@rotateright
Copy link
Contributor

mem-frob pushed a commit to draperlaboratory/hope-llvm-project that referenced this issue Oct 7, 2022
This is a fix for a regression discussed in:
llvm/llvm-project#53829

We cleared more high multiplier bits with 995d400,
but that can lead to worse codegen because we would fail
to recognize the now disguised multiplication by neg-power-of-2
as a shift-left. The problem exists independently of the IR
change in the case that the multiply already had cleared high
bits. We also convert shl+sub into mul+add in instcombine's
negator.

This patch fills in the high-bits to see the shift transform
opportunity. Alive2 attempt to show correctness:
https://alive2.llvm.org/ce/z/GgSKVX

The AArch64, RISCV, and MIPS diffs look like clear wins. The
x86 code requires an extra move register in the minimal examples,
but it's still an improvement to get rid of the multiply on all
CPUs that I am aware of (because multiply is never as fast as a
shift).

There's a potential follow-up noted by the TODO comment. We
should already convert that pattern into shl+add in IR, so
it's probably not common:
https://alive2.llvm.org/ce/z/7QY_Ga

Fixes #53829

Differential Revision: https://reviews.llvm.org/D120216
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants