Regression in code quality for mul=>shift conversions in multiple backends #53829

asb · 2022-02-14T18:54:23Z

I spotted that rG995d400f3a3c: [InstCombine] reduce mul operands based on undemanded high bits caused some regressions in terms of code that previously resulted in shifts being selected generating a multiply instead. I got as far as bisecting the commit, but not as far as looking in any detail at mitigations - so creating this bug so others don't duplicate effort.

I don't think there's any suggestion the modification is wrong - this kind of issue is quite common with these demanded bits optimisations.

Take this example from fn2_4 in 20040629-1.c from the GCC torture suite.

clang-generated .ll:

%struct.anon = type { i32 }
@b = dso_local global %struct.anon zeroinitializer, align 4

; Function Attrs: nounwind
define dso_local void @fn2_4(i32 noundef %x) #0 {
entry:
  %x.addr = alloca i32, align 4
  store i32 %x, i32* %x.addr, align 4
  %0 = load i32, i32* %x.addr, align 4
  %bf.load = load i32, i32* getelementptr inbounds (%struct.anon, %struct.anon* @b, i32 0, i32 0), align 4
  %bf.lshr = lshr i32 %bf.load, 6
  %bf.clear = and i32 %bf.lshr, 2047
  %sub = sub i32 %bf.clear, %0
  %bf.load1 = load i32, i32* getelementptr inbounds (%struct.anon, %struct.anon* @b, i32 0, i32 0), align 4
  %bf.value = and i32 %sub, 2047
  %bf.shl = shl i32 %bf.value, 6
  %bf.clear2 = and i32 %bf.load1, -131009
  %bf.set = or i32 %bf.clear2, %bf.shl
  store i32 %bf.set, i32* getelementptr inbounds (%struct.anon, %struct.anon* @b, i32 0, i32 0), align 4
  ret void
}

Result of opt -O2 after 995d400:

; ModuleID = '<stdin>'
source_filename = "<stdin>"

%struct.anon = type { i32 }

@b = dso_local local_unnamed_addr global %struct.anon zeroinitializer, align 4

; Function Attrs: mustprogress nofree norecurse nosync nounwind willreturn
define dso_local void @fn2_4(i32 noundef %x) local_unnamed_addr #0 {
entry:
  %bf.load = load i32, i32* getelementptr inbounds (%struct.anon, %struct.anon* @b, i64 0, i32 0), align 4
  %sub1.neg = mul i32 %x, 131008
  %bf.lshr2 = add i32 %bf.load, %sub1.neg
  %bf.shl = and i32 %bf.lshr2, 131008
  %bf.clear2 = and i32 %bf.load, -131009
  %bf.set = or i32 %bf.shl, %bf.clear2
  store i32 %bf.set, i32* getelementptr inbounds (%struct.anon, %struct.anon* @b, i64 0, i32 0), align 4
  ret void
}

attributes #0 = { mustprogress nofree norecurse nosync nounwind willreturn }

Result of opt -O2 before that commit:

; ModuleID = '<stdin>'
source_filename = "<stdin>"

%struct.anon = type { i32 }

@b = dso_local local_unnamed_addr global %struct.anon zeroinitializer, align 4

; Function Attrs: mustprogress nofree norecurse nosync nounwind willreturn
define dso_local void @fn2_4(i32 noundef %x) local_unnamed_addr #0 {
entry:
  %bf.load = load i32, i32* getelementptr inbounds (%struct.anon, %struct.anon* @b, i64 0, i32 0), align 4
  %sub1.neg = mul i32 %x, -64
  %bf.lshr2 = add i32 %bf.load, %sub1.neg
  %bf.shl = and i32 %bf.lshr2, 131008
  %bf.clear2 = and i32 %bf.load, -131009
  %bf.set = or i32 %bf.shl, %bf.clear2
  store i32 %bf.set, i32* getelementptr inbounds (%struct.anon, %struct.anon* @b, i64 0, i32 0), align 4
  ret void
}

attributes #0 = { mustprogress nofree norecurse nosync nounwind willreturn }

From a quick look, this results in mul instructions for at least i686/x86_64, RISC-V, and AArch64 when they weren't present before.

CC @spatel-gh @topperc

The text was updated successfully, but these errors were encountered:

topperc · 2022-02-14T19:05:59Z

I assume this is interfering with the negative power of 2 special case lowering in the backend.

rotateright · 2022-02-14T22:03:38Z

Adding @rotateright - I'm also @spatel-gh, but the other ID is mapped to my LLVM commits.

rotateright · 2022-02-15T19:07:33Z

Removing a few more bits from the example, we have these equivalent representations:

define i32 @src(i32 %x, i32 %y) {
  %m = mul i32 %x, -64
  %a = add i32 %y, %m
  %r = and i32 %a, 131008
  ret i32 %r
}

define i32 @tgt(i32 %x, i32 %y) {
  %m = mul i32 %x, 131008
  %a = add i32 %y, %m
  %r = and i32 %a, 131008
  ret i32 %r
}

https://alive2.llvm.org/ce/z/LD6uds

rotateright · 2022-02-15T19:17:05Z

x86-64:

movl	%esi, %eax
shll	$6, %edi
subl	%edi, %eax
andl	$131008, %eax                   ## imm = 0x1FFC0

vs.

imull	$131008, %edi, %eax             ## imm = 0x1FFC0
addl	%esi, %eax
andl	$131008, %eax                   ## imm = 0x1FFC0

AArch64:

sub	w8, w1, w0, lsl #6
and	w0, w8, #0x1ffc0

vs.

mov	w8, #131008
madd	w8, w0, w8, w1
and	w0, w8, #0x1ffc0

RISCV with "m" (should we test with any other attributes?):

slli	a0, a0, 6
sub	a0, a1, a0
lui	a1, 32
addi	a1, a1, -64
and	a0, a0, a1

vs.

lui	a2, 32
addi	a2, a2, -64
mul	a0, a0, a2
add	a0, a1, a0
and	a0, a0, a2

rotateright · 2022-02-15T19:26:01Z

More generally, we're inverting a transform that is done by InstCombine's Negator.

This is the fold in DAGCombiner - no target hooks or other constraints:

  // fold (mul x, -(1 << c)) -> -(x << c) or (-x) << c
  if (N1IsConst && !N1IsOpaqueConst && ConstValue1.isNegatedPowerOf2()) {
    unsigned Log2Val = (-ConstValue1).logBase2();
    SDLoc DL(N);
    // FIXME: If the input is something that is easily negated (e.g. a
    // single-use add), we should put the negate there.
    return DAG.getNode(ISD::SUB, DL, VT,
                       DAG.getConstant(0, DL, VT),
                       DAG.getNode(ISD::SHL, DL, VT, N0,
                            DAG.getConstant(Log2Val, DL,
                                      getShiftAmountTy(N0.getValueType()))));
  }

rotateright · 2022-02-18T13:24:13Z

I think we can reverse this in SDAG's demanded bits, but there's at least one missing fold exposed.

The existing fold in the previous comment seems too aggressive (it may not always be good to replace a single multiply with shift + negate), but we can probably always profitably fold mul+add/sub --> shl+sub.

This fold is done in IR: https://alive2.llvm.org/ce/z/jWyFrP There is an x86 test that shows an improvement from the added flexibility of using add (commutative). The other diffs are presumed neutral. Note that this could also be folded to an 'xor', but I'm not sure if that would be universally better (eg, x86 can convert adds more easily into LEA). This helps prevent regressions from a potential fold for issue #53829.

See #53829

rotateright · 2022-02-21T02:21:53Z

https://reviews.llvm.org/D120216

This is a fix for a regression discussed in: llvm/llvm-project#53829 We cleared more high multiplier bits with 995d400, but that can lead to worse codegen because we would fail to recognize the now disguised multiplication by neg-power-of-2 as a shift-left. The problem exists independently of the IR change in the case that the multiply already had cleared high bits. We also convert shl+sub into mul+add in instcombine's negator. This patch fills in the high-bits to see the shift transform opportunity. Alive2 attempt to show correctness: https://alive2.llvm.org/ce/z/GgSKVX The AArch64, RISCV, and MIPS diffs look like clear wins. The x86 code requires an extra move register in the minimal examples, but it's still an improvement to get rid of the multiply on all CPUs that I am aware of (because multiply is never as fast as a shift). There's a potential follow-up noted by the TODO comment. We should already convert that pattern into shl+add in IR, so it's probably not common: https://alive2.llvm.org/ce/z/7QY_Ga Fixes #53829 Differential Revision: https://reviews.llvm.org/D120216

asb added regression code-quality llvm:optimizations labels Feb 14, 2022

rotateright self-assigned this Feb 18, 2022

rotateright added a commit that referenced this issue Feb 20, 2022

[AArch64][RISCV][x86] add tests for mul-add demanded bits; NFC

7f827eb

See #53829

rotateright closed this as completed in 21d7c3b Feb 23, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Regression in code quality for mul=>shift conversions in multiple backends #53829

Regression in code quality for mul=>shift conversions in multiple backends #53829

asb commented Feb 14, 2022 •

edited

Loading

topperc commented Feb 14, 2022 •

edited

Loading

rotateright commented Feb 14, 2022

rotateright commented Feb 15, 2022

rotateright commented Feb 15, 2022

rotateright commented Feb 15, 2022

rotateright commented Feb 18, 2022

rotateright commented Feb 21, 2022

Regression in code quality for mul=>shift conversions in multiple backends #53829

Regression in code quality for mul=>shift conversions in multiple backends #53829

Comments

asb commented Feb 14, 2022 • edited Loading

topperc commented Feb 14, 2022 • edited Loading

rotateright commented Feb 14, 2022

rotateright commented Feb 15, 2022

rotateright commented Feb 15, 2022

rotateright commented Feb 15, 2022

rotateright commented Feb 18, 2022

rotateright commented Feb 21, 2022

asb commented Feb 14, 2022 •

edited

Loading

topperc commented Feb 14, 2022 •

edited

Loading