[WIP][DAG] Introduce generic shl_add node [NFC] #88791

preames · 2024-04-15T20:42:23Z

Please start with (#89263)

This patch introduces a generic SHL_ADD node in SelectionDAG. This node has the semantics of (add (shl A, ConstantB), C) but performed as single instruction. This corresponds to instructions on multiple architectures:

LEA on X86
sh1add, sh2add, and sh3add on RISCV
adds on AArch64

The initial use case is during expansion of scalar multiplies. The X86 backend has long had the MUL_IMM node to prevent combine loops. As actually used, MUL_IMM only supported the immediates 3, 6, and 9. Not coincidentally, these are the multiplies which can be done via an LEA.

I'm in the process of extending our scalar multiplication code on RISCV, and am going to need a similar construct. Rather than reinventing the wheel, it seems better to introduce a common representation.

In the current form, I am not proposing legalization, or any generic combines to form shl_add. In the current structure patch, shl_add is only created post legalization and always corresponds to a legal instruction. (This is what x86 had before.) Note that the range of valid constant shift amounts differs on each architecture.

I am hoping that I'll be able to restructure some of the x86 code into generic combine infrastructure to share mul expansion, but that is an unproven theory at the moment. Longer term, we may also be able to form shl_add earlier when legal, but that has the challenge of potentially interfering with other desireable transforms.

In the current patch, I am not proposing any flags for shl_add as the current use case doesn't require them, but in principal, I see nothing which prevents us from later supporting nuw, nsw and disjoint with the obvious meanings.

This patch introduces a generic SHL_ADD node in SelectionDAG. This node has the semantics of (add (shl A, ConstantB), C) but performed as single instruction. This corresponds to instructions on multiple architectures: * LEA on X86 * sh1add, sh2add, and sh3add on RISCV * adds on AArch64 The initial use case is during expansion of scalar multiplies. The X86 backend has long had the MUL_IMM node to prevent combine loops. As actually used, MUL_IMM only supported the immediates 3, 6, and 9. Not coincidentally, these are the multiplies which can be done via an LEA. I'm in the process of extending our scalar multiplication code on RISCV, and am going to need a similar construct. Rather than reinventing the wheel, it seems better to introduce a common representation. In the current form, I am not proposing legalization, or any generic combines to form shl_add. In the current structure patch, shl_add is only created post legalization and always corresponds to a legal instruction. (This is what x86 had before.) Note that the range of valid constant shift amounts differs on each architecture. I am hoping that I'll be able to restructure some of the x86 code into generic combine infrastructure to share mul expansion, but that is an unproven theory at the moment. Longer term, we may also be able to form shl_add earlier when legal, but that has the challenge of potentially interfering with other desireable transforms. In the current patch, I am not proposing any flags for shl_add as the current use case doesn't require them, but in principal, I see nothing which prevents us from later supporting nuw, nsw and disjoint with the obvious meanings.

llvmbot · 2024-04-15T20:42:54Z

@llvm/pr-subscribers-backend-risc-v

@llvm/pr-subscribers-backend-x86

Author: Philip Reames (preames)

Changes

This patch introduces a generic SHL_ADD node in SelectionDAG. This node has the semantics of (add (shl A, ConstantB), C) but performed as single instruction. This corresponds to instructions on multiple architectures:

LEA on X86
sh1add, sh2add, and sh3add on RISCV
adds on AArch64

The initial use case is during expansion of scalar multiplies. The X86 backend has long had the MUL_IMM node to prevent combine loops. As actually used, MUL_IMM only supported the immediates 3, 6, and 9. Not coincidentally, these are the multiplies which can be done via an LEA.

I'm in the process of extending our scalar multiplication code on RISCV, and am going to need a similar construct. Rather than reinventing the wheel, it seems better to introduce a common representation.

In the current form, I am not proposing legalization, or any generic combines to form shl_add. In the current structure patch, shl_add is only created post legalization and always corresponds to a legal instruction. (This is what x86 had before.) Note that the range of valid constant shift amounts differs on each architecture.

I am hoping that I'll be able to restructure some of the x86 code into generic combine infrastructure to share mul expansion, but that is an unproven theory at the moment. Longer term, we may also be able to form shl_add earlier when legal, but that has the challenge of potentially interfering with other desireable transforms.

In the current patch, I am not proposing any flags for shl_add as the current use case doesn't require them, but in principal, I see nothing which prevents us from later supporting nuw, nsw and disjoint with the obvious meanings.

Full diff: https://github.com/llvm/llvm-project/pull/88791.diff

10 Files Affected:

(modified) llvm/include/llvm/CodeGen/ISDOpcodes.h (+7)
(modified) llvm/include/llvm/Target/TargetSelectionDAG.td (+5)
(modified) llvm/lib/CodeGen/SelectionDAG/SelectionDAG.cpp (+12)
(modified) llvm/lib/CodeGen/SelectionDAG/SelectionDAGDumper.cpp (+1)
(modified) llvm/lib/Target/RISCV/RISCVISelLowering.cpp (+10-8)
(modified) llvm/lib/Target/RISCV/RISCVInstrInfoZb.td (+5)
(modified) llvm/lib/Target/X86/X86ISelDAGToDAG.cpp (+39-2)
(modified) llvm/lib/Target/X86/X86ISelLowering.cpp (+15-23)
(modified) llvm/lib/Target/X86/X86ISelLowering.h (-3)
(modified) llvm/lib/Target/X86/X86InstrFragments.td (+3-5)

diff --git a/llvm/include/llvm/CodeGen/ISDOpcodes.h b/llvm/include/llvm/CodeGen/ISDOpcodes.h
index 49d51a27e3c0f6..a45e5b26a58198 100644
--- a/llvm/include/llvm/CodeGen/ISDOpcodes.h
+++ b/llvm/include/llvm/CodeGen/ISDOpcodes.h
@@ -710,6 +710,13 @@ enum NodeType {
   FSHL,
   FSHR,
 
+  // Represents (ADD (SHL a, b), c) with the arguments appearing in the order
+  // a, b, c.  'b' must be a constant, and follows the rules for shift amount
+  // types described just above.  This is used soley post-legalization when
+  // lowering MUL to target specific instructions - e.g. LEA on x86 or
+  // sh1add/sh2add/sh3add on RISCV.
+  SHL_ADD,
+
   /// Byte Swap and Counting operators.
   BSWAP,
   CTTZ,
diff --git a/llvm/include/llvm/Target/TargetSelectionDAG.td b/llvm/include/llvm/Target/TargetSelectionDAG.td
index ea3520835fa07d..aeef292d34b85b 100644
--- a/llvm/include/llvm/Target/TargetSelectionDAG.td
+++ b/llvm/include/llvm/Target/TargetSelectionDAG.td
@@ -121,6 +121,10 @@ def SDTIntShiftOp : SDTypeProfile<1, 2, [   // shl, sra, srl
 def SDTIntShiftDOp: SDTypeProfile<1, 3, [   // fshl, fshr
   SDTCisSameAs<0, 1>, SDTCisSameAs<0, 2>, SDTCisInt<0>, SDTCisInt<3>
 ]>;
+def SDTIntShiftAddOp : SDTypeProfile<1, 3, [   // shl_add
+  SDTCisSameAs<0, 1>, SDTCisSameAs<0, 3>, SDTCisInt<0>, SDTCisInt<2>,
+  SDTCisInt<3>
+]>;
 def SDTIntSatNoShOp : SDTypeProfile<1, 2, [   // ssat with no shift
   SDTCisSameAs<0, 1>, SDTCisInt<2>
 ]>;
@@ -411,6 +415,7 @@ def rotl       : SDNode<"ISD::ROTL"      , SDTIntShiftOp>;
 def rotr       : SDNode<"ISD::ROTR"      , SDTIntShiftOp>;
 def fshl       : SDNode<"ISD::FSHL"      , SDTIntShiftDOp>;
 def fshr       : SDNode<"ISD::FSHR"      , SDTIntShiftDOp>;
+def shl_add    : SDNode<"ISD::SHL_ADD"   , SDTIntShiftAddOp>;
 def and        : SDNode<"ISD::AND"       , SDTIntBinOp,
                         [SDNPCommutative, SDNPAssociative]>;
 def or         : SDNode<"ISD::OR"        , SDTIntBinOp,
diff --git a/llvm/lib/CodeGen/SelectionDAG/SelectionDAG.cpp b/llvm/lib/CodeGen/SelectionDAG/SelectionDAG.cpp
index ca0a95750ba8d8..59385c8bc27925 100644
--- a/llvm/lib/CodeGen/SelectionDAG/SelectionDAG.cpp
+++ b/llvm/lib/CodeGen/SelectionDAG/SelectionDAG.cpp
@@ -3521,6 +3521,13 @@ KnownBits SelectionDAG::computeKnownBits(SDValue Op, const APInt &DemandedElts,
     Known = KnownBits::ashr(Known, Known2, /*ShAmtNonZero=*/false,
                             Op->getFlags().hasExact());
     break;
+  case ISD::SHL_ADD:
+    Known = computeKnownBits(Op.getOperand(0), DemandedElts, Depth + 1);
+    Known2 = computeKnownBits(Op.getOperand(1), DemandedElts, Depth + 1);
+    Known = KnownBits::computeForAddSub(true, false, false,
+        KnownBits::shl(Known, Known2),
+        computeKnownBits(Op.getOperand(2), DemandedElts, Depth + 1));
+    break;
   case ISD::FSHL:
   case ISD::FSHR:
     if (ConstantSDNode *C = isConstOrConstSplat(Op.getOperand(2), DemandedElts)) {
@@ -7346,6 +7353,11 @@ SDValue SelectionDAG::getNode(unsigned Opcode, const SDLoc &DL, EVT VT,
     if (N1.getValueType() == VT)
       return N1;
     break;
+  case ISD::SHL_ADD:
+    assert(VT == N1.getValueType() && VT == N3.getValueType());
+    assert(TLI->isTypeLegal(VT) && "Created only post legalize");
+    assert(isa<ConstantSDNode>(N2) && "Constant shift expected");
+    break;
   }
 
   // Memoize node if it doesn't produce a glue result.
diff --git a/llvm/lib/CodeGen/SelectionDAG/SelectionDAGDumper.cpp b/llvm/lib/CodeGen/SelectionDAG/SelectionDAGDumper.cpp
index 6691aa41face39..cc9dafcfa0c728 100644
--- a/llvm/lib/CodeGen/SelectionDAG/SelectionDAGDumper.cpp
+++ b/llvm/lib/CodeGen/SelectionDAG/SelectionDAGDumper.cpp
@@ -264,6 +264,7 @@ std::string SDNode::getOperationName(const SelectionDAG *G) const {
   case ISD::SRL:                        return "srl";
   case ISD::ROTL:                       return "rotl";
   case ISD::ROTR:                       return "rotr";
+  case ISD::SHL_ADD:                    return "shl_add";
   case ISD::FSHL:                       return "fshl";
   case ISD::FSHR:                       return "fshr";
   case ISD::FADD:                       return "fadd";
diff --git a/llvm/lib/Target/RISCV/RISCVISelLowering.cpp b/llvm/lib/Target/RISCV/RISCVISelLowering.cpp
index 27387595164a46..cc64ccbedee92c 100644
--- a/llvm/lib/Target/RISCV/RISCVISelLowering.cpp
+++ b/llvm/lib/Target/RISCV/RISCVISelLowering.cpp
@@ -12789,10 +12789,9 @@ static SDValue transformAddShlImm(SDNode *N, SelectionDAG &DAG,
   SDLoc DL(N);
   SDValue NS = (C0 < C1) ? N0->getOperand(0) : N1->getOperand(0);
   SDValue NL = (C0 > C1) ? N0->getOperand(0) : N1->getOperand(0);
-  SDValue NA0 =
-      DAG.getNode(ISD::SHL, DL, VT, NL, DAG.getConstant(Diff, DL, VT));
-  SDValue NA1 = DAG.getNode(ISD::ADD, DL, VT, NA0, NS);
-  return DAG.getNode(ISD::SHL, DL, VT, NA1, DAG.getConstant(Bits, DL, VT));
+  SDValue SHADD =
+      DAG.getNode(ISD::SHL_ADD, DL, VT, NL, DAG.getConstant(Diff, DL, VT), NS);
+  return DAG.getNode(ISD::SHL, DL, VT, SHADD, DAG.getConstant(Bits, DL, VT));
 }
 
 // Combine a constant select operand into its use:
@@ -13028,14 +13027,17 @@ static SDValue combineAddOfBooleanXor(SDNode *N, SelectionDAG &DAG) {
                      N0.getOperand(0));
 }
 
-static SDValue performADDCombine(SDNode *N, SelectionDAG &DAG,
+static SDValue performADDCombine(SDNode *N,
+                                 TargetLowering::DAGCombinerInfo &DCI,
                                  const RISCVSubtarget &Subtarget) {
+  SelectionDAG &DAG = DCI.DAG;
   if (SDValue V = combineAddOfBooleanXor(N, DAG))
     return V;
   if (SDValue V = transformAddImmMulImm(N, DAG, Subtarget))
     return V;
-  if (SDValue V = transformAddShlImm(N, DAG, Subtarget))
-    return V;
+  if (!DCI.isBeforeLegalize())
+    if (SDValue V = transformAddShlImm(N, DAG, Subtarget))
+      return V;
   if (SDValue V = combineBinOpToReduce(N, DAG, Subtarget))
     return V;
   if (SDValue V = combineBinOpOfExtractToReduceTree(N, DAG, Subtarget))
@@ -15894,7 +15896,7 @@ SDValue RISCVTargetLowering::PerformDAGCombine(SDNode *N,
       return V;
     if (SDValue V = combineToVWMACC(N, DAG, Subtarget))
       return V;
-    return performADDCombine(N, DAG, Subtarget);
+    return performADDCombine(N, DCI, Subtarget);
   }
   case ISD::SUB: {
     if (SDValue V = combineBinOp_VLToVWBinOp_VL(N, DCI, Subtarget))
diff --git a/llvm/lib/Target/RISCV/RISCVInstrInfoZb.td b/llvm/lib/Target/RISCV/RISCVInstrInfoZb.td
index 434b071e628a0e..8837c66d603779 100644
--- a/llvm/lib/Target/RISCV/RISCVInstrInfoZb.td
+++ b/llvm/lib/Target/RISCV/RISCVInstrInfoZb.td
@@ -678,6 +678,8 @@ foreach i = {1,2,3} in {
   defvar shxadd = !cast<Instruction>("SH"#i#"ADD");
   def : Pat<(XLenVT (add_like_non_imm12 (shl GPR:$rs1, (XLenVT i)), GPR:$rs2)),
             (shxadd GPR:$rs1, GPR:$rs2)>;
+  def : Pat<(XLenVT (shl_add GPR:$rs1, (XLenVT i), GPR:$rs2)),
+            (shxadd GPR:$rs1, GPR:$rs2)>;
 
   defvar pat = !cast<ComplexPattern>("sh"#i#"add_op");
   // More complex cases use a ComplexPattern.
@@ -881,6 +883,9 @@ foreach i = {1,2,3} in {
   defvar shxadd = !cast<Instruction>("SH"#i#"ADD");
   def : Pat<(i32 (add_like_non_imm12 (shl GPR:$rs1, (i64 i)), GPR:$rs2)),
             (shxadd GPR:$rs1, GPR:$rs2)>;
+  def : Pat<(i32 (shl_add GPR:$rs1, (i32 i), GPR:$rs2)),
+            (shxadd GPR:$rs1, GPR:$rs2)>;
+
 }
 }
 
diff --git a/llvm/lib/Target/X86/X86ISelDAGToDAG.cpp b/llvm/lib/Target/X86/X86ISelDAGToDAG.cpp
index 4e4241efd63d6b..50e271bde041ad 100644
--- a/llvm/lib/Target/X86/X86ISelDAGToDAG.cpp
+++ b/llvm/lib/Target/X86/X86ISelDAGToDAG.cpp
@@ -2519,7 +2519,6 @@ bool X86DAGToDAGISel::matchAddressRecursively(SDValue N, X86ISelAddressMode &AM,
     if (N.getResNo() != 0) break;
     [[fallthrough]];
   case ISD::MUL:
-  case X86ISD::MUL_IMM:
     // X*[3,5,9] -> X+X*[2,4,8]
     if (AM.BaseType == X86ISelAddressMode::RegBase &&
         AM.Base_Reg.getNode() == nullptr &&
@@ -2551,7 +2550,45 @@ bool X86DAGToDAGISel::matchAddressRecursively(SDValue N, X86ISelAddressMode &AM,
         }
     }
     break;
-
+  case ISD::SHL_ADD: {
+    // X << [1,2,3] + Y (we should never create anything else)
+    auto *CN = cast<ConstantSDNode>(N.getOperand(1));
+    assert(CN->getZExtValue() == 1 || CN->getZExtValue() == 2 ||
+           CN->getZExtValue() == 3);
+    if (AM.BaseType == X86ISelAddressMode::RegBase &&
+        AM.Base_Reg.getNode() == nullptr &&
+        AM.IndexReg.getNode() == nullptr) {
+      AM.Scale = unsigned(2 << (CN->getZExtValue() - 1));
+
+      if (N.getOperand(0) == N.getOperand(2)) {
+        SDValue MulVal = N.getOperand(0);
+        SDValue Reg;
+
+        // Okay, we know that we have a scale by now.  However, if the scaled
+        // value is an add of something and a constant, we can fold the
+        // constant into the disp field here.
+        if (MulVal.getNode()->getOpcode() == ISD::ADD &&
+            N->isOnlyUserOf(MulVal.getNode()) &&
+            isa<ConstantSDNode>(MulVal.getOperand(1))) {
+          Reg = MulVal.getOperand(0);
+          auto *AddVal = cast<ConstantSDNode>(MulVal.getOperand(1));
+          uint64_t Disp = AddVal->getSExtValue() * (AM.Scale + 1);
+          if (foldOffsetIntoAddress(Disp, AM))
+            Reg = N.getOperand(0);
+        } else {
+          Reg = N.getOperand(0);
+        }
+        AM.IndexReg = AM.Base_Reg = Reg;
+        return false;
+      }
+      // TODO: If N.getOperand(2) is a constant, we could try folding
+      // the displacement analogously to the above.
+      AM.IndexReg = N.getOperand(0);
+      AM.Base_Reg = N.getOperand(2);
+      return false;
+    }
+    break;
+  }
   case ISD::SUB: {
     // Given A-B, if A can be completely folded into the address and
     // the index field with the index field unused, use -B as the index.
diff --git a/llvm/lib/Target/X86/X86ISelLowering.cpp b/llvm/lib/Target/X86/X86ISelLowering.cpp
index f16a751a166d69..477c368c22e8ea 100644
--- a/llvm/lib/Target/X86/X86ISelLowering.cpp
+++ b/llvm/lib/Target/X86/X86ISelLowering.cpp
@@ -33553,7 +33553,6 @@ const char *X86TargetLowering::getTargetNodeName(unsigned Opcode) const {
   NODE_NAME_CASE(BZHI)
   NODE_NAME_CASE(PDEP)
   NODE_NAME_CASE(PEXT)
-  NODE_NAME_CASE(MUL_IMM)
   NODE_NAME_CASE(MOVMSK)
   NODE_NAME_CASE(PTEST)
   NODE_NAME_CASE(TESTP)
@@ -36845,13 +36844,6 @@ void X86TargetLowering::computeKnownBitsForTargetNode(const SDValue Op,
   Known.resetAll();
   switch (Opc) {
   default: break;
-  case X86ISD::MUL_IMM: {
-    KnownBits Known2;
-    Known = DAG.computeKnownBits(Op.getOperand(1), DemandedElts, Depth + 1);
-    Known2 = DAG.computeKnownBits(Op.getOperand(0), DemandedElts, Depth + 1);
-    Known = KnownBits::mul(Known, Known2);
-    break;
-  }
   case X86ISD::SETCC:
     Known.Zero.setBitsFrom(1);
     break;
@@ -46905,12 +46897,18 @@ static SDValue reduceVMULWidth(SDNode *N, SelectionDAG &DAG,
   return DAG.getNode(ISD::CONCAT_VECTORS, DL, VT, ResLo, ResHi);
 }
 
+static SDValue createMulImm(uint64_t MulAmt, SDValue N, SelectionDAG &DAG,
+                            EVT VT, const SDLoc &DL) {
+  assert(MulAmt == 3 || MulAmt == 5 || MulAmt == 9);
+  SDValue ShAmt = DAG.getConstant(Log2_64(MulAmt-1), DL, MVT::i8);
+  return DAG.getNode(ISD::SHL_ADD, DL, VT, N, ShAmt, N);
+}
+
 static SDValue combineMulSpecial(uint64_t MulAmt, SDNode *N, SelectionDAG &DAG,
                                  EVT VT, const SDLoc &DL) {
 
   auto combineMulShlAddOrSub = [&](int Mult, int Shift, bool isAdd) {
-    SDValue Result = DAG.getNode(X86ISD::MUL_IMM, DL, VT, N->getOperand(0),
-                                 DAG.getConstant(Mult, DL, VT));
+    SDValue Result = createMulImm(Mult, N->getOperand(0), DAG, VT, DL);
     Result = DAG.getNode(ISD::SHL, DL, VT, Result,
                          DAG.getConstant(Shift, DL, MVT::i8));
     Result = DAG.getNode(isAdd ? ISD::ADD : ISD::SUB, DL, VT, Result,
@@ -46919,10 +46917,8 @@ static SDValue combineMulSpecial(uint64_t MulAmt, SDNode *N, SelectionDAG &DAG,
   };
 
   auto combineMulMulAddOrSub = [&](int Mul1, int Mul2, bool isAdd) {
-    SDValue Result = DAG.getNode(X86ISD::MUL_IMM, DL, VT, N->getOperand(0),
-                                 DAG.getConstant(Mul1, DL, VT));
-    Result = DAG.getNode(X86ISD::MUL_IMM, DL, VT, Result,
-                         DAG.getConstant(Mul2, DL, VT));
+    SDValue Result = createMulImm(Mul1, N->getOperand(0), DAG, VT, DL);
+    Result = createMulImm(Mul2, Result, DAG, VT, DL);
     Result = DAG.getNode(isAdd ? ISD::ADD : ISD::SUB, DL, VT, Result,
                          N->getOperand(0));
     return Result;
@@ -46982,9 +46978,8 @@ static SDValue combineMulSpecial(uint64_t MulAmt, SDNode *N, SelectionDAG &DAG,
       unsigned ShiftAmt = Log2_64((MulAmt & (MulAmt - 1)));
       SDValue Shift1 = DAG.getNode(ISD::SHL, DL, VT, N->getOperand(0),
                                    DAG.getConstant(ShiftAmt, DL, MVT::i8));
-      SDValue Shift2 = DAG.getNode(ISD::SHL, DL, VT, N->getOperand(0),
-                                   DAG.getConstant(ScaleShift, DL, MVT::i8));
-      return DAG.getNode(ISD::ADD, DL, VT, Shift1, Shift2);
+      return DAG.getNode(ISD::SHL_ADD, DL, VT, N->getOperand(0),
+                         DAG.getConstant(ScaleShift, DL, MVT::i8), Shift1);
     }
   }
 
@@ -47204,8 +47199,7 @@ static SDValue combineMul(SDNode *N, SelectionDAG &DAG,
   SDValue NewMul = SDValue();
   if (VT == MVT::i64 || VT == MVT::i32) {
     if (AbsMulAmt == 3 || AbsMulAmt == 5 || AbsMulAmt == 9) {
-      NewMul = DAG.getNode(X86ISD::MUL_IMM, DL, VT, N->getOperand(0),
-                           DAG.getConstant(AbsMulAmt, DL, VT));
+      NewMul = createMulImm(AbsMulAmt, N->getOperand(0), DAG, VT, DL);
       if (SignMulAmt < 0)
         NewMul =
             DAG.getNode(ISD::SUB, DL, VT, DAG.getConstant(0, DL, VT), NewMul);
@@ -47243,15 +47237,13 @@ static SDValue combineMul(SDNode *N, SelectionDAG &DAG,
         NewMul = DAG.getNode(ISD::SHL, DL, VT, N->getOperand(0),
                              DAG.getConstant(Log2_64(MulAmt1), DL, MVT::i8));
       else
-        NewMul = DAG.getNode(X86ISD::MUL_IMM, DL, VT, N->getOperand(0),
-                             DAG.getConstant(MulAmt1, DL, VT));
+        NewMul = createMulImm(MulAmt1, N->getOperand(0), DAG, VT, DL);
 
       if (isPowerOf2_64(MulAmt2))
         NewMul = DAG.getNode(ISD::SHL, DL, VT, NewMul,
                              DAG.getConstant(Log2_64(MulAmt2), DL, MVT::i8));
       else
-        NewMul = DAG.getNode(X86ISD::MUL_IMM, DL, VT, NewMul,
-                             DAG.getConstant(MulAmt2, DL, VT));
+        NewMul = NewMul = createMulImm(MulAmt2, NewMul, DAG, VT, DL);
 
       // Negate the result.
       if (SignMulAmt < 0)
diff --git a/llvm/lib/Target/X86/X86ISelLowering.h b/llvm/lib/Target/X86/X86ISelLowering.h
index 0a1e8ca4427314..7c5bfac3308c8e 100644
--- a/llvm/lib/Target/X86/X86ISelLowering.h
+++ b/llvm/lib/Target/X86/X86ISelLowering.h
@@ -417,9 +417,6 @@ namespace llvm {
     PDEP,
     PEXT,
 
-    // X86-specific multiply by immediate.
-    MUL_IMM,
-
     // Vector sign bit extraction.
     MOVMSK,
 
diff --git a/llvm/lib/Target/X86/X86InstrFragments.td b/llvm/lib/Target/X86/X86InstrFragments.td
index f14c7200af968a..faeeccab7dac72 100644
--- a/llvm/lib/Target/X86/X86InstrFragments.td
+++ b/llvm/lib/Target/X86/X86InstrFragments.td
@@ -284,8 +284,6 @@ def X86bzhi   : SDNode<"X86ISD::BZHI",   SDTIntBinOp>;
 def X86pdep   : SDNode<"X86ISD::PDEP",   SDTIntBinOp>;
 def X86pext   : SDNode<"X86ISD::PEXT",   SDTIntBinOp>;
 
-def X86mul_imm : SDNode<"X86ISD::MUL_IMM", SDTIntBinOp>;
-
 def X86DynAlloca : SDNode<"X86ISD::DYN_ALLOCA", SDT_X86DYN_ALLOCA,
                           [SDNPHasChain, SDNPOutGlue]>;
 
@@ -341,11 +339,11 @@ def X86cmpccxadd : SDNode<"X86ISD::CMPCCXADD", SDTX86Cmpccxadd,
 // Define X86-specific addressing mode.
 def addr      : ComplexPattern<iPTR, 5, "selectAddr", [], [SDNPWantParent]>;
 def lea32addr : ComplexPattern<i32, 5, "selectLEAAddr",
-                               [add, sub, mul, X86mul_imm, shl, or, xor, frameindex],
+                               [add, sub, mul, shl_add, shl, or, xor, frameindex],
                                []>;
 // In 64-bit mode 32-bit LEAs can use RIP-relative addressing.
 def lea64_32addr : ComplexPattern<i32, 5, "selectLEA64_32Addr",
-                                  [add, sub, mul, X86mul_imm, shl, or, xor,
+                                  [add, sub, mul, shl_add, shl, or, xor,
                                    frameindex, X86WrapperRIP],
                                   []>;
 
@@ -356,7 +354,7 @@ def tls32baseaddr : ComplexPattern<i32, 5, "selectTLSADDRAddr",
                                [tglobaltlsaddr], []>;
 
 def lea64addr : ComplexPattern<i64, 5, "selectLEAAddr",
-                        [add, sub, mul, X86mul_imm, shl, or, xor, frameindex,
+                        [add, sub, mul, shl_add, shl, or, xor, frameindex,
                          X86WrapperRIP], []>;
 
 def tls64addr : ComplexPattern<i64, 5, "selectTLSADDRAddr",

github-actions · 2024-04-15T20:45:21Z

⚠️ C/C++ code formatter, clang-format found issues in your code. ⚠️

You can test this locally with the following command:

git-clang-format --diff b7b183371b54a2c4b5d2a39c594d3967a7384cb8 33fd866e5fe4711a3dcd906b1f0d3449bf5e7d09 -- llvm/include/llvm/CodeGen/ISDOpcodes.h llvm/lib/CodeGen/SelectionDAG/SelectionDAG.cpp llvm/lib/CodeGen/SelectionDAG/SelectionDAGDumper.cpp llvm/lib/Target/RISCV/RISCVISelLowering.cpp llvm/lib/Target/X86/X86ISelDAGToDAG.cpp llvm/lib/Target/X86/X86ISelLowering.cpp llvm/lib/Target/X86/X86ISelLowering.h

View the diff from clang-format here.

diff --git a/llvm/lib/Target/X86/X86ISelLowering.cpp b/llvm/lib/Target/X86/X86ISelLowering.cpp
index 1748b372e2..749d9f7378 100644
--- a/llvm/lib/Target/X86/X86ISelLowering.cpp
+++ b/llvm/lib/Target/X86/X86ISelLowering.cpp
@@ -36843,7 +36843,8 @@ void X86TargetLowering::computeKnownBitsForTargetNode(const SDValue Op,
 
   Known.resetAll();
   switch (Opc) {
-  default: break;
+  default:
+    break;
   case X86ISD::SETCC:
     Known.Zero.setBitsFrom(1);
     break;

arsenm · 2024-04-15T21:13:03Z

AMDGPU has these too, sometimes

preames · 2024-04-18T14:25:31Z

Ping

RKSimon

No objections from me - needing MUL_IMM at all has always annoyed me, its often needed to stop extra combines that would prevent us matching for LEA etc. but that often leads to more complex code than necessary.

There's the possibility we could use SHL_ADD for vector gather/scatter address math (vector index + offset, imm scalar shift amt), but I don't see anything here that would prevent that in the future?

Note - there are a few real clang-format warnings that need cleaning up.

This implements a RISCV specific version of the SHL_ADD node proposed in llvm#88791. If that lands, the infrastructure from this patch should seemlessly switch over the to generic DAG node. I'm posting this separately because I've run out of useful multiply strength reduction work to do without having a way to represent MUL X, 3/5/9 as a single instruction. The majority of this change is moving two sets of patterns out of tablgen and into the post-legalize combine. The major reason for this is that I have an upcoming change which needs to reuse the expansion logic, but it also helps common up some code between zba and the THeadBa variants. On the test changes, there's a couple major categories: * We chose a different lowering for mul x, 25. The new lowering involves one fewer register and the same critical path, so this seems like a win. * The order of the two multiplies changes in (3,5,9)*(3,5,9) in some cases. I don't believe this matters. * I'm removing the one use restriction on the multiply. This restriction doesn't really make sense to me, and the test changes appear positve.

preames · 2024-04-18T16:48:19Z

Note - there are a few real clang-format warnings that need cleaning up.

Real ones fixed, remaining failure appears to be a misconfiguration in the bot.

PR Link: llvm/llvm-project#88791

topperc · 2024-04-18T17:17:07Z

llvm/include/llvm/CodeGen/ISDOpcodes.h

+  // a, b, c.  'b' must be a constant, and follows the rules for shift amount
+  // types described just above.  This is used soley post-legalization when
+  // lowering MUL to target specific instructions - e.g. LEA on x86 or
+  // sh1add/sh2add/sh3add on RISCV.


RISCV -> RISC-V

topperc · 2024-04-18T17:18:33Z

llvm/lib/Target/RISCV/RISCVISelLowering.cpp

-  SDValue NA1 = DAG.getNode(ISD::ADD, DL, VT, NA0, NS);
-  return DAG.getNode(ISD::SHL, DL, VT, NA1, DAG.getConstant(Bits, DL, VT));
+  SDValue SHADD =
+      DAG.getNode(ISD::SHL_ADD, DL, VT, NL, DAG.getConstant(Diff, DL, VT), NS);


Should we use TargetConstant if its required to be a constant?

Fixed over in (#89263).

topperc · 2024-04-18T17:26:08Z

llvm/lib/Target/X86/X86ISelLowering.cpp

+                            EVT VT, const SDLoc &DL) {
+  assert(MulAmt == 3 || MulAmt == 5 || MulAmt == 9);
+  SDValue ShAmt = DAG.getConstant(Log2_64(MulAmt - 1), DL, MVT::i8);
+  return DAG.getNode(ISD::SHL_ADD, DL, VT, N, ShAmt, N);


Is this going to cause issue with poison? We've now increased the use count of N.

That... is a good question. We probably need to freeze here since we're increasing the number of uses, I had not considered that. Let me add the freeze and see if that influences codegen in practice. If it does, we may need to consider both a SHL_ADD node and a MUL359 node. I'm hoping we don't, let me investigate and report back.

Well, the news here is not good. Adding in Freeze in the x86 backend code causes a whole bunch of regressions that were not obvious on first glance. Interestingly, incorporating the same logic into the RISC-V specific version of this patch (#89263) doesn't seem to expose the same kind of problems - mostly likely because the usage is much more isolated. #89290 fixes an analogous freeze issue in code already landed, again with no visible code diff.

I think what I'd like to suggest here is that we go ahead and focus review on #89263. Once we land that, I can iterate in tree on the RISC-V specific parts, and then rebase this patch on a fully fleshed through implementation and focus it on the x86 merge. (I clearly need to track something down there.)

(For the record, the issue @dtcxzyw flagged in the RISCV specific part of this patch doesn't exist in #89263 as I focused on a different subset there. That's probably confusing for reviewers in retrospect, sorry!)

I investigated these differences further. Net result is one fairly obvious missed optimization, one somewhat complicated but reasonable issue with COPY elimination, and one fundamental issue. I'm going to focus on only the last.

We end up with a situation where an inserted freeze gets hoisted through a chain of computation. This is all correct and fine, but as a side effect of that hoisting, we strip nsw off an add. The net result is that we can't prove a narrow addressing sequence is equivalent to the wider form, and thus fail to be able to fold a constant base offset into the addressing mode.

I'm a bit stuck on what to do about this case, and need to give this more thought.

dtcxzyw · 2024-04-18T17:32:58Z

This patch is not a NFC.

Example: https://godbolt.org/z/PMTqjf45M

; llc -mtriple=riscv64 -mattr=+zba test.ll
define i64 @func(i64 signext %0, i32 signext %1) {
entry:
  %2 = zext i32 %1 to i64
  %3 = shl nuw nsw i64 %2, 5
  %4 = shl nsw i64 %0, 3
  %5 = add nsw i64 %3, %4
  ret i64 %5
}

Before:

func:                                   # @func
        sh2add.uw       a0, a1, a0
        slli    a0, a0, 3
        ret

After:

func:                                   # @func
        zext.w  a1, a1
        sh2add  a0, a1, a0
        slli    a0, a0, 3
        ret

preames · 2024-04-18T18:29:24Z

This patch is not a NFC.

Example: https://godbolt.org/z/PMTqjf45M

Looks like maybe I missed a pattern for the .uw variant, will go track this down.

Can I ask how you're finding these? This is the second time you've chimed in with a really useful counter example on these threads. Are you fuzzing the changes? Much appreciated.

topperc pointed this out in review of llvm#88791, but I believe the problem applies here as well. Worth noting is that the code I introduced with this bug was mostly copied from other targets - which also have this bug.

dtcxzyw · 2024-04-18T18:44:11Z

Can I ask how you're finding these? This is the second time you've chimed in with a really useful counter example on these threads.

These cases are sampled from my opt-benchmark. It is built from some popular real-world libraries/applications. BTW, it has been used to assist in the review of InstCombine-related patches.

Are you fuzzing the changes?

I don't see any value in the fuzzer-generated cases (at least in terms of performance). See also https://llvm.org/docs/InstCombineContributorGuide.html#real-world-usefulness.

preames · 2024-04-18T21:48:09Z

This patch is not a NFC.
Example: https://godbolt.org/z/PMTqjf45M

Looks like maybe I missed a pattern for the .uw variant, will go track this down.

I've confirmed this was a missing pattern, and have a local patch with a fix separated based on the RISCVISD::SHL_ADD change. I am not going to post it until the RISCVISD::SHL_ADD patch lands to avoid confusing us all.

…#89290) topperc pointed this out in review of #88791, but I believe the problem applies here as well. Worth noting is that the code I introduced with this bug was mostly copied from other targets - which also have this bug.

…89263) This implements a RISCV specific version of the SHL_ADD node proposed in #88791. If that lands, the infrastructure from this patch should seamlessly switch over the to generic DAG node. I'm posting this separately because I've run out of useful multiply strength reduction work to do without having a way to represent MUL X, 3/5/9 as a single instruction. The majority of this change is moving two sets of patterns out of tablgen and into the post-legalize combine. The major reason for this is that I have an upcoming change which needs to reuse the expansion logic, but it also helps common up some code between zba and the THeadBa variants. On the test changes, there's a couple major categories: * We chose a different lowering for mul x, 25. The new lowering involves one fewer register and the same critical path, so this seems like a win. * The order of the two multiplies changes in (3,5,9)*(3,5,9) in some cases. I don't believe this matters. * I'm removing the one use restriction on the multiply. This restriction doesn't really make sense to me, and the test changes appear positive.

…ombine (#89263)" Changes since original commit: * Rebase over improved test coverage for theadba * Revert change to use TargetConstant as it appears to prevent the uimm2 clause from matching in the XTheadBa patterns. * Fix an order of operands bug in the THeadBa pattern visible in the new test coverage. Original commit message follows: This implements a RISCV specific version of the SHL_ADD node proposed in #88791. If that lands, the infrastructure from this patch should seamlessly switch over the to generic DAG node. I'm posting this separately because I've run out of useful multiply strength reduction work to do without having a way to represent MUL X, 3/5/9 as a single instruction. The majority of this change is moving two sets of patterns out of tablgen and into the post-legalize combine. The major reason for this is that I have an upcoming change which needs to reuse the expansion logic, but it also helps common up some code between zba and the THeadBa variants. On the test changes, there's a couple major categories: * We chose a different lowering for mul x, 25. The new lowering involves one fewer register and the same critical path, so this seems like a win. * The order of the two multiplies changes in (3,5,9)*(3,5,9) in some cases. I don't believe this matters. * I'm removing the one use restriction on the multiply. This restriction doesn't really make sense to me, and the test changes appear positive.

Doing so avoids negative interactions with other combines which don't know the shl_add is a single instruction. From the commit log, we've had several combine loops already. This was originally posted as part of llvm#88791, where a bug was pointed out. That bug was fixed by llvm#89789 which hits the same issue from another angle. To confirm the fix, I included the reduce test case here.

Doing so avoids negative interactions with other combines which don't know the shl_add is a single instruction. From the commit log, we've had several combine loops already. This was originally posted as part of #88791, where a bug was pointed out. That bug was fixed by #89789 which hits the same issue from another angle. To confirm the fix, I included the reduced test case here.

preames requested review from topperc and goldsteinn April 15, 2024 20:42

llvmbot added backend:RISC-V backend:X86 llvm:SelectionDAG SelectionDAGISel as well labels Apr 15, 2024

RKSimon self-requested a review April 16, 2024 10:31

preames mentioned this pull request Apr 17, 2024

[RISCV] Strength reduce mul by 2^N - 2^M #88983

Open

RKSimon reviewed Apr 18, 2024

View reviewed changes

Fix clang format warnings

33fd866

preames mentioned this pull request Apr 18, 2024

[RISCV] Implement RISCVISD::SHL_ADD and move patterns into combine #89263

Merged

dtcxzyw added a commit to dtcxzyw/llvm-codegen-benchmark that referenced this pull request Apr 18, 2024

pre-commit: test PR88791

0576fd4

PR Link: llvm/llvm-project#88791

dtcxzyw mentioned this pull request Apr 18, 2024

pre-commit: test PR88791 dtcxzyw/llvm-codegen-benchmark#12

Open

topperc reviewed Apr 18, 2024

View reviewed changes

preames mentioned this pull request Apr 18, 2024

[RISCV] Add freeze when expanding mul by constant to two or more uses #89290

Merged

preames changed the title ~~[DAG] Introduce generic shl_add node [NFC]~~ [WIP][DAG] Introduce generic shl_add node [NFC] Apr 18, 2024

dtcxzyw mentioned this pull request Apr 19, 2024

[AArch64] Suboptimal code for multiplication by certain constants #89430

Closed

preames mentioned this pull request Apr 23, 2024

[RISCV] Use RISCVISD::SHL_ADD in transformAddShlImm #89832

Merged

dtcxzyw mentioned this pull request Apr 25, 2024

[RISCV] Move strength reduction of mul X, 3/5/9*2^N to combine #89966

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[WIP][DAG] Introduce generic shl_add node [NFC] #88791

[WIP][DAG] Introduce generic shl_add node [NFC] #88791

preames commented Apr 15, 2024 •

edited

llvmbot commented Apr 15, 2024 •

edited

github-actions bot commented Apr 15, 2024 •

edited

arsenm commented Apr 15, 2024

preames commented Apr 18, 2024

RKSimon left a comment

preames commented Apr 18, 2024

topperc Apr 18, 2024

topperc Apr 18, 2024

preames Apr 18, 2024

topperc Apr 18, 2024

preames Apr 18, 2024

preames Apr 18, 2024

preames Apr 23, 2024

dtcxzyw commented Apr 18, 2024

preames commented Apr 18, 2024

dtcxzyw commented Apr 18, 2024

preames commented Apr 18, 2024

[WIP][DAG] Introduce generic shl_add node [NFC] #88791

Are you sure you want to change the base?

[WIP][DAG] Introduce generic shl_add node [NFC] #88791

Conversation

preames commented Apr 15, 2024 • edited

llvmbot commented Apr 15, 2024 • edited

github-actions bot commented Apr 15, 2024 • edited

arsenm commented Apr 15, 2024

preames commented Apr 18, 2024

RKSimon left a comment

Choose a reason for hiding this comment

preames commented Apr 18, 2024

topperc Apr 18, 2024

Choose a reason for hiding this comment

topperc Apr 18, 2024

Choose a reason for hiding this comment

preames Apr 18, 2024

Choose a reason for hiding this comment

topperc Apr 18, 2024

Choose a reason for hiding this comment

preames Apr 18, 2024

Choose a reason for hiding this comment

preames Apr 18, 2024

Choose a reason for hiding this comment

preames Apr 23, 2024

Choose a reason for hiding this comment

dtcxzyw commented Apr 18, 2024

preames commented Apr 18, 2024

dtcxzyw commented Apr 18, 2024

preames commented Apr 18, 2024

preames commented Apr 15, 2024 •

edited

llvmbot commented Apr 15, 2024 •

edited

github-actions bot commented Apr 15, 2024 •

edited