Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[WIP][DAG] Introduce generic shl_add node [NFC] #88791

Open
wants to merge 2 commits into
base: main
Choose a base branch
from

Conversation

preames
Copy link
Collaborator

@preames preames commented Apr 15, 2024

Please start with (#89263)

This patch introduces a generic SHL_ADD node in SelectionDAG. This node has the semantics of (add (shl A, ConstantB), C) but performed as single instruction. This corresponds to instructions on multiple architectures:

  • LEA on X86
  • sh1add, sh2add, and sh3add on RISCV
  • adds on AArch64

The initial use case is during expansion of scalar multiplies. The X86 backend has long had the MUL_IMM node to prevent combine loops. As actually used, MUL_IMM only supported the immediates 3, 6, and 9. Not coincidentally, these are the multiplies which can be done via an LEA.

I'm in the process of extending our scalar multiplication code on RISCV, and am going to need a similar construct. Rather than reinventing the wheel, it seems better to introduce a common representation.

In the current form, I am not proposing legalization, or any generic combines to form shl_add. In the current structure patch, shl_add is only created post legalization and always corresponds to a legal instruction. (This is what x86 had before.) Note that the range of valid constant shift amounts differs on each architecture.

I am hoping that I'll be able to restructure some of the x86 code into generic combine infrastructure to share mul expansion, but that is an unproven theory at the moment. Longer term, we may also be able to form shl_add earlier when legal, but that has the challenge of potentially interfering with other desireable transforms.

In the current patch, I am not proposing any flags for shl_add as the current use case doesn't require them, but in principal, I see nothing which prevents us from later supporting nuw, nsw and disjoint with the obvious meanings.

This patch introduces a generic SHL_ADD node in SelectionDAG.  This node
has the semantics of (add (shl A, ConstantB), C) but performed as single
instruction.  This corresponds to instructions on multiple architectures:
* LEA on X86
* sh1add, sh2add, and sh3add on RISCV
* adds on AArch64

The initial use case is during expansion of scalar multiplies.  The
X86 backend has long had the MUL_IMM node to prevent combine loops.  As
actually used, MUL_IMM only supported the immediates 3, 6, and 9.  Not
coincidentally, these are the multiplies which can be done via an LEA.

I'm in the process of extending our scalar multiplication code on RISCV,
and am going to need a similar construct.  Rather than reinventing the
wheel, it seems better to introduce a common representation.

In the current form, I am not proposing legalization, or any generic
combines to form shl_add.  In the current structure patch, shl_add
is only created post legalization and always corresponds to a legal
instruction.  (This is what x86 had before.)  Note that the range of
valid constant shift amounts differs on each architecture.

I am hoping that I'll be able to restructure some of the x86 code
into generic combine infrastructure to share mul expansion, but that
is an unproven theory at the moment.  Longer term, we may also be
able to form shl_add earlier when legal, but that has the challenge
of potentially interfering with other desireable transforms.

In the current patch, I am not proposing any flags for shl_add as
the current use case doesn't require them, but in principal, I see
nothing which prevents us from later supporting nuw, nsw and disjoint
with the obvious meanings.
@llvmbot
Copy link
Collaborator

llvmbot commented Apr 15, 2024

@llvm/pr-subscribers-backend-risc-v

@llvm/pr-subscribers-backend-x86

Author: Philip Reames (preames)

Changes

This patch introduces a generic SHL_ADD node in SelectionDAG. This node has the semantics of (add (shl A, ConstantB), C) but performed as single instruction. This corresponds to instructions on multiple architectures:

  • LEA on X86
  • sh1add, sh2add, and sh3add on RISCV
  • adds on AArch64

The initial use case is during expansion of scalar multiplies. The X86 backend has long had the MUL_IMM node to prevent combine loops. As actually used, MUL_IMM only supported the immediates 3, 6, and 9. Not coincidentally, these are the multiplies which can be done via an LEA.

I'm in the process of extending our scalar multiplication code on RISCV, and am going to need a similar construct. Rather than reinventing the wheel, it seems better to introduce a common representation.

In the current form, I am not proposing legalization, or any generic combines to form shl_add. In the current structure patch, shl_add is only created post legalization and always corresponds to a legal instruction. (This is what x86 had before.) Note that the range of valid constant shift amounts differs on each architecture.

I am hoping that I'll be able to restructure some of the x86 code into generic combine infrastructure to share mul expansion, but that is an unproven theory at the moment. Longer term, we may also be able to form shl_add earlier when legal, but that has the challenge of potentially interfering with other desireable transforms.

In the current patch, I am not proposing any flags for shl_add as the current use case doesn't require them, but in principal, I see nothing which prevents us from later supporting nuw, nsw and disjoint with the obvious meanings.


Full diff: https://github.com/llvm/llvm-project/pull/88791.diff

10 Files Affected:

  • (modified) llvm/include/llvm/CodeGen/ISDOpcodes.h (+7)
  • (modified) llvm/include/llvm/Target/TargetSelectionDAG.td (+5)
  • (modified) llvm/lib/CodeGen/SelectionDAG/SelectionDAG.cpp (+12)
  • (modified) llvm/lib/CodeGen/SelectionDAG/SelectionDAGDumper.cpp (+1)
  • (modified) llvm/lib/Target/RISCV/RISCVISelLowering.cpp (+10-8)
  • (modified) llvm/lib/Target/RISCV/RISCVInstrInfoZb.td (+5)
  • (modified) llvm/lib/Target/X86/X86ISelDAGToDAG.cpp (+39-2)
  • (modified) llvm/lib/Target/X86/X86ISelLowering.cpp (+15-23)
  • (modified) llvm/lib/Target/X86/X86ISelLowering.h (-3)
  • (modified) llvm/lib/Target/X86/X86InstrFragments.td (+3-5)
diff --git a/llvm/include/llvm/CodeGen/ISDOpcodes.h b/llvm/include/llvm/CodeGen/ISDOpcodes.h
index 49d51a27e3c0f6..a45e5b26a58198 100644
--- a/llvm/include/llvm/CodeGen/ISDOpcodes.h
+++ b/llvm/include/llvm/CodeGen/ISDOpcodes.h
@@ -710,6 +710,13 @@ enum NodeType {
   FSHL,
   FSHR,
 
+  // Represents (ADD (SHL a, b), c) with the arguments appearing in the order
+  // a, b, c.  'b' must be a constant, and follows the rules for shift amount
+  // types described just above.  This is used soley post-legalization when
+  // lowering MUL to target specific instructions - e.g. LEA on x86 or
+  // sh1add/sh2add/sh3add on RISCV.
+  SHL_ADD,
+
   /// Byte Swap and Counting operators.
   BSWAP,
   CTTZ,
diff --git a/llvm/include/llvm/Target/TargetSelectionDAG.td b/llvm/include/llvm/Target/TargetSelectionDAG.td
index ea3520835fa07d..aeef292d34b85b 100644
--- a/llvm/include/llvm/Target/TargetSelectionDAG.td
+++ b/llvm/include/llvm/Target/TargetSelectionDAG.td
@@ -121,6 +121,10 @@ def SDTIntShiftOp : SDTypeProfile<1, 2, [   // shl, sra, srl
 def SDTIntShiftDOp: SDTypeProfile<1, 3, [   // fshl, fshr
   SDTCisSameAs<0, 1>, SDTCisSameAs<0, 2>, SDTCisInt<0>, SDTCisInt<3>
 ]>;
+def SDTIntShiftAddOp : SDTypeProfile<1, 3, [   // shl_add
+  SDTCisSameAs<0, 1>, SDTCisSameAs<0, 3>, SDTCisInt<0>, SDTCisInt<2>,
+  SDTCisInt<3>
+]>;
 def SDTIntSatNoShOp : SDTypeProfile<1, 2, [   // ssat with no shift
   SDTCisSameAs<0, 1>, SDTCisInt<2>
 ]>;
@@ -411,6 +415,7 @@ def rotl       : SDNode<"ISD::ROTL"      , SDTIntShiftOp>;
 def rotr       : SDNode<"ISD::ROTR"      , SDTIntShiftOp>;
 def fshl       : SDNode<"ISD::FSHL"      , SDTIntShiftDOp>;
 def fshr       : SDNode<"ISD::FSHR"      , SDTIntShiftDOp>;
+def shl_add    : SDNode<"ISD::SHL_ADD"   , SDTIntShiftAddOp>;
 def and        : SDNode<"ISD::AND"       , SDTIntBinOp,
                         [SDNPCommutative, SDNPAssociative]>;
 def or         : SDNode<"ISD::OR"        , SDTIntBinOp,
diff --git a/llvm/lib/CodeGen/SelectionDAG/SelectionDAG.cpp b/llvm/lib/CodeGen/SelectionDAG/SelectionDAG.cpp
index ca0a95750ba8d8..59385c8bc27925 100644
--- a/llvm/lib/CodeGen/SelectionDAG/SelectionDAG.cpp
+++ b/llvm/lib/CodeGen/SelectionDAG/SelectionDAG.cpp
@@ -3521,6 +3521,13 @@ KnownBits SelectionDAG::computeKnownBits(SDValue Op, const APInt &DemandedElts,
     Known = KnownBits::ashr(Known, Known2, /*ShAmtNonZero=*/false,
                             Op->getFlags().hasExact());
     break;
+  case ISD::SHL_ADD:
+    Known = computeKnownBits(Op.getOperand(0), DemandedElts, Depth + 1);
+    Known2 = computeKnownBits(Op.getOperand(1), DemandedElts, Depth + 1);
+    Known = KnownBits::computeForAddSub(true, false, false,
+        KnownBits::shl(Known, Known2),
+        computeKnownBits(Op.getOperand(2), DemandedElts, Depth + 1));
+    break;
   case ISD::FSHL:
   case ISD::FSHR:
     if (ConstantSDNode *C = isConstOrConstSplat(Op.getOperand(2), DemandedElts)) {
@@ -7346,6 +7353,11 @@ SDValue SelectionDAG::getNode(unsigned Opcode, const SDLoc &DL, EVT VT,
     if (N1.getValueType() == VT)
       return N1;
     break;
+  case ISD::SHL_ADD:
+    assert(VT == N1.getValueType() && VT == N3.getValueType());
+    assert(TLI->isTypeLegal(VT) && "Created only post legalize");
+    assert(isa<ConstantSDNode>(N2) && "Constant shift expected");
+    break;
   }
 
   // Memoize node if it doesn't produce a glue result.
diff --git a/llvm/lib/CodeGen/SelectionDAG/SelectionDAGDumper.cpp b/llvm/lib/CodeGen/SelectionDAG/SelectionDAGDumper.cpp
index 6691aa41face39..cc9dafcfa0c728 100644
--- a/llvm/lib/CodeGen/SelectionDAG/SelectionDAGDumper.cpp
+++ b/llvm/lib/CodeGen/SelectionDAG/SelectionDAGDumper.cpp
@@ -264,6 +264,7 @@ std::string SDNode::getOperationName(const SelectionDAG *G) const {
   case ISD::SRL:                        return "srl";
   case ISD::ROTL:                       return "rotl";
   case ISD::ROTR:                       return "rotr";
+  case ISD::SHL_ADD:                    return "shl_add";
   case ISD::FSHL:                       return "fshl";
   case ISD::FSHR:                       return "fshr";
   case ISD::FADD:                       return "fadd";
diff --git a/llvm/lib/Target/RISCV/RISCVISelLowering.cpp b/llvm/lib/Target/RISCV/RISCVISelLowering.cpp
index 27387595164a46..cc64ccbedee92c 100644
--- a/llvm/lib/Target/RISCV/RISCVISelLowering.cpp
+++ b/llvm/lib/Target/RISCV/RISCVISelLowering.cpp
@@ -12789,10 +12789,9 @@ static SDValue transformAddShlImm(SDNode *N, SelectionDAG &DAG,
   SDLoc DL(N);
   SDValue NS = (C0 < C1) ? N0->getOperand(0) : N1->getOperand(0);
   SDValue NL = (C0 > C1) ? N0->getOperand(0) : N1->getOperand(0);
-  SDValue NA0 =
-      DAG.getNode(ISD::SHL, DL, VT, NL, DAG.getConstant(Diff, DL, VT));
-  SDValue NA1 = DAG.getNode(ISD::ADD, DL, VT, NA0, NS);
-  return DAG.getNode(ISD::SHL, DL, VT, NA1, DAG.getConstant(Bits, DL, VT));
+  SDValue SHADD =
+      DAG.getNode(ISD::SHL_ADD, DL, VT, NL, DAG.getConstant(Diff, DL, VT), NS);
+  return DAG.getNode(ISD::SHL, DL, VT, SHADD, DAG.getConstant(Bits, DL, VT));
 }
 
 // Combine a constant select operand into its use:
@@ -13028,14 +13027,17 @@ static SDValue combineAddOfBooleanXor(SDNode *N, SelectionDAG &DAG) {
                      N0.getOperand(0));
 }
 
-static SDValue performADDCombine(SDNode *N, SelectionDAG &DAG,
+static SDValue performADDCombine(SDNode *N,
+                                 TargetLowering::DAGCombinerInfo &DCI,
                                  const RISCVSubtarget &Subtarget) {
+  SelectionDAG &DAG = DCI.DAG;
   if (SDValue V = combineAddOfBooleanXor(N, DAG))
     return V;
   if (SDValue V = transformAddImmMulImm(N, DAG, Subtarget))
     return V;
-  if (SDValue V = transformAddShlImm(N, DAG, Subtarget))
-    return V;
+  if (!DCI.isBeforeLegalize())
+    if (SDValue V = transformAddShlImm(N, DAG, Subtarget))
+      return V;
   if (SDValue V = combineBinOpToReduce(N, DAG, Subtarget))
     return V;
   if (SDValue V = combineBinOpOfExtractToReduceTree(N, DAG, Subtarget))
@@ -15894,7 +15896,7 @@ SDValue RISCVTargetLowering::PerformDAGCombine(SDNode *N,
       return V;
     if (SDValue V = combineToVWMACC(N, DAG, Subtarget))
       return V;
-    return performADDCombine(N, DAG, Subtarget);
+    return performADDCombine(N, DCI, Subtarget);
   }
   case ISD::SUB: {
     if (SDValue V = combineBinOp_VLToVWBinOp_VL(N, DCI, Subtarget))
diff --git a/llvm/lib/Target/RISCV/RISCVInstrInfoZb.td b/llvm/lib/Target/RISCV/RISCVInstrInfoZb.td
index 434b071e628a0e..8837c66d603779 100644
--- a/llvm/lib/Target/RISCV/RISCVInstrInfoZb.td
+++ b/llvm/lib/Target/RISCV/RISCVInstrInfoZb.td
@@ -678,6 +678,8 @@ foreach i = {1,2,3} in {
   defvar shxadd = !cast<Instruction>("SH"#i#"ADD");
   def : Pat<(XLenVT (add_like_non_imm12 (shl GPR:$rs1, (XLenVT i)), GPR:$rs2)),
             (shxadd GPR:$rs1, GPR:$rs2)>;
+  def : Pat<(XLenVT (shl_add GPR:$rs1, (XLenVT i), GPR:$rs2)),
+            (shxadd GPR:$rs1, GPR:$rs2)>;
 
   defvar pat = !cast<ComplexPattern>("sh"#i#"add_op");
   // More complex cases use a ComplexPattern.
@@ -881,6 +883,9 @@ foreach i = {1,2,3} in {
   defvar shxadd = !cast<Instruction>("SH"#i#"ADD");
   def : Pat<(i32 (add_like_non_imm12 (shl GPR:$rs1, (i64 i)), GPR:$rs2)),
             (shxadd GPR:$rs1, GPR:$rs2)>;
+  def : Pat<(i32 (shl_add GPR:$rs1, (i32 i), GPR:$rs2)),
+            (shxadd GPR:$rs1, GPR:$rs2)>;
+
 }
 }
 
diff --git a/llvm/lib/Target/X86/X86ISelDAGToDAG.cpp b/llvm/lib/Target/X86/X86ISelDAGToDAG.cpp
index 4e4241efd63d6b..50e271bde041ad 100644
--- a/llvm/lib/Target/X86/X86ISelDAGToDAG.cpp
+++ b/llvm/lib/Target/X86/X86ISelDAGToDAG.cpp
@@ -2519,7 +2519,6 @@ bool X86DAGToDAGISel::matchAddressRecursively(SDValue N, X86ISelAddressMode &AM,
     if (N.getResNo() != 0) break;
     [[fallthrough]];
   case ISD::MUL:
-  case X86ISD::MUL_IMM:
     // X*[3,5,9] -> X+X*[2,4,8]
     if (AM.BaseType == X86ISelAddressMode::RegBase &&
         AM.Base_Reg.getNode() == nullptr &&
@@ -2551,7 +2550,45 @@ bool X86DAGToDAGISel::matchAddressRecursively(SDValue N, X86ISelAddressMode &AM,
         }
     }
     break;
-
+  case ISD::SHL_ADD: {
+    // X << [1,2,3] + Y (we should never create anything else)
+    auto *CN = cast<ConstantSDNode>(N.getOperand(1));
+    assert(CN->getZExtValue() == 1 || CN->getZExtValue() == 2 ||
+           CN->getZExtValue() == 3);
+    if (AM.BaseType == X86ISelAddressMode::RegBase &&
+        AM.Base_Reg.getNode() == nullptr &&
+        AM.IndexReg.getNode() == nullptr) {
+      AM.Scale = unsigned(2 << (CN->getZExtValue() - 1));
+
+      if (N.getOperand(0) == N.getOperand(2)) {
+        SDValue MulVal = N.getOperand(0);
+        SDValue Reg;
+
+        // Okay, we know that we have a scale by now.  However, if the scaled
+        // value is an add of something and a constant, we can fold the
+        // constant into the disp field here.
+        if (MulVal.getNode()->getOpcode() == ISD::ADD &&
+            N->isOnlyUserOf(MulVal.getNode()) &&
+            isa<ConstantSDNode>(MulVal.getOperand(1))) {
+          Reg = MulVal.getOperand(0);
+          auto *AddVal = cast<ConstantSDNode>(MulVal.getOperand(1));
+          uint64_t Disp = AddVal->getSExtValue() * (AM.Scale + 1);
+          if (foldOffsetIntoAddress(Disp, AM))
+            Reg = N.getOperand(0);
+        } else {
+          Reg = N.getOperand(0);
+        }
+        AM.IndexReg = AM.Base_Reg = Reg;
+        return false;
+      }
+      // TODO: If N.getOperand(2) is a constant, we could try folding
+      // the displacement analogously to the above.
+      AM.IndexReg = N.getOperand(0);
+      AM.Base_Reg = N.getOperand(2);
+      return false;
+    }
+    break;
+  }
   case ISD::SUB: {
     // Given A-B, if A can be completely folded into the address and
     // the index field with the index field unused, use -B as the index.
diff --git a/llvm/lib/Target/X86/X86ISelLowering.cpp b/llvm/lib/Target/X86/X86ISelLowering.cpp
index f16a751a166d69..477c368c22e8ea 100644
--- a/llvm/lib/Target/X86/X86ISelLowering.cpp
+++ b/llvm/lib/Target/X86/X86ISelLowering.cpp
@@ -33553,7 +33553,6 @@ const char *X86TargetLowering::getTargetNodeName(unsigned Opcode) const {
   NODE_NAME_CASE(BZHI)
   NODE_NAME_CASE(PDEP)
   NODE_NAME_CASE(PEXT)
-  NODE_NAME_CASE(MUL_IMM)
   NODE_NAME_CASE(MOVMSK)
   NODE_NAME_CASE(PTEST)
   NODE_NAME_CASE(TESTP)
@@ -36845,13 +36844,6 @@ void X86TargetLowering::computeKnownBitsForTargetNode(const SDValue Op,
   Known.resetAll();
   switch (Opc) {
   default: break;
-  case X86ISD::MUL_IMM: {
-    KnownBits Known2;
-    Known = DAG.computeKnownBits(Op.getOperand(1), DemandedElts, Depth + 1);
-    Known2 = DAG.computeKnownBits(Op.getOperand(0), DemandedElts, Depth + 1);
-    Known = KnownBits::mul(Known, Known2);
-    break;
-  }
   case X86ISD::SETCC:
     Known.Zero.setBitsFrom(1);
     break;
@@ -46905,12 +46897,18 @@ static SDValue reduceVMULWidth(SDNode *N, SelectionDAG &DAG,
   return DAG.getNode(ISD::CONCAT_VECTORS, DL, VT, ResLo, ResHi);
 }
 
+static SDValue createMulImm(uint64_t MulAmt, SDValue N, SelectionDAG &DAG,
+                            EVT VT, const SDLoc &DL) {
+  assert(MulAmt == 3 || MulAmt == 5 || MulAmt == 9);
+  SDValue ShAmt = DAG.getConstant(Log2_64(MulAmt-1), DL, MVT::i8);
+  return DAG.getNode(ISD::SHL_ADD, DL, VT, N, ShAmt, N);
+}
+
 static SDValue combineMulSpecial(uint64_t MulAmt, SDNode *N, SelectionDAG &DAG,
                                  EVT VT, const SDLoc &DL) {
 
   auto combineMulShlAddOrSub = [&](int Mult, int Shift, bool isAdd) {
-    SDValue Result = DAG.getNode(X86ISD::MUL_IMM, DL, VT, N->getOperand(0),
-                                 DAG.getConstant(Mult, DL, VT));
+    SDValue Result = createMulImm(Mult, N->getOperand(0), DAG, VT, DL);
     Result = DAG.getNode(ISD::SHL, DL, VT, Result,
                          DAG.getConstant(Shift, DL, MVT::i8));
     Result = DAG.getNode(isAdd ? ISD::ADD : ISD::SUB, DL, VT, Result,
@@ -46919,10 +46917,8 @@ static SDValue combineMulSpecial(uint64_t MulAmt, SDNode *N, SelectionDAG &DAG,
   };
 
   auto combineMulMulAddOrSub = [&](int Mul1, int Mul2, bool isAdd) {
-    SDValue Result = DAG.getNode(X86ISD::MUL_IMM, DL, VT, N->getOperand(0),
-                                 DAG.getConstant(Mul1, DL, VT));
-    Result = DAG.getNode(X86ISD::MUL_IMM, DL, VT, Result,
-                         DAG.getConstant(Mul2, DL, VT));
+    SDValue Result = createMulImm(Mul1, N->getOperand(0), DAG, VT, DL);
+    Result = createMulImm(Mul2, Result, DAG, VT, DL);
     Result = DAG.getNode(isAdd ? ISD::ADD : ISD::SUB, DL, VT, Result,
                          N->getOperand(0));
     return Result;
@@ -46982,9 +46978,8 @@ static SDValue combineMulSpecial(uint64_t MulAmt, SDNode *N, SelectionDAG &DAG,
       unsigned ShiftAmt = Log2_64((MulAmt & (MulAmt - 1)));
       SDValue Shift1 = DAG.getNode(ISD::SHL, DL, VT, N->getOperand(0),
                                    DAG.getConstant(ShiftAmt, DL, MVT::i8));
-      SDValue Shift2 = DAG.getNode(ISD::SHL, DL, VT, N->getOperand(0),
-                                   DAG.getConstant(ScaleShift, DL, MVT::i8));
-      return DAG.getNode(ISD::ADD, DL, VT, Shift1, Shift2);
+      return DAG.getNode(ISD::SHL_ADD, DL, VT, N->getOperand(0),
+                         DAG.getConstant(ScaleShift, DL, MVT::i8), Shift1);
     }
   }
 
@@ -47204,8 +47199,7 @@ static SDValue combineMul(SDNode *N, SelectionDAG &DAG,
   SDValue NewMul = SDValue();
   if (VT == MVT::i64 || VT == MVT::i32) {
     if (AbsMulAmt == 3 || AbsMulAmt == 5 || AbsMulAmt == 9) {
-      NewMul = DAG.getNode(X86ISD::MUL_IMM, DL, VT, N->getOperand(0),
-                           DAG.getConstant(AbsMulAmt, DL, VT));
+      NewMul = createMulImm(AbsMulAmt, N->getOperand(0), DAG, VT, DL);
       if (SignMulAmt < 0)
         NewMul =
             DAG.getNode(ISD::SUB, DL, VT, DAG.getConstant(0, DL, VT), NewMul);
@@ -47243,15 +47237,13 @@ static SDValue combineMul(SDNode *N, SelectionDAG &DAG,
         NewMul = DAG.getNode(ISD::SHL, DL, VT, N->getOperand(0),
                              DAG.getConstant(Log2_64(MulAmt1), DL, MVT::i8));
       else
-        NewMul = DAG.getNode(X86ISD::MUL_IMM, DL, VT, N->getOperand(0),
-                             DAG.getConstant(MulAmt1, DL, VT));
+        NewMul = createMulImm(MulAmt1, N->getOperand(0), DAG, VT, DL);
 
       if (isPowerOf2_64(MulAmt2))
         NewMul = DAG.getNode(ISD::SHL, DL, VT, NewMul,
                              DAG.getConstant(Log2_64(MulAmt2), DL, MVT::i8));
       else
-        NewMul = DAG.getNode(X86ISD::MUL_IMM, DL, VT, NewMul,
-                             DAG.getConstant(MulAmt2, DL, VT));
+        NewMul = NewMul = createMulImm(MulAmt2, NewMul, DAG, VT, DL);
 
       // Negate the result.
       if (SignMulAmt < 0)
diff --git a/llvm/lib/Target/X86/X86ISelLowering.h b/llvm/lib/Target/X86/X86ISelLowering.h
index 0a1e8ca4427314..7c5bfac3308c8e 100644
--- a/llvm/lib/Target/X86/X86ISelLowering.h
+++ b/llvm/lib/Target/X86/X86ISelLowering.h
@@ -417,9 +417,6 @@ namespace llvm {
     PDEP,
     PEXT,
 
-    // X86-specific multiply by immediate.
-    MUL_IMM,
-
     // Vector sign bit extraction.
     MOVMSK,
 
diff --git a/llvm/lib/Target/X86/X86InstrFragments.td b/llvm/lib/Target/X86/X86InstrFragments.td
index f14c7200af968a..faeeccab7dac72 100644
--- a/llvm/lib/Target/X86/X86InstrFragments.td
+++ b/llvm/lib/Target/X86/X86InstrFragments.td
@@ -284,8 +284,6 @@ def X86bzhi   : SDNode<"X86ISD::BZHI",   SDTIntBinOp>;
 def X86pdep   : SDNode<"X86ISD::PDEP",   SDTIntBinOp>;
 def X86pext   : SDNode<"X86ISD::PEXT",   SDTIntBinOp>;
 
-def X86mul_imm : SDNode<"X86ISD::MUL_IMM", SDTIntBinOp>;
-
 def X86DynAlloca : SDNode<"X86ISD::DYN_ALLOCA", SDT_X86DYN_ALLOCA,
                           [SDNPHasChain, SDNPOutGlue]>;
 
@@ -341,11 +339,11 @@ def X86cmpccxadd : SDNode<"X86ISD::CMPCCXADD", SDTX86Cmpccxadd,
 // Define X86-specific addressing mode.
 def addr      : ComplexPattern<iPTR, 5, "selectAddr", [], [SDNPWantParent]>;
 def lea32addr : ComplexPattern<i32, 5, "selectLEAAddr",
-                               [add, sub, mul, X86mul_imm, shl, or, xor, frameindex],
+                               [add, sub, mul, shl_add, shl, or, xor, frameindex],
                                []>;
 // In 64-bit mode 32-bit LEAs can use RIP-relative addressing.
 def lea64_32addr : ComplexPattern<i32, 5, "selectLEA64_32Addr",
-                                  [add, sub, mul, X86mul_imm, shl, or, xor,
+                                  [add, sub, mul, shl_add, shl, or, xor,
                                    frameindex, X86WrapperRIP],
                                   []>;
 
@@ -356,7 +354,7 @@ def tls32baseaddr : ComplexPattern<i32, 5, "selectTLSADDRAddr",
                                [tglobaltlsaddr], []>;
 
 def lea64addr : ComplexPattern<i64, 5, "selectLEAAddr",
-                        [add, sub, mul, X86mul_imm, shl, or, xor, frameindex,
+                        [add, sub, mul, shl_add, shl, or, xor, frameindex,
                          X86WrapperRIP], []>;
 
 def tls64addr : ComplexPattern<i64, 5, "selectTLSADDRAddr",

Copy link

github-actions bot commented Apr 15, 2024

⚠️ C/C++ code formatter, clang-format found issues in your code. ⚠️

You can test this locally with the following command:
git-clang-format --diff b7b183371b54a2c4b5d2a39c594d3967a7384cb8 33fd866e5fe4711a3dcd906b1f0d3449bf5e7d09 -- llvm/include/llvm/CodeGen/ISDOpcodes.h llvm/lib/CodeGen/SelectionDAG/SelectionDAG.cpp llvm/lib/CodeGen/SelectionDAG/SelectionDAGDumper.cpp llvm/lib/Target/RISCV/RISCVISelLowering.cpp llvm/lib/Target/X86/X86ISelDAGToDAG.cpp llvm/lib/Target/X86/X86ISelLowering.cpp llvm/lib/Target/X86/X86ISelLowering.h
View the diff from clang-format here.
diff --git a/llvm/lib/Target/X86/X86ISelLowering.cpp b/llvm/lib/Target/X86/X86ISelLowering.cpp
index 1748b372e2..749d9f7378 100644
--- a/llvm/lib/Target/X86/X86ISelLowering.cpp
+++ b/llvm/lib/Target/X86/X86ISelLowering.cpp
@@ -36843,7 +36843,8 @@ void X86TargetLowering::computeKnownBitsForTargetNode(const SDValue Op,
 
   Known.resetAll();
   switch (Opc) {
-  default: break;
+  default:
+    break;
   case X86ISD::SETCC:
     Known.Zero.setBitsFrom(1);
     break;

@arsenm
Copy link
Contributor

arsenm commented Apr 15, 2024

AMDGPU has these too, sometimes

@preames
Copy link
Collaborator Author

preames commented Apr 18, 2024

Ping

Copy link
Collaborator

@RKSimon RKSimon left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No objections from me - needing MUL_IMM at all has always annoyed me, its often needed to stop extra combines that would prevent us matching for LEA etc. but that often leads to more complex code than necessary.

There's the possibility we could use SHL_ADD for vector gather/scatter address math (vector index + offset, imm scalar shift amt), but I don't see anything here that would prevent that in the future?

Note - there are a few real clang-format warnings that need cleaning up.

preames added a commit to preames/llvm-project that referenced this pull request Apr 18, 2024
This implements a RISCV specific version of the SHL_ADD node proposed in
llvm#88791.

If that lands, the infrastructure from this patch should seemlessly switch
over the to generic DAG node.  I'm posting this separately because I've run
out of useful multiply strength reduction work to do without having a way to
represent MUL X, 3/5/9 as a single instruction.

The majority of this change is moving two sets of patterns out of tablgen
and into the post-legalize combine.  The major reason for this is that I
have an upcoming change which needs to reuse the expansion logic, but it
also helps common up some code between zba and the THeadBa variants.

On the test changes, there's a couple major categories:
* We chose a different lowering for mul x, 25.  The new lowering involves
  one fewer register and the same critical path, so this seems like a win.
* The order of the two multiplies changes in (3,5,9)*(3,5,9) in some cases.
  I don't believe this matters.
* I'm removing the one use restriction on the multiply.  This restriction
  doesn't really make sense to me, and the test changes appear positve.
@preames
Copy link
Collaborator Author

preames commented Apr 18, 2024

Note - there are a few real clang-format warnings that need cleaning up.

Real ones fixed, remaining failure appears to be a misconfiguration in the bot.

dtcxzyw added a commit to dtcxzyw/llvm-codegen-benchmark that referenced this pull request Apr 18, 2024
// a, b, c. 'b' must be a constant, and follows the rules for shift amount
// types described just above. This is used soley post-legalization when
// lowering MUL to target specific instructions - e.g. LEA on x86 or
// sh1add/sh2add/sh3add on RISCV.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

RISCV -> RISC-V

SDValue NA1 = DAG.getNode(ISD::ADD, DL, VT, NA0, NS);
return DAG.getNode(ISD::SHL, DL, VT, NA1, DAG.getConstant(Bits, DL, VT));
SDValue SHADD =
DAG.getNode(ISD::SHL_ADD, DL, VT, NL, DAG.getConstant(Diff, DL, VT), NS);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we use TargetConstant if its required to be a constant?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed over in (#89263).

EVT VT, const SDLoc &DL) {
assert(MulAmt == 3 || MulAmt == 5 || MulAmt == 9);
SDValue ShAmt = DAG.getConstant(Log2_64(MulAmt - 1), DL, MVT::i8);
return DAG.getNode(ISD::SHL_ADD, DL, VT, N, ShAmt, N);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this going to cause issue with poison? We've now increased the use count of N.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That... is a good question. We probably need to freeze here since we're increasing the number of uses, I had not considered that. Let me add the freeze and see if that influences codegen in practice. If it does, we may need to consider both a SHL_ADD node and a MUL359 node. I'm hoping we don't, let me investigate and report back.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well, the news here is not good. Adding in Freeze in the x86 backend code causes a whole bunch of regressions that were not obvious on first glance. Interestingly, incorporating the same logic into the RISC-V specific version of this patch (#89263) doesn't seem to expose the same kind of problems - mostly likely because the usage is much more isolated. #89290 fixes an analogous freeze issue in code already landed, again with no visible code diff.

I think what I'd like to suggest here is that we go ahead and focus review on #89263. Once we land that, I can iterate in tree on the RISC-V specific parts, and then rebase this patch on a fully fleshed through implementation and focus it on the x86 merge. (I clearly need to track something down there.)

(For the record, the issue @dtcxzyw flagged in the RISCV specific part of this patch doesn't exist in #89263 as I focused on a different subset there. That's probably confusing for reviewers in retrospect, sorry!)

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I investigated these differences further. Net result is one fairly obvious missed optimization, one somewhat complicated but reasonable issue with COPY elimination, and one fundamental issue. I'm going to focus on only the last.

We end up with a situation where an inserted freeze gets hoisted through a chain of computation. This is all correct and fine, but as a side effect of that hoisting, we strip nsw off an add. The net result is that we can't prove a narrow addressing sequence is equivalent to the wider form, and thus fail to be able to fold a constant base offset into the addressing mode.

I'm a bit stuck on what to do about this case, and need to give this more thought.

@dtcxzyw
Copy link
Member

dtcxzyw commented Apr 18, 2024

This patch is not a NFC.

Example: https://godbolt.org/z/PMTqjf45M

; llc -mtriple=riscv64 -mattr=+zba test.ll
define i64 @func(i64 signext %0, i32 signext %1) {
entry:
  %2 = zext i32 %1 to i64
  %3 = shl nuw nsw i64 %2, 5
  %4 = shl nsw i64 %0, 3
  %5 = add nsw i64 %3, %4
  ret i64 %5
}

Before:

func:                                   # @func
        sh2add.uw       a0, a1, a0
        slli    a0, a0, 3
        ret

After:

func:                                   # @func
        zext.w  a1, a1
        sh2add  a0, a1, a0
        slli    a0, a0, 3
        ret

@preames
Copy link
Collaborator Author

preames commented Apr 18, 2024

This patch is not a NFC.

Example: https://godbolt.org/z/PMTqjf45M

Looks like maybe I missed a pattern for the .uw variant, will go track this down.

Can I ask how you're finding these? This is the second time you've chimed in with a really useful counter example on these threads. Are you fuzzing the changes? Much appreciated.

preames added a commit to preames/llvm-project that referenced this pull request Apr 18, 2024
topperc pointed this out in review of llvm#88791, but I believe the problem applies here as well.

Worth noting is that the code I introduced with this bug was mostly
copied from other targets - which also have this bug.
@dtcxzyw
Copy link
Member

dtcxzyw commented Apr 18, 2024

Can I ask how you're finding these? This is the second time you've chimed in with a really useful counter example on these threads.

These cases are sampled from my opt-benchmark. It is built from some popular real-world libraries/applications. BTW, it has been used to assist in the review of InstCombine-related patches.

Are you fuzzing the changes?

I don't see any value in the fuzzer-generated cases (at least in terms of performance). See also https://llvm.org/docs/InstCombineContributorGuide.html#real-world-usefulness.

@preames preames changed the title [DAG] Introduce generic shl_add node [NFC] [WIP][DAG] Introduce generic shl_add node [NFC] Apr 18, 2024
@preames
Copy link
Collaborator Author

preames commented Apr 18, 2024

This patch is not a NFC.
Example: https://godbolt.org/z/PMTqjf45M

Looks like maybe I missed a pattern for the .uw variant, will go track this down.

I've confirmed this was a missing pattern, and have a local patch with a fix separated based on the RISCVISD::SHL_ADD change. I am not going to post it until the RISCVISD::SHL_ADD patch lands to avoid confusing us all.

preames added a commit that referenced this pull request Apr 22, 2024
…#89290)

topperc pointed this out in review of
#88791, but I believe the
problem applies
here as well. Worth noting is that the code I introduced with this bug
was mostly copied from other targets - which
also have this bug.
preames added a commit that referenced this pull request Apr 22, 2024
…89263)

This implements a RISCV specific version of the SHL_ADD node proposed in
#88791.

If that lands, the infrastructure from this patch should seamlessly
switch over the to generic DAG node. I'm posting this separately because
I've run out of useful multiply strength reduction work to do without
having a way to represent MUL X, 3/5/9 as a single instruction.

The majority of this change is moving two sets of patterns out of
tablgen and into the post-legalize combine. The major reason for this is
that I have an upcoming change which needs to reuse the expansion logic,
but it also helps common up some code between zba and the THeadBa
variants.

On the test changes, there's a couple major categories:
* We chose a different lowering for mul x, 25. The new lowering involves
one fewer register and the same critical path, so this seems like a win.
* The order of the two multiplies changes in (3,5,9)*(3,5,9) in some
cases. I don't believe this matters.
* I'm removing the one use restriction on the multiply. This restriction
doesn't really make sense to me, and the test changes appear positive.
preames added a commit that referenced this pull request Apr 23, 2024
…ombine (#89263)"

Changes since original commit:
* Rebase over improved test coverage for theadba
* Revert change to use TargetConstant as it appears to prevent the uimm2
  clause from matching in the XTheadBa patterns.
* Fix an order of operands bug in the THeadBa pattern visible in the new
  test coverage.

Original commit message follows:

This implements a RISCV specific version of the SHL_ADD node proposed in
#88791.

If that lands, the infrastructure from this patch should seamlessly
switch over the to generic DAG node. I'm posting this separately because
I've run out of useful multiply strength reduction work to do without
having a way to represent MUL X, 3/5/9 as a single instruction.

The majority of this change is moving two sets of patterns out of
tablgen and into the post-legalize combine. The major reason for this is
that I have an upcoming change which needs to reuse the expansion logic,
but it also helps common up some code between zba and the THeadBa
variants.

On the test changes, there's a couple major categories:
* We chose a different lowering for mul x, 25. The new lowering involves
  one fewer register and the same critical path, so this seems like a win.
* The order of the two multiplies changes in (3,5,9)*(3,5,9) in some
  cases. I don't believe this matters.
* I'm removing the one use restriction on the multiply. This restriction
  doesn't really make sense to me, and the test changes appear positive.
preames added a commit to preames/llvm-project that referenced this pull request Apr 23, 2024
Doing so avoids negative interactions with other combines which don't
know the shl_add is a single instruction.  From the commit log, we've had
several combine loops already.

This was originally posted as part of llvm#88791, where a bug was pointed out.
That bug was fixed by llvm#89789 which hits the same issue from another angle.
To confirm the fix, I included the reduce test case here.
preames added a commit that referenced this pull request May 13, 2024
Doing so avoids negative interactions with other combines which don't
know the shl_add is a single instruction. From the commit log, we've had
several combine loops already.

This was originally posted as part of #88791, where a bug was pointed
out. That bug was fixed by #89789 which hits the same issue from another
angle. To confirm the fix, I included the reduced test case here.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

6 participants