Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[LoopIdiomRecognizer] Implement CRC recognition #79295

Open
wants to merge 12 commits into
base: main
Choose a base branch
from

Conversation

joe-img
Copy link

@joe-img joe-img commented Jan 24, 2024

Recognizes CRC byte loops and replaces them with a table lookup.

Current limitations:

  • Only works on byte loops
    CRC size can be any, but the data is limited to one byte. i.e. a loop with iteration count 8.
  • Only works on single-block loops
    most CRC loops would have been flattened to one block with select instructions this far into the pipeline.

Both limitations were in effort to reduce complexity, especially for a first patch. The code can be fairly easily extended to overcome these limitations.

Implementation details:

  1. Check if the loop looks like CRC and extract some useful information
  2. Execute one iteration of the instruction of the loop to see what happens to our output value
  3. Ensure the output value is predicated on the value of the M/LSB of our input
  4. Construct an expected output value of one iteration of CRC using the extracted information from step one and compare
  5. Construct a lookup table and replace the output value with a lookup

Copy link

Thank you for submitting a Pull Request (PR) to the LLVM Project!

This PR will be automatically labeled and the relevant teams will be
notified.

If you wish to, you can add reviewers by using the "Reviewers" section on this page.

If this is not working for you, it is probably because you do not have write
permissions for the repository. In which case you can instead tag reviewers by
name in a comment by using @ followed by their GitHub username.

If you have received no comments on your PR for a week, you can request a review
by "ping"ing the PR by adding a comment “Ping”. The common courtesy "ping" rate
is once a week. Please remember that you are asking for valuable time from other developers.

If you have further questions, they may be answered by the LLVM GitHub User Guide.

You can also ask questions in a comment on this PR, on the LLVM Discord or on the forums.

@llvmbot
Copy link
Collaborator

llvmbot commented Jan 24, 2024

@llvm/pr-subscribers-llvm-transforms

Author: Joe Faulls (joe-img)

Changes

Recognizes CRC byte loops and replaces them with a table lookup.

Current limitations:

  • Only works on byte loops
    CRC size can be any, but the data is limited to one byte. i.e. a loop with iteration count 8.
  • Only works on single-block loops
    most CRC loops would have been flattened to one block with select instructions this far into the pipeline.

Both limitations were in effort to reduce complexity, especially for a first patch. The code can be fairly easily extended to overcome these limitations.

Implementation details:

  1. Check if the loop looks like CRC and extract some useful information
  2. Execute one iteration of the instruction of the loop to see what happens to our output value
  3. Ensure the output value is predicated on the value of the M/LSB of our input
  4. Construct an expected output value of one iteration of CRC using the extracted information from step one and compare
  5. Construct a lookup table and replace the output value with a lookup

Patch is 46.11 KiB, truncated to 20.00 KiB below, full version: https://github.com/llvm/llvm-project/pull/79295.diff

3 Files Affected:

  • (modified) llvm/lib/Transforms/Scalar/LoopIdiomRecognize.cpp (+910-9)
  • (added) llvm/test/Transforms/LoopIdiom/crc/crc.ll (+195)
  • (added) llvm/test/Transforms/LoopIdiom/crc/not-crc.ll (+113)
diff --git a/llvm/lib/Transforms/Scalar/LoopIdiomRecognize.cpp b/llvm/lib/Transforms/Scalar/LoopIdiomRecognize.cpp
index 3721564890ddb4..f20947daaed8d5 100644
--- a/llvm/lib/Transforms/Scalar/LoopIdiomRecognize.cpp
+++ b/llvm/lib/Transforms/Scalar/LoopIdiomRecognize.cpp
@@ -90,6 +90,8 @@
 #include <algorithm>
 #include <cassert>
 #include <cstdint>
+#include <map>
+#include <sstream>
 #include <utility>
 #include <vector>
 
@@ -135,6 +137,9 @@ static cl::opt<bool> UseLIRCodeSizeHeurs(
              "with -Os/-Oz"),
     cl::init(true), cl::Hidden);
 
+static cl::opt<bool> CRCRecognize("recognize-crc", cl::desc("CRC RECOGNIZE"),
+                                  cl::init(false), cl::Hidden);
+
 namespace {
 
 class LoopIdiomRecognize {
@@ -186,6 +191,15 @@ class LoopIdiomRecognize {
             // handling.
   };
 
+  struct CRCInfo {
+    Value *CRCInput;
+    Value *CRCOutput;
+    Value *DataInput;
+    uint64_t Width;
+    uint64_t Polynomial;
+    bool BitReversed;
+  };
+
   /// \name Countable Loop Idiom Handling
   /// @{
 
@@ -242,6 +256,9 @@ class LoopIdiomRecognize {
 
   bool recognizeShiftUntilBitTest();
   bool recognizeShiftUntilZero();
+  std::optional<CRCInfo> looksLikeCRC(const SCEV *BECount);
+  bool recognizeCRC(const SCEV *BECount);
+  void writeTableBasedCRCOneByte(CRCInfo &CRC);
 
   /// @}
 };
@@ -298,13 +315,8 @@ bool LoopIdiomRecognize::runOnLoop(Loop *L) {
   ApplyCodeSizeHeuristics =
       L->getHeader()->getParent()->hasOptSize() && UseLIRCodeSizeHeurs;
 
-  HasMemset = TLI->has(LibFunc_memset);
-  HasMemsetPattern = TLI->has(LibFunc_memset_pattern16);
-  HasMemcpy = TLI->has(LibFunc_memcpy);
-
-  if (HasMemset || HasMemsetPattern || HasMemcpy)
-    if (SE->hasLoopInvariantBackedgeTakenCount(L))
-      return runOnCountableLoop();
+  if (SE->hasLoopInvariantBackedgeTakenCount(L))
+    return runOnCountableLoop();
 
   return runOnNoncountableLoop();
 }
@@ -329,6 +341,17 @@ bool LoopIdiomRecognize::runOnCountableLoop() {
                     << "] Countable Loop %" << CurLoop->getHeader()->getName()
                     << "\n");
 
+  bool MadeChange = false;
+  if (CRCRecognize)
+    MadeChange |= recognizeCRC(BECount);
+
+  HasMemset = TLI->has(LibFunc_memset);
+  HasMemsetPattern = TLI->has(LibFunc_memset_pattern16);
+  HasMemcpy = TLI->has(LibFunc_memcpy);
+
+  if (!(HasMemset || HasMemsetPattern || HasMemcpy))
+    return MadeChange;
+
   // The following transforms hoist stores/memsets into the loop pre-header.
   // Give up if the loop has instructions that may throw.
   SimpleLoopSafetyInfo SafetyInfo;
@@ -336,8 +359,6 @@ bool LoopIdiomRecognize::runOnCountableLoop() {
   if (SafetyInfo.anyBlockMayThrow())
     return false;
 
-  bool MadeChange = false;
-
   // Scan all the blocks in the loop that are not in subloops.
   for (auto *BB : CurLoop->getBlocks()) {
     // Ignore blocks in subloops.
@@ -2868,3 +2889,883 @@ bool LoopIdiomRecognize::recognizeShiftUntilZero() {
   ++NumShiftUntilZero;
   return MadeChange;
 }
+
+static uint64_t reverseBits(uint64_t Num, unsigned NumBits) {
+  uint64_t Reversed = 0;
+  for (unsigned i = 1; i <= NumBits; i++) {
+    Reversed |= (Num & 1) << (NumBits - i);
+    Num >>= 1;
+  }
+  return Reversed;
+}
+
+class ValueBits {
+  // This is a representation of a value's bits in terms of references to
+  // other values' bits, or 1/0 if the bit is known. This allows symbolic
+  // execution of bitwise instructions without knowing the exact values.
+  //
+  // Example:
+  //
+  // LLVM IR Value i8 %x:
+  // [%x[7], %x[6], %x[5], %x[4], %x[3], %x[2], %x[1], %x[0]]
+  //
+  // %shr = lshr i8 %x, 2
+  // [ 0, 0, %x[7], %x[6], %x[5], %x[4], %x[3], %x[2]]
+  //
+  // %shl = shl i8 %shr, 1
+  // [ 0, %x[7], %x[6], %x[5], %x[4], %x[3], %x[2], 0]
+  //
+  // %xor = xor i8 %shl, 0xb
+  // [ 0, %x[7], %x[6], %x[5], %x[4]^1, %x[3], %x[2]^1, 1]
+public:
+  class ValueBit {
+  public:
+    enum BitType { ONE, ZERO, REF, XOR };
+
+  private:
+    BitType _Type;
+    std::pair<Value *, uint64_t> _BitRef;
+    ValueBit *_LHS;
+    ValueBit *_RHS;
+
+    ValueBit(BitType Type) : _Type(Type) {}
+    ValueBit(BitType Type, std::pair<Value *, uint64_t> BitRef)
+        : _Type(Type), _BitRef(BitRef) {}
+    ValueBit(BitType Type, ValueBit *LHS, ValueBit *RHS)
+        : _Type(Type), _LHS(LHS), _RHS(RHS) {}
+
+  public:
+    static ValueBit *CreateOneBit() { return new ValueBit(BitType::ONE); }
+    static ValueBit *CreateZeroBit() { return new ValueBit(BitType::ZERO); }
+    static ValueBit *CreateRefBit(Value *Ref, uint64_t Offset) {
+      return new ValueBit(BitType::REF, std::make_pair(Ref, Offset));
+    }
+    static ValueBit *CreateXORBit(ValueBit *LHS, ValueBit *RHS) {
+      return new ValueBit(BitType::XOR, LHS, RHS);
+    }
+    inline BitType getType() { return _Type; }
+    bool equals(ValueBit *RHS) {
+      if (_Type != RHS->getType())
+        return false;
+      switch (_Type) {
+      case BitType::ONE:
+      case BitType::ZERO:
+        return true;
+      case BitType::REF:
+        return _BitRef == RHS->_BitRef;
+      case BitType::XOR:
+        return (_LHS->equals(RHS->_LHS) && _RHS->equals(RHS->_RHS)) ||
+               (_LHS->equals(RHS->_RHS) && _RHS->equals(RHS->_LHS));
+      }
+      return false;
+    }
+
+    void print(raw_ostream &OS) {
+      switch (_Type) {
+      case BitType::ONE:
+        OS << "1";
+        break;
+      case BitType::ZERO:
+        OS << "0";
+        break;
+      case BitType::REF:
+        OS << _BitRef.first->getNameOrAsOperand() << "[" << _BitRef.second
+           << "]";
+        break;
+      case BitType::XOR:
+        _LHS->print(OS);
+        OS << "^";
+        _RHS->print(OS);
+        break;
+      }
+    }
+  };
+
+private:
+  uint64_t Size;
+  std::vector<ValueBit *> Bits;
+
+  virtual void _Shl(uint64_t N) {
+    for (; N > 0; N--) {
+      Bits.insert(Bits.begin(), ValueBit::CreateZeroBit());
+      Bits.pop_back();
+    }
+  }
+  virtual void _LShr(uint64_t N) {
+    for (; N > 0; N--) {
+      Bits.insert(Bits.end(), ValueBit::CreateZeroBit());
+      Bits.erase(Bits.begin());
+    }
+  }
+  virtual void _Xor(ValueBits *RHS) {
+    assert(Size == RHS->getSize());
+    for (unsigned I = 0; I < Size; I++) {
+      auto It = Bits.begin() + I;
+      ValueBit *RHSBit = RHS->getBit(I);
+      if (RHSBit->getType() == ValueBit::BitType::ONE) {
+        Bits.erase(It);
+        if ((*It)->getType() == ValueBit::BitType::ZERO) {
+          Bits.insert(It, ValueBit::CreateOneBit());
+        } else if ((*It)->getType() == ValueBit::BitType::ONE) {
+          Bits.insert(It, ValueBit::CreateZeroBit());
+        } else {
+          ValueBit *One = ValueBit::CreateOneBit();
+          Bits.insert(It, ValueBit::CreateXORBit(*It, One));
+        }
+      } else if (RHSBit->getType() != ValueBit::BitType::ZERO) {
+        if ((*It)->getType() == ValueBit::BitType::ZERO) {
+          Bits.erase(It);
+          ValueBit *BitRef = new ValueBit(*RHSBit);
+          Bits.insert(It, BitRef);
+        } else {
+          ValueBit *ItVB = *It;
+          Bits.erase(It);
+          Bits.insert(It, ValueBit::CreateXORBit(ItVB, RHSBit));
+        }
+      }
+    }
+  }
+  virtual void _ZExt(uint64_t ToSize) {
+    assert(ToSize > Size);
+    for (uint64_t I = 0; I < ToSize - Size; I++)
+      Bits.push_back(ValueBit::CreateZeroBit());
+    Size = ToSize;
+  }
+  virtual void _Trunc(uint64_t ToSize) {
+    assert(ToSize < Size);
+    Bits.erase(Bits.begin() + ToSize, Bits.end());
+    Size = ToSize;
+  }
+  virtual void _And(uint64_t RHS) {
+    for (unsigned I = 0; I < Size; I++) {
+      if (!(RHS & 1)) {
+        auto It = Bits.begin() + I;
+        Bits.erase(It);
+        Bits.insert(It, ValueBit::CreateZeroBit());
+      }
+      RHS >>= 1;
+    }
+  }
+
+protected:
+  ValueBits() {}
+
+public:
+  ValueBits(Value *InitialVal, uint64_t BitLength) : Size(BitLength) {
+    for (unsigned i = 0; i < BitLength; i++)
+      Bits.push_back(ValueBit::CreateRefBit(InitialVal, i));
+  }
+  ValueBits(uint64_t InitialVal, uint64_t BitLength) : Size(BitLength) {
+    for (unsigned i = 0; i < BitLength; i++) {
+      if (InitialVal & 0x1)
+        Bits.push_back(ValueBit::CreateOneBit());
+      else
+        Bits.push_back(ValueBit::CreateZeroBit());
+      InitialVal >>= 1;
+    }
+  }
+  uint64_t getSize() { return Size; }
+  ValueBit *getBit(unsigned i) { return Bits[i]; }
+
+  virtual ValueBits *copyBits() { return new ValueBits(*this); }
+
+  static ValueBits *Shl(ValueBits *LHS, uint64_t N) {
+    ValueBits *Shifted = LHS->copyBits();
+    Shifted->_Shl(N);
+    return Shifted;
+  }
+  static ValueBits *LShr(ValueBits *LHS, uint64_t N) {
+    ValueBits *Shifted = LHS->copyBits();
+    Shifted->_LShr(N);
+    return Shifted;
+  }
+  static ValueBits *Xor(ValueBits *LHS, ValueBits *RHS) {
+    ValueBits *Xord = LHS->copyBits();
+    Xord->_Xor(RHS);
+    return Xord;
+  }
+  static ValueBits *ZExt(ValueBits *LHS, uint64_t ToSize) {
+    ValueBits *Zexted = LHS->copyBits();
+    Zexted->_ZExt(ToSize);
+    return Zexted;
+  }
+  static ValueBits *Trunc(ValueBits *LHS, uint64_t N) {
+    ValueBits *Trunced = LHS->copyBits();
+    Trunced->_Trunc(N);
+    return Trunced;
+  }
+  static ValueBits *And(ValueBits *LHS, uint64_t RHS) {
+    ValueBits *Anded = LHS->copyBits();
+    Anded->_And(RHS);
+    return Anded;
+  }
+
+  virtual bool isPredicated() { return false; }
+
+  virtual bool equals(ValueBits *RHS) {
+    if (Size != RHS->getSize())
+      return false;
+
+    for (unsigned I = 0; I < Size; I++)
+      if (!getBit(I)->equals(RHS->getBit(I)))
+        return false;
+
+    return true;
+  }
+
+  virtual void print(raw_ostream &OS) {
+    assert(Size != 0);
+    OS << "[";
+    Bits[Size - 1]->print(OS);
+    for (int i = Size - 2; i >= 0; i--) {
+      OS << " | ";
+      Bits[i]->print(OS);
+    }
+    OS << "]\n";
+  }
+};
+
+inline raw_ostream &operator<<(raw_ostream &OS, ValueBits &VBS) {
+  VBS.print(OS);
+  return OS;
+}
+
+inline raw_ostream &operator<<(raw_ostream &OS, ValueBits::ValueBit &VB) {
+  VB.print(OS);
+  return OS;
+}
+class PredicatedValueBits : public ValueBits {
+  // This would be representitive of select or phi instructions where the bits
+  // would depend on an icmp.
+private:
+  ICmpInst *_Predicate;
+  ValueBits *_IfTrue;
+  ValueBits *_IfFalse;
+
+  void _Shl(uint64_t N) override {
+    _IfTrue = ValueBits::Shl(_IfTrue, N);
+    _IfFalse = ValueBits::Shl(_IfFalse, N);
+  }
+  void _LShr(uint64_t N) override {
+    _IfTrue = ValueBits::LShr(_IfTrue, N);
+    _IfFalse = ValueBits::LShr(_IfFalse, N);
+  }
+  void _ZExt(uint64_t N) override {
+    _IfTrue = ValueBits::ZExt(_IfTrue, N);
+    _IfFalse = ValueBits::ZExt(_IfFalse, N);
+  }
+  void _And(uint64_t N) override {
+    _IfTrue = ValueBits::And(_IfTrue, N);
+    _IfFalse = ValueBits::And(_IfFalse, N);
+  }
+  void _Xor(ValueBits *RHS) override {
+    _IfTrue = ValueBits::Xor(_IfTrue, RHS);
+    _IfFalse = ValueBits::Xor(_IfFalse, RHS);
+  }
+  void _Trunc(uint64_t N) override {
+    _IfTrue = ValueBits::Trunc(_IfTrue, N);
+    _IfFalse = ValueBits::Trunc(_IfFalse, N);
+  }
+
+public:
+  PredicatedValueBits(ICmpInst *Predicate, ValueBits *IfTrue,
+                      ValueBits *IfFalse)
+      : _Predicate(Predicate), _IfTrue(IfTrue), _IfFalse(IfFalse) {}
+
+  ValueBits *copyBits() override { return new PredicatedValueBits(*this); }
+  bool isPredicated() override { return true; }
+  ValueBits *getIfTrue() { return _IfTrue; }
+  ValueBits *getIfFalse() { return _IfFalse; }
+  ICmpInst *getPredicate() { return _Predicate; }
+
+  virtual void print(raw_ostream &OS) override {
+    OS << "Predicate: " << *_Predicate << "\nIf True:\n"
+       << *_IfTrue << "If False:\n"
+       << *_IfFalse;
+  }
+};
+
+// Execute the instructions in a basic block whilst mapping out Values to
+// ValueBits
+static bool symbolicallyExecute(BasicBlock *BB,
+                                std::map<Value *, ValueBits *> &ValueMap) {
+
+  auto getConstantOperand = [](Instruction *I, uint8_t Operand) {
+    ConstantInt *CI = dyn_cast<ConstantInt>(I->getOperand(Operand));
+    if (!CI) {
+      LLVM_DEBUG(dbgs() << DEBUG_TYPE " CRCRegonize: Do not know how to"
+                        << " handle this operation with non-constant operand "
+                        << Operand << ":\n"
+                        << *I << "\n");
+    }
+    return CI;
+  };
+
+  auto getOrCreateValueBits = [&ValueMap](Value *Val) {
+    auto Result = ValueMap.find(Val);
+    ValueBits *LHSBits = nullptr;
+    if (Result == ValueMap.end()) {
+      ConstantInt *CI = dyn_cast<ConstantInt>(Val);
+      if (CI) {
+        LHSBits = new ValueBits(CI->getSExtValue(),
+                                Val->getType()->getScalarSizeInBits());
+      } else {
+        LHSBits = new ValueBits(Val, Val->getType()->getScalarSizeInBits());
+      }
+    } else
+      LHSBits = Result->second;
+    return LHSBits;
+  };
+
+  for (Instruction &I : *BB) {
+    uint64_t BitSize = I.getType()->getScalarSizeInBits();
+    switch (I.getOpcode()) {
+    case Instruction::PHI: {
+      PHINode *PHI = dyn_cast<PHINode>(&I);
+      const BasicBlock *IncomingBlock = nullptr;
+      for (const BasicBlock *Incoming : PHI->blocks()) {
+        if (Incoming != BB) {
+          if (IncomingBlock) {
+            LLVM_DEBUG(dbgs()
+                       << DEBUG_TYPE " CRCRegonize: Do not know how to"
+                       << " handle loop with multiple entries" << I << "\n");
+            return false;
+          }
+          IncomingBlock = Incoming;
+        }
+      }
+      assert(IncomingBlock);
+      ValueMap[&I] =
+          getOrCreateValueBits(PHI->getIncomingValueForBlock(IncomingBlock));
+    } break;
+    case Instruction::Shl: {
+      ConstantInt *CI = getConstantOperand(&I, 1);
+      if (!CI)
+        return false;
+      Value *LHSVal = I.getOperand(0);
+      ValueBits *LHSBits = getOrCreateValueBits(LHSVal);
+      ValueMap[&I] = ValueBits::Shl(LHSBits, CI->getSExtValue());
+    } break;
+    case Instruction::LShr: {
+      ConstantInt *CI = getConstantOperand(&I, 1);
+      if (!CI)
+        return false;
+      Value *LHSVal = I.getOperand(0);
+      ValueBits *LHSBits = getOrCreateValueBits(LHSVal);
+      ValueMap[&I] = ValueBits::LShr(LHSBits, CI->getSExtValue());
+    } break;
+    case Instruction::And: {
+      ConstantInt *CI = getConstantOperand(&I, 1);
+      if (!CI)
+        return false;
+      Value *LHSVal = I.getOperand(0);
+      ValueBits *LHSBits = getOrCreateValueBits(LHSVal);
+      ValueMap[&I] = ValueBits::And(LHSBits, CI->getSExtValue());
+    } break;
+    case Instruction::Xor: {
+      ValueBits *LHSBits = getOrCreateValueBits(I.getOperand(0));
+      ValueBits *RHSBits = getOrCreateValueBits(I.getOperand(1));
+      ValueMap[&I] = ValueBits::Xor(LHSBits, RHSBits);
+    } break;
+    case Instruction::ZExt: {
+      ValueBits *LHSBits = getOrCreateValueBits(I.getOperand(0));
+      ValueMap[&I] = ValueBits::ZExt(LHSBits, BitSize);
+    } break;
+    case Instruction::Trunc: {
+      ValueBits *LHSBits = getOrCreateValueBits(I.getOperand(0));
+      ValueMap[&I] = ValueBits::Trunc(LHSBits, BitSize);
+    } break;
+    case Instruction::Select: {
+      SelectInst *Select = cast<SelectInst>(&I);
+      ICmpInst *Cond = dyn_cast<ICmpInst>(Select->getCondition());
+      if (!Cond) {
+        LLVM_DEBUG(dbgs() << DEBUG_TYPE " CRCRegonize: Do not know how to"
+                          << " handle SelectInst with non-icmp condition: " << I
+                          << "\n");
+        return false;
+      }
+      ValueBits *IfTrue = getOrCreateValueBits(Select->getTrueValue());
+      ValueBits *IfFalse = getOrCreateValueBits(Select->getFalseValue());
+      ValueMap[&I] = new PredicatedValueBits(Cond, IfTrue, IfFalse);
+    } break;
+    default:
+      // If this instruction is not recognized, then just continue. This is
+      // okay because users of this will just reference it by value, which is
+      // conservative.
+      break;
+    }
+  }
+  return true;
+}
+
+void LoopIdiomRecognize::writeTableBasedCRCOneByte(CRCInfo &CRC) {
+  BasicBlock *ExitBB = CurLoop->getExitBlock();
+  IRBuilder<> Builder(ExitBB);
+  Builder.SetInsertPoint(ExitBB->getFirstNonPHI());
+  Type *CRCType = CRC.CRCInput->getType();
+  uint64_t CRCSize = CRCType->getScalarSizeInBits();
+
+  // Construct the CRC table
+  uint64_t CRCTable[256];
+  uint64_t Polynomial = CRC.Polynomial;
+  uint64_t SB = CRC.BitReversed ? 0x1 : (0x1 << (CRCSize - 1));
+  if (CRC.BitReversed)
+    Polynomial = reverseBits(Polynomial, CRCSize);
+  for (uint64_t Dividend = 0; Dividend < 256; Dividend++) {
+    uint64_t CurByte = Dividend;
+    if (!CRC.BitReversed)
+      CurByte <<= CRCSize - 8;
+    for (uint8_t Bit = 0; Bit < 8; Bit++) {
+      if ((CurByte & SB) != 0) {
+        CurByte = CRC.BitReversed ? CurByte >> 1 : CurByte << 1;
+        CurByte = CurByte ^ Polynomial;
+      } else {
+        CurByte = CRC.BitReversed ? CurByte >> 1 : CurByte << 1;
+      }
+    }
+    CRCTable[Dividend] = CurByte;
+  }
+  // To construct a global data array, we need the raw data in bytes.
+  // The calculated table array is an array of 64bit values because we can't
+  // dynamically type it, so we need to truncate the values to the crc size
+  // to avoid padded zeros. Do this by allocating a byte array (of slightly more
+  // than we need to account for overflow) and copying the 64bit values across
+  // aligned correctly
+  uint64_t CRCNumBytes = CRCSize / 8;
+  char *CRCTableData = (char *)malloc(CRCNumBytes * 260);
+  for (int I = 0; I < 256; I++) {
+    *((uint64_t *)(CRCTableData + I * CRCNumBytes)) = CRCTable[I];
+  }
+
+  // Construct and add the table as a global variable
+  ArrayType *TableType = ArrayType::get(CRCType, 256);
+  Constant *ConstantArr = ConstantDataArray::getRaw(
+      StringRef(CRCTableData, CRCNumBytes * 256), 256, CRCType);
+  std::stringstream TableNameSS;
+  TableNameSS << "crctable.i" << CRCSize << "." << CRC.Polynomial;
+  if (CRC.BitReversed)
+    TableNameSS << ".reversed";
+  GlobalVariable *CRCTableGlobal = new GlobalVariable(
+      TableType, true, GlobalVariable::LinkageTypes::PrivateLinkage,
+      ConstantArr, TableNameSS.str());
+  ExitBB->getModule()->insertGlobalVariable(CRCTableGlobal);
+  free(CRCTableData);
+
+  // Construct the IR to load from this table
+  Value *CRCOffset = CRC.CRCInput;
+  if (CRCSize > 8) {
+    // Get the next byte into position and truncate
+    if (!CRC.BitReversed)
+      CRCOffset = Builder.CreateLShr(CRCOffset, CRCSize - 8);
+    CRCOffset = Builder.CreateTrunc(CRCOffset, Builder.getInt8Ty());
+  }
+  if (CRC.DataInput) {
+    // Data size can be more than 8 due to extending
+    Value *Data = CRC.DataInput;
+    if (CRC.DataInput->getType()->getScalarSizeInBits() > 8) {
+      Data = Builder.CreateTrunc(Data, Builder.getInt8Ty());
+    }
+    // Xor the data, offset into the table and load
+    CRCOffset = Builder.CreateXor(CRCOffset, Data);
+  }
+
+  CRCOffset = Builder.CreateZExt(CRCOffset, Builder.getInt32Ty());
+  Value *Gep = Builder.CreateInBoundsGEP(CRCType, CRCTableGlobal, {CRCOffset});
+  Value *CRCRes = Builder.CreateLoad(CRCType, Gep);
+  if (CRCSize > 8) {
+    // Shift out SB used for division and Xor the rest of the crc back in
+    Value *RestOfCRC = CRC.CRCInput;
+    if (CRC.BitReversed)
+      RestOfCRC = Builder.CreateLShr(CRC.CRCInput, 8);
+    else
+      RestOfCRC = Builder.CreateShl(CRC.CRCInput, 8);
+    CRCRes = Builder.CreateXor(RestOfCRC, CRCRes);
+  }
+  for (PHINode &ExitPhi : CurLoop->getExitBlock()->phis()) {
+    if (ExitPhi.getNumIncomingValues() == 1 &&
+        ExitPhi.getIncomingValue(0) == CRC.CRCOutput)
+      ExitPhi.replaceAllUsesWith(CRCRes);
+  }
+}
+
+bool LoopIdiomRecognize::recognizeCRC(const SCEV *BECount) {
+  // Step one: Check if the loop looks like crc, and extract some useful
+  // information for us to check
+  std::optional<CRCInfo> MaybeCRC = looksLikeCRC(BECount);
+  if (!MaybeCRC)
+    return false;
+  CRCInfo CRC = *MaybeCRC;
+
+  uint64_t CRCSize = CRC.CRCInput->getType()->getScalarSizeInBits();
+  LLVM_DEBUG(dbgs() << DEBUG_TYPE " CRCRegonize: Found potential CRCLoop "
+                    << *CurLoop << "\n"
+                    << "Input CRC: " << *CRC.CRCInput << "\n"
+                    << "Output CRC: " << *CRC.CRCOutput << "\n"
+           ...
[truncated]

@joe-img
Copy link
Author

joe-img commented Jan 24, 2024

// Symbolically execute one iteration of the loop to populate a map of
// Value's to their ValueBits, aka a representation of their bits in terms of
// 1's, 0's and references to other values' bits. If these match pre-computed
// crc values, then we can say it's doing crc.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is executing one iteration enough?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated. Does it make sense?

return false;
}

// If there exists a Data input, ensure the check bit is crc^data.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this comment is too specific to coremark. Can you make it more general?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looking at many CRC implementations, I disagree that this is specific to CoreMark. The data is almost always xor'd into the crc. Half of these, coremark included, decide to xor inside the loop.

@mgudim
Copy link
Contributor

mgudim commented Feb 7, 2024

@joe-img I am going to review this properly, but now I am pretty busy with other work. Please ping me in a 2 weeks, if I haven't reviewed before, sorry for the long wait.

@joe-img
Copy link
Author

joe-img commented Feb 21, 2024

@joe-img I am going to review this properly, but now I am pretty busy with other work. Please ping me in a 2 weeks, if I haven't reviewed before, sorry for the long wait.

Ping : )

@mgudim
Copy link
Contributor

mgudim commented Feb 22, 2024

@joe-img I am going to review this properly, but now I am pretty busy with other work. Please ping me in a 2 weeks, if I haven't reviewed before, sorry for the long wait.

Ping : )

Thanks for the reminder. I'm in the process.

@@ -135,6 +137,9 @@ static cl::opt<bool> UseLIRCodeSizeHeurs(
"with -Os/-Oz"),
cl::init(true), cl::Hidden);

static cl::opt<bool> CRCRecognize("recognize-crc", cl::desc("CRC RECOGNIZE"),
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Description should not be in all caps

static ValueBit *CreateOneBit() { return new ValueBit(BitType::ONE); }
static ValueBit *CreateZeroBit() { return new ValueBit(BitType::ZERO); }
static ValueBit *CreateRefBit(Value *Ref, uint64_t Offset) {
return new ValueBit(BitType::REF, std::make_pair(Ref, Offset));
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this memory ever freed? A quick search doesn't show any calls to delete or unique_ptr

if (Incoming != BB) {
if (IncomingBlock) {
LLVM_DEBUG(dbgs()
<< DEBUG_TYPE " CRCRegonize: Do not know how to"
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Recognize*. This spelling error is repeated in many messages.

// than we need to account for overflow) and copying the 64bit values across
// aligned correctly
uint64_t CRCNumBytes = CRCSize / 8;
char *CRCTableData = (char *)malloc(CRCNumBytes * 260);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please use array new and delete.

And use uint8_t or int8_t instead of char. The signedness of char is platform dependent

CRCOffset = Builder.CreateXor(CRCOffset, Data);
}

CRCOffset = Builder.CreateZExt(CRCOffset, Builder.getInt32Ty());
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You should probably get the index type from DataLayout and not hardcode i32

// (MSB): crc < 0
// (MSB): crc <= -1
// And decide whether the check is checking for existence of 1 or 0
bool checkZero = false;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Capitalize variable name

return std::nullopt;

// Check iteration count is 8
const SCEV *TripCountSCEV =
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's less code to use ScalarEvolution::getSmallConstantTripCount, but we already have the backedge count so I don't know what's better.

define dso_local zeroext i8 @crc8_loop(ptr noundef %data, i32 noundef %length) {
entry:
br label %for.cond

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please minimize the tests so that they only show the loop involved without all the extra code.

@mgudim
Copy link
Contributor

mgudim commented Feb 24, 2024

We should think through the high-level plan of how the implementation should work eventually for targets supporting this. At the same time, we can divide this into smaller pieces which has many advantages.

I can suggest the following, but I really would appreciate feedback / approval of others (@kparzysz):

(1) Create new (overloaded) intrinsics for performing a number of steps in polynomial division. Something like:
pdivstep(i8 dividend, i8 divisor, i32 nsteps). This intrinsic returns the result of performing nsteps steps in the polynomial divison of dividend by the divisor. Here it is assumed that the highest degree of polynomial corresponds to the MSB of the number representing it. Actual loop could be using reverse representation, but we can express it by combining with bitreverse intrinsics.

(2) Add functions to PreISelIntrinsicLowering to expand the pdivstep intrinsic. We can expand back into a loop (or unrolled loop) like expandMemCpyAsLoop. Or we can emit a table lookup (like in this MR), or we can expand using clmul intrinsics.
For testing purposes, we can introduce flags to always (if possible) do specified type of expansion. We can start with "expand into a loop" and a "table lookup". We can add expansion using clmuls later.

(3) Implement the recognition part (and generation of pdivstep intrinsic) in LoopIdiomRecognize. As in this MR, the generation of the intrinsic should be controlled by a flag. Also, the idiom should be "polynomial division", CRC is just one application of it. I don't know any other though :-) If the whole plan is approved, we can merge this part first.

(4) Recognize identities on pdiv intrinsics. I think in coremark, the 8-bit ones combine eventually into larger one. This should be after inlining, so probably we can do it in PreISelIntrinsicLowering?

(5) Create a way for targets to specify if pdiv intrinsics should be generated. If yes, how should they be expanded (during PreISelLowering)?

@joe-img
Copy link
Author

joe-img commented Feb 29, 2024

Hi both and thanks for the comments. Sorry, I am working on this PR, I've just had some higher priority items.

pdivstep(i8 dividend, i8 divisor, i32 nsteps). This intrinsic returns the result of performing nsteps steps in the polynomial divison of dividend by the divisor

So the return value is the quotient? Doesn't that complicate the expression of CRC? What we need for CRC is surely something like premstep instead?

Actual loop could be using reverse representation, but we can express it by combining with bitreverse intrinsics.

Would it be easier to have a bit reversed variation of the intrinsic? We'd have to identify these bitreverse intrinsics upon expansion anyway. It would just avoid having to insert and then recognize it.

In general I think this is a good plan, and was the original vision. If this whole plan is approved, would the first MR be just recognition and creation of the division intrinsic? I.E strip the table based lookup from this MR and emit intrinsic but would error in the backend.

@mgudim
Copy link
Contributor

mgudim commented Mar 7, 2024

So the return value is the quotient? Doesn't that complicate the expression of CRC? What we need for CRC is surely something like premstep instead?

Yes, you're right. Actually, maybe we need both? In coremark after we emit crc8 intinsic, after inlining they combine to a larger crc (32 I think). To recognize that, we need to see that after doing 8-bit crc we continue with the quotient that we got, right? My memory is vague on this, so please correct me if I am wrong.

Would it be easier to have a bit reversed variation of the intrinsic? We'd have to identify these bitreverse intrinsics upon expansion anyway. It would just avoid having to insert and then recognize it.

I am OK with that too. I just didn't want to introduce more intrinsics than we need.

would the first MR be just recognition and creation of the division intrinsic?

I think first MR should be just defining the intrinsics.

would the first MR be just recognition and creation of the division intrinsic? I.E strip the table based lookup from this MR

yes.

but would error in the backend.

Only emit intrinsic if a flag is passed so it doesn't break anything. Name of the flag could indicate that at this point the code gen doesn't work yet for the intrinsic. But at the same time you should be able to add tests to your MR.

@mgudim
Copy link
Contributor

mgudim commented Mar 7, 2024

So the return value is the quotient? Doesn't that complicate the expression of CRC? What we need for CRC is surely something like premstep instead?

Yes, you're right. Actually, maybe we need both? In coremark after we emit crc8 intinsic, after inlining they combine to a larger crc (32 I think). To recognize that, we need to see that after doing 8-bit crc we continue with the quotient that we got, right? My memory is vague on this, so please correct me if I am wrong.

Sorry for all the noise. Looks like we only need the remainder after all. The identity is something like:

prem((prem(x, d, k_1) << s) || y), d, k_2) = prem((x << s ) || y, d, k_1 + k_2)

Copy link

⚠️ C/C++ code formatter, clang-format found issues in your code. ⚠️

You can test this locally with the following command:
git-clang-format --diff 182ab1c7034b951433fb8831b67e7758fe61d4e8 9ebb437002f79e51f1ece7e17c06bd821f6ae7b7 -- llvm/lib/Transforms/Scalar/LoopIdiomRecognize.cpp
View the diff from clang-format here.
diff --git a/llvm/lib/Transforms/Scalar/LoopIdiomRecognize.cpp b/llvm/lib/Transforms/Scalar/LoopIdiomRecognize.cpp
index b4bc8d71db..8bec76d5d3 100644
--- a/llvm/lib/Transforms/Scalar/LoopIdiomRecognize.cpp
+++ b/llvm/lib/Transforms/Scalar/LoopIdiomRecognize.cpp
@@ -2955,7 +2955,7 @@ public:
         _RHS = new ValueBit(*VB._RHS);
       }
     }
-    ValueBit& operator=(const ValueBit &VB) {
+    ValueBit &operator=(const ValueBit &VB) {
       _Type = VB._Type;
       if (_Type == BitType::REF)
         _BitRef = VB._BitRef;
@@ -3272,8 +3272,8 @@ symbolicallyExecute(BasicBlock *BB,
         }
       }
       assert(IncomingBlock);
-      ValueMap.insert({&I,
-          getOrCreateValueBits(PHI->getIncomingValueForBlock(IncomingBlock))});
+      ValueMap.insert({&I, getOrCreateValueBits(
+                               PHI->getIncomingValueForBlock(IncomingBlock))});
     } break;
     case Instruction::Shl: {
       ConstantInt *CI = getConstantOperand(&I, 1);
@@ -3320,8 +3320,8 @@ symbolicallyExecute(BasicBlock *BB,
       }
       auto IfTrue = getOrCreateValueBits(Select->getTrueValue());
       auto IfFalse = getOrCreateValueBits(Select->getFalseValue());
-      ValueMap.insert({&I,
-          std::make_shared<PredicatedValueBits>(Cond, IfTrue, IfFalse)});
+      ValueMap.insert(
+          {&I, std::make_shared<PredicatedValueBits>(Cond, IfTrue, IfFalse)});
     } break;
     default:
       // If this instruction is not recognized, then just continue. This is
@@ -3456,7 +3456,7 @@ bool LoopIdiomRecognize::recognizeCRC(const SCEV *BECount) {
   // any unexpected loop variant operations happening, e.g. additional select
   // logic or shifts, then this will be captured in the ValueBits.
   std::map<Value *, std::shared_ptr<ValueBits>> ValueMap;
-    
+
   if (!symbolicallyExecute(CurLoop->getHeader(), ValueMap))
     return false;
 
@@ -3487,8 +3487,8 @@ bool LoopIdiomRecognize::recognizeCRC(const SCEV *BECount) {
   // whether this is bit reversed CRC
   ICmpInst *ICmp = CRCOutBitsPred->getPredicate();
   CmpInst::Predicate Pred = ICmp->getPredicate();
-  LLVM_DEBUG(dbgs() << DEBUG_TYPE << " CRCRecognize checking to see if " << *ICmp
-                    << " is checking the "
+  LLVM_DEBUG(dbgs() << DEBUG_TYPE << " CRCRecognize checking to see if "
+                    << *ICmp << " is checking the "
                     << (CRC.BitReversed ? "LSB\n" : "MSB\n"));
 
   // Firstly check the LHS is in our map, and RHS is a constant
@@ -3620,7 +3620,8 @@ bool LoopIdiomRecognize::recognizeCRC(const SCEV *BECount) {
   }
   uint64_t GeneratorPolynomial =
       CRC.BitReversed ? reverseBits(CRC.Polynomial, CRCSize) : CRC.Polynomial;
-  PValueBits Polynomial = std::make_shared<ValueBits>(GeneratorPolynomial, CRCSize);
+  PValueBits Polynomial =
+      std::make_shared<ValueBits>(GeneratorPolynomial, CRCSize);
 
   // Case where the MSB/LSB of the data is 0
   PValueBits IfZero = CRC.BitReversed ? ValueBits::LShr(*CRCValueBits, 1)

@joe-img
Copy link
Author

joe-img commented Mar 21, 2024

So the return value is the quotient? Doesn't that complicate the expression of CRC? What we need for CRC is surely something like premstep instead?

Yes, you're right. Actually, maybe we need both? In coremark after we emit crc8 intinsic, after inlining they combine to a larger crc (32 I think). To recognize that, we need to see that after doing 8-bit crc we continue with the quotient that we got, right? My memory is vague on this, so please correct me if I am wrong.

Sorry for all the noise. Looks like we only need the remainder after all. The identity is something like:

prem((prem(x, d, k_1) << s) || y), d, k_2) = prem((x << s ) || y, d, k_1 + k_2)

I'm a bit confused with all those arguments to prem. What do they represent?

Also, what is considered a 'step' in polynomial division?

In CoreMark, the helper function ee_u16 crcu8(ee_u8 data, ee_u16 crc ) is used, but this is still crc16, as the generator polynomial and remainder are 16-bits wide. I think it's called 'crcu8' because it xor's in 8 bits of data.

If we're using a prem intrinsic, then surely it makes most sense for it to be something like prem(divisor, dividend), where it returns the remainder after polynomial division? With this, I see two issues:
Firstly, for crcn, prem would have to take a divisor of bitsize 2n. CRC implementations avoid this by using a linear feedback shift register.

Secondly, I think this complicates the expression of crc. For example, I'm pretty sure that coremark's crcu8 will be represented like:

(crc << 8) ^ prem((((crc & 0xff00) >> 8) ^ (data)) << 16), poly | (1 << 16))

Although crc is the remainder of polynomial division accross some stream of data, the way it's implemented in crc, I don't think matches up very well with a polynomial remainder intrinsic.

We can expand back into a loop (or unrolled loop) like expandMemCpyAsLoop

Is this necessary? If in LoopIdiomRecognize we're recognizing the loop, can we just avoid emitting the intrinsic using the same logic as would be in PreISelIntrinsicLowering to decide whether to lower to loop?

@mgudim
Copy link
Contributor

mgudim commented Apr 9, 2024

Hi @joe-img, I am really sorry, I am still too busy with other things (and I don't know for how long).

This will check to see if a loop looks like CRC, not necessarily
guaranteeing that it is CRC.
This is a representation of a value's bits in terms of references to
other values' bits, or 1/0 if the bit is known. This allows symbolic
execution of bitwise instructions without knowing the exact values.

Example:

LLVM IR Value i8 %x:
[%x[7], %x[6], %x[5], %x[4], %x[3], %x[2], %x[1], %x[0]]

%shr = lshr i8 %x, 2
[ 0, 0, %x[7], %x[6], %x[5], %x[4], %x[3], %x[2]]

%shl = shl i8 %shr, 1
[ 0, %x[7], %x[6], %x[5], %x[4], %x[3], %x[2], 0]

%xor = xor i8 %shl, 0xb
[ 0, %x[7], %x[6], %x[5], %x[4]^1, %x[3], %x[2]^1, 1]
These would be representitive of select or phi instructions where the
bits would depend on an icmp.
The result is a map between llvm Values and their bit representations as
ValueBits.
- Use shared pointers for factory methods

I think the use of shared pointers in `symbolicallyExecute` is the
correct decision due to the filling of a map. I tried for a while to
make this a map of objects, but due to [object
slicing](https://en.wikipedia.org/wiki/Object_slicing) it would not be
possible to store a PredicatedValueBits in this map. (This would require
PredicatedValueBits to be redesigned into ValueBits class, but then the
internal methods get messy. Virtual functions was a clean answer to
this) Additionally, using `unique_ptr` proves difficult due to having to
store it in a map. And unique pointers cannot be copied.  As
`symbolicallyExecute` is the primary user of the factory methods, it
made sense for the factory methods to return a shared pointer instead of
a raw pointer or object and then wrapping this in a shared pointer
within symbolicallyExecute.

- Have XOR bits own their LHS/RHS

Having them as member objects instead of pointers I don't think makes
much sense due to them not being needed for most ValueBit objects, but
they would have to be initialized for all ValueBit instantiations.

- Use object values instead of pointers for bit representation

IMO the cleanest way. They're very lightweight anyway.
getNameOrAsOperand is only available in debug mode, so causes failures
when built in release.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

5 participants