Skip to content

Conversation

atrosinenko
Copy link
Contributor

@atrosinenko atrosinenko commented Aug 28, 2025

Refactor MCInstReference class and move it from PAuth gadget scanner to Core.

MCInstReference is a class representing a reference to a constant instruction inside a parent entity - either inside a basic block (which has a reference to its parent function) or directly inside a function (when CFG information is not available).

This patch reapplies #138655 with a fix for iterator usage and multiple minor issues fixed during the second round of review.

Refactor MCInstReference class and move it from PAuth gadget scanner to
Core.

MCInstReference is a class representing a constant reference to an
instruction inside a parent entity - either inside a basic block (which
has a reference to its parent function) or directly inside a function
(when CFG information is not available).
@llvmbot
Copy link
Member

llvmbot commented Aug 28, 2025

@llvm/pr-subscribers-bolt

Author: Anatoly Trosinenko (atrosinenko)

Changes

Refactor MCInstReference class and move it from PAuth gadget scanner to Core.

MCInstReference is a class representing a constant reference to an instruction inside a parent entity - either inside a basic block (which has a reference to its parent function) or directly inside a function (when CFG information is not available).

This patch reapplies #138655 with MCInstReference::getAddress() function introduced and a fix for iterator usage.


Patch is 21.65 KiB, truncated to 20.00 KiB below, full version: https://github.com/llvm/llvm-project/pull/155846.diff

5 Files Affected:

  • (added) bolt/include/bolt/Core/MCInstUtils.h (+163)
  • (modified) bolt/include/bolt/Passes/PAuthGadgetScanner.h (+1-175)
  • (modified) bolt/lib/Core/CMakeLists.txt (+1)
  • (added) bolt/lib/Core/MCInstUtils.cpp (+104)
  • (modified) bolt/lib/Passes/PAuthGadgetScanner.cpp (+16-55)
diff --git a/bolt/include/bolt/Core/MCInstUtils.h b/bolt/include/bolt/Core/MCInstUtils.h
new file mode 100644
index 0000000000000..a2797c6b31aa2
--- /dev/null
+++ b/bolt/include/bolt/Core/MCInstUtils.h
@@ -0,0 +1,163 @@
+//===- bolt/Core/MCInstUtils.h ----------------------------------*- C++ -*-===//
+//
+// Part of the LLVM Project, under the Apache License v2.0 with LLVM Exceptions.
+// See https://llvm.org/LICENSE.txt for license information.
+// SPDX-License-Identifier: Apache-2.0 WITH LLVM-exception
+//
+//===----------------------------------------------------------------------===//
+
+#ifndef BOLT_CORE_MCINSTUTILS_H
+#define BOLT_CORE_MCINSTUTILS_H
+
+#include "bolt/Core/BinaryBasicBlock.h"
+#include <map>
+#include <variant>
+
+namespace llvm {
+class MCCodeEmitter;
+}
+
+namespace llvm {
+namespace bolt {
+
+class BinaryFunction;
+
+/// MCInstReference represents a reference to a constant MCInst as stored either
+/// in a BinaryFunction (i.e. before a CFG is created), or in a BinaryBasicBlock
+/// (after a CFG is created).
+class MCInstReference {
+public:
+  using nocfg_const_iterator = std::map<uint32_t, MCInst>::const_iterator;
+
+  /// Constructs an empty reference.
+  MCInstReference() : Reference(RefInBB(nullptr, nullptr)) {}
+  /// Constructs a reference to the instruction inside the basic block.
+  MCInstReference(const BinaryBasicBlock *BB, const MCInst *Inst)
+      : Reference(RefInBB(BB, Inst)) {
+    assert(BB && Inst && "Neither BB nor Inst should be nullptr");
+  }
+  /// Constructs a reference to the instruction inside the basic block.
+  MCInstReference(const BinaryBasicBlock *BB, unsigned Index)
+      : Reference(RefInBB(BB, &BB->getInstructionAtIndex(Index))) {
+    assert(BB && "Basic block should not be nullptr");
+  }
+  /// Constructs a reference to the instruction inside the function without
+  /// CFG information.
+  MCInstReference(const BinaryFunction *BF, nocfg_const_iterator It)
+      : Reference(RefInBF(BF, It)) {
+    assert(BF && "Function should not be nullptr");
+  }
+
+  /// Locates an instruction inside a function and returns a reference.
+  static MCInstReference get(const MCInst *Inst, const BinaryFunction &BF);
+
+  bool operator==(const MCInstReference &Other) const {
+    return Reference == Other.Reference;
+  }
+
+  const MCInst &getMCInst() const {
+    assert(!empty() && "Empty reference");
+    if (auto *Ref = tryGetRefInBB())
+      return *Ref->Inst;
+    return getRefInBF().It->second;
+  }
+
+  operator const MCInst &() const { return getMCInst(); }
+
+  bool empty() const {
+    if (auto *Ref = tryGetRefInBB())
+      return Ref->BB == nullptr;
+    return getRefInBF().BF == nullptr;
+  }
+
+  bool hasCFG() const { return !empty() && tryGetRefInBB() != nullptr; }
+
+  const BinaryFunction *getFunction() const {
+    assert(!empty() && "Empty reference");
+    if (auto *Ref = tryGetRefInBB())
+      return Ref->BB->getFunction();
+    return getRefInBF().BF;
+  }
+
+  const BinaryBasicBlock *getBasicBlock() const {
+    assert(!empty() && "Empty reference");
+    if (auto *Ref = tryGetRefInBB())
+      return Ref->BB;
+    return nullptr;
+  }
+
+  // MCCodeEmitter is not thread safe.
+  uint64_t getAddress(const MCCodeEmitter *Emitter = nullptr) const;
+
+  raw_ostream &print(raw_ostream &OS) const;
+
+private:
+  // Two cases are possible:
+  // * functions with CFG reconstructed - a function stores a collection of
+  //   basic blocks, each basic block stores a contiguous vector of MCInst
+  // * functions without CFG - there are no basic blocks created,
+  //   the instructions are directly stored in std::map in BinaryFunction
+  //
+  // In both cases, the direct parent of MCInst is stored together with an
+  // iterator pointing to the instruction.
+
+  // Helper struct: CFG is available, the direct parent is a basic block,
+  // iterator's type is `MCInst *`.
+  struct RefInBB {
+    RefInBB(const BinaryBasicBlock *BB, const MCInst *Inst)
+        : BB(BB), Inst(Inst) {}
+    RefInBB(const RefInBB &Other) = default;
+    RefInBB &operator=(const RefInBB &Other) = default;
+
+    const BinaryBasicBlock *BB;
+    const MCInst *Inst;
+
+    bool operator==(const RefInBB &Other) const {
+      return BB == Other.BB && Inst == Other.Inst;
+    }
+  };
+
+  // Helper struct: CFG is *not* available, the direct parent is a function,
+  // iterator's type is std::map<uint32_t, MCInst>::iterator (the mapped value
+  // is an instruction's offset).
+  struct RefInBF {
+    RefInBF(const BinaryFunction *BF, nocfg_const_iterator It)
+        : BF(BF), It(It) {}
+    RefInBF(const RefInBF &Other) = default;
+    RefInBF &operator=(const RefInBF &Other) = default;
+
+    const BinaryFunction *BF;
+    nocfg_const_iterator It;
+
+    bool operator==(const RefInBF &Other) const {
+      return BF == Other.BF && It->first == Other.It->first;
+    }
+  };
+
+  std::variant<RefInBB, RefInBF> Reference;
+
+  // Utility methods to be used like this:
+  //
+  //     if (auto *Ref = tryGetRefInBB())
+  //       return Ref->doSomething(...);
+  //     return getRefInBF().doSomethingElse(...);
+  const RefInBB *tryGetRefInBB() const {
+    assert(std::get_if<RefInBB>(&Reference) ||
+           std::get_if<RefInBF>(&Reference));
+    return std::get_if<RefInBB>(&Reference);
+  }
+  const RefInBF &getRefInBF() const {
+    assert(std::get_if<RefInBF>(&Reference));
+    return *std::get_if<RefInBF>(&Reference);
+  }
+};
+
+static inline raw_ostream &operator<<(raw_ostream &OS,
+                                      const MCInstReference &Ref) {
+  return Ref.print(OS);
+}
+
+} // namespace bolt
+} // namespace llvm
+
+#endif
diff --git a/bolt/include/bolt/Passes/PAuthGadgetScanner.h b/bolt/include/bolt/Passes/PAuthGadgetScanner.h
index 721fd664a3253..cb865a725d72a 100644
--- a/bolt/include/bolt/Passes/PAuthGadgetScanner.h
+++ b/bolt/include/bolt/Passes/PAuthGadgetScanner.h
@@ -11,187 +11,13 @@
 
 #include "bolt/Core/BinaryContext.h"
 #include "bolt/Core/BinaryFunction.h"
+#include "bolt/Core/MCInstUtils.h"
 #include "bolt/Passes/BinaryPasses.h"
 #include "llvm/Support/raw_ostream.h"
 #include <memory>
 
 namespace llvm {
 namespace bolt {
-
-/// @brief  MCInstReference represents a reference to an MCInst as stored either
-/// in a BinaryFunction (i.e. before a CFG is created), or in a BinaryBasicBlock
-/// (after a CFG is created). It aims to store the necessary information to be
-/// able to find the specific MCInst in either the BinaryFunction or
-/// BinaryBasicBlock data structures later, so that e.g. the InputAddress of
-/// the corresponding instruction can be computed.
-
-struct MCInstInBBReference {
-  BinaryBasicBlock *BB;
-  int64_t BBIndex;
-  MCInstInBBReference(BinaryBasicBlock *BB, int64_t BBIndex)
-      : BB(BB), BBIndex(BBIndex) {}
-  MCInstInBBReference() : BB(nullptr), BBIndex(0) {}
-  static MCInstInBBReference get(const MCInst *Inst, BinaryFunction &BF) {
-    for (BinaryBasicBlock &BB : BF)
-      for (size_t I = 0; I < BB.size(); ++I)
-        if (Inst == &BB.getInstructionAtIndex(I))
-          return MCInstInBBReference(&BB, I);
-    return {};
-  }
-  bool operator==(const MCInstInBBReference &RHS) const {
-    return BB == RHS.BB && BBIndex == RHS.BBIndex;
-  }
-  bool operator<(const MCInstInBBReference &RHS) const {
-    return std::tie(BB, BBIndex) < std::tie(RHS.BB, RHS.BBIndex);
-  }
-  operator MCInst &() const {
-    assert(BB != nullptr);
-    return BB->getInstructionAtIndex(BBIndex);
-  }
-  uint64_t getAddress() const {
-    // 4 bytes per instruction on AArch64.
-    // FIXME: the assumption of 4 byte per instruction needs to be fixed before
-    // this method gets used on any non-AArch64 binaries (but should be fine for
-    // pac-ret analysis, as that is an AArch64-specific feature).
-    return BB->getFunction()->getAddress() + BB->getOffset() + BBIndex * 4;
-  }
-};
-
-raw_ostream &operator<<(raw_ostream &OS, const MCInstInBBReference &);
-
-struct MCInstInBFReference {
-  BinaryFunction *BF;
-  uint64_t Offset;
-  MCInstInBFReference(BinaryFunction *BF, uint64_t Offset)
-      : BF(BF), Offset(Offset) {}
-
-  static MCInstInBFReference get(const MCInst *Inst, BinaryFunction &BF) {
-    for (auto &I : BF.instrs())
-      if (Inst == &I.second)
-        return MCInstInBFReference(&BF, I.first);
-    return {};
-  }
-
-  MCInstInBFReference() : BF(nullptr), Offset(0) {}
-  bool operator==(const MCInstInBFReference &RHS) const {
-    return BF == RHS.BF && Offset == RHS.Offset;
-  }
-  bool operator<(const MCInstInBFReference &RHS) const {
-    return std::tie(BF, Offset) < std::tie(RHS.BF, RHS.Offset);
-  }
-  operator MCInst &() const {
-    assert(BF != nullptr);
-    return *BF->getInstructionAtOffset(Offset);
-  }
-
-  uint64_t getOffset() const { return Offset; }
-
-  uint64_t getAddress() const { return BF->getAddress() + getOffset(); }
-};
-
-raw_ostream &operator<<(raw_ostream &OS, const MCInstInBFReference &);
-
-struct MCInstReference {
-  enum Kind { FunctionParent, BasicBlockParent };
-  Kind ParentKind;
-  union U {
-    MCInstInBBReference BBRef;
-    MCInstInBFReference BFRef;
-    U(MCInstInBBReference BBRef) : BBRef(BBRef) {}
-    U(MCInstInBFReference BFRef) : BFRef(BFRef) {}
-  } U;
-  MCInstReference(MCInstInBBReference BBRef)
-      : ParentKind(BasicBlockParent), U(BBRef) {}
-  MCInstReference(MCInstInBFReference BFRef)
-      : ParentKind(FunctionParent), U(BFRef) {}
-  MCInstReference(BinaryBasicBlock *BB, int64_t BBIndex)
-      : MCInstReference(MCInstInBBReference(BB, BBIndex)) {}
-  MCInstReference(BinaryFunction *BF, uint32_t Offset)
-      : MCInstReference(MCInstInBFReference(BF, Offset)) {}
-
-  static MCInstReference get(const MCInst *Inst, BinaryFunction &BF) {
-    if (BF.hasCFG())
-      return MCInstInBBReference::get(Inst, BF);
-    return MCInstInBFReference::get(Inst, BF);
-  }
-
-  bool operator<(const MCInstReference &RHS) const {
-    if (ParentKind != RHS.ParentKind)
-      return ParentKind < RHS.ParentKind;
-    switch (ParentKind) {
-    case BasicBlockParent:
-      return U.BBRef < RHS.U.BBRef;
-    case FunctionParent:
-      return U.BFRef < RHS.U.BFRef;
-    }
-    llvm_unreachable("");
-  }
-
-  bool operator==(const MCInstReference &RHS) const {
-    if (ParentKind != RHS.ParentKind)
-      return false;
-    switch (ParentKind) {
-    case BasicBlockParent:
-      return U.BBRef == RHS.U.BBRef;
-    case FunctionParent:
-      return U.BFRef == RHS.U.BFRef;
-    }
-    llvm_unreachable("");
-  }
-
-  operator MCInst &() const {
-    switch (ParentKind) {
-    case BasicBlockParent:
-      return U.BBRef;
-    case FunctionParent:
-      return U.BFRef;
-    }
-    llvm_unreachable("");
-  }
-
-  operator bool() const {
-    switch (ParentKind) {
-    case BasicBlockParent:
-      return U.BBRef.BB != nullptr;
-    case FunctionParent:
-      return U.BFRef.BF != nullptr;
-    }
-    llvm_unreachable("");
-  }
-
-  uint64_t getAddress() const {
-    switch (ParentKind) {
-    case BasicBlockParent:
-      return U.BBRef.getAddress();
-    case FunctionParent:
-      return U.BFRef.getAddress();
-    }
-    llvm_unreachable("");
-  }
-
-  BinaryFunction *getFunction() const {
-    switch (ParentKind) {
-    case FunctionParent:
-      return U.BFRef.BF;
-    case BasicBlockParent:
-      return U.BBRef.BB->getFunction();
-    }
-    llvm_unreachable("");
-  }
-
-  BinaryBasicBlock *getBasicBlock() const {
-    switch (ParentKind) {
-    case FunctionParent:
-      return nullptr;
-    case BasicBlockParent:
-      return U.BBRef.BB;
-    }
-    llvm_unreachable("");
-  }
-};
-
-raw_ostream &operator<<(raw_ostream &OS, const MCInstReference &);
-
 namespace PAuthGadgetScanner {
 
 // The report classes are designed to be used in an immutable manner.
diff --git a/bolt/lib/Core/CMakeLists.txt b/bolt/lib/Core/CMakeLists.txt
index fc72dc023c590..58cfcab370f16 100644
--- a/bolt/lib/Core/CMakeLists.txt
+++ b/bolt/lib/Core/CMakeLists.txt
@@ -32,6 +32,7 @@ add_llvm_library(LLVMBOLTCore
   GDBIndex.cpp
   HashUtilities.cpp
   JumpTable.cpp
+  MCInstUtils.cpp
   MCPlusBuilder.cpp
   ParallelUtilities.cpp
   Relocation.cpp
diff --git a/bolt/lib/Core/MCInstUtils.cpp b/bolt/lib/Core/MCInstUtils.cpp
new file mode 100644
index 0000000000000..f721c259e9b28
--- /dev/null
+++ b/bolt/lib/Core/MCInstUtils.cpp
@@ -0,0 +1,104 @@
+//===- bolt/Passes/MCInstUtils.cpp ----------------------------------------===//
+//
+// Part of the LLVM Project, under the Apache License v2.0 with LLVM Exceptions.
+// See https://llvm.org/LICENSE.txt for license information.
+// SPDX-License-Identifier: Apache-2.0 WITH LLVM-exception
+//
+//===----------------------------------------------------------------------===//
+
+#include "bolt/Core/MCInstUtils.h"
+#include "bolt/Core/BinaryBasicBlock.h"
+#include "bolt/Core/BinaryFunction.h"
+#include "llvm/ADT/iterator.h"
+
+#include <type_traits>
+
+using namespace llvm;
+using namespace llvm::bolt;
+
+// It is assumed in a few places that BinaryBasicBlock stores its instructions
+// in a contiguous vector. Give this assumption a name to simplify marking the
+// particular places with static_assert.
+using BasicBlockStorageIsVector =
+    std::is_same<BinaryBasicBlock::const_iterator,
+                 std::vector<MCInst>::const_iterator>;
+
+namespace {
+// Cannot reuse MCPlusBuilder::InstructionIterator because it has to be
+// constructed from a non-const std::map iterator.
+class mapped_mcinst_iterator
+    : public iterator_adaptor_base<mapped_mcinst_iterator,
+                                   MCInstReference::nocfg_const_iterator> {
+public:
+  mapped_mcinst_iterator(MCInstReference::nocfg_const_iterator It)
+      : iterator_adaptor_base(It) {}
+  const MCInst &operator*() const { return this->I->second; }
+};
+} // anonymous namespace
+
+MCInstReference MCInstReference::get(const MCInst *Inst,
+                                     const BinaryFunction &BF) {
+  if (BF.hasCFG()) {
+    for (BinaryBasicBlock &BB : BF)
+      for (MCInst &MI : BB)
+        if (&MI == Inst)
+          return MCInstReference(&BB, Inst);
+    return {};
+  }
+
+  for (auto I = BF.instrs().begin(), E = BF.instrs().end(); I != E; ++I) {
+    if (&I->second == Inst)
+      return MCInstReference(&BF, I);
+  }
+  return {};
+}
+
+uint64_t MCInstReference::getAddress(const MCCodeEmitter *Emitter) const {
+  assert(!empty() && "Taking instruction address by empty reference");
+
+  const BinaryContext &BC = getFunction()->getBinaryContext();
+  if (auto *Ref = tryGetRefInBB()) {
+    static_assert(BasicBlockStorageIsVector::value,
+                  "Cannot use 'const MCInst *' as iterator type");
+    uint64_t AddressOfBB = getFunction()->getAddress() + Ref->BB->getOffset();
+    const MCInst *FirstInstInBB = &*Ref->BB->begin();
+
+    uint64_t OffsetInBB = BC.computeCodeSize(FirstInstInBB, Ref->Inst, Emitter);
+
+    return AddressOfBB + OffsetInBB;
+  }
+
+  auto &Ref = getRefInBF();
+  mapped_mcinst_iterator FirstInstInBF(Ref.BF->instrs().begin());
+  mapped_mcinst_iterator ThisInst(Ref.It);
+
+  uint64_t OffsetInBF = BC.computeCodeSize(FirstInstInBF, ThisInst, Emitter);
+
+  return getFunction()->getAddress() + OffsetInBF;
+}
+
+raw_ostream &MCInstReference::print(raw_ostream &OS) const {
+  if (const RefInBB *Ref = tryGetRefInBB()) {
+    OS << "MCInstBBRef<";
+    if (Ref->BB == nullptr) {
+      OS << "BB:(null)";
+    } else {
+      static_assert(BasicBlockStorageIsVector::value,
+                    "Cannot use pointer arithmetic on 'const MCInst *'");
+      const MCInst *FirstInstInBB = &*Ref->BB->begin();
+      unsigned IndexInBB = Ref->Inst - FirstInstInBB;
+      OS << "BB:" << Ref->BB->getName() << ":" << IndexInBB;
+    }
+    OS << ">";
+    return OS;
+  }
+
+  const RefInBF &Ref = getRefInBF();
+  OS << "MCInstBFRef<";
+  if (Ref.BF == nullptr)
+    OS << "BF:(null)";
+  else
+    OS << "BF:" << Ref.BF->getPrintName() << ":" << Ref.It->first;
+  OS << ">";
+  return OS;
+}
diff --git a/bolt/lib/Passes/PAuthGadgetScanner.cpp b/bolt/lib/Passes/PAuthGadgetScanner.cpp
index 65c84ebc8c4f4..287cbe31df3af 100644
--- a/bolt/lib/Passes/PAuthGadgetScanner.cpp
+++ b/bolt/lib/Passes/PAuthGadgetScanner.cpp
@@ -24,39 +24,6 @@
 
 namespace llvm {
 namespace bolt {
-
-raw_ostream &operator<<(raw_ostream &OS, const MCInstInBBReference &Ref) {
-  OS << "MCInstBBRef<";
-  if (Ref.BB == nullptr)
-    OS << "BB:(null)";
-  else
-    OS << "BB:" << Ref.BB->getName() << ":" << Ref.BBIndex;
-  OS << ">";
-  return OS;
-}
-
-raw_ostream &operator<<(raw_ostream &OS, const MCInstInBFReference &Ref) {
-  OS << "MCInstBFRef<";
-  if (Ref.BF == nullptr)
-    OS << "BF:(null)";
-  else
-    OS << "BF:" << Ref.BF->getPrintName() << ":" << Ref.getOffset();
-  OS << ">";
-  return OS;
-}
-
-raw_ostream &operator<<(raw_ostream &OS, const MCInstReference &Ref) {
-  switch (Ref.ParentKind) {
-  case MCInstReference::BasicBlockParent:
-    OS << Ref.U.BBRef;
-    return OS;
-  case MCInstReference::FunctionParent:
-    OS << Ref.U.BFRef;
-    return OS;
-  }
-  llvm_unreachable("");
-}
-
 namespace PAuthGadgetScanner {
 
 [[maybe_unused]] static void traceInst(const BinaryContext &BC, StringRef Label,
@@ -91,10 +58,10 @@ template <typename T> static void iterateOverInstrs(BinaryFunction &BF, T Fn) {
   if (BF.hasCFG()) {
     for (BinaryBasicBlock &BB : BF)
       for (int64_t I = 0, E = BB.size(); I < E; ++I)
-        Fn(MCInstInBBReference(&BB, I));
+        Fn(MCInstReference(&BB, I));
   } else {
-    for (auto I : BF.instrs())
-      Fn(MCInstInBFReference(&BF, I.first));
+    for (auto I = BF.instrs().begin(), E = BF.instrs().end(); I != E; ++I)
+      Fn(MCInstReference(&BF, I));
   }
 }
 
@@ -566,7 +533,7 @@ class SrcSafetyAnalysis {
     std::vector<MCInstReference> Result;
     for (const MCInst *Inst : lastWritingInsts(S, ClobberedReg)) {
       MCInstReference Ref = MCInstReference::get(Inst, BF);
-      assert(Ref && "Expected Inst to be found");
+      assert(!Ref.empty() && "Expected Inst to be found");
       Result.push_back(Ref);
     }
     return Result;
@@ -1138,7 +1105,7 @@ class DstSafetyAnalysis {
     std::vector<MCInstReference> Result;
     for (const MCInst *Inst : firstLeakingInsts(S, LeakedReg)) {
       MCInstReference Ref = MCInstReference::get(Inst, BF);
-      assert(Ref && "Expected Inst to be found");
+      assert(!Ref.empty() && "Expected Inst to be found");
       Result.push_back(Ref);
     }
     return Result;
@@ -1345,8 +1312,7 @@ static bool shouldAnalyzeTailCallInst(const BinaryContext &BC,
   // (such as isBranch at the time of writing this comment), some don't (such
   // as isCall). For that reason, call MCInstrDesc's methods explicitly when
   // it is important.
-  const MCInstrDesc &Desc =
-      BC.MII->get(static_cast<const MCInst &>(Inst).getOpcode());
+  const MCInstrDesc &Desc = BC.MII->get(Inst.getMCInst().getOpcode());
   // Tail call should be a branch (but not necessarily an indirect one).
   if (!Desc.isBranch())
     return false;
@@ -1541,7 +1507,7 @@ void FunctionAnalysisContext::findUnsafeUses(
       // This is printed as "[message] in function [name], basic block ...,
       // at address ..." when the issue is reported to the user.
       Reports.push_back(make_generic_report(
-          MCInstReference::get(FirstInst, BF),
+          MCInstReference(&BB, FirstInst),
           "Warning: possibly imprecise CFG, the analysis quality may be "
           "degraded in this function. According to BOLT, unreachable code is "
           "found" /* in function [name]... */));
@@ -1711,25 +1677,20 @@ static void printBB(const BinaryContext &BC, const BinaryBasicBlock *BB,
     EndIndex = BB->size() - 1;
   const BinaryFunction *BF = BB->getFunction();
   for (unsigned I = StartIndex; I <= EndIndex; ++I) {
-    // FIXME: this assumes all instructions are 4 bytes in size. This is true
-    // for AArch64, but it might be good to extract this function so it can be
-    // used elsewhere and for other targets too.
-    uint64_t Address = BB->getOffset() + BF->getAddress() + 4 * I;
-    const MCInst &Inst = BB->getInstructionAtIndex(I);
+    MCInstReference Inst(BB, I);
     if (BC.MIB->isCFI(Inst))
       continue;
-    BC.printInstruction(outs(), Inst, Address, BF);
+    BC.printInstruction(outs(), Inst, Inst.getAddress(), BF);
   }
 }
 
 static void reportFoundGadgetInSingleBBSingleRelatedInst(
     raw_ostream &OS, const BinaryC...
[truncated]

@atrosinenko
Copy link
Contributor Author

Ping.

@aaupov
Copy link
Contributor

aaupov commented Sep 15, 2025

Hi @atrosinenko,
Thank you for working on this functionality, I've experienced a similar need to track individual instructions while working on the data flow graph prototype in BOLT.

However, using an instruction pointer can lead to hard-to-track bugs due to BB instructions reallocations (e.g. on insertion/removal) or simply deleting the instruction being referenced. Do you plan to cover that as well?

@atrosinenko
Copy link
Contributor Author

@aaupov This is a good question, especially given that one of the constructors accept a pair of basic block and the index of the instruction, which can mislead the user that at least inserting an instruction after the referenced one is safe. At the very least I have probably to warn the user in the class comment that this reference is invalidated when instructions are inserted/deleted to the containing basic block (IIRC there should be no such issue in the "no-CFG" case with std::map iterators, unless the referenced instruction itself is removed, of course).

On the other hand, when CFG is available I could store a pair of (pointer to basic block, index) instead of (pointer to basic block, pointer to instruction) - by the way, this could be computed in constant time when constructing a reference, as I already rely on the instructions being stored in a vector inside BinaryBasicBlock and there is already an assertion on that. This way, the referenced instruction would still silently change when an instruction is inserted or removed before the originally referenced one, but I wonder if it would be easier to debug - a different instruction would be returned in place of the originally referenced one, but it would be a valid instruction at least, and we can trivially assert on out-of-bounds references (when there are not enough instruction in the BB anymore).

@aaupov
Copy link
Contributor

aaupov commented Sep 17, 2025

@atrosinenko
Using a pointer to the basic block plus an index sounds reasonable (I've used it in a DFG prototype).
In terms of reference invalidation on insertion/removal, we could provide MCInstReferences through a class tracking if references are valid (potentially on a per-BB basis). For future iterations, it would also be nice to have a functionality to fix invalid references, but preventing incorrect usage is my primary concern for now.

@atrosinenko
Copy link
Contributor Author

I would rather postpone non-trivial invalidation tracking until it is actually used - thus switched back to BB+index approach so far (it was actually the approach used by the existing MCInstReference implementation) and added a brief warning to the class description in 2a25f3a.

}

// MCCodeEmitter is not thread safe.
uint64_t getAddress(const MCCodeEmitter *Emitter = nullptr) const;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you add documentation for this method. What's the expected address returned?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Writing the documentation made it obvious to me that a name starting with "get" is rather misleading for this method:

  • it does not just retrieve the address from a field of some object, it performs computations every time
  • it does not return "the" address, it returns an approximation that is expected to be good enough for debug printing (and for the addresses reported by gadget scanner to the user, which are, frankly speaking, reported on the best-effort basis)

For that reason, I renamed this method in b068098 and refactored its callers a bit.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks. To clarify, the expected output is the address of the instruction in the input binary? I.e. it's only valid in the context of instructions that were present in the input?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm... My particular use case is "read-only" analysis, but initially I thought getAddress() should work equally well for the "original" and "rewritten" code. Now I see getAddress() can only be as accurate as BinaryFunction::getAddress() and BinaryBasicBlock::getOffset().

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right, BinaryFunction::getAddress() returns the original address of the function. For the output, we don't know the address until code emission and mapping. If you are interested in output addresses, you can take a look at BinaryFunction::translateInputToOutputAddress() for approximation.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for clarification, updated the description in ec6bff9.

/// Constructs an empty reference.
MCInstReference() : Reference(RefInBB(nullptr, /*Index=*/0)) {}
/// Constructs a reference to the instruction inside the basic block.
MCInstReference(const BinaryBasicBlock *BB, const MCInst *Inst)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In general, if you don't expect to pass/handle nullptr as a pointer, prefer a reference. You might need a special case for an empty reference.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for pointing this out, I replaced a number of pointer-typed arguments with references in 0558f1b and now it should be clearer to the user when empty references can be returned (the privately-used RefInBB and RefInBF helper classes still use pointers, though). Furthermore, I made MCInstReference::get accept both arguments by reference in 5a7414c: while it could have sense to return empty reference for (nullptr, SomeFunction) arguments (but even this usage is slightly questionable), it is definitely strange to pass unrelated instruction and function and expect the MCInstReference::get to silently return an empty reference.

Comment on lines 73 to 77
auto &Ref = getRefInBF();
mapped_mcinst_iterator FirstInstInBF(Ref.BF->instrs().begin());
mapped_mcinst_iterator ThisInst(Ref.It);

uint64_t OffsetInBF = BC.computeCodeSize(FirstInstInBF, ThisInst, Emitter);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Without the CFG, the offset of the instruction is stored as a key in BF.instrs() map. There's no need to recompute it.

With this comment addressed, the PR should be ready to go.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oops, missed this opportunity, thanks!

Copy link
Contributor

@maksfb maksfb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM with a couple of nits. Thanks!


const BinaryContext &BC = getFunction()->getBinaryContext();
if (auto *Ref = tryGetRefInBB()) {
uint64_t AddressOfBB = getFunction()->getAddress() + Ref->BB->getOffset();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
uint64_t AddressOfBB = getFunction()->getAddress() + Ref->BB->getOffset();
const uint64_t AddressOfBB = getFunction()->getAddress() + Ref->BB->getOffset();


// Usage of plain 'const MCInst *' as iterators assumes the instructions
// are stored in a vector, see BasicBlockStorageIsVector.
uint64_t OffsetInBB = BC.computeCodeSize(FirstInstInBB, ThisInst, Emitter);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
uint64_t OffsetInBB = BC.computeCodeSize(FirstInstInBB, ThisInst, Emitter);
const uint64_t OffsetInBB = BC.computeCodeSize(FirstInstInBB, ThisInst, Emitter);

}

auto &Ref = getRefInBF();
uint64_t OffsetInBF = Ref.It->first;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
uint64_t OffsetInBF = Ref.It->first;
const uint64_t OffsetInBF = Ref.It->first;

@atrosinenko atrosinenko merged commit da315a3 into main Sep 30, 2025
9 checks passed
@atrosinenko atrosinenko deleted the users/atrosinenko/retry-bolt-factor-out-mcinstreference branch September 30, 2025 09:46
mahesh-attarde pushed a commit to mahesh-attarde/llvm-project that referenced this pull request Oct 3, 2025
Refactor MCInstReference class and move it from PAuth gadget scanner to
Core.

MCInstReference is a class representing a reference to a constant
instruction inside a parent entity - either inside a basic block (which
has a reference to its parent function) or directly inside a function
(when CFG information is not available).

This patch reapplies llvm#138655 with a fix for iterator usage and multiple
minor issues fixed during the second round of review.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants