Modify llvm-gsymutil lookups to handle overlapping ranges correctly. #72350

clayborg · 2023-11-15T06:08:30Z

llvm-gsymutil allows address ranges to overlap. There was a bug where if we had debug info for a function with a range like [0x100-0x200) and a symbol at the same start address yet with a larger range like [0x100-0x300), we would randomly get either only information from the first or second entry. This could cause lookups to fail due to the way the binary search worked.

This patch makes sure that when lookups happen we find the first address table entry that can match an address, and also ensures that we always select the first FunctionInfo that could match. FunctionInfo entries are sorted such that the most debug info rich entries come first. And if we have two ranges that have the same start address, the smaller range comes first and the larger one comes next. This patch also adds the ability to iterate over all function infos with the same start address to always find a range that contains the address.

Added a unit test to test this functionality that failed prior to this fix and now succeeds.

llvmbot · 2023-11-15T06:08:58Z

@llvm/pr-subscribers-debuginfo

Author: Greg Clayton (clayborg)

Changes

llvm-gsymutil allows address ranges to overlap. There was a bug where if we had debug info for a function with a range like [0x100-0x200) and a symbol at the same start address yet with a larger range like [0x100-0x300), we would randomly get either only information from the first or second entry. This could cause lookups to fail due to the way the binary search worked.

This patch makes sure that when lookups happen we find the first address table entry that can match an address, and also ensures that we always select the first FunctionInfo that could match. FunctionInfo entries are sorted such that the most debug info rich entries come first. And if we have two ranges that have the same start address, the smaller range comes first and the larger one comes next. This patch also adds the ability to iterate over all function infos with the same start address to always find a range that contains the address.

Added a unit test to test this functionality that failed prior to this fix and now succeeds.

Full diff: https://github.com/llvm/llvm-project/pull/72350.diff

3 Files Affected:

(modified) llvm/include/llvm/DebugInfo/GSYM/GsymReader.h (+34)
(modified) llvm/lib/DebugInfo/GSYM/GsymReader.cpp (+53-36)
(modified) llvm/unittests/DebugInfo/GSYM/GSYMTest.cpp (+177)

diff --git a/llvm/include/llvm/DebugInfo/GSYM/GsymReader.h b/llvm/include/llvm/DebugInfo/GSYM/GsymReader.h
index cd4fdfa0e9e37d9..f7258abb707309a 100644
--- a/llvm/include/llvm/DebugInfo/GSYM/GsymReader.h
+++ b/llvm/include/llvm/DebugInfo/GSYM/GsymReader.h
@@ -266,6 +266,18 @@ class GsymReader {
       return std::nullopt;
     if (Iter == End || AddrOffset < *Iter)
       --Iter;
+
+    // GSYM files store the richest information first in the file, so always
+    // backup as much as possible as long as the address offset is the same
+    // as the previous entry.
+    while (Iter != Begin) {
+      auto Prev = Iter - 1;
+      if (*Prev == *Iter)
+        Iter = Prev;
+      else
+        break;
+    }
+
     return std::distance(Begin, Iter);
   }
 
@@ -303,6 +315,28 @@ class GsymReader {
   /// \returns An optional GSYM data offset for the offset of the FunctionInfo
   /// that needs to be decoded.
   std::optional<uint64_t> getAddressInfoOffset(size_t Index) const;
+
+  /// Given an address, find the correct function info data and function
+  /// address.
+  ///
+  /// Binary search the address table and find the matching address info
+  /// and make sure that the function info contains the address. GSYM allows
+  /// functions to overlap, and the most debug info in contained in the first
+  /// entries due to the sorting when GSYM files are created. We can have
+  /// multiple function info that start at the same address only if their
+  /// address range doesn't match. So find the first entry that matches \a Addr
+  /// and iiterate forward until we find one that contains the address.
+  ///
+  /// \param[in] Addr A virtual address that matches the original object file
+  /// to lookup.
+  ///
+  /// \param[out] FuncStartAddr A virtual address that is the base address of
+  /// the function that is used for decoding the FunctionInfo.
+  ///
+  /// \returns An valid data extractor on success, or an error if we fail to
+  /// find the address in a function info or corrrectly decode the data
+  llvm::Expected<llvm::DataExtractor>
+  getFunctionInfoData(uint64_t Addr, uint64_t &FuncStartAddr) const;
 };
 
 } // namespace gsym
diff --git a/llvm/lib/DebugInfo/GSYM/GsymReader.cpp b/llvm/lib/DebugInfo/GSYM/GsymReader.cpp
index 1fe90ef579a3d91..bc9ba55332347e3 100644
--- a/llvm/lib/DebugInfo/GSYM/GsymReader.cpp
+++ b/llvm/lib/DebugInfo/GSYM/GsymReader.cpp
@@ -253,49 +253,66 @@ GsymReader::getAddressIndex(const uint64_t Addr) const {
 
 }
 
-llvm::Expected<FunctionInfo> GsymReader::getFunctionInfo(uint64_t Addr) const {
-  Expected<uint64_t> AddressIndex = getAddressIndex(Addr);
-  if (!AddressIndex)
-    return AddressIndex.takeError();
-  // Address info offsets size should have been checked in parse().
-  assert(*AddressIndex < AddrInfoOffsets.size());
-  auto AddrInfoOffset = AddrInfoOffsets[*AddressIndex];
-  assert(
-      (Endian == llvm::endianness::big || Endian == llvm::endianness::little) &&
-      "Endian must be either big or little");
-  DataExtractor Data(MemBuffer->getBuffer().substr(AddrInfoOffset),
-                     Endian == llvm::endianness::little, 4);
-  if (std::optional<uint64_t> OptAddr = getAddress(*AddressIndex)) {
-    auto ExpectedFI = FunctionInfo::decode(Data, *OptAddr);
-    if (ExpectedFI) {
-      if (ExpectedFI->Range.contains(Addr) || ExpectedFI->Range.size() == 0)
-        return ExpectedFI;
+llvm::Expected<DataExtractor>
+GsymReader::getFunctionInfoData(uint64_t Addr, uint64_t &FuncAddr) const {
+  Expected<uint64_t> ExpectedAddrIdx = getAddressIndex(Addr);
+  if (!ExpectedAddrIdx)
+    return ExpectedAddrIdx.takeError();
+  const uint64_t FirstAddrIdx = *ExpectedAddrIdx;
+  std::optional<uint64_t> OptFirstAddr = getAddress(FirstAddrIdx);
+  if (!OptFirstAddr)
+    return createStringError(std::errc::invalid_argument,
+                             "failed to extract address[%" PRIu64 "]",
+                             FirstAddrIdx);
+  // The AddrIdx is the first index of the function info entries that match
+  // \a Addr. We need to iterate over all function info objects that start with
+  // the same address until we find a match.
+  const auto FirstAddr = *OptFirstAddr;
+  const size_t NumAddresses = getNumAddresses();
+  assert((Endian == endianness::big || Endian == endianness::little) &&
+         "Endian must be either big or little");
+  for (uint64_t AddrIdx = FirstAddrIdx; AddrIdx < NumAddresses; ++AddrIdx) {
+    // Extract the function address and make sure it matches FirstAddr
+    std::optional<uint64_t> OptFuncAddr = getAddress(AddrIdx);
+    if (!OptFuncAddr)
       return createStringError(std::errc::invalid_argument,
-                                "address 0x%" PRIx64 " is not in GSYM", Addr);
+                               "failed to extract address[%" PRIu64 "]",
+                               AddrIdx);
+    if (*OptFuncAddr != FirstAddr)
+      break; // Done with consecutive function info entries with same address.
+
+    // Address info offsets size should have been checked in parse().
+    auto AddrInfoOffset = AddrInfoOffsets[AddrIdx];
+    DataExtractor Data(MemBuffer->getBuffer().substr(AddrInfoOffset),
+                       Endian == llvm::endianness::little, 4);
+    uint64_t Offset = 0;
+    // Some symbols on Darwin don't have valid sizes. If we run into a symbol
+    // with zero size, then we have found a match.
+    uint32_t FuncSize = Data.getU32(&Offset);
+    if (FuncSize == 0 ||
+        AddressRange(*OptFuncAddr, *OptFuncAddr + FuncSize).contains(Addr)) {
+      FuncAddr = *OptFuncAddr;
+      return Data;
     }
   }
   return createStringError(std::errc::invalid_argument,
-                           "failed to extract address[%" PRIu64 "]",
-                           *AddressIndex);
+                           "address 0x%" PRIx64 " is not in GSYM", Addr);
+}
+
+llvm::Expected<FunctionInfo> GsymReader::getFunctionInfo(uint64_t Addr) const {
+  uint64_t FuncAddr = 0;
+  if (auto ExpectedData = getFunctionInfoData(Addr, FuncAddr))
+    return FunctionInfo::decode(*ExpectedData, FuncAddr);
+  else
+    return ExpectedData.takeError();
 }
 
 llvm::Expected<LookupResult> GsymReader::lookup(uint64_t Addr) const {
-  Expected<uint64_t> AddressIndex = getAddressIndex(Addr);
-  if (!AddressIndex)
-    return AddressIndex.takeError();
-  // Address info offsets size should have been checked in parse().
-  assert(*AddressIndex < AddrInfoOffsets.size());
-  auto AddrInfoOffset = AddrInfoOffsets[*AddressIndex];
-  assert(
-      (Endian == llvm::endianness::big || Endian == llvm::endianness::little) &&
-      "Endian must be either big or little");
-  DataExtractor Data(MemBuffer->getBuffer().substr(AddrInfoOffset),
-                     Endian == llvm::endianness::little, 4);
-  if (std::optional<uint64_t> OptAddr = getAddress(*AddressIndex))
-    return FunctionInfo::lookup(Data, *this, *OptAddr, Addr);
-  return createStringError(std::errc::invalid_argument,
-                           "failed to extract address[%" PRIu64 "]",
-                           *AddressIndex);
+  uint64_t FuncAddr = 0;
+  if (auto ExpectedData = getFunctionInfoData(Addr, FuncAddr))
+    return FunctionInfo::lookup(*ExpectedData, *this, FuncAddr, Addr);
+  else
+    return ExpectedData.takeError();
 }
 
 void GsymReader::dump(raw_ostream &OS) {
diff --git a/llvm/unittests/DebugInfo/GSYM/GSYMTest.cpp b/llvm/unittests/DebugInfo/GSYM/GSYMTest.cpp
index 53de96cc6953c2d..fec75d1ca619ca9 100644
--- a/llvm/unittests/DebugInfo/GSYM/GSYMTest.cpp
+++ b/llvm/unittests/DebugInfo/GSYM/GSYMTest.cpp
@@ -4681,3 +4681,180 @@ TEST(GSYMTest, TestHandlingOfInvalidFileIndexes) {
   for (const auto &Error : ExpectedLogErrors)
     EXPECT_TRUE(errors.find(Error) != std::string::npos);
 }
+
+TEST(GSYMTest, TestLookupsOfOverlappingAndUnequalRanges) {
+  // Test that llvm-gsymutil lookup the correct funtion info when address
+  // ranges overlap. When functions overlap we always want to pick the first
+  // function info when symbolicating if there are multiple entries with the
+  // same address. Previous to this fix we would just binary search the address
+  // table and pick the first function info that matched the address. After
+  // this fix we now always select the first matching entry whose address range
+  // contains the lookup address to ensure we have the most debug info. We have
+  // seen case where the debug info would contain a small range and a symbol
+  // would have the same start address but the range was larger and sometimes,
+  // depending on how the binary search of the address table happened, we would
+  // pick these latter entries. We want the first entries because they always
+  // have the most debug info.
+  //
+  // To repro this case, we just make some simple DWARF that has two
+  // overlapping ranges and ensure that any lookups between 0x1000 and 0x104f
+  // match "foo", and any ranges between 0x1050 and 0x1fff match "bar".
+  //
+  // 0x0000000b: DW_TAG_compile_unit
+  //               DW_AT_name	("/tmp/main.cpp")
+  //               DW_AT_language	(DW_LANG_C)
+  //               DW_AT_stmt_list	(0x00000000)
+  //
+  // 0x00000015:   DW_TAG_subprogram
+  //                 DW_AT_name	("foo")
+  //                 DW_AT_low_pc	(0x0000000000001000)
+  //                 DW_AT_high_pc	(0x0000000000001050)
+  //
+  // 0x0000002a:   DW_TAG_subprogram
+  //                 DW_AT_name	("bar")
+  //                 DW_AT_low_pc	(0x0000000000001000)
+  //                 DW_AT_high_pc	(0x0000000000001100)
+  //
+  // 0x0000003f:   NULL
+
+  StringRef yamldata = R"(
+  debug_str:
+    - ''
+    - '/tmp/main.cpp'
+    - foo
+    - bar
+  debug_abbrev:
+    - ID:              0
+      Table:
+        - Code:            0x1
+          Tag:             DW_TAG_compile_unit
+          Children:        DW_CHILDREN_yes
+          Attributes:
+            - Attribute:       DW_AT_name
+              Form:            DW_FORM_strp
+            - Attribute:       DW_AT_language
+              Form:            DW_FORM_udata
+            - Attribute:       DW_AT_stmt_list
+              Form:            DW_FORM_sec_offset
+        - Code:            0x2
+          Tag:             DW_TAG_subprogram
+          Children:        DW_CHILDREN_no
+          Attributes:
+            - Attribute:       DW_AT_name
+              Form:            DW_FORM_strp
+            - Attribute:       DW_AT_low_pc
+              Form:            DW_FORM_addr
+            - Attribute:       DW_AT_high_pc
+              Form:            DW_FORM_addr
+  debug_info:
+    - Length:          0x3C
+      Version:         4
+      AbbrevTableID:   0
+      AbbrOffset:      0x0
+      AddrSize:        8
+      Entries:
+        - AbbrCode:        0x1
+          Values:
+            - Value:           0x1
+            - Value:           0x2
+            - Value:           0x0
+        - AbbrCode:        0x2
+          Values:
+            - Value:           0xF
+            - Value:           0x1000
+            - Value:           0x1050
+        - AbbrCode:        0x2
+          Values:
+            - Value:           0x13
+            - Value:           0x1000
+            - Value:           0x1100
+        - AbbrCode:        0x0
+  debug_line:
+    - Length:          71
+      Version:         2
+      PrologueLength:  36
+      MinInstLength:   1
+      DefaultIsStmt:   1
+      LineBase:        251
+      LineRange:       14
+      OpcodeBase:      13
+      StandardOpcodeLengths: [ 0, 1, 1, 1, 1, 0, 0, 0, 1, 0, 0, 1 ]
+      IncludeDirs:
+        - '/tmp'
+      Files:
+        - Name:            main.cpp
+          DirIdx:          1
+          ModTime:         0
+          Length:          0
+      Opcodes:
+        - Opcode:          DW_LNS_extended_op
+          ExtLen:          9
+          SubOpcode:       DW_LNE_set_address
+          Data:            4096
+        - Opcode:          DW_LNS_advance_line
+          SData:           9
+          Data:            0
+        - Opcode:          DW_LNS_copy
+          Data:            0
+        - Opcode:          DW_LNS_advance_pc
+          Data:            16
+        - Opcode:          DW_LNS_advance_line
+          SData:           1
+          Data:            0
+        - Opcode:          DW_LNS_copy
+          Data:            0
+        - Opcode:          DW_LNS_advance_line
+          SData:           1
+          Data:            0
+        - Opcode:          DW_LNS_copy
+          Data:            0
+        - Opcode:          DW_LNS_advance_pc
+          Data:            64
+        - Opcode:          DW_LNS_advance_line
+          SData:           1
+          Data:            0
+        - Opcode:          DW_LNS_extended_op
+          ExtLen:          1
+          SubOpcode:       DW_LNE_end_sequence
+          Data:            0
+  )";
+  auto ErrOrSections = DWARFYAML::emitDebugSections(yamldata);
+  ASSERT_THAT_EXPECTED(ErrOrSections, Succeeded());
+  std::unique_ptr<DWARFContext> DwarfContext =
+      DWARFContext::create(*ErrOrSections, 8);
+  ASSERT_TRUE(DwarfContext.get() != nullptr);
+  std::string errors;
+  raw_string_ostream OS(errors);
+  GsymCreator GC;
+  DwarfTransformer DT(*DwarfContext, GC);
+  const uint32_t ThreadCount = 1;
+  ASSERT_THAT_ERROR(DT.convert(ThreadCount, &OS), Succeeded());
+  ASSERT_THAT_ERROR(GC.finalize(OS), Succeeded());
+  OS.flush();
+  SmallString<512> Str;
+  raw_svector_ostream OutStrm(Str);
+  const auto ByteOrder = llvm::endianness::native;
+  FileWriter FW(OutStrm, ByteOrder);
+  ASSERT_THAT_ERROR(GC.encode(FW), Succeeded());
+  Expected<GsymReader> GR = GsymReader::copyBuffer(OutStrm.str());
+  ASSERT_THAT_EXPECTED(GR, Succeeded());
+  // There should be two functions in our GSYM.
+  EXPECT_EQ(GR->getNumAddresses(), 2u);
+  // Verify "foo" is correctly looked up for each of its addresses.
+  for (uint64_t Addr = 0x1000; Addr < 0x1050; ++Addr) {
+    auto ExpFI = GR->getFunctionInfo(Addr);
+    ASSERT_THAT_EXPECTED(ExpFI, Succeeded());
+    ASSERT_EQ(ExpFI->Range, AddressRange(0x1000, 0x1050));
+    StringRef FuncName = GR->getString(ExpFI->Name);
+    EXPECT_EQ(FuncName, "foo");
+  }
+
+  // Verify "bar" is correctly looked up for each of its addresses.
+  for (uint64_t Addr = 0x1050; Addr < 0x1100; ++Addr) {
+    auto ExpFI = GR->getFunctionInfo(Addr);
+    ASSERT_THAT_EXPECTED(ExpFI, Succeeded());
+    ASSERT_EQ(ExpFI->Range, AddressRange(0x1000, 0x1100));
+    StringRef FuncName = GR->getString(ExpFI->Name);
+    EXPECT_EQ(FuncName, "bar");
+  }
+}

dwblaikie · 2023-11-15T16:49:22Z

we had debug info for a function with a range like [0x100-0x200) and a symbol at the same start address yet with a larger range like [0x100-0x300)

Do you have a test case for where this happens? If clang/llvm are producing debug info like that I'd be concerned/want to look into fixing it.

clayborg · 2023-11-15T18:52:55Z

we had debug info for a function with a range like [0x100-0x200) and a symbol at the same start address yet with a larger range like [0x100-0x300)

Do you have a test case for where this happens? If clang/llvm are producing debug info like that I'd be concerned/want to look into fixing it.

This is Haskell debugging related where the compiler is producing debug info with a small sized DW_TAG_subprogram and the symbol table symbol has a much larger range. I don't believe clang supports Haskell. The DW_AT_producer in this case is:

DW_AT_producer    ("The Glorious Glasgow Haskell Compilation System 8.8.3")

dwblaikie · 2023-11-15T19:03:05Z

we had debug info for a function with a range like [0x100-0x200) and a symbol at the same start address yet with a larger range like [0x100-0x300)

Do you have a test case for where this happens? If clang/llvm are producing debug info like that I'd be concerned/want to look into fixing it.

This is Haskell debugging related where the compiler is producing debug info with a small sized DW_TAG_subprogram and the symbol table symbol has a much larger range. I don't believe clang supports Haskell. The DW_AT_producer in this case is:
DW_AT_producer    ("The Glorious Glasgow Haskell Compilation System 8.8.3")

ah, thanks for the details. Yeah, I can see how merging both the ELF symbol table and the DWARF description could end up with differences, possibly due to whether the alignment padding gets included in one description or the other - probably better that it not be included in either, but yeah, such is life.

ayermolo · 2023-11-16T00:17:24Z

llvm/lib/DebugInfo/GSYM/GsymReader.cpp

+      break; // Done with consecutive function info entries with same address.
+
+    // Address info offsets size should have been checked in parse().
+    auto AddrInfoOffset = AddrInfoOffsets[AddrIdx];


I realize it was auto in original code, but can we change it to actual type?

ayermolo · 2023-11-16T00:18:48Z

llvm/include/llvm/DebugInfo/GSYM/GsymReader.h

@@ -266,6 +266,18 @@ class GsymReader {
      return std::nullopt;
    if (Iter == End || AddrOffset < *Iter)
      --Iter;
+
+    // GSYM files store the richest information first in the file, so always


What do you mean by richest?

github-actions · 2023-11-16T22:38:47Z

✅ With the latest revision this PR passed the C/C++ code formatter.

llvm-gsymutil allows address ranges to overlap. There was a bug where if we had debug info for a function with a range like [0x100-0x200) and a symbol at the same start address yet with a larger range like [0x100-0x300), we would randomly get either only information from the first or second entry. This could cause lookups to fail due to the way the binary search worked. This patch makes sure that when lookups happen we find the first address table entry that can match an address, and also ensures that we always select the first FunctionInfo that could match. FunctionInfo entries are sorted such that the most debug info rich entries come first. And if we have two ranges that have the same start address, the smaller range comes first and the larger one comes next. This patch also adds the ability to iterate over all function infos with the same start address to always find a range that contains the address. Added a unit test to test this functionality that failed prior to this fix and now succeeds.

Fixed a case where when we dumped an entire GSYM file we would dump the wrong FunctionInfo for a given address entry. This was because we would lookup a function info by address when dumping a specific address table entry. If the address table had multiple addresses with different ranges, it would always emit the one that was found via lookup by address.

…lvm#72350) llvm-gsymutil allows address ranges to overlap. There was a bug where if we had debug info for a function with a range like [0x100-0x200) and a symbol at the same start address yet with a larger range like [0x100-0x300), we would randomly get either only information from the first or second entry. This could cause lookups to fail due to the way the binary search worked. This patch makes sure that when lookups happen we find the first address table entry that can match an address, and also ensures that we always select the first FunctionInfo that could match. FunctionInfo entries are sorted such that the most debug info rich entries come first. And if we have two ranges that have the same start address, the smaller range comes first and the larger one comes next. This patch also adds the ability to iterate over all function infos with the same start address to always find a range that contains the address. Added a unit test to test this functionality that failed prior to this fix and now succeeds. Also fix an issue when dumping an entire GSYM file that has duplicate address entries where it used to always print out the binary search match for the FunctionInfo, not the actual data for the address index.

clayborg requested review from kusmour, jeffreytan81 and ayermolo November 15, 2023 06:08

llvmbot added the debuginfo label Nov 15, 2023

clayborg self-assigned this Nov 15, 2023

ayermolo reviewed Nov 16, 2023

View reviewed changes

kusmour approved these changes Nov 17, 2023

View reviewed changes

clayborg added 4 commits November 17, 2023 10:29

Fix comment typos.

22f038b

clang-format

3849c76

clayborg force-pushed the gsym-overlap-lookup branch from 4e1fdb6 to 3849c76 Compare November 17, 2023 18:29

clayborg merged commit 18eefc1 into llvm:main Nov 17, 2023
2 of 3 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Modify llvm-gsymutil lookups to handle overlapping ranges correctly. #72350

Modify llvm-gsymutil lookups to handle overlapping ranges correctly. #72350

clayborg commented Nov 15, 2023

llvmbot commented Nov 15, 2023

dwblaikie commented Nov 15, 2023

clayborg commented Nov 15, 2023

dwblaikie commented Nov 15, 2023

ayermolo Nov 16, 2023

ayermolo Nov 16, 2023

github-actions bot commented Nov 16, 2023 •

edited

Loading

Modify llvm-gsymutil lookups to handle overlapping ranges correctly. #72350

Modify llvm-gsymutil lookups to handle overlapping ranges correctly. #72350

Conversation

clayborg commented Nov 15, 2023

llvmbot commented Nov 15, 2023

dwblaikie commented Nov 15, 2023

clayborg commented Nov 15, 2023

dwblaikie commented Nov 15, 2023

ayermolo Nov 16, 2023

Choose a reason for hiding this comment

ayermolo Nov 16, 2023

Choose a reason for hiding this comment

github-actions bot commented Nov 16, 2023 • edited Loading

github-actions bot commented Nov 16, 2023 •

edited

Loading