Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Reduce llvm-gsymutil memory usage #91023

Open
wants to merge 10 commits into
base: main
Choose a base branch
from

Conversation

kevinfrei
Copy link
Contributor

@kevinfrei kevinfrei commented May 3, 2024

llvm-gsymutil eats a lot of RAM. On some large binaries, it causes OOM's on smaller hardware, consuming well over 64GB of RAM. This change frees line tables once we're done with them, and frees DWARFUnits's DIE's when we finish processing each DU, though they may get reconstituted if there are references from other DU's during processing. Once the conversion is complete, all DIE's are freed. The reduction in peak memory usage from these changes showed between 7-12% in my tests.

The double-checked locking around the creation & freeing of the data structures was tested on a 166 core system. I validated that it trivially malfunctioned without the locks (and with stupid reordering of the locks) and worked reliably with them.

@kevinfrei kevinfrei marked this pull request as ready for review May 6, 2024 16:13
@llvmbot
Copy link
Collaborator

llvmbot commented May 6, 2024

@llvm/pr-subscribers-debuginfo

Author: Kevin Frei (kevinfrei)

Changes

llvm-gsymutil eats a lot of RAM. On some large binaries, it causes OOM's on smaller hardware, consuming well over 64GB of RAM. This change frees line tables once we're done with them, and frees DWARFUnits's DIE's when we finish processing each DU, though they may get reconstituted if there are references from other DU's during processing. Once the conversion is complete, all DIE's are freed. The reduction in peak memory usage from these changes showed between 7-12% in my tests.

My testing of the double-checked locking around the creation & freeing of the data structures was tested on a 166 core system, validating that it trivially malfunctioned without the locks (or with stupid reordering of the locks) and worked reliably with them.


Full diff: https://github.com/llvm/llvm-project/pull/91023.diff

3 Files Affected:

  • (modified) llvm/include/llvm/DebugInfo/DWARF/DWARFUnit.h (+7-3)
  • (modified) llvm/lib/DebugInfo/DWARF/DWARFUnit.cpp (+25-7)
  • (modified) llvm/lib/DebugInfo/GSYM/DwarfTransformer.cpp (+14-1)
diff --git a/llvm/include/llvm/DebugInfo/DWARF/DWARFUnit.h b/llvm/include/llvm/DebugInfo/DWARF/DWARFUnit.h
index 80c27aea893123..9614aab8bb9b50 100644
--- a/llvm/include/llvm/DebugInfo/DWARF/DWARFUnit.h
+++ b/llvm/include/llvm/DebugInfo/DWARF/DWARFUnit.h
@@ -22,6 +22,7 @@
 #include "llvm/DebugInfo/DWARF/DWARFLocationExpression.h"
 #include "llvm/DebugInfo/DWARF/DWARFUnitIndex.h"
 #include "llvm/Support/DataExtractor.h"
+#include "llvm/Support/RWMutex.h"
 #include <cassert>
 #include <cstddef>
 #include <cstdint>
@@ -257,6 +258,9 @@ class DWARFUnit {
 
   std::shared_ptr<DWARFUnit> DWO;
 
+  mutable llvm::sys::RWMutex m_cu_die_array_mutex;
+  mutable llvm::sys::RWMutex m_all_die_array_mutex;
+
 protected:
   friend dwarf_linker::parallel::CompileUnit;
 
@@ -566,6 +570,9 @@ class DWARFUnit {
 
   Error tryExtractDIEsIfNeeded(bool CUDieOnly);
 
+  /// clearDIEs - Clear parsed DIEs to keep memory usage low.
+  void clearDIEs(bool KeepCUDie);
+
 private:
   /// Size in bytes of the .debug_info data associated with this compile unit.
   size_t getDebugInfoSize() const {
@@ -581,9 +588,6 @@ class DWARFUnit {
   void extractDIEsToVector(bool AppendCUDie, bool AppendNonCUDIEs,
                            std::vector<DWARFDebugInfoEntry> &DIEs) const;
 
-  /// clearDIEs - Clear parsed DIEs to keep memory usage low.
-  void clearDIEs(bool KeepCUDie);
-
   /// parseDWO - Parses .dwo file for current compile unit. Returns true if
   /// it was actually constructed.
   /// The \p AlternativeLocation specifies an alternative location to get
diff --git a/llvm/lib/DebugInfo/DWARF/DWARFUnit.cpp b/llvm/lib/DebugInfo/DWARF/DWARFUnit.cpp
index bdd04b00f557bd..cc79d9ec7130c9 100644
--- a/llvm/lib/DebugInfo/DWARF/DWARFUnit.cpp
+++ b/llvm/lib/DebugInfo/DWARF/DWARFUnit.cpp
@@ -496,13 +496,29 @@ void DWARFUnit::extractDIEsIfNeeded(bool CUDieOnly) {
 }
 
 Error DWARFUnit::tryExtractDIEsIfNeeded(bool CUDieOnly) {
-  if ((CUDieOnly && !DieArray.empty()) ||
-      DieArray.size() > 1)
-    return Error::success(); // Already parsed.
-
-  bool HasCUDie = !DieArray.empty();
-  extractDIEsToVector(!HasCUDie, !CUDieOnly, DieArray);
-
+  {
+    llvm::sys::ScopedReader Lock(m_cu_die_array_mutex);
+    if ((CUDieOnly && !DieArray.empty()) || DieArray.size() > 1)
+      return Error::success(); // Already parsed.
+  }
+  bool HasCUDie = false;
+  {
+    llvm::sys::ScopedWriter Lock(m_cu_die_array_mutex);
+    if ((CUDieOnly && !DieArray.empty()) || DieArray.size() > 1)
+      return Error::success(); // Already parsed.
+    HasCUDie = !DieArray.empty();
+    extractDIEsToVector(!HasCUDie, !CUDieOnly, DieArray);
+  }
+  {
+    llvm::sys::ScopedReader Lock(m_all_die_array_mutex);
+    if (DieArray.empty())
+      return Error::success();
+
+    // If CU DIE was just parsed, copy several attribute values from it.
+    if (HasCUDie)
+      return Error::success();
+  }
+  llvm::sys::ScopedWriter Lock(m_all_die_array_mutex);
   if (DieArray.empty())
     return Error::success();
 
@@ -658,6 +674,8 @@ void DWARFUnit::clearDIEs(bool KeepCUDie) {
   // It depends on the implementation whether the request is fulfilled.
   // Create a new vector with a small capacity and assign it to the DieArray to
   // have previous contents freed.
+  llvm::sys::ScopedWriter CULock(m_cu_die_array_mutex);
+  llvm::sys::ScopedWriter AllLock(m_all_die_array_mutex);
   DieArray = (KeepCUDie && !DieArray.empty())
                  ? std::vector<DWARFDebugInfoEntry>({DieArray[0]})
                  : std::vector<DWARFDebugInfoEntry>();
diff --git a/llvm/lib/DebugInfo/GSYM/DwarfTransformer.cpp b/llvm/lib/DebugInfo/GSYM/DwarfTransformer.cpp
index 601686fdd3dd51..4a1ed91a7244f2 100644
--- a/llvm/lib/DebugInfo/GSYM/DwarfTransformer.cpp
+++ b/llvm/lib/DebugInfo/GSYM/DwarfTransformer.cpp
@@ -587,6 +587,11 @@ Error DwarfTransformer::convert(uint32_t NumThreads, OutputAggregator &Out) {
       DWARFDie Die = getDie(*CU);
       CUInfo CUI(DICtx, dyn_cast<DWARFCompileUnit>(CU.get()));
       handleDie(Out, CUI, Die);
+      // Release the line table, once we're done.
+      DICtx.clearLineTableForUnit(CU.get());
+      // Free any DIEs that were allocated by the DWARF parser.
+      // If/when they're needed by other CU's, they'll be recreated.
+      CU->clearDIEs(false);
     }
   } else {
     // LLVM Dwarf parser is not thread-safe and we need to parse all DWARF up
@@ -612,11 +617,16 @@ Error DwarfTransformer::convert(uint32_t NumThreads, OutputAggregator &Out) {
       DWARFDie Die = getDie(*CU);
       if (Die) {
         CUInfo CUI(DICtx, dyn_cast<DWARFCompileUnit>(CU.get()));
-        pool.async([this, CUI, &LogMutex, &Out, Die]() mutable {
+        pool.async([this, CUI, &CU, &LogMutex, &Out, Die]() mutable {
           std::string storage;
           raw_string_ostream StrStream(storage);
           OutputAggregator ThreadOut(Out.GetOS() ? &StrStream : nullptr);
           handleDie(ThreadOut, CUI, Die);
+          // Release the line table once we're done.
+          DICtx.clearLineTableForUnit(CU.get());
+          // Free any DIEs that were allocated by the DWARF parser.
+          // If/when they're needed by other CU's, they'll be recreated.
+          CU->clearDIEs(false);
           // Print ThreadLogStorage lines into an actual stream under a lock
           std::lock_guard<std::mutex> guard(LogMutex);
           if (Out.GetOS()) {
@@ -629,6 +639,9 @@ Error DwarfTransformer::convert(uint32_t NumThreads, OutputAggregator &Out) {
     }
     pool.wait();
   }
+  // Now get rid of all the DIEs that may have been recreated
+  for (const auto &CU : DICtx.compile_units())
+    CU->clearDIEs(false);
   size_t FunctionsAddedCount = Gsym.getNumFunctionInfos() - NumBefore;
   Out << "Loaded " << FunctionsAddedCount << " functions from DWARF.\n";
   return Error::success();

@kevinfrei
Copy link
Contributor Author

@clayborg @jeffreytan81 could one of you please take a look?

Copy link
Contributor

@jeffreytan81 jeffreytan81 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Per my understanding, these two mutex are required to protect any read/write to DieArray field? If so, there seem to be many more accesses of this field not protected?

llvm/lib/DebugInfo/GSYM/DwarfTransformer.cpp Outdated Show resolved Hide resolved
llvm/lib/DebugInfo/GSYM/DwarfTransformer.cpp Outdated Show resolved Hide resolved
llvm/include/llvm/DebugInfo/DWARF/DWARFUnit.h Outdated Show resolved Hide resolved
llvm/include/llvm/DebugInfo/DWARF/DWARFUnit.h Outdated Show resolved Hide resolved
llvm/include/llvm/DebugInfo/DWARF/DWARFUnit.h Outdated Show resolved Hide resolved
llvm/lib/DebugInfo/DWARF/DWARFUnit.cpp Outdated Show resolved Hide resolved
@kevinfrei
Copy link
Contributor Author

kevinfrei commented May 13, 2024

Per my understanding, these two mutex are required to protect any read/write to DieArray field? If so, there seem to be many more accesses of this field not protected?

They're there for protecting the allocation+population, or destruction of the DieArrays, not generally reading/writing those fields. There needs to be two (for just the first CUDie, and then the whole array) as populating the whole array can trigger a request to populate the first CuDie.

Copy link
Collaborator

@clayborg clayborg left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am fine with the GSYM parts. I will let the LLVM DWARF experts accept the changes in DWARFUnit.h/.cpp

llvm/lib/DebugInfo/GSYM/DwarfTransformer.cpp Show resolved Hide resolved
@ayermolo
Copy link
Contributor

Per my understanding, these two mutex are required to protect any read/write to DieArray field? If so, there seem to be many more accesses of this field not protected?

They're there for protecting the allocation+population, or destruction of the DieArrays, not generally reading/writing those fields. There needs to be two (for just the first CUDie, and then the whole array) as populating the first can trigger a request to populate the rest.

What is the code path for that?

@kevinfrei
Copy link
Contributor Author

kevinfrei commented May 14, 2024

Per my understanding, these two mutex are required to protect any read/write to DieArray field? If so, there seem to be many more accesses of this field not protected?

They're there for protecting the allocation+population, or destruction of the DieArrays, not generally reading/writing those fields. There needs to be two (for just the first CUDie, and then the whole array) as populating the first can trigger a request to populate the rest.

What is the code path for that?

Sorry: I wrote that backwards (fixed the original comment). Populating the full array can trigger a request to populate the CuDie through determineStringOffsetsTableContribution, which is why you need the two different locks. The first is always created before the list, but you need to lock it separately so that the read-lock will be free to return just the CUDie.

@ayermolo
Copy link
Contributor

Per my understanding, these two mutex are required to protect any read/write to DieArray field? If so, there seem to be many more accesses of this field not protected?

They're there for protecting the allocation+population, or destruction of the DieArrays, not generally reading/writing those fields. There needs to be two (for just the first CUDie, and then the whole array) as populating the first can trigger a request to populate the rest.

What is the code path for that?

Sorry: I wrote that backwards (fixed the original comment). Populating the full array can trigger a request to populate the CuDie through determineStringOffsetsTableContribution, which is why you need the two different locks. The first is always created before the list, but you need to lock it separately so that the read-lock will be free to return just the CUDie.

Glanced at the code. I think I am missing something. How can code through determineStringOffsetsTableContribution trigger the path? It just accesses StrOffset Section.

@kevinfrei
Copy link
Contributor Author

Per my understanding, these two mutex are required to protect any read/write to DieArray field? If so, there seem to be many more accesses of this field not protected?

They're there for protecting the allocation+population, or destruction of the DieArrays, not generally reading/writing those fields. There needs to be two (for just the first CUDie, and then the whole array) as populating the first can trigger a request to populate the rest.

What is the code path for that?

Sorry: I wrote that backwards (fixed the original comment). Populating the full array can trigger a request to populate the CuDie through determineStringOffsetsTableContribution, which is why you need the two different locks. The first is always created before the list, but you need to lock it separately so that the read-lock will be free to return just the CUDie.

Glanced at the code. I think I am missing something. How can code through determineStringOffsetsTableContribution trigger the path? It just accesses StrOffset Section.

Yeah, I didn't see it by looking at the code, because the recursion occurs inside getUnitDIE() (in the .h file), which calls extractDIEsIfNeeded(...)

@kevinfrei
Copy link
Contributor Author

ping @ayermolo @dwblaikie This is blocking some dwarf-health tracking work I'm trying to make headway on. The only 'stylistic' adjustment I can see making sense is to have some sort of doubled-checked locking abstraction that takes a lock and a couple lambda's to make the duplicated code cleaner, which I'm happy to do if someone wants it, but I haven't seen something like that elsewhere.

@ayermolo
Copy link
Contributor

OK. I think this two lock thing makes sense in my head.
One concern I have is that between locks in tryExtractDIEsIfNeeded it switches to clearDIEs.

@kevinfrei
Copy link
Contributor Author

OK. I think this two lock thing makes sense in my head. One concern I have is that between locks in tryExtractDIEsIfNeeded it switches to clearDIEs.

Oooh! You're right: There's a spot in there where bad things could happen. I'll get that fixed.

@kevinfrei
Copy link
Contributor Author

@ayermolo I used a rwlock model to prevent the issue you identified. Good catch!

@kevinfrei
Copy link
Contributor Author

To alleviate copy-pasta problems, I've added a local DoubleCheckewhatdRWLocker helper function that takes the "check if it's already been initialized" lambda, and the initializer lambda (and the lock). I feel like the result is a much cleaner version of the original diff, plus it has the third lock to prevent free's from occurring between the single CU and the array CU initialization code that @ayermolo saw. @dwblaikie they tell me you're the guy who needs to approve this, now...

@dwblaikie
Copy link
Collaborator

Ah, I thought this was all in gsym, so hadn't given the review much attention.

Are these independent changes - the line table clearing stuff presumably is independent, for instance? I guess the unit DIE clearing is dependent on the extra locking being added?

The extra locking is unfortunate/more difficult to just straight up approve, as we'd previously restricted the multithreading support to the DWARFContext by having distinct contexts - so that non-multithreaded code didn't take the hit or complexity for the multithreaded code. But this new codepath and locking is present for single and multithreaded uses alike, by the looks of it? Not sure what to do about that...

@kevinfrei
Copy link
Contributor Author

Are these independent changes - the line table clearing stuff presumably is independent, for instance? I guess the unit DIE clearing is dependent on the extra locking being added?

I'm only doing one "real" thing here, so no, the changes can't be split into smaller, still useful/sensible pieces (except that I could delay the gsymutil change until after the dwarfunit support for releasing the memory is in, which doesn't seem worthwhile)

The extra locking is unfortunate/more difficult to just straight up approve, as we'd previously restricted the multithreading support to the DWARFContext by having distinct contexts - so that non-multithreaded code didn't take the hit or complexity for the multithreaded code. But this new codepath and locking is present for single and multithreaded uses alike, by the looks of it? Not sure what to do about that...

I spent about 3 days staring at how to accomplish this in a way that wouldn't increase the cost of single-threaded code, and couldn't come up with anything other that copying the entire class and making a multi-threaded 'freeable' duplicate, which seemed like an awful lot of code duplication to save a few hundred CPU cycles in code that is egregiously I/O bound. (I have a 166 core server: gsymutil works fastest when it's run on 6-10 cores: any more swamps the NVMe drive).

@kevinfrei
Copy link
Contributor Author

All that said, apparently something broke a bunch of tests (I hadn't run them since my last couple changes) so I'm not blocked on your review just yet :/

@kevinfrei
Copy link
Contributor Author

@dwblaikie If you could take a look, now, I'd appreciate it. Using move semantics to force error handling makes for extra fun with lambda's :)

return ((CUDieOnly && !DieArray.empty()) || DieArray.size() > 1);
};

// Lambda to extract the CU DIE.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure if the comment (& other ones like it above/below on each lambda, etc) helps, since it's mostly captured in the lambda name? (& it's two lines of code, so probably self explanatory anyway?)

How badly does this whole chunk of code format if you inline the lambdas into the DoubleCheckedRWLocker call? (rather than having separate named lambdas to pass to it)

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

oh, and usually for a lambda that's only used within its scope, I wouldn't bother with explicit capture lists, and jus tuse [&] by default - but perhaps due to these things happening under locks, etc, it might be worth the extra clarity - I'm not 100% sure, though.

Again, how's this code look if it's all in the one DoubleCheckedRWLocker call, with default capture?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I showed the code to a coworker with lots more recent experience in LLVM, and he suggested inlining the lambdas (which was nearly unreadable originally), but breaking the code into pieces so you can see the locking at each point (which makes the code more legible). I'm gonna get that change up hopefully by end of day, then you can tell me which you prefer.

Copy link
Contributor

@ayermolo ayermolo Jun 6, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I vote for readable. :D

I think I remember your internal one, and this one is more readable. at least to me.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I vote for readable. :D

I think I remember your internal one, and this one is more readable. at least to me.

Yeah, this is better than the previous iteration, but broken up in even smaller pieces looks downright well thought out. Update forthcoming...

Comment on lines 714 to 716
llvm::sys::ScopedWriter FreeLock(CUDieFreeMutex);
llvm::sys::ScopedWriter CULock(CUDieArrayMutex);
llvm::sys::ScopedWriter AllLock(AllDieArrayMutex);
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe taking all 3 locks is overkill, as getting the CUDieFreeMutex writer lock will block until the CUDie and AllDie locks are free. Thoughts?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, sounds about right.

And perhaps once the other two are removed the code'll be basically the same as it was without this patch (except for two lock acquisitions) before and so a lot of the concerns/questions/issues (the newly subdivided functions and names are quite confusing to me, but I've deleted my comments on that so as not to confuse things further if it's all going away anyway) go away?

Copy link
Contributor Author

@kevinfrei kevinfrei Jun 7, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can't remove the other two locks. They have to stay, as the freeing mechanism requires it. The third lock (Free) is specifically to deal with the issue that @ayermolo pointed out a couple revisions ago. (Freeing could free just the CU DIE while the non-CU DIE stuff was being initialized, resulting in an invalid state of no CU DIE, but non-CU DIE's are parsed). The entire process is kind of a mess, but I need a "CU DIE initialization" RWLock, plus the "Non-CU DIE initialization" lock as two separate entities, as initializing the non-CU DIE's can trigger recursion (you have to be able to grab the reader lock of the CU DIE init to tell that it's already been initialized, then bail, while still holding the non-CU DIE writer lock)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

6 participants