[Object][Archive][NFC] Create all symbolic files objects before calculating offsets. #85229

cjacek · 2024-03-14T14:41:38Z

This is refactoring preparing to move UseECMap computation to the archive writer. We currently require writeArchive caller to pass that. This is not practical for llvm-ar, which currently interprets at most one passed object. For a reliable UseECMap, we need to interpret all symbolic objects: we may have, for example, a list of x86_64 files followed by aarch64 file, which indicates that we should use EC map for x86_64 objects.

This commit interprets symbolic files in a separated pass, which will be a convenient place to implement UseECMap computation in the follow up. I think it also makes accessing the next member for AIX big archive offset computation a bit easier.

…lating offsets. This is refactoring preparing to move UseECMap computation to the archive writer. We currently require writeArchive caller to pass that. This is not practical for llvm-ar, which currently interprets at most one passed object. For a reliable UseECMap, we need to interpret all symbolic objects: we may have, for example, a list of x86_64 files followed by aarch64 file, which indicates that we should use EC map for x86_64 objects. This commit interprets symbolic files in a separated pass, which will be a convenient place to implement UseECMap computation in the follow up. I think it also makes accessing the next member for AIX big archive offset computation a bit easier.

llvmbot · 2024-03-14T14:42:11Z

@llvm/pr-subscribers-llvm-binary-utilities

Author: Jacek Caban (cjacek)

Changes

This is refactoring preparing to move UseECMap computation to the archive writer. We currently require writeArchive caller to pass that. This is not practical for llvm-ar, which currently interprets at most one passed object. For a reliable UseECMap, we need to interpret all symbolic objects: we may have, for example, a list of x86_64 files followed by aarch64 file, which indicates that we should use EC map for x86_64 objects.

This commit interprets symbolic files in a separated pass, which will be a convenient place to implement UseECMap computation in the follow up. I think it also makes accessing the next member for AIX big archive offset computation a bit easier.

Full diff: https://github.com/llvm/llvm-project/pull/85229.diff

1 Files Affected:

(modified) llvm/lib/Object/ArchiveWriter.cpp (+23-35)

diff --git a/llvm/lib/Object/ArchiveWriter.cpp b/llvm/lib/Object/ArchiveWriter.cpp
index aa57e55de70c8d..97d0e4f75be8e0 100644
--- a/llvm/lib/Object/ArchiveWriter.cpp
+++ b/llvm/lib/Object/ArchiveWriter.cpp
@@ -795,23 +795,31 @@ computeMemberData(raw_ostream &StringTable, raw_ostream &SymNames,
       Entry.second = Entry.second > 1 ? 1 : 0;
   }
 
+  std::vector<std::unique_ptr<SymbolicFile>> SymFiles;
+
+  if (NeedSymbols != SymtabWritingMode::NoSymtab || isAIXBigArchive(Kind)) {
+    for (const NewArchiveMember &M : NewMembers) {
+      Expected<std::unique_ptr<SymbolicFile>> SymFileOrErr =
+          getSymbolicFile(M.Buf->getMemBufferRef(), Context);
+      if (!SymFileOrErr)
+        return createFileError(M.MemberName, SymFileOrErr.takeError());
+      SymFiles.push_back(std::move(*SymFileOrErr));
+    }
+  }
+
   // The big archive format needs to know the offset of the previous member
   // header.
   uint64_t PrevOffset = 0;
   uint64_t NextMemHeadPadSize = 0;
-  std::unique_ptr<SymbolicFile> CurSymFile;
-  std::unique_ptr<SymbolicFile> NextSymFile;
-  uint16_t Index = 0;
 
-  for (auto M = NewMembers.begin(); M < NewMembers.end(); ++M) {
+  for (uint32_t Index = 0; Index < NewMembers.size(); ++Index) {
+    const NewArchiveMember *M = &NewMembers[Index];
     std::string Header;
     raw_string_ostream Out(Header);
 
     MemoryBufferRef Buf = M->Buf->getMemBufferRef();
     StringRef Data = Thin ? "" : Buf.getBuffer();
 
-    Index++;
-
     // ld64 expects the members to be 8-byte aligned for 64-bit content and at
     // least 4-byte aligned for 32-bit content.  Opt for the larger encoding
     // uniformly.  This matches the behaviour with cctools and ensures that ld64
@@ -837,29 +845,9 @@ computeMemberData(raw_ostream &StringTable, raw_ostream &SymNames,
           std::move(StringMsg), object::object_error::parse_failed);
     }
 
-    if (NeedSymbols != SymtabWritingMode::NoSymtab || isAIXBigArchive(Kind)) {
-      auto SetNextSymFile = [&NextSymFile,
-                             &Context](MemoryBufferRef Buf,
-                                       StringRef MemberName) -> Error {
-        Expected<std::unique_ptr<SymbolicFile>> SymFileOrErr =
-            getSymbolicFile(Buf, Context);
-        if (!SymFileOrErr)
-          return createFileError(MemberName, SymFileOrErr.takeError());
-        NextSymFile = std::move(*SymFileOrErr);
-        return Error::success();
-      };
-
-      if (M == NewMembers.begin())
-        if (Error Err = SetNextSymFile(Buf, M->MemberName))
-          return std::move(Err);
-
-      CurSymFile = std::move(NextSymFile);
-
-      if ((M + 1) != NewMembers.end())
-        if (Error Err = SetNextSymFile((M + 1)->Buf->getMemBufferRef(),
-                                       (M + 1)->MemberName))
-          return std::move(Err);
-    }
+    std::unique_ptr<SymbolicFile> CurSymFile;
+    if (!SymFiles.empty())
+      CurSymFile = std::move(SymFiles[Index]);
 
     // In the big archive file format, we need to calculate and include the next
     // member offset and previous member offset in the file member header.
@@ -880,13 +868,13 @@ computeMemberData(raw_ostream &StringTable, raw_ostream &SymNames,
 
       // If there is another member file after this, we need to calculate the
       // padding before the header.
-      if ((M + 1) != NewMembers.end()) {
-        uint64_t OffsetToNextMemData = NextOffset +
-                                       sizeof(object::BigArMemHdrType) +
-                                       alignTo((M + 1)->MemberName.size(), 2);
+      if (Index + 1 != SymFiles.size()) {
+        uint64_t OffsetToNextMemData =
+            NextOffset + sizeof(object::BigArMemHdrType) +
+            alignTo(NewMembers[Index + 1].MemberName.size(), 2);
         NextMemHeadPadSize =
             alignToPowerOf2(OffsetToNextMemData,
-                            getMemberAlignment(NextSymFile.get())) -
+                            getMemberAlignment(SymFiles[Index + 1].get())) -
             OffsetToNextMemData;
         NextOffset += NextMemHeadPadSize;
       }
@@ -902,7 +890,7 @@ computeMemberData(raw_ostream &StringTable, raw_ostream &SymNames,
     std::vector<unsigned> Symbols;
     if (NeedSymbols != SymtabWritingMode::NoSymtab) {
       Expected<std::vector<unsigned>> SymbolsOrErr =
-          getSymbols(CurSymFile.get(), Index, SymNames, SymMap);
+          getSymbols(CurSymFile.get(), Index + 1, SymNames, SymMap);
       if (!SymbolsOrErr)
         return createFileError(M->MemberName, SymbolsOrErr.takeError());
       Symbols = std::move(*SymbolsOrErr);

dpaoliello

This is a really great cleanup IMHO, I found the NextSymFile thing very confusing when dealing with this code.

jh7370 · 2024-03-18T09:52:48Z

I think this PR could have done with some eyes from the AIX developer (@diggerlin, I think) and myself as people previously involved (and not all that long ago) with this area of code.

Taking a look at the change, I'm a little concerned about the potential impact on memory usage of this version of the PR. Previously, the code would only need one symbolic file to be in memory at a time, but now it requires all of them. I don't have any numbers, but I wouldn't be surprised in a large archive with many members with lots of symbols (not an unusual thing) if this change has drastically increased memory usage. I think it would be a good idea to do some performance analysis of the before and after of this change for particularly large archives.

cjacek · 2024-03-18T13:46:17Z

Taking a look at the change, I'm a little concerned about the potential impact on memory usage of this version of the PR. Previously, the code would only need one symbolic file to be in memory at a time, but now it requires all of them. I don't have any numbers, but I wouldn't be surprised in a large archive with many members with lots of symbols (not an unusual thing) if this change has drastically increased memory usage. I think it would be a good idea to do some performance analysis of the before and after of this change for particularly large archives.

Please note that symbolic file is stored in MemberData struct, which is returned by computeMemberData. CurSymFile is not destructed in the loop, it's moved to Ret vector instead, so we store all of them in memory with or without this patch. The memory usage impact of this PR is just one additional vector of pointers.

jh7370 · 2024-03-19T08:14:55Z

Taking a look at the change, I'm a little concerned about the potential impact on memory usage of this version of the PR. Previously, the code would only need one symbolic file to be in memory at a time, but now it requires all of them. I don't have any numbers, but I wouldn't be surprised in a large archive with many members with lots of symbols (not an unusual thing) if this change has drastically increased memory usage. I think it would be a good idea to do some performance analysis of the before and after of this change for particularly large archives.

Please note that symbolic file is stored in MemberData struct, which is returned by computeMemberData. CurSymFile is not destructed in the loop, it's moved to Ret vector instead, so we store all of them in memory with or without this patch. The memory usage impact of this PR is just one additional vector of pointers.

Oh, you're right. I looked through this code at least 3 times before posting my comment, looking for anything like this and still didn't see it. Sorry for the noise.

cjacek requested review from mstorsjo, dpaoliello and efriedma-quic March 14, 2024 14:41

llvmbot added the llvm:binary-utilities label Mar 14, 2024

cjacek mentioned this pull request Mar 14, 2024

[llvm-ar][Object][COFF] Add support for EC symbols to llvm-ar. #85230

Merged

dpaoliello approved these changes Mar 14, 2024

View reviewed changes

cjacek merged commit c7fc95b into llvm:main Mar 15, 2024
6 checks passed

cjacek deleted the ar-obj branch March 15, 2024 11:39

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Object][Archive][NFC] Create all symbolic files objects before calculating offsets. #85229

[Object][Archive][NFC] Create all symbolic files objects before calculating offsets. #85229

cjacek commented Mar 14, 2024

llvmbot commented Mar 14, 2024

dpaoliello left a comment

jh7370 commented Mar 18, 2024

cjacek commented Mar 18, 2024

jh7370 commented Mar 19, 2024

[Object][Archive][NFC] Create all symbolic files objects before calculating offsets. #85229

[Object][Archive][NFC] Create all symbolic files objects before calculating offsets. #85229

Conversation

cjacek commented Mar 14, 2024

llvmbot commented Mar 14, 2024

dpaoliello left a comment

Choose a reason for hiding this comment

jh7370 commented Mar 18, 2024

cjacek commented Mar 18, 2024

jh7370 commented Mar 19, 2024