Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Object][Archive][NFC] Create all symbolic files objects before calculating offsets. #85229

Merged
merged 1 commit into from
Mar 15, 2024

Conversation

cjacek
Copy link
Contributor

@cjacek cjacek commented Mar 14, 2024

This is refactoring preparing to move UseECMap computation to the archive writer. We currently require writeArchive caller to pass that. This is not practical for llvm-ar, which currently interprets at most one passed object. For a reliable UseECMap, we need to interpret all symbolic objects: we may have, for example, a list of x86_64 files followed by aarch64 file, which indicates that we should use EC map for x86_64 objects.

This commit interprets symbolic files in a separated pass, which will be a convenient place to implement UseECMap computation in the follow up. I think it also makes accessing the next member for AIX big archive offset computation a bit easier.

…lating offsets.

This is refactoring preparing to move UseECMap computation to the archive writer.
We currently require writeArchive caller to pass that. This is not practical
for llvm-ar, which currently interprets at most one passed object. For a reliable
UseECMap, we need to interpret all symbolic objects: we may have, for example,
a list of x86_64 files followed by aarch64 file, which indicates that we should
use EC map for x86_64 objects.

This commit interprets symbolic files in a separated pass, which will be a convenient
place to implement UseECMap computation in the follow up. I think it also
makes accessing the next member for AIX big archive offset computation a bit easier.
@llvmbot
Copy link
Collaborator

llvmbot commented Mar 14, 2024

@llvm/pr-subscribers-llvm-binary-utilities

Author: Jacek Caban (cjacek)

Changes

This is refactoring preparing to move UseECMap computation to the archive writer. We currently require writeArchive caller to pass that. This is not practical for llvm-ar, which currently interprets at most one passed object. For a reliable UseECMap, we need to interpret all symbolic objects: we may have, for example, a list of x86_64 files followed by aarch64 file, which indicates that we should use EC map for x86_64 objects.

This commit interprets symbolic files in a separated pass, which will be a convenient place to implement UseECMap computation in the follow up. I think it also makes accessing the next member for AIX big archive offset computation a bit easier.


Full diff: https://github.com/llvm/llvm-project/pull/85229.diff

1 Files Affected:

  • (modified) llvm/lib/Object/ArchiveWriter.cpp (+23-35)
diff --git a/llvm/lib/Object/ArchiveWriter.cpp b/llvm/lib/Object/ArchiveWriter.cpp
index aa57e55de70c8d..97d0e4f75be8e0 100644
--- a/llvm/lib/Object/ArchiveWriter.cpp
+++ b/llvm/lib/Object/ArchiveWriter.cpp
@@ -795,23 +795,31 @@ computeMemberData(raw_ostream &StringTable, raw_ostream &SymNames,
       Entry.second = Entry.second > 1 ? 1 : 0;
   }
 
+  std::vector<std::unique_ptr<SymbolicFile>> SymFiles;
+
+  if (NeedSymbols != SymtabWritingMode::NoSymtab || isAIXBigArchive(Kind)) {
+    for (const NewArchiveMember &M : NewMembers) {
+      Expected<std::unique_ptr<SymbolicFile>> SymFileOrErr =
+          getSymbolicFile(M.Buf->getMemBufferRef(), Context);
+      if (!SymFileOrErr)
+        return createFileError(M.MemberName, SymFileOrErr.takeError());
+      SymFiles.push_back(std::move(*SymFileOrErr));
+    }
+  }
+
   // The big archive format needs to know the offset of the previous member
   // header.
   uint64_t PrevOffset = 0;
   uint64_t NextMemHeadPadSize = 0;
-  std::unique_ptr<SymbolicFile> CurSymFile;
-  std::unique_ptr<SymbolicFile> NextSymFile;
-  uint16_t Index = 0;
 
-  for (auto M = NewMembers.begin(); M < NewMembers.end(); ++M) {
+  for (uint32_t Index = 0; Index < NewMembers.size(); ++Index) {
+    const NewArchiveMember *M = &NewMembers[Index];
     std::string Header;
     raw_string_ostream Out(Header);
 
     MemoryBufferRef Buf = M->Buf->getMemBufferRef();
     StringRef Data = Thin ? "" : Buf.getBuffer();
 
-    Index++;
-
     // ld64 expects the members to be 8-byte aligned for 64-bit content and at
     // least 4-byte aligned for 32-bit content.  Opt for the larger encoding
     // uniformly.  This matches the behaviour with cctools and ensures that ld64
@@ -837,29 +845,9 @@ computeMemberData(raw_ostream &StringTable, raw_ostream &SymNames,
           std::move(StringMsg), object::object_error::parse_failed);
     }
 
-    if (NeedSymbols != SymtabWritingMode::NoSymtab || isAIXBigArchive(Kind)) {
-      auto SetNextSymFile = [&NextSymFile,
-                             &Context](MemoryBufferRef Buf,
-                                       StringRef MemberName) -> Error {
-        Expected<std::unique_ptr<SymbolicFile>> SymFileOrErr =
-            getSymbolicFile(Buf, Context);
-        if (!SymFileOrErr)
-          return createFileError(MemberName, SymFileOrErr.takeError());
-        NextSymFile = std::move(*SymFileOrErr);
-        return Error::success();
-      };
-
-      if (M == NewMembers.begin())
-        if (Error Err = SetNextSymFile(Buf, M->MemberName))
-          return std::move(Err);
-
-      CurSymFile = std::move(NextSymFile);
-
-      if ((M + 1) != NewMembers.end())
-        if (Error Err = SetNextSymFile((M + 1)->Buf->getMemBufferRef(),
-                                       (M + 1)->MemberName))
-          return std::move(Err);
-    }
+    std::unique_ptr<SymbolicFile> CurSymFile;
+    if (!SymFiles.empty())
+      CurSymFile = std::move(SymFiles[Index]);
 
     // In the big archive file format, we need to calculate and include the next
     // member offset and previous member offset in the file member header.
@@ -880,13 +868,13 @@ computeMemberData(raw_ostream &StringTable, raw_ostream &SymNames,
 
       // If there is another member file after this, we need to calculate the
       // padding before the header.
-      if ((M + 1) != NewMembers.end()) {
-        uint64_t OffsetToNextMemData = NextOffset +
-                                       sizeof(object::BigArMemHdrType) +
-                                       alignTo((M + 1)->MemberName.size(), 2);
+      if (Index + 1 != SymFiles.size()) {
+        uint64_t OffsetToNextMemData =
+            NextOffset + sizeof(object::BigArMemHdrType) +
+            alignTo(NewMembers[Index + 1].MemberName.size(), 2);
         NextMemHeadPadSize =
             alignToPowerOf2(OffsetToNextMemData,
-                            getMemberAlignment(NextSymFile.get())) -
+                            getMemberAlignment(SymFiles[Index + 1].get())) -
             OffsetToNextMemData;
         NextOffset += NextMemHeadPadSize;
       }
@@ -902,7 +890,7 @@ computeMemberData(raw_ostream &StringTable, raw_ostream &SymNames,
     std::vector<unsigned> Symbols;
     if (NeedSymbols != SymtabWritingMode::NoSymtab) {
       Expected<std::vector<unsigned>> SymbolsOrErr =
-          getSymbols(CurSymFile.get(), Index, SymNames, SymMap);
+          getSymbols(CurSymFile.get(), Index + 1, SymNames, SymMap);
       if (!SymbolsOrErr)
         return createFileError(M->MemberName, SymbolsOrErr.takeError());
       Symbols = std::move(*SymbolsOrErr);

Copy link
Contributor

@dpaoliello dpaoliello left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a really great cleanup IMHO, I found the NextSymFile thing very confusing when dealing with this code.

@cjacek cjacek merged commit c7fc95b into llvm:main Mar 15, 2024
6 checks passed
@cjacek cjacek deleted the ar-obj branch March 15, 2024 11:39
@jh7370
Copy link
Collaborator

jh7370 commented Mar 18, 2024

I think this PR could have done with some eyes from the AIX developer (@diggerlin, I think) and myself as people previously involved (and not all that long ago) with this area of code.

Taking a look at the change, I'm a little concerned about the potential impact on memory usage of this version of the PR. Previously, the code would only need one symbolic file to be in memory at a time, but now it requires all of them. I don't have any numbers, but I wouldn't be surprised in a large archive with many members with lots of symbols (not an unusual thing) if this change has drastically increased memory usage. I think it would be a good idea to do some performance analysis of the before and after of this change for particularly large archives.

@cjacek
Copy link
Contributor Author

cjacek commented Mar 18, 2024

Taking a look at the change, I'm a little concerned about the potential impact on memory usage of this version of the PR. Previously, the code would only need one symbolic file to be in memory at a time, but now it requires all of them. I don't have any numbers, but I wouldn't be surprised in a large archive with many members with lots of symbols (not an unusual thing) if this change has drastically increased memory usage. I think it would be a good idea to do some performance analysis of the before and after of this change for particularly large archives.

Please note that symbolic file is stored in MemberData struct, which is returned by computeMemberData. CurSymFile is not destructed in the loop, it's moved to Ret vector instead, so we store all of them in memory with or without this patch. The memory usage impact of this PR is just one additional vector of pointers.

@jh7370
Copy link
Collaborator

jh7370 commented Mar 19, 2024

Taking a look at the change, I'm a little concerned about the potential impact on memory usage of this version of the PR. Previously, the code would only need one symbolic file to be in memory at a time, but now it requires all of them. I don't have any numbers, but I wouldn't be surprised in a large archive with many members with lots of symbols (not an unusual thing) if this change has drastically increased memory usage. I think it would be a good idea to do some performance analysis of the before and after of this change for particularly large archives.

Please note that symbolic file is stored in MemberData struct, which is returned by computeMemberData. CurSymFile is not destructed in the loop, it's moved to Ret vector instead, so we store all of them in memory with or without this patch. The memory usage impact of this PR is just one additional vector of pointers.

Oh, you're right. I looked through this code at least 3 times before posting my comment, looking for anything like this and still didn't see it. Sorry for the noise.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants