Skip to content

Conversation

jayfoad
Copy link
Contributor

@jayfoad jayfoad commented Jan 17, 2025

TableGen writes all output to an in-memory buffer in case the
-write-if-change option is being used. Using raw_string_ostream for this
is inefficient because all the data has to be copied every time the
underlying std::string is resized. Fix this by writing to a custom
raw_ostream which stores the buffered data as the concatenation of a
vector of strings of increasing capacity. Each string in the vector is
never resized beyond its initial capacity to avoid unnecessary copying.

TableGen writes all output to an in-memory buffer in case the
-write-if-change option is being used. Using raw_string_ostream for this
is inefficient because all the data has to be copied every time the
underlying std::string is resized. Fix this by writing to a custom
raw_ostream which stores the buffered data as the concatenation of a
vector of strings of increasing capacity. Each string in the vector is
never resized beyond its initial capacity to avoid unnecessary copying.
@llvmbot
Copy link
Member

llvmbot commented Jan 17, 2025

@llvm/pr-subscribers-tablegen

Author: Jay Foad (jayfoad)

Changes

TableGen writes all output to an in-memory buffer in case the
-write-if-change option is being used. Using raw_string_ostream for this
is inefficient because all the data has to be copied every time the
underlying std::string is resized. Fix this by writing to a custom
raw_ostream which stores the buffered data as the concatenation of a
vector of strings of increasing capacity. Each string in the vector is
never resized beyond its initial capacity to avoid unnecessary copying.


Full diff: https://github.com/llvm/llvm-project/pull/123353.diff

1 Files Affected:

  • (modified) llvm/lib/TableGen/Main.cpp (+49-4)
diff --git a/llvm/lib/TableGen/Main.cpp b/llvm/lib/TableGen/Main.cpp
index 55a99cbfc58acd..88bca04ec19a41 100644
--- a/llvm/lib/TableGen/Main.cpp
+++ b/llvm/lib/TableGen/Main.cpp
@@ -37,6 +37,52 @@
 #include <utility>
 using namespace llvm;
 
+class stringvec_ostream : public raw_ostream {
+  std::vector<std::string> V;
+
+  size_t Pos = 0;
+  uint64_t current_pos() const override { return Pos; }
+
+  void write_impl(const char *Ptr, size_t Size) override {
+    Pos += Size;
+
+    size_t ThisSize = std::min(Size, V.back().capacity() - V.back().size());
+    V.back().append(Ptr, ThisSize);
+    Ptr += ThisSize;
+    Size -= ThisSize;
+
+    if (Size != 0) {
+      size_t NewCapacity = std::max(Size, V.back().capacity() * 2);
+      V.emplace_back();
+      V.back().reserve(NewCapacity);
+      V.back().append(Ptr, Size);
+    }
+  }
+
+public:
+  stringvec_ostream() : V(1) { SetUnbuffered(); }
+
+  friend raw_ostream &operator<<(raw_ostream &OS,
+                                 const stringvec_ostream &RHS) {
+    for (const std::string &S : RHS.V)
+      OS << S;
+    return OS;
+  }
+
+  bool operator==(StringRef RHS) {
+    if (Pos != RHS.size())
+      return false;
+
+    size_t Offset = 0;
+    for (const std::string &S : V) {
+      if (S != RHS.slice(Offset, Offset + S.size()))
+        return false;
+      Offset += S.size();
+    }
+    return true;
+  }
+};
+
 static cl::opt<std::string>
 OutputFilename("o", cl::desc("Output filename"), cl::value_desc("filename"),
                cl::init("-"));
@@ -130,8 +176,7 @@ int llvm::TableGenMain(const char *argv0,
 
   // Write output to memory.
   Timer.startBackendTimer("Backend overall");
-  std::string OutString;
-  raw_string_ostream Out(OutString);
+  stringvec_ostream Out;
   unsigned status = 0;
   // ApplyCallback will return true if it did not apply any callback. In that
   // case, attempt to apply the MainFn.
@@ -158,7 +203,7 @@ int llvm::TableGenMain(const char *argv0,
     // aren't any.
     if (auto ExistingOrErr =
             MemoryBuffer::getFile(OutputFilename, /*IsText=*/true))
-      if (std::move(ExistingOrErr.get())->getBuffer() == OutString)
+      if (Out == std::move(ExistingOrErr.get())->getBuffer())
         WriteFile = false;
   }
   if (WriteFile) {
@@ -167,7 +212,7 @@ int llvm::TableGenMain(const char *argv0,
     if (EC)
       return reportError(argv0, "error opening " + OutputFilename + ": " +
                                     EC.message() + "\n");
-    OutFile.os() << OutString;
+    OutFile.os() << Out;
     if (ErrorsPrinted == 0)
       OutFile.keep();
   }

@jayfoad
Copy link
Contributor Author

jayfoad commented Jan 17, 2025

This is mostly just an RFC to see if the idea makes sense, and if anything like my stringvec_ostream already exists? And if not, where should it live and what should it be called?

@kazutakahirata
Copy link
Contributor

@jayfoad Would it be acceptable to call std::string::reserve with the size of the existing output file? It might seem less elegant to rely on the file to be overwritten, but we read the existing output file for comparison purposes anyway, so we might as well use its size.

if (WriteIfChanged) {
// Only updates the real output file if there are any differences.
// This prevents recompilation of all the files depending on it if there
// aren't any.
if (auto ExistingOrErr =
MemoryBuffer::getFile(OutputFilename, /*IsText=*/true))
if (std::move(ExistingOrErr.get())->getBuffer() == OutString)
WriteFile = false;
}

@jayfoad
Copy link
Contributor Author

jayfoad commented Jan 22, 2025

Would it be acceptable to call std::string::reserve with the size of the existing output file?

That's a clever trick but it only helps if you're using -write-if-change and the file already exists. For example it does not help if you're building LLVM from scratch.

But I guess if you're not using -write-if-change, or the file doesn't already exist, then there is no need for the memory buffer at all and you can write directly to the output file. Is that what you were thinking?

@s-barannikov
Copy link
Contributor

I think we should use timestamp based approach rather than file comparison (see gcc's -MD/-MT/-MF/...).
Not sure how difficult it would be to integrate it with cmake/bazel.

@jayfoad
Copy link
Contributor Author

jayfoad commented Jan 22, 2025

I think we should use timestamp based approach rather than file comparison (see gcc's -MD/-MT/-MF/...). Not sure how difficult it would be to integrate it with cmake/bazel.

Not sure what you're suggesting. Those gcc options are to do with generating dependencies, but we already get this mostly right. E.g. if I touch RISCV.td and rebuild llc, only the RISCV backend gets rebuilt.

-write-if-changed is a separate mechanism that detects when an input .td file has changed, but in a harmless way that does not affect the .inc files generated by tablegen, so other source files that depend on them do not need to be rebuilt.

@s-barannikov
Copy link
Contributor

Not sure what you're suggesting. Those gcc options are to do with generating dependencies, but we already get this mostly right. E.g. if I touch RISCV.td and rebuild llc, only the RISCV backend gets rebuilt.

Ah, right. A dependency-generating option already exists (-d), but it is hidden and I missed it.

@MaskRay
Copy link
Member

MaskRay commented Jan 22, 2025

How much does stringvec_ostream improve the performance?

@kazutakahirata
Copy link
Contributor

But I guess if you're not using -write-if-change, or the file doesn't already exist, then there is no need for the memory buffer at all and you can write directly to the output file. Is that what you were thinking?

No, I wasn't thinking that far ahead, but, yes, we could write directly to the output file.

As @MaskRay asks, I'm also curious about the performance with this PR. If we simply rely on std::string's mechanism to double its buffer (implementation dependent?), each character in the buffer is written to the buffer twice on average. The number of memory allocations should be roughly the same, with or without this PR.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants