Skip to content

Conversation

ilovepi
Copy link
Contributor

@ilovepi ilovepi commented Sep 22, 2025

The naive char-by-char lookup performed OK, but we can skip ahead to the
next match, avoiding all the extra hash lookups in the key map. Likely
there is a faster method than this, but its already a 42% win in the
BM_Mustache_StringRendering/Escaped benchmark, and an order of magnitude
improvement for BM_Mustache_LargeOutputString.

Benchmark Before (ns) After (ns) Speedup
StringRendering/Escaped 29,440,922 16,583,603 ~44%
LargeOutputString 15,139,251 929,891 ~94%
HugeArrayIteration 102,148,245 95,943,960 ~6%
PartialsRendering 308,330,014 303,556,563 ~1.6%

Unreported benchmarks, like those for parsing, had no significant change.

Copy link
Contributor Author

ilovepi commented Sep 22, 2025

This stack of pull requests is managed by Graphite. Learn more about stacking.

@llvmbot
Copy link
Member

llvmbot commented Sep 22, 2025

@llvm/pr-subscribers-llvm-support

Author: Paul Kirth (ilovepi)

Changes

The naive char-by-char lookup performed OK, but we can skip ahead to the
next match, avoiding all the extra hash lookups in the key map. Likely
there is a faster method than this, but its already a 42% win in the
BM_Mustache_StringRendering/Escaped benchmark, and an order of magnitude
improvement for BM_Mustache_LargeOutputString.

Benchmark Before (ns) After (ns) Speedup


StringRendering/Escaped 29,440,922 16,583,603 ~44%
LargeOutputString 15,139,251 929,891 ~94%
HugeArrayIteration 102,148,245 95,943,960 ~6%
PartialsRendering 308,330,014 303,556,563 ~1.6%

Unreported benchmarks, like those for parsing, had no significant change.


Full diff: https://github.com/llvm/llvm-project/pull/160166.diff

1 Files Affected:

  • (modified) llvm/lib/Support/Mustache.cpp (+22-8)
diff --git a/llvm/lib/Support/Mustache.cpp b/llvm/lib/Support/Mustache.cpp
index c7cebe6b64fae..911fd5ee7fa01 100644
--- a/llvm/lib/Support/Mustache.cpp
+++ b/llvm/lib/Support/Mustache.cpp
@@ -428,19 +428,32 @@ class EscapeStringStream : public raw_ostream {
 public:
   explicit EscapeStringStream(llvm::raw_ostream &WrappedStream,
                               EscapeMap &Escape)
-      : Escape(Escape), WrappedStream(WrappedStream) {
+      : Escape(Escape), EscapeChars(Escape.keys().begin(), Escape.keys().end()),
+        WrappedStream(WrappedStream) {
     SetUnbuffered();
   }
 
 protected:
   void write_impl(const char *Ptr, size_t Size) override {
-    llvm::StringRef Data(Ptr, Size);
-    for (char C : Data) {
-      auto It = Escape.find(C);
-      if (It != Escape.end())
-        WrappedStream << It->getSecond();
-      else
-        WrappedStream << C;
+    StringRef Data(Ptr, Size);
+    size_t Start = 0;
+    while (Start < Size) {
+      // Find the next character that needs to be escaped.
+      size_t Next = Data.find_first_of(EscapeChars.str(), Start);
+
+      // If no escapable characters are found, write the rest of the string.
+      if (Next == StringRef::npos) {
+        WrappedStream << Data.substr(Start);
+        return;
+      }
+
+      // Write the chunk of text before the escapable character.
+      if (Next > Start)
+        WrappedStream << Data.substr(Start, Next - Start);
+
+      // Look up and write the escaped version of the character.
+      WrappedStream << Escape[Data[Next]];
+      Start = Next + 1;
     }
   }
 
@@ -448,6 +461,7 @@ class EscapeStringStream : public raw_ostream {
 
 private:
   EscapeMap &Escape;
+  SmallString<8> EscapeChars;
   llvm::raw_ostream &WrappedStream;
 };
 

Copy link
Contributor

@nikic nikic left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can't we make EscapeMap an std::array<std::string, 256> instead of DenseMap<char, std::string>? That would make the lookup cheap.

@ilovepi
Copy link
Contributor Author

ilovepi commented Sep 22, 2025

Can't we make EscapeMap an std::array<std::string, 256> instead of DenseMap<char, std::string>? That would make the lookup cheap.

Yes, that's something I'd like to do as a follow up.

@ilovepi
Copy link
Contributor Author

ilovepi commented Sep 23, 2025

This patch:

Benchmark Baseline (ns) Experiment (ns) Speedup
LargeOutputString 8,926,576 591,254 ~93%
StringRendering/Escaped 18,196,698 10,280,591 ~44%
DeeplyNestedRendering 2,799 2,474 ~12%
PartialsRendering 211,153,502 197,101,139 ~7%
DeepTraversal 4,412,011 4,148,482 ~6%
HugeArrayIteration 61,887,053 58,737,900 ~5%

std::array<string,256>:

Benchmark Baseline (ns) Experiment (ns) Change
StringRendering/Escaped 18,196,698 16,979,453 ~7% faster
PartialsRendering 211,153,502 198,234,189 ~6% faster
LargeOutputString 8,926,576 8,423,018 ~6% faster
HugeArrayIteration 61,887,053 63,989,131 ~3% slower

I didn't try combining them. Its not clear how we'd initialize the list of special escape characters in the stream, unless we assume you can't override them. Besides, we do many fewer lookups now, so IDK how worth it it is in practice. The Mustache generation is about 20% faster w/ this patch. That part is only a small fraction of the overall execution time, but it did a make a difference.

@ilovepi ilovepi changed the base branch from users/ilovepi/mustache-delimiter-find to users/ilovepi/mustache-bench September 23, 2025 02:15
@ilovepi ilovepi force-pushed the users/ilovepi/mustache-escapestream-opt branch from 29e37be to 632536e Compare September 23, 2025 02:16
@ilovepi ilovepi force-pushed the users/ilovepi/mustache-bench branch from a34af38 to 17b25b0 Compare September 23, 2025 02:16
@ilovepi
Copy link
Contributor Author

ilovepi commented Sep 23, 2025

Combined:

Benchmark Baseline (ns) Combined (ns) Change
LargeOutputString 8,926,576 595,732 ~93% faster
StringRendering/Escaped 18,196,698 10,167,501 ~44% faster
SmallTemplateParsing 3,526 3,965 ~12% slower
PartialsRendering 211,153,502 258,420,352 ~22% slower
DeepTraversal 4,412,011 5,847,327 ~32% slower
HugeArrayIteration 61,887,053 84,320,500 ~36% slower

@nikic
Copy link
Contributor

nikic commented Sep 23, 2025

In these tests, at which point are you constructing the std::array? Is it inside each EscapeStream or once when Template is constructed?

It's possible that std::array wasn't the right suggestion -- maybe the fact that it stores std::string makes it too large. But if you check what find_first_of actually does

std::bitset<1 << CHAR_BIT> CharBits;
for (char C : Chars)
CharBits.set((unsigned char)C);
it will just take that string of characters you pass it an convert it into a bitset. We may as well directly create the bitset instead of creating the char string and then converting it to a bit set on every call.

@ilovepi
Copy link
Contributor Author

ilovepi commented Sep 23, 2025

In these tests, at which point are you constructing the std::array? Is it inside each EscapeStream or once when Template is constructed?

Once when the template is constructed. I used a function w/ a static variable to provide the default escapes, so I think it's just once for the whole program.

It's possible that std::array wasn't the right suggestion -- maybe the fact that it stores std::string makes it too large. But if you check what find_first_of actually does

std::bitset<1 << CHAR_BIT> CharBits;
for (char C : Chars)
CharBits.set((unsigned char)C);

it will just take that string of characters you pass it an convert it into a bitset. We may as well directly create the bitset instead of creating the char string and then converting it to a bit set on every call.

Could be. I think the big win is that when we use find_first_of we pass a big stringref to the output stream instead of passing in one string at a time. There's no copy, and the number of iterations where we write to the stream is reduced.

@ilovepi ilovepi force-pushed the users/ilovepi/mustache-bench branch from 17b25b0 to 4404c23 Compare September 25, 2025 21:37
@ilovepi ilovepi force-pushed the users/ilovepi/mustache-escapestream-opt branch from 632536e to bba1a54 Compare September 25, 2025 21:37
@nikic
Copy link
Contributor

nikic commented Sep 25, 2025

Could be. I think the big win is that when we use find_first_of we pass a big stringref to the output stream instead of passing in one string at a time. There's no copy, and the number of iterations where we write to the stream is reduced.

Oh, I see. I thought the bottleneck here was the escape lookup, not the write to the stream. If the stream is the slow bit, would it make sense to write everything into a SmallString first and then write the full string to the stream?

@ilovepi ilovepi force-pushed the users/ilovepi/mustache-bench branch 9 times, most recently from 6c45ea6 to 1f83764 Compare September 25, 2025 23:17
@ilovepi ilovepi force-pushed the users/ilovepi/mustache-bench branch 4 times, most recently from 2d7f1b9 to 47a5f24 Compare September 25, 2025 23:36
Base automatically changed from users/ilovepi/mustache-bench to main September 26, 2025 00:09
The naive char-by-char lookup performed OK, but we can skip ahead to the
next match, avoiding all the extra hash lookups in the key map. Likely
there is a faster method than this, but its already a 42% win in the
BM_Mustache_StringRendering/Escaped benchmark, and an order of magnitude
improvement for BM_Mustache_LargeOutputString.

 Benchmark                  Before (ns)   After (ns)  Speedup
 -------------------------  -----------  -----------  -------
 StringRendering/Escaped     29,440,922   16,583,603     ~44%
 LargeOutputString           15,139,251      929,891     ~94%
 HugeArrayIteration         102,148,245   95,943,960      ~6%
 PartialsRendering          308,330,014  303,556,563    ~1.6%

Unreported benchmarks, like those for parsing, had no significant change.
@ilovepi ilovepi force-pushed the users/ilovepi/mustache-escapestream-opt branch from bba1a54 to 67509f6 Compare September 26, 2025 00:35
@ilovepi
Copy link
Contributor Author

ilovepi commented Sep 26, 2025

Oh, I see. I thought the bottleneck here was the escape lookup, not the write to the stream. If the stream is the slow bit, would it make sense to write everything into a SmallString first and then write the full string to the stream?

hmm, I think its sort of a combo. in the char-by-char case, we do a lookup (which may be pretty fast w/ the bitset or std::array) and then write a string into the stream. That's not slow, but if we do it char by char, there's just a bunch of overhead. find_first_of() is suboptimal in that its recomputing the bitset, but it does let us nicely bound the stringref to pass to the stream, which I assume is just the one copy from src -> dst. If we have an escape char, we put that in the stream, and continue.

If I use a SmallString in the same way, I'll grow it as many times as I'd write escapes out. It seems more direct/efficient to just write it to the stream at that point, but 🤷 that's just my intuition.

I guess we'd save on bitset creation if I more or less inlined find_first_of, but 🤷 , it seems way nicer to just use the StringRef API and not worry too much. We've sped up this bit quite a lot, and I don't think its going to be a bottleneck anymore. I have a separate stack of other Mustache improvements that deal w/ the lack of spec compliance, so it may be worth revisiting the perf issues and any new regressions once that's done. I know a few of the things I did to make the implementation more correct, also had some performance implications, like removing redundant parsing, and multi-pass algorithms from the original naive implementation.

Copy link
Contributor

@nikic nikic left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. Thanks for the detailed explanations and experiments!

@ilovepi ilovepi merged commit f9065fc into main Sep 26, 2025
7 of 9 checks passed
@ilovepi ilovepi deleted the users/ilovepi/mustache-escapestream-opt branch September 26, 2025 22:46
YixingZhang007 pushed a commit to YixingZhang007/llvm-project that referenced this pull request Sep 27, 2025
…lvm#160166)

The naive char-by-char lookup performed OK, but we can skip ahead to the
next match, avoiding all the extra hash lookups in the key map. Likely
there is a faster method than this, but its already a 42% win in the
BM_Mustache_StringRendering/Escaped benchmark, and an order of magnitude
improvement for BM_Mustache_LargeOutputString.

| Benchmark | Before (ns) | After (ns) | Speedup |
| :--- | ---: | ---: | ---: |
| `StringRendering/Escaped` | 29,440,922 | 16,583,603 | ~44% |
| `LargeOutputString` | 15,139,251 | 929,891 | ~94% |
| `HugeArrayIteration` | 102,148,245 | 95,943,960 | ~6% |
| `PartialsRendering` | 308,330,014 | 303,556,563 | ~1.6% |

Unreported benchmarks, like those for parsing, had no significant
change.
mahesh-attarde pushed a commit to mahesh-attarde/llvm-project that referenced this pull request Oct 3, 2025
…lvm#160166)

The naive char-by-char lookup performed OK, but we can skip ahead to the
next match, avoiding all the extra hash lookups in the key map. Likely
there is a faster method than this, but its already a 42% win in the
BM_Mustache_StringRendering/Escaped benchmark, and an order of magnitude
improvement for BM_Mustache_LargeOutputString.

| Benchmark | Before (ns) | After (ns) | Speedup |
| :--- | ---: | ---: | ---: |
| `StringRendering/Escaped` | 29,440,922 | 16,583,603 | ~44% |
| `LargeOutputString` | 15,139,251 | 929,891 | ~94% |
| `HugeArrayIteration` | 102,148,245 | 95,943,960 | ~6% |
| `PartialsRendering` | 308,330,014 | 303,556,563 | ~1.6% |

Unreported benchmarks, like those for parsing, had no significant
change.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants