[llvm] Improve implementation of StringRef::find_last_of and cie #71865

serge-sans-paille · 2023-11-09T20:46:54Z

Using a simple LUT implies less operations than a bitset, at the expense of slightly more stack usage.

Mandatory performance report:

https://llvm-compile-time-tracker.com/compare.php?from=3f9d385e5844f2f1f144305037cfc904789c6187&to=86f685e9552acaf8d5db630161e91752348422a1&stat=instructions:u

llvmbot · 2023-11-09T20:47:27Z

@llvm/pr-subscribers-llvm-adt
@llvm/pr-subscribers-clang-driver
@llvm/pr-subscribers-backend-spir-v
@llvm/pr-subscribers-backend-amdgpu
@llvm/pr-subscribers-mlir-core
@llvm/pr-subscribers-mlir

@llvm/pr-subscribers-llvm-support

Author: None (serge-sans-paille)

Changes

Using a simple LUT implies less operations than a bitset, at the expense of slightly more stack usage.

Mandatory performance report:

https://llvm-compile-time-tracker.com/compare.php?from=3f9d385e5844f2f1f144305037cfc904789c6187&to=86f685e9552acaf8d5db630161e91752348422a1&stat=instructions:u

Full diff: https://github.com/llvm/llvm-project/pull/71865.diff

1 Files Affected:

(modified) llvm/lib/Support/StringRef.cpp (+12-13)

diff --git a/llvm/lib/Support/StringRef.cpp b/llvm/lib/Support/StringRef.cpp
index feee47ca693b251..be6e340ff0d5901 100644
--- a/llvm/lib/Support/StringRef.cpp
+++ b/llvm/lib/Support/StringRef.cpp
@@ -13,7 +13,6 @@
 #include "llvm/ADT/StringExtras.h"
 #include "llvm/ADT/edit_distance.h"
 #include "llvm/Support/Error.h"
-#include <bitset>
 
 using namespace llvm;
 
@@ -236,12 +235,12 @@ size_t StringRef::rfind_insensitive(StringRef Str) const {
 /// Note: O(size() + Chars.size())
 StringRef::size_type StringRef::find_first_of(StringRef Chars,
                                               size_t From) const {
-  std::bitset<1 << CHAR_BIT> CharBits;
+  bool CharBits[1 << CHAR_BIT] = {false};
   for (char C : Chars)
-    CharBits.set((unsigned char)C);
+    CharBits[(unsigned)C] = true;
 
   for (size_type i = std::min(From, Length), e = Length; i != e; ++i)
-    if (CharBits.test((unsigned char)Data[i]))
+    if (CharBits[(unsigned char)Data[i]])
       return i;
   return npos;
 }
@@ -258,12 +257,12 @@ StringRef::size_type StringRef::find_first_not_of(char C, size_t From) const {
 /// Note: O(size() + Chars.size())
 StringRef::size_type StringRef::find_first_not_of(StringRef Chars,
                                                   size_t From) const {
-  std::bitset<1 << CHAR_BIT> CharBits;
+  bool CharBits[1 << CHAR_BIT] = {false};
   for (char C : Chars)
-    CharBits.set((unsigned char)C);
+    CharBits[(unsigned)C] = true;
 
   for (size_type i = std::min(From, Length), e = Length; i != e; ++i)
-    if (!CharBits.test((unsigned char)Data[i]))
+    if (!CharBits[(unsigned char)Data[i]])
       return i;
   return npos;
 }
@@ -274,12 +273,12 @@ StringRef::size_type StringRef::find_first_not_of(StringRef Chars,
 /// Note: O(size() + Chars.size())
 StringRef::size_type StringRef::find_last_of(StringRef Chars,
                                              size_t From) const {
-  std::bitset<1 << CHAR_BIT> CharBits;
+  bool CharBits[1 << CHAR_BIT] = {false};
   for (char C : Chars)
-    CharBits.set((unsigned char)C);
+    CharBits[(unsigned)C] = true;
 
   for (size_type i = std::min(From, Length) - 1, e = -1; i != e; --i)
-    if (CharBits.test((unsigned char)Data[i]))
+    if (CharBits[(unsigned char)Data[i]])
       return i;
   return npos;
 }
@@ -299,12 +298,12 @@ StringRef::size_type StringRef::find_last_not_of(char C, size_t From) const {
 /// Note: O(size() + Chars.size())
 StringRef::size_type StringRef::find_last_not_of(StringRef Chars,
                                                  size_t From) const {
-  std::bitset<1 << CHAR_BIT> CharBits;
+  bool CharBits[1 << CHAR_BIT] = {false};
   for (char C : Chars)
-    CharBits.set((unsigned char)C);
+    CharBits[(unsigned)C] = true;
 
   for (size_type i = std::min(From, Length) - 1, e = -1; i != e; --i)
-    if (!CharBits.test((unsigned char)Data[i]))
+    if (!CharBits[(unsigned char)Data[i]])
       return i;
   return npos;
 }

aganea · 2023-11-10T01:54:32Z

Hello @serge-sans-paille! Could you please explain a bit more in detail why do you think this is a good optimization? Less instructions retired is not necessary better if that increases the memory pressure? It might be nothing, but before the whole bitset was fitting in a cacheline, now it takes 4 cache lines to do the same thing. Again, this might be nothing, but that means 3 extra cache line being evicted elsewhere to execute this function. Also I'd like to be convinced by the profile link that you've attached, but the difference is so small that is could be normal fluctuations (also 'cycles' stat is higher). One more point, the previous bitshifting code in bitset might run better on AArch64 than on x64, so it'd be interesting to see the difference there too. Thanks!

serge-sans-paille · 2023-11-13T17:09:58Z

@aganea I submitted a different approach that will work with any SSE2 - powered machine, and it doesn't have the same stack usage issue as the previous one.

kazutakahirata

This patch looks good to me. I am not sure if we want to implement architecture-specific optimizations like __SSE2__ at the source-code level. But then I don't want that to block the landing of this patch. Is there any way you could separate these two patches? Thanks!

EDIT: I guess "this patch" isn't clear. I am referring to the first one that replaces "." with '.', etc.

joker-eph · 2023-11-15T00:18:34Z

llvm/lib/Support/StringRef.cpp

+    } while (Sz);
+    return npos;
+  }
+#endif


Can this be abstracted or made out-of-line?
I'm wondering about the scalability of HW-specific intrinsics in-line (anticipating for the incoming #elif defined(ARM64)...)

I don't think using https://github.com/xtensor-stack/xsimd is an option :-) And https://en.cppreference.com/w/cpp/experimental/simd/simd is still not a thing :-/

We already have some bits of SSE2 in clang and llvm. OK to factor this in a function.

Yeah I'm fine with SSE2, I was just trying to keep the specialized implementation out-of-line, what you have now looks good I think.

github-actions · 2023-11-15T14:05:31Z

✅ With the latest revision this PR passed the C/C++ code formatter.

serge-sans-paille · 2023-11-15T14:05:50Z

This patch looks good to me. I am not sure if we want to implement architecture-specific optimizations like __SSE2__ at the source-code level.

We already have these for critical part of the code (in the clang lexer for instance). I only went that way because this particular function appeared in a profile of a teamate's machine during code indexing (and because it was very entertaining to implement).

EDIT: I guess "this patch" isn't clear. I am referring to the first one that replaces "." with '.', etc.

I've landed the first commit as 33b5158, it was a no-brainer

joker-eph · 2023-11-15T19:18:20Z

llvm/lib/Support/StringRef.cpp

+#ifdef __SSE2__
+
+StringRef::size_type vectorized_find_last_of_specialized(const char *Data,
+                                                         size_t Sz, char C0,


joker-eph · 2023-11-15T19:19:32Z

llvm/lib/Support/StringRef.cpp

+StringRef::size_type vectorized_find_last_of_specialized(const char *Data,
+                                                         size_t Sz, char C0,


Suggested change

StringRef::size_type vectorized_find_last_of_specialized(const char *Data,

size_t Sz, char C0,

static StringRef::size_type

vectorized_find_last_of_specialized(const char *Data, size_t Sz, char C0,

aganea · 2023-11-15T22:47:57Z

llvm/lib/Support/StringRef.cpp

+  __m128i Needle1 = _mm_set1_epi8(C1);
+  do {
+    Sz = Sz < 16 ? 0 : Sz - 16;
+    __m128i Buffer = _mm_loadu_si128((const __m128i *)(Data + Sz));


This load instruction will generate an out-of-bounds access if strlen(Data)<15. Do you think you can use the "slow" algorithm when that happens?

aganea · 2023-11-15T22:51:53Z

llvm/lib/Support/StringRef.cpp

@@ -623,8 +657,7 @@ hash_code llvm::hash_value(StringRef S) {
 }

 unsigned DenseMapInfo<StringRef, void>::getHashValue(StringRef Val) {
-  assert(Val.data() != getEmptyKey().data() &&
-         "Cannot hash the empty key!");
+  assert(Val.data() != getEmptyKey().data() && "Cannot hash the empty key!");


Can you please commit all these NFC formatting changes separately?

aganea · 2023-11-15T22:52:22Z

llvm/lib/Support/StringRef.cpp

@@ -372,7 +402,8 @@ size_t StringRef::count(StringRef Str) const {
  size_t Count = 0;
  size_t Pos = 0;
  size_t N = Str.size();
-  // TODO: For an empty `Str` we return 0 for legacy reasons. Consider changing
+  // TODO: For an empty `Str` we return 0 for legacy reasons. Consider
+  // changing


Text should go with the following line.

serge-sans-paille · 2023-11-16T22:35:52Z

@aganea / @joker-eph / @kazutakahirata : I actually switched the implementation to something that's almost as efficient (no vector load required) and works across all architecture. (Not to mention it also increases my geekness karma)

aganea · 2023-11-16T23:53:28Z

llvm/lib/Support/StringRef.cpp

@@ -268,17 +268,47 @@ StringRef::size_type StringRef::find_first_not_of(StringRef Chars,
  return npos;
 }

+// See https://graphics.stanford.edu/~seander/bithacks.html#ValueInWord
+static inline uint64_t haszero(uint64_t v) {
+  return ((v)-0x0101010101010101UL) & ~(v) & 0x8080808080808080UL;


ULL instead of UL?

aganea · 2023-11-16T23:53:49Z

llvm/lib/Support/StringRef.cpp

+  return ((v)-0x0101010101010101UL) & ~(v) & 0x8080808080808080UL;
+}
+static inline uint64_t hasvalue(uint64_t x, char n) {
+  return haszero((x) ^ (~0UL / 255 * (n)));


You'll need ~0ULL here otherwise things won't work as expected. x is a 64-bit value and the goal of the algorithm to spread out n to all bytes. The initial algorithm in the link you've provided was written for 32-bit values. With just ~0UL only the lowest 32-bit values of the 64-bit value will be filled out.

This hasn't been addressed yet.

llvm/lib/Support/StringRef.cpp

aganea · 2023-11-16T23:54:37Z

llvm/lib/Support/StringRef.cpp

+    if (Check)
+      return Sz + 7 - llvm::countl_zero(Check) / 8;
+  } while (Sz);
+  return -1;


return npos

llvm/lib/Support/StringRef.cpp

aganea · 2023-11-17T00:03:00Z

llvm/lib/Support/StringRef.cpp

 /// find_last_of - Find the last character in the string that is in \arg C,
 /// or npos if not found.
 ///
 /// Note: O(size() + Chars.size())
 StringRef::size_type StringRef::find_last_of(StringRef Chars,
                                             size_t From) const {
+  size_type Sz = std::min(From, Length);
+
+  if (Chars.size() == 2)


Could you please elaborate a bit more on the specific use cases that you hit in clangd that justify this if (Chars.size() == 2)? At least a unit test along with a little comment about the "why" will help future readers of this patch.

llvm/lib/Support/StringRef.cpp

serge-sans-paille · 2023-11-20T07:33:30Z

It turns out StringRef::find_last_of was missing unittests, I added that.

aganea · 2023-11-20T20:06:31Z

llvm/lib/Support/StringRef.cpp

+  return ((v)-0x0101010101010101UL) & ~(v) & 0x8080808080808080UL;
+}
+static inline uint64_t hasvalue(uint64_t x, char n) {
+  return haszero((x) ^ (~0UL / 255 * (n)));


You'll need ~0ULL here otherwise things won't work as expected. x is a 64-bit value and the goal of the algorithm to spread out n to all bytes. The initial algorithm in the link you've provided was written for 32-bit values. With just ~0UL only the lowest 32-bit values of the 64-bit value will be filled out.

llvm/lib/Support/StringRef.cpp

aganea · 2023-11-20T23:16:17Z

llvm/lib/Support/StringRef.cpp

 StringRef::size_type StringRef::find_last_of(StringRef Chars,
                                             size_t From) const {
+  size_type Sz = std::min(From, Length);
+
+  if (Chars.size() == 2) {


Before going further with this PR, I'd like to fully understand which specific call site of find_last_of in ClangD is improved by this change, and what is the user flow that triggers it? Is this for .find_last_of("\r\n") ? Or for .find_last_of("/\\"); ? Something else? The situation might change with different/newer CPU architectures, that's why a specific repro would be good to add in the PR description.

You also said above that this function appeared on a teammates' profile during code indexing, are you able to test before/after this patch see how the situation is improved? It would be really nice to have numbers along with this PR, so that others can compare.

yeah, that's the find_last_of("\r\n") part

Is there any precedent for such micro-optimization in core libraries? I still feel that this optimization should go in a ClangD utility file instead. @dwblaikie @MaskRay @joker-eph Do you have an opinion on this?

It's a tradeoff between impact vs cost of maintenance.

Moving it to clangd does not reduce the maintenance aspect for the LLVM project though, are we concerned with StringRef maintenance here?
I would say that on the contrary, moving it to clangs runs the risk than Bolt or LLDB reimplement the same thing, not knowing clangd has it.
So I'd say either the implementation isn't overly complex for the long term maintenance here compared to the benefits, or it may not belong to the project?

I don't find a mention of the current perf impact on clangd though?
In #71865 (comment) you mention a micro-benchmark @serge-sans-paille but no the result as far as I can see?

llvm/lib/Support/StringRef.cpp

arsenm · 2024-03-06T13:46:09Z

Is this still relevant?

serge-sans-paille · 2024-03-06T14:48:00Z

yeah, it probably is,I just switched priorities, I'll do another iteration on that one.

serge-sans-paille · 2024-03-06T20:54:48Z

Code updated, and I ran the following micro benchmark:

$ cat a.cpp
#include "llvm/ADT/StringRef.h"
#include <cstdlib>

int main(int argc, char ** argv) {
  long n = atol(argv[2]);
  char * data = argv[1];
  llvm::StringRef S(data);
  size_t k = 0;
  for (long i = 0; i < n; ++i)
    k+= S.find_last_of("\r\n");
  return k;
}
$ clang++ a.cpp -fPIC -o a -O2 -lLLVM -L$PWD/lib
$ LD_LIBRARY_PATH=$PWD/lib ./a "AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA" 10000000

which is as stupid as a micro benchmark can be. This patch make it run twice as fast as the non-patched version.

serge-sans-paille · 2024-03-07T06:48:53Z

the build issue seems unrelated (?)

arsenm · 2024-03-07T06:49:42Z

the build issue seems unrelated (?)

The windows bot has been pure noise for many weeks

joker-eph · 2024-03-07T08:44:40Z

the build issue seems unrelated (?)

The windows bot has been pure noise for many weeks

This is factually incorrect: https://lab.llvm.org/buildbot/#/builders/271
Your experience is likely outdated.

…l case of 2 chars Almost all usage of StringRef::find_last_of in Clang/LLVM use a Needle of 2 elements, which can be optimized using a generic vectorized algorithm and a few bit hacks.

dwblaikie · 2024-03-18T19:14:56Z

Does this micro-optimization adversely affect cases that aren't covered by it (eg: does checking for length 2 slow down non-length-2 cases perceptibly?)

Ralender · 2024-06-01T18:33:42Z

llvm/lib/Support/StringRef.cpp

+           0x7F7F7F7F7F7F7F7FULL);
+}
+static inline uint64_t hasvalue(uint64_t x, char n) {
+  return haszero((x) ^ (~0ULL / 255 * (n)));


Maybe its just me but I think haszero((x) ^ ((uint64_t)n * 0x0101010101010101)) makes it more clear that n is just repeated over the entire uint64_t

Ralender · 2024-06-01T18:36:32Z

llvm/lib/Support/StringRef.cpp

+  while (Sz >= 8) {
+    Sz -= 8;
+    uint64_t Buffer = 0;
+    std::memcpy((void *)&Buffer, (void *)(Data + Sz), sizeof(Buffer));


Isn't there some already existing utility to do unaligned loads more explicitly ?

serge-sans-paille requested a review from kazutakahirata November 9, 2023 20:47

llvmbot added the llvm:support label Nov 9, 2023

serge-sans-paille requested a review from cor3ntin November 9, 2023 20:47

serge-sans-paille force-pushed the perf/find-last-of branch 2 times, most recently from fd5224f to 16b0bf6 Compare November 13, 2023 08:00

llvmbot added clang Clang issues not falling into any other category backend:AMDGPU clang:driver 'clang' and 'clang++' user-facing binaries. Not 'clang-cl' mlir:core MLIR Core Infrastructure mlir backend:SPIR-V labels Nov 13, 2023

serge-sans-paille force-pushed the perf/find-last-of branch 2 times, most recently from 731953e to 2e48f5a Compare November 13, 2023 10:16

kazutakahirata reviewed Nov 14, 2023

View reviewed changes

joker-eph reviewed Nov 15, 2023

View reviewed changes

serge-sans-paille force-pushed the perf/find-last-of branch from 2e48f5a to 5ebe6d8 Compare November 15, 2023 14:02

serge-sans-paille force-pushed the perf/find-last-of branch 2 times, most recently from cc47899 to 6933b73 Compare November 15, 2023 14:40

joker-eph reviewed Nov 15, 2023

View reviewed changes

joker-eph approved these changes Nov 15, 2023

View reviewed changes

joker-eph reviewed Nov 15, 2023

View reviewed changes

aganea requested changes Nov 15, 2023

View reviewed changes

serge-sans-paille force-pushed the perf/find-last-of branch 2 times, most recently from 00f640e to e8ed042 Compare November 16, 2023 17:04

aganea reviewed Nov 17, 2023

View reviewed changes

serge-sans-paille force-pushed the perf/find-last-of branch from e8ed042 to cbfb801 Compare November 19, 2023 16:23

llvmbot added the llvm:adt label Nov 19, 2023

aganea reviewed Nov 20, 2023

View reviewed changes

serge-sans-paille force-pushed the perf/find-last-of branch from cbfb801 to 75e87c9 Compare March 6, 2024 20:52

[llvm] Improve implementation of StringRef::find_last_of for the usua…

2770064

…l case of 2 chars Almost all usage of StringRef::find_last_of in Clang/LLVM use a Needle of 2 elements, which can be optimized using a generic vectorized algorithm and a few bit hacks.

serge-sans-paille force-pushed the perf/find-last-of branch from 75e87c9 to 2770064 Compare March 14, 2024 06:39

Ralender reviewed Jun 1, 2024

View reviewed changes

		StringRef::size_type vectorized_find_last_of_specialized(const char *Data,
		size_t Sz, char C0,

[llvm] Improve implementation of StringRef::find_last_of and cie #71865

Are you sure you want to change the base?

[llvm] Improve implementation of StringRef::find_last_of and cie #71865

Conversation

serge-sans-paille commented Nov 9, 2023

llvmbot commented Nov 9, 2023 • edited

aganea commented Nov 10, 2023

serge-sans-paille commented Nov 13, 2023

kazutakahirata left a comment • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

github-actions bot commented Nov 15, 2023 • edited

serge-sans-paille commented Nov 15, 2023

Choose a reason for hiding this comment

joker-eph Nov 15, 2023 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

serge-sans-paille commented Nov 16, 2023

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

serge-sans-paille commented Nov 20, 2023

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

arsenm commented Mar 6, 2024

serge-sans-paille commented Mar 6, 2024

serge-sans-paille commented Mar 6, 2024

serge-sans-paille commented Mar 7, 2024

arsenm commented Mar 7, 2024

joker-eph commented Mar 7, 2024

dwblaikie commented Mar 18, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

llvmbot commented Nov 9, 2023 •

edited

kazutakahirata left a comment •

edited

github-actions bot commented Nov 15, 2023 •

edited

joker-eph Nov 15, 2023 •

edited