Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[llvm] Improve implementation of StringRef::find_last_of and cie #71865

Open
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

serge-sans-paille
Copy link
Collaborator

Using a simple LUT implies less operations than a bitset, at the expense of slightly more stack usage.

Mandatory performance report:

https://llvm-compile-time-tracker.com/compare.php?from=3f9d385e5844f2f1f144305037cfc904789c6187&to=86f685e9552acaf8d5db630161e91752348422a1&stat=instructions:u

@llvmbot
Copy link
Collaborator

llvmbot commented Nov 9, 2023

@llvm/pr-subscribers-llvm-adt
@llvm/pr-subscribers-clang-driver
@llvm/pr-subscribers-backend-spir-v
@llvm/pr-subscribers-backend-amdgpu
@llvm/pr-subscribers-mlir-core
@llvm/pr-subscribers-mlir

@llvm/pr-subscribers-llvm-support

Author: None (serge-sans-paille)

Changes

Using a simple LUT implies less operations than a bitset, at the expense of slightly more stack usage.

Mandatory performance report:

https://llvm-compile-time-tracker.com/compare.php?from=3f9d385e5844f2f1f144305037cfc904789c6187&to=86f685e9552acaf8d5db630161e91752348422a1&stat=instructions:u


Full diff: https://github.com/llvm/llvm-project/pull/71865.diff

1 Files Affected:

  • (modified) llvm/lib/Support/StringRef.cpp (+12-13)
diff --git a/llvm/lib/Support/StringRef.cpp b/llvm/lib/Support/StringRef.cpp
index feee47ca693b251..be6e340ff0d5901 100644
--- a/llvm/lib/Support/StringRef.cpp
+++ b/llvm/lib/Support/StringRef.cpp
@@ -13,7 +13,6 @@
 #include "llvm/ADT/StringExtras.h"
 #include "llvm/ADT/edit_distance.h"
 #include "llvm/Support/Error.h"
-#include <bitset>
 
 using namespace llvm;
 
@@ -236,12 +235,12 @@ size_t StringRef::rfind_insensitive(StringRef Str) const {
 /// Note: O(size() + Chars.size())
 StringRef::size_type StringRef::find_first_of(StringRef Chars,
                                               size_t From) const {
-  std::bitset<1 << CHAR_BIT> CharBits;
+  bool CharBits[1 << CHAR_BIT] = {false};
   for (char C : Chars)
-    CharBits.set((unsigned char)C);
+    CharBits[(unsigned)C] = true;
 
   for (size_type i = std::min(From, Length), e = Length; i != e; ++i)
-    if (CharBits.test((unsigned char)Data[i]))
+    if (CharBits[(unsigned char)Data[i]])
       return i;
   return npos;
 }
@@ -258,12 +257,12 @@ StringRef::size_type StringRef::find_first_not_of(char C, size_t From) const {
 /// Note: O(size() + Chars.size())
 StringRef::size_type StringRef::find_first_not_of(StringRef Chars,
                                                   size_t From) const {
-  std::bitset<1 << CHAR_BIT> CharBits;
+  bool CharBits[1 << CHAR_BIT] = {false};
   for (char C : Chars)
-    CharBits.set((unsigned char)C);
+    CharBits[(unsigned)C] = true;
 
   for (size_type i = std::min(From, Length), e = Length; i != e; ++i)
-    if (!CharBits.test((unsigned char)Data[i]))
+    if (!CharBits[(unsigned char)Data[i]])
       return i;
   return npos;
 }
@@ -274,12 +273,12 @@ StringRef::size_type StringRef::find_first_not_of(StringRef Chars,
 /// Note: O(size() + Chars.size())
 StringRef::size_type StringRef::find_last_of(StringRef Chars,
                                              size_t From) const {
-  std::bitset<1 << CHAR_BIT> CharBits;
+  bool CharBits[1 << CHAR_BIT] = {false};
   for (char C : Chars)
-    CharBits.set((unsigned char)C);
+    CharBits[(unsigned)C] = true;
 
   for (size_type i = std::min(From, Length) - 1, e = -1; i != e; --i)
-    if (CharBits.test((unsigned char)Data[i]))
+    if (CharBits[(unsigned char)Data[i]])
       return i;
   return npos;
 }
@@ -299,12 +298,12 @@ StringRef::size_type StringRef::find_last_not_of(char C, size_t From) const {
 /// Note: O(size() + Chars.size())
 StringRef::size_type StringRef::find_last_not_of(StringRef Chars,
                                                  size_t From) const {
-  std::bitset<1 << CHAR_BIT> CharBits;
+  bool CharBits[1 << CHAR_BIT] = {false};
   for (char C : Chars)
-    CharBits.set((unsigned char)C);
+    CharBits[(unsigned)C] = true;
 
   for (size_type i = std::min(From, Length) - 1, e = -1; i != e; --i)
-    if (!CharBits.test((unsigned char)Data[i]))
+    if (!CharBits[(unsigned char)Data[i]])
       return i;
   return npos;
 }

@aganea
Copy link
Member

aganea commented Nov 10, 2023

Hello @serge-sans-paille! Could you please explain a bit more in detail why do you think this is a good optimization? Less instructions retired is not necessary better if that increases the memory pressure? It might be nothing, but before the whole bitset was fitting in a cacheline, now it takes 4 cache lines to do the same thing. Again, this might be nothing, but that means 3 extra cache line being evicted elsewhere to execute this function. Also I'd like to be convinced by the profile link that you've attached, but the difference is so small that is could be normal fluctuations (also 'cycles' stat is higher). One more point, the previous bitshifting code in bitset might run better on AArch64 than on x64, so it'd be interesting to see the difference there too. Thanks!

@serge-sans-paille serge-sans-paille force-pushed the perf/find-last-of branch 2 times, most recently from fd5224f to 16b0bf6 Compare November 13, 2023 08:00
@llvmbot llvmbot added clang Clang issues not falling into any other category backend:AMDGPU clang:driver 'clang' and 'clang++' user-facing binaries. Not 'clang-cl' mlir:core MLIR Core Infrastructure mlir backend:SPIR-V labels Nov 13, 2023
@serge-sans-paille serge-sans-paille force-pushed the perf/find-last-of branch 2 times, most recently from 731953e to 2e48f5a Compare November 13, 2023 10:16
@serge-sans-paille
Copy link
Collaborator Author

@aganea I submitted a different approach that will work with any SSE2 - powered machine, and it doesn't have the same stack usage issue as the previous one.

Copy link
Contributor

@kazutakahirata kazutakahirata left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This patch looks good to me. I am not sure if we want to implement architecture-specific optimizations like __SSE2__ at the source-code level. But then I don't want that to block the landing of this patch. Is there any way you could separate these two patches? Thanks!

EDIT: I guess "this patch" isn't clear. I am referring to the first one that replaces "." with '.', etc.

} while (Sz);
return npos;
}
#endif
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can this be abstracted or made out-of-line?
I'm wondering about the scalability of HW-specific intrinsics in-line (anticipating for the incoming #elif defined(ARM64)...)

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think using https://github.com/xtensor-stack/xsimd is an option :-) And https://en.cppreference.com/w/cpp/experimental/simd/simd is still not a thing :-/

We already have some bits of SSE2 in clang and llvm. OK to factor this in a function.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah I'm fine with SSE2, I was just trying to keep the specialized implementation out-of-line, what you have now looks good I think.

Copy link

github-actions bot commented Nov 15, 2023

✅ With the latest revision this PR passed the C/C++ code formatter.

@serge-sans-paille
Copy link
Collaborator Author

This patch looks good to me. I am not sure if we want to implement architecture-specific optimizations like __SSE2__ at the source-code level.

We already have these for critical part of the code (in the clang lexer for instance). I only went that way because this particular function appeared in a profile of a teamate's machine during code indexing (and because it was very entertaining to implement).

EDIT: I guess "this patch" isn't clear. I am referring to the first one that replaces "." with '.', etc.

I've landed the first commit as 33b5158, it was a no-brainer

@serge-sans-paille serge-sans-paille force-pushed the perf/find-last-of branch 2 times, most recently from cc47899 to 6933b73 Compare November 15, 2023 14:40
#ifdef __SSE2__

StringRef::size_type vectorized_find_last_of_specialized(const char *Data,
size_t Sz, char C0,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

static?

Comment on lines 277 to 278
StringRef::size_type vectorized_find_last_of_specialized(const char *Data,
size_t Sz, char C0,
Copy link
Collaborator

@joker-eph joker-eph Nov 15, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
StringRef::size_type vectorized_find_last_of_specialized(const char *Data,
size_t Sz, char C0,
static StringRef::size_type
vectorized_find_last_of_specialized(const char *Data, size_t Sz, char C0,

__m128i Needle1 = _mm_set1_epi8(C1);
do {
Sz = Sz < 16 ? 0 : Sz - 16;
__m128i Buffer = _mm_loadu_si128((const __m128i *)(Data + Sz));
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This load instruction will generate an out-of-bounds access if strlen(Data)<15. Do you think you can use the "slow" algorithm when that happens?

@@ -623,8 +657,7 @@ hash_code llvm::hash_value(StringRef S) {
}

unsigned DenseMapInfo<StringRef, void>::getHashValue(StringRef Val) {
assert(Val.data() != getEmptyKey().data() &&
"Cannot hash the empty key!");
assert(Val.data() != getEmptyKey().data() && "Cannot hash the empty key!");
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you please commit all these NFC formatting changes separately?

@@ -372,7 +402,8 @@ size_t StringRef::count(StringRef Str) const {
size_t Count = 0;
size_t Pos = 0;
size_t N = Str.size();
// TODO: For an empty `Str` we return 0 for legacy reasons. Consider changing
// TODO: For an empty `Str` we return 0 for legacy reasons. Consider
// changing
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Text should go with the following line.

@serge-sans-paille serge-sans-paille force-pushed the perf/find-last-of branch 2 times, most recently from 00f640e to e8ed042 Compare November 16, 2023 17:04
@serge-sans-paille
Copy link
Collaborator Author

@aganea / @joker-eph / @kazutakahirata : I actually switched the implementation to something that's almost as efficient (no vector load required) and works across all architecture. (Not to mention it also increases my geekness karma)

@@ -268,17 +268,47 @@ StringRef::size_type StringRef::find_first_not_of(StringRef Chars,
return npos;
}

// See https://graphics.stanford.edu/~seander/bithacks.html#ValueInWord
static inline uint64_t haszero(uint64_t v) {
return ((v)-0x0101010101010101UL) & ~(v) & 0x8080808080808080UL;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ULL instead of UL?

return ((v)-0x0101010101010101UL) & ~(v) & 0x8080808080808080UL;
}
static inline uint64_t hasvalue(uint64_t x, char n) {
return haszero((x) ^ (~0UL / 255 * (n)));
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ULL

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You'll need ~0ULL here otherwise things won't work as expected. x is a 64-bit value and the goal of the algorithm to spread out n to all bytes. The initial algorithm in the link you've provided was written for 32-bit values. With just ~0UL only the lowest 32-bit values of the 64-bit value will be filled out.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This hasn't been addressed yet.

llvm/lib/Support/StringRef.cpp Show resolved Hide resolved
if (Check)
return Sz + 7 - llvm::countl_zero(Check) / 8;
} while (Sz);
return -1;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

return npos

llvm/lib/Support/StringRef.cpp Show resolved Hide resolved
/// find_last_of - Find the last character in the string that is in \arg C,
/// or npos if not found.
///
/// Note: O(size() + Chars.size())
StringRef::size_type StringRef::find_last_of(StringRef Chars,
size_t From) const {
size_type Sz = std::min(From, Length);

if (Chars.size() == 2)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you please elaborate a bit more on the specific use cases that you hit in clangd that justify this if (Chars.size() == 2)? At least a unit test along with a little comment about the "why" will help future readers of this patch.

llvm/lib/Support/StringRef.cpp Show resolved Hide resolved
@serge-sans-paille
Copy link
Collaborator Author

It turns out StringRef::find_last_of was missing unittests, I added that.

return ((v)-0x0101010101010101UL) & ~(v) & 0x8080808080808080UL;
}
static inline uint64_t hasvalue(uint64_t x, char n) {
return haszero((x) ^ (~0UL / 255 * (n)));
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You'll need ~0ULL here otherwise things won't work as expected. x is a 64-bit value and the goal of the algorithm to spread out n to all bytes. The initial algorithm in the link you've provided was written for 32-bit values. With just ~0UL only the lowest 32-bit values of the 64-bit value will be filled out.

llvm/lib/Support/StringRef.cpp Show resolved Hide resolved
StringRef::size_type StringRef::find_last_of(StringRef Chars,
size_t From) const {
size_type Sz = std::min(From, Length);

if (Chars.size() == 2) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Before going further with this PR, I'd like to fully understand which specific call site of find_last_of in ClangD is improved by this change, and what is the user flow that triggers it? Is this for .find_last_of("\r\n") ? Or for .find_last_of("/\\"); ? Something else? The situation might change with different/newer CPU architectures, that's why a specific repro would be good to add in the PR description.

You also said above that this function appeared on a teammates' profile during code indexing, are you able to test before/after this patch see how the situation is improved? It would be really nice to have numbers along with this PR, so that others can compare.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah, that's the find_last_of("\r\n") part

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there any precedent for such micro-optimization in core libraries? I still feel that this optimization should go in a ClangD utility file instead. @dwblaikie @MaskRay @joker-eph Do you have an opinion on this?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's a tradeoff between impact vs cost of maintenance.

Moving it to clangd does not reduce the maintenance aspect for the LLVM project though, are we concerned with StringRef maintenance here?
I would say that on the contrary, moving it to clangs runs the risk than Bolt or LLDB reimplement the same thing, not knowing clangd has it.
So I'd say either the implementation isn't overly complex for the long term maintenance here compared to the benefits, or it may not belong to the project?

I don't find a mention of the current perf impact on clangd though?
In #71865 (comment) you mention a micro-benchmark @serge-sans-paille but no the result as far as I can see?

llvm/lib/Support/StringRef.cpp Show resolved Hide resolved
@arsenm
Copy link
Contributor

arsenm commented Mar 6, 2024

Is this still relevant?

@serge-sans-paille
Copy link
Collaborator Author

yeah, it probably is,I just switched priorities, I'll do another iteration on that one.

@serge-sans-paille
Copy link
Collaborator Author

Code updated, and I ran the following micro benchmark:

$ cat a.cpp
#include "llvm/ADT/StringRef.h"
#include <cstdlib>

int main(int argc, char ** argv) {
  long n = atol(argv[2]);
  char * data = argv[1];
  llvm::StringRef S(data);
  size_t k = 0;
  for (long i = 0; i < n; ++i)
    k+= S.find_last_of("\r\n");
  return k;
}
$ clang++ a.cpp -fPIC -o a -O2 -lLLVM -L$PWD/lib
$ LD_LIBRARY_PATH=$PWD/lib ./a "AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA" 10000000

which is as stupid as a micro benchmark can be. This patch make it run twice as fast as the non-patched version.

@serge-sans-paille
Copy link
Collaborator Author

the build issue seems unrelated (?)

@arsenm
Copy link
Contributor

arsenm commented Mar 7, 2024

the build issue seems unrelated (?)

The windows bot has been pure noise for many weeks

@joker-eph
Copy link
Collaborator

the build issue seems unrelated (?)

The windows bot has been pure noise for many weeks

This is factually incorrect: https://lab.llvm.org/buildbot/#/builders/271
Your experience is likely outdated.

…l case of 2 chars

Almost all usage of StringRef::find_last_of in Clang/LLVM use a Needle
of 2 elements, which can be optimized using a generic vectorized
algorithm and a few bit hacks.
@dwblaikie
Copy link
Collaborator

Does this micro-optimization adversely affect cases that aren't covered by it (eg: does checking for length 2 slow down non-length-2 cases perceptibly?)

0x7F7F7F7F7F7F7F7FULL);
}
static inline uint64_t hasvalue(uint64_t x, char n) {
return haszero((x) ^ (~0ULL / 255 * (n)));
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe its just me but I think haszero((x) ^ ((uint64_t)n * 0x0101010101010101)) makes it more clear that n is just repeated over the entire uint64_t

while (Sz >= 8) {
Sz -= 8;
uint64_t Buffer = 0;
std::memcpy((void *)&Buffer, (void *)(Data + Sz), sizeof(Buffer));
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Isn't there some already existing utility to do unaligned loads more explicitly ?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
backend:AMDGPU backend:SPIR-V clang:driver 'clang' and 'clang++' user-facing binaries. Not 'clang-cl' clang Clang issues not falling into any other category llvm:adt llvm:support mlir:core MLIR Core Infrastructure mlir
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

8 participants