-
Notifications
You must be signed in to change notification settings - Fork 10.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[llvm] Improve implementation of StringRef::find_last_of and cie #71865
base: main
Are you sure you want to change the base?
Conversation
@llvm/pr-subscribers-llvm-adt @llvm/pr-subscribers-llvm-support Author: None (serge-sans-paille) ChangesUsing a simple LUT implies less operations than a bitset, at the expense of slightly more stack usage. Mandatory performance report: Full diff: https://github.com/llvm/llvm-project/pull/71865.diff 1 Files Affected:
diff --git a/llvm/lib/Support/StringRef.cpp b/llvm/lib/Support/StringRef.cpp
index feee47ca693b251..be6e340ff0d5901 100644
--- a/llvm/lib/Support/StringRef.cpp
+++ b/llvm/lib/Support/StringRef.cpp
@@ -13,7 +13,6 @@
#include "llvm/ADT/StringExtras.h"
#include "llvm/ADT/edit_distance.h"
#include "llvm/Support/Error.h"
-#include <bitset>
using namespace llvm;
@@ -236,12 +235,12 @@ size_t StringRef::rfind_insensitive(StringRef Str) const {
/// Note: O(size() + Chars.size())
StringRef::size_type StringRef::find_first_of(StringRef Chars,
size_t From) const {
- std::bitset<1 << CHAR_BIT> CharBits;
+ bool CharBits[1 << CHAR_BIT] = {false};
for (char C : Chars)
- CharBits.set((unsigned char)C);
+ CharBits[(unsigned)C] = true;
for (size_type i = std::min(From, Length), e = Length; i != e; ++i)
- if (CharBits.test((unsigned char)Data[i]))
+ if (CharBits[(unsigned char)Data[i]])
return i;
return npos;
}
@@ -258,12 +257,12 @@ StringRef::size_type StringRef::find_first_not_of(char C, size_t From) const {
/// Note: O(size() + Chars.size())
StringRef::size_type StringRef::find_first_not_of(StringRef Chars,
size_t From) const {
- std::bitset<1 << CHAR_BIT> CharBits;
+ bool CharBits[1 << CHAR_BIT] = {false};
for (char C : Chars)
- CharBits.set((unsigned char)C);
+ CharBits[(unsigned)C] = true;
for (size_type i = std::min(From, Length), e = Length; i != e; ++i)
- if (!CharBits.test((unsigned char)Data[i]))
+ if (!CharBits[(unsigned char)Data[i]])
return i;
return npos;
}
@@ -274,12 +273,12 @@ StringRef::size_type StringRef::find_first_not_of(StringRef Chars,
/// Note: O(size() + Chars.size())
StringRef::size_type StringRef::find_last_of(StringRef Chars,
size_t From) const {
- std::bitset<1 << CHAR_BIT> CharBits;
+ bool CharBits[1 << CHAR_BIT] = {false};
for (char C : Chars)
- CharBits.set((unsigned char)C);
+ CharBits[(unsigned)C] = true;
for (size_type i = std::min(From, Length) - 1, e = -1; i != e; --i)
- if (CharBits.test((unsigned char)Data[i]))
+ if (CharBits[(unsigned char)Data[i]])
return i;
return npos;
}
@@ -299,12 +298,12 @@ StringRef::size_type StringRef::find_last_not_of(char C, size_t From) const {
/// Note: O(size() + Chars.size())
StringRef::size_type StringRef::find_last_not_of(StringRef Chars,
size_t From) const {
- std::bitset<1 << CHAR_BIT> CharBits;
+ bool CharBits[1 << CHAR_BIT] = {false};
for (char C : Chars)
- CharBits.set((unsigned char)C);
+ CharBits[(unsigned)C] = true;
for (size_type i = std::min(From, Length) - 1, e = -1; i != e; --i)
- if (!CharBits.test((unsigned char)Data[i]))
+ if (!CharBits[(unsigned char)Data[i]])
return i;
return npos;
}
|
Hello @serge-sans-paille! Could you please explain a bit more in detail why do you think this is a good optimization? Less instructions retired is not necessary better if that increases the memory pressure? It might be nothing, but before the whole bitset was fitting in a cacheline, now it takes 4 cache lines to do the same thing. Again, this might be nothing, but that means 3 extra cache line being evicted elsewhere to execute this function. Also I'd like to be convinced by the profile link that you've attached, but the difference is so small that is could be normal fluctuations (also 'cycles' stat is higher). One more point, the previous bitshifting code in bitset might run better on AArch64 than on x64, so it'd be interesting to see the difference there too. Thanks! |
fd5224f
to
16b0bf6
Compare
731953e
to
2e48f5a
Compare
@aganea I submitted a different approach that will work with any SSE2 - powered machine, and it doesn't have the same stack usage issue as the previous one. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This patch looks good to me. I am not sure if we want to implement architecture-specific optimizations like __SSE2__
at the source-code level. But then I don't want that to block the landing of this patch. Is there any way you could separate these two patches? Thanks!
EDIT: I guess "this patch" isn't clear. I am referring to the first one that replaces "."
with '.'
, etc.
llvm/lib/Support/StringRef.cpp
Outdated
} while (Sz); | ||
return npos; | ||
} | ||
#endif |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can this be abstracted or made out-of-line?
I'm wondering about the scalability of HW-specific intrinsics in-line (anticipating for the incoming #elif defined(ARM64)
...)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think using https://github.com/xtensor-stack/xsimd is an option :-) And https://en.cppreference.com/w/cpp/experimental/simd/simd is still not a thing :-/
We already have some bits of SSE2 in clang and llvm. OK to factor this in a function.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah I'm fine with SSE2, I was just trying to keep the specialized implementation out-of-line, what you have now looks good I think.
2e48f5a
to
5ebe6d8
Compare
✅ With the latest revision this PR passed the C/C++ code formatter. |
We already have these for critical part of the code (in the clang lexer for instance). I only went that way because this particular function appeared in a profile of a teamate's machine during code indexing (and because it was very entertaining to implement).
I've landed the first commit as 33b5158, it was a no-brainer |
cc47899
to
6933b73
Compare
llvm/lib/Support/StringRef.cpp
Outdated
#ifdef __SSE2__ | ||
|
||
StringRef::size_type vectorized_find_last_of_specialized(const char *Data, | ||
size_t Sz, char C0, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
static?
llvm/lib/Support/StringRef.cpp
Outdated
StringRef::size_type vectorized_find_last_of_specialized(const char *Data, | ||
size_t Sz, char C0, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
StringRef::size_type vectorized_find_last_of_specialized(const char *Data, | |
size_t Sz, char C0, | |
static StringRef::size_type | |
vectorized_find_last_of_specialized(const char *Data, size_t Sz, char C0, |
llvm/lib/Support/StringRef.cpp
Outdated
__m128i Needle1 = _mm_set1_epi8(C1); | ||
do { | ||
Sz = Sz < 16 ? 0 : Sz - 16; | ||
__m128i Buffer = _mm_loadu_si128((const __m128i *)(Data + Sz)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This load instruction will generate an out-of-bounds access if strlen(Data)<15
. Do you think you can use the "slow" algorithm when that happens?
llvm/lib/Support/StringRef.cpp
Outdated
@@ -623,8 +657,7 @@ hash_code llvm::hash_value(StringRef S) { | |||
} | |||
|
|||
unsigned DenseMapInfo<StringRef, void>::getHashValue(StringRef Val) { | |||
assert(Val.data() != getEmptyKey().data() && | |||
"Cannot hash the empty key!"); | |||
assert(Val.data() != getEmptyKey().data() && "Cannot hash the empty key!"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you please commit all these NFC formatting changes separately?
llvm/lib/Support/StringRef.cpp
Outdated
@@ -372,7 +402,8 @@ size_t StringRef::count(StringRef Str) const { | |||
size_t Count = 0; | |||
size_t Pos = 0; | |||
size_t N = Str.size(); | |||
// TODO: For an empty `Str` we return 0 for legacy reasons. Consider changing | |||
// TODO: For an empty `Str` we return 0 for legacy reasons. Consider | |||
// changing |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Text should go with the following line.
00f640e
to
e8ed042
Compare
@aganea / @joker-eph / @kazutakahirata : I actually switched the implementation to something that's almost as efficient (no vector load required) and works across all architecture. (Not to mention it also increases my geekness karma) |
llvm/lib/Support/StringRef.cpp
Outdated
@@ -268,17 +268,47 @@ StringRef::size_type StringRef::find_first_not_of(StringRef Chars, | |||
return npos; | |||
} | |||
|
|||
// See https://graphics.stanford.edu/~seander/bithacks.html#ValueInWord | |||
static inline uint64_t haszero(uint64_t v) { | |||
return ((v)-0x0101010101010101UL) & ~(v) & 0x8080808080808080UL; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ULL instead of UL?
llvm/lib/Support/StringRef.cpp
Outdated
return ((v)-0x0101010101010101UL) & ~(v) & 0x8080808080808080UL; | ||
} | ||
static inline uint64_t hasvalue(uint64_t x, char n) { | ||
return haszero((x) ^ (~0UL / 255 * (n))); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ULL
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You'll need ~0ULL
here otherwise things won't work as expected. x
is a 64-bit value and the goal of the algorithm to spread out n
to all bytes. The initial algorithm in the link you've provided was written for 32-bit values. With just ~0UL
only the lowest 32-bit values of the 64-bit value will be filled out.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This hasn't been addressed yet.
llvm/lib/Support/StringRef.cpp
Outdated
if (Check) | ||
return Sz + 7 - llvm::countl_zero(Check) / 8; | ||
} while (Sz); | ||
return -1; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
return npos
llvm/lib/Support/StringRef.cpp
Outdated
/// find_last_of - Find the last character in the string that is in \arg C, | ||
/// or npos if not found. | ||
/// | ||
/// Note: O(size() + Chars.size()) | ||
StringRef::size_type StringRef::find_last_of(StringRef Chars, | ||
size_t From) const { | ||
size_type Sz = std::min(From, Length); | ||
|
||
if (Chars.size() == 2) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you please elaborate a bit more on the specific use cases that you hit in clangd that justify this if (Chars.size() == 2)
? At least a unit test along with a little comment about the "why" will help future readers of this patch.
e8ed042
to
cbfb801
Compare
It turns out |
llvm/lib/Support/StringRef.cpp
Outdated
return ((v)-0x0101010101010101UL) & ~(v) & 0x8080808080808080UL; | ||
} | ||
static inline uint64_t hasvalue(uint64_t x, char n) { | ||
return haszero((x) ^ (~0UL / 255 * (n))); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You'll need ~0ULL
here otherwise things won't work as expected. x
is a 64-bit value and the goal of the algorithm to spread out n
to all bytes. The initial algorithm in the link you've provided was written for 32-bit values. With just ~0UL
only the lowest 32-bit values of the 64-bit value will be filled out.
StringRef::size_type StringRef::find_last_of(StringRef Chars, | ||
size_t From) const { | ||
size_type Sz = std::min(From, Length); | ||
|
||
if (Chars.size() == 2) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Before going further with this PR, I'd like to fully understand which specific call site of find_last_of
in ClangD is improved by this change, and what is the user flow that triggers it? Is this for .find_last_of("\r\n")
? Or for .find_last_of("/\\");
? Something else? The situation might change with different/newer CPU architectures, that's why a specific repro would be good to add in the PR description.
You also said above that this function appeared on a teammates' profile during code indexing, are you able to test before/after this patch see how the situation is improved? It would be really nice to have numbers along with this PR, so that others can compare.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yeah, that's the find_last_of("\r\n")
part
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there any precedent for such micro-optimization in core libraries? I still feel that this optimization should go in a ClangD utility file instead. @dwblaikie @MaskRay @joker-eph Do you have an opinion on this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's a tradeoff between impact vs cost of maintenance.
Moving it to clangd does not reduce the maintenance aspect for the LLVM project though, are we concerned with StringRef maintenance here?
I would say that on the contrary, moving it to clangs runs the risk than Bolt or LLDB reimplement the same thing, not knowing clangd has it.
So I'd say either the implementation isn't overly complex for the long term maintenance here compared to the benefits, or it may not belong to the project?
I don't find a mention of the current perf impact on clangd though?
In #71865 (comment) you mention a micro-benchmark @serge-sans-paille but no the result as far as I can see?
Is this still relevant? |
yeah, it probably is,I just switched priorities, I'll do another iteration on that one. |
cbfb801
to
75e87c9
Compare
Code updated, and I ran the following micro benchmark:
which is as stupid as a micro benchmark can be. This patch make it run twice as fast as the non-patched version. |
the build issue seems unrelated (?) |
The windows bot has been pure noise for many weeks |
This is factually incorrect: https://lab.llvm.org/buildbot/#/builders/271 |
…l case of 2 chars Almost all usage of StringRef::find_last_of in Clang/LLVM use a Needle of 2 elements, which can be optimized using a generic vectorized algorithm and a few bit hacks.
75e87c9
to
2770064
Compare
Does this micro-optimization adversely affect cases that aren't covered by it (eg: does checking for length 2 slow down non-length-2 cases perceptibly?) |
0x7F7F7F7F7F7F7F7FULL); | ||
} | ||
static inline uint64_t hasvalue(uint64_t x, char n) { | ||
return haszero((x) ^ (~0ULL / 255 * (n))); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe its just me but I think haszero((x) ^ ((uint64_t)n * 0x0101010101010101))
makes it more clear that n
is just repeated over the entire uint64_t
while (Sz >= 8) { | ||
Sz -= 8; | ||
uint64_t Buffer = 0; | ||
std::memcpy((void *)&Buffer, (void *)(Data + Sz), sizeof(Buffer)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Isn't there some already existing utility to do unaligned loads more explicitly ?
Using a simple LUT implies less operations than a bitset, at the expense of slightly more stack usage.
Mandatory performance report:
https://llvm-compile-time-tracker.com/compare.php?from=3f9d385e5844f2f1f144305037cfc904789c6187&to=86f685e9552acaf8d5db630161e91752348422a1&stat=instructions:u