New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
<xstring>
: __builtin_wmemcmp
is slow
#2289
Comments
@StephanTLavavej tl;dr: We probably won't have to do anything about Since yesterday I got a bit scared that I made a mistake during my benchmarks. I did my due diligence today and re-ran the benchmark on my 5950X with:
Repeated runs of the benchmark produced results that were within <1% of each other.
However I also gave the benchmark to a colleague with a Intel i7 9700k and it produced the exact same weird
But again this was also with Hyper-Threading, Turbo Boost, etc. enabled as I did initially. I've published my benchmark here: https://github.com/lhecker/stl-issue-2289 P.S: |
…1725) til::equals: At the time of writing wmemcmp() is not an intrinsic for MSVC, but the STL uses it to implement wide string comparisons. This produces 3x the assembly _per_ comparison and increases runtime by 2-3x for strings of medium length (16 characters) and 5x or more for long strings (128 characters or more). See: microsoft/STL#2289 Additionally a number of case insensitive, locale unaware helpers for prefix/suffix comparisons are introduced.
It is indeed: https://godbolt.org/z/Mdr843vxM It was reported as DevCom-1616711, but was closed as the duplicate of this issue. The implementation of Both can be vectorized with SSE2/AVX2 to compare by 16/32 bytes. |
DevCom-1616711 is Closed - Fixed, do we still have this issue? |
Given that it doesn't fix the issue for strings >= 16 wide chars (2x slower; 5x slower for 64 wide chars), I feel it should be kept open. Basically, I believe the 1. point in the issue has been addressed, but not 2. and 3. |
Now that I'm testing this again, I believe there's additionally still some funky business going on in regards to inlinability of I suppose this is due to VSO-1332678 / VSO-685462 not being properly resolved yet? The extra copy the compiler makes, when passing 128-bit types on the x64 ABI, is probably messing things up here and the peephole optimizations can't fix this in hindsight if I had to take a guess... In any case, something, somewhere prevents some IMO valuable string related optimizations as the example shows. |
Original report:
Reported by @lhecker to an internal mailing list, quoted with his permission, edited for Markdown:
More analysis from me:
Here's where the STL calls
__builtin_wmemcmp
:STL/stl/inc/xstring
Lines 240 to 245 in d8f03cf
This is called by:
STL/stl/inc/xstring
Lines 564 to 569 in d8f03cf
STL/stl/inc/xstring
Lines 1441 to 1443 in d8f03cf
STL/stl/inc/xstring
Lines 1715 to 1719 in d8f03cf
There are a few issues here:
__builtin_wmemcmp
is slower thanwmemcmp
at runtime, that should be reported as a compiler bug.wstring
/wstring_view
relational comparison (<
/<=
/>
/>=
/<=>
), we can work around that compiler bug by checkingis_constant_evaluated
and callingwmemcmp
at runtime. This is less convenient than calling the builtin form unconditionally, but it's worth paying that code complexity for runtime performance (fixing a regression). As usual, compiler bug workarounds should be commented asTRANSITION
.wstring
/wstring_view
equality comparison (==
/!=
), we need to retain aconstexpr
-compatible codepath, but at runtime, we can take advantage of the knowledge that we only need an "equal / non-equal" answer, for whichmemcmp
is inherently faster thanwmemcmp
as Leonard measured._Traits_equal
is the right place to make this change. We still need to handle user-defined traits, but it should be possible to useif constexpr
to detect when the traits arechar_traits<wchar_t>
orchar_traits<char16_t>
.The text was updated successfully, but these errors were encountered: