Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

<xstring>: __builtin_wmemcmp is slow #2289

Open
StephanTLavavej opened this issue Oct 21, 2021 · 5 comments
Open

<xstring>: __builtin_wmemcmp is slow #2289

StephanTLavavej opened this issue Oct 21, 2021 · 5 comments
Labels
performance Must go faster

Comments

@StephanTLavavej
Copy link
Member

Original report:

Reported by @lhecker to an internal mailing list, quoted with his permission, edited for Markdown:

Related to this it should also be noted that the default implementation for STL wide string comparisons uses __builtin_wmemcmp, which is about 4x (+/- something) slower than good old memcmp.

Here are some benchmarks with 128-byte long strings (meaning 64 chars for wide strings):

  • Run on (32 X 3400 MHz CPU s)
  • CPU Caches:
    • L1 Data 32 KiB (x16)
    • L1 Instruction 32 KiB (x16)
    • L2 Unified 512 KiB (x16)
    • L3 Unified 32768 KiB (x2)
Benchmark Time CPU Iterations
std_string_view 7.84 ns 7.85 ns 89600000
std_wstring_view 32.0 ns 32.1 ns 22400000
standard_wmemcmp 17.8 ns 18.0 ns 37333333
standard_memcmp 6.88 ns 6.80 ns 89600000

Most curiously, as you can see here, is that __builtin_wmemcmp seems to be significantly slower than regular wmemcmp (the former produces rather abstruse assembly). In any case, if you want performance for string equality tests, I can only suggest using memcmp instead of wmemcmp. Our STL should likely replace any use of wmemcmp for such pure equality tests.

More analysis from me:

Here's where the STL calls __builtin_wmemcmp:

STL/stl/inc/xstring

Lines 240 to 245 in d8f03cf

_NODISCARD static _CONSTEXPR17 int compare(_In_reads_(_Count) const _Elem* const _First1,
_In_reads_(_Count) const _Elem* const _First2, const size_t _Count) noexcept /* strengthened */ {
// compare [_First1, _First1 + _Count) with [_First2, ...)
#if _HAS_CXX17
if constexpr (is_same_v<_Elem, wchar_t>) {
return __builtin_wmemcmp(_First1, _First2, _Count);

This is called by:

STL/stl/inc/xstring

Lines 564 to 569 in d8f03cf

template <class _Traits>
constexpr bool _Traits_equal(_In_reads_(_Left_size) const _Traits_ptr_t<_Traits> _Left, const size_t _Left_size,
_In_reads_(_Right_size) const _Traits_ptr_t<_Traits> _Right, const size_t _Right_size) noexcept {
// compare [_Left, _Left + _Left_size) to [_Right, _Right + _Right_size) for equality using _Traits
return _Left_size == _Right_size && _Traits::compare(_Left, _Right, _Left_size) == 0;
}

STL/stl/inc/xstring

Lines 1441 to 1443 in d8f03cf

constexpr bool _Equal(const basic_string_view _Right) const noexcept {
return _Traits_equal<_Traits>(_Mydata, _Mysize, _Right._Mydata, _Right._Mysize);
}

STL/stl/inc/xstring

Lines 1715 to 1719 in d8f03cf

template <class _Elem, class _Traits>
_NODISCARD constexpr bool operator==(
const basic_string_view<_Elem, _Traits> _Lhs, const basic_string_view<_Elem, _Traits> _Rhs) noexcept {
return _Lhs._Equal(_Rhs);
}

There are a few issues here:

  1. If __builtin_wmemcmp is slower than wmemcmp at runtime, that should be reported as a compiler bug.
  2. For wstring/wstring_view relational comparison (</<=/>/>=/<=>), we can work around that compiler bug by checking is_constant_evaluated and calling wmemcmp at runtime. This is less convenient than calling the builtin form unconditionally, but it's worth paying that code complexity for runtime performance (fixing a regression). As usual, compiler bug workarounds should be commented as TRANSITION.
  3. For wstring/wstring_view equality comparison (==/!=), we need to retain a constexpr-compatible codepath, but at runtime, we can take advantage of the knowledge that we only need an "equal / non-equal" answer, for which memcmp is inherently faster than wmemcmp as Leonard measured.
    • I believe that _Traits_equal is the right place to make this change. We still need to handle user-defined traits, but it should be possible to use if constexpr to detect when the traits are char_traits<wchar_t> or char_traits<char16_t>.
@StephanTLavavej StephanTLavavej added the performance Must go faster label Oct 21, 2021
@lhecker
Copy link
Member

lhecker commented Oct 21, 2021

@StephanTLavavej tl;dr: We probably won't have to do anything about __builtin_wmemcmp for now. 🙂 This largely depends on whether others can reproduce the issue with that intrinsic.

Since yesterday I got a bit scared that I made a mistake during my benchmarks. I did my due diligence today and re-ran the benchmark on my 5950X with:

  • All but one CCD disabled
  • Hyper-Threading disabled
  • Core clock fixed at 4 GHz
  • MSVC v143
  • Windows 11 SDK (22000)

Repeated runs of the benchmark produced results that were within <1% of each other.
As such they might be the ones we can rely on for now:

bytes compared -> 4 8 16 32 64 128
std::string_view::operator== 10.70 ns 9.85 ns 10.20 ns 10.50 ns 11.40 ns 12.70 ns
std::wstring_view::operator== 11.00 ns 11.60 ns 12.50 ns 15.50 ns 21.00 ns 39.40 ns
__builtin_wmemcmp 2.27 ns 3.28 ns 5.30 ns 10.10 ns 18.10 ns 37.90 ns
wmemcmp 2.27 ns 3.28 ns 5.32 ns 9.31 ns 18.10 ns 34.30 ns
__builtin_memcmp 3.78 ns 4.28 ns 4.55 ns 4.05 ns 4.79 ns 6.32 ns

However I also gave the benchmark to a colleague with a Intel i7 9700k and it produced the exact same weird __builtin_wmemcmp behavior:

bytes compared -> 4 8 16 32 64 128
std::string_view::operator== 7.95 ns 7.50 ns 7.95 ns 8.16 ns 9.21 ns 11.00 ns
std::wstring_view::operator== 7.25 ns 8.02 ns 9.52 ns 13.80 ns 26.10 ns 33.00 ns
__builtin_wmemcmp 2.89 ns 4.30 ns 6.98 ns 12.30 ns 27.90 ns 43.90 ns
wmemcmp 2.61 ns 3.53 ns 5.31 ns 8.72 ns 15.70 ns 29.80 ns
__builtin_memcmp 4.14 ns 4.24 ns 4.76 ns 4.30 ns 5.31 ns 6.98 ns

But again this was also with Hyper-Threading, Turbo Boost, etc. enabled as I did initially.
(I don't want to inconvenience them to change all these settings in their BIOS first.)
One could theorize that __builtin_wmemcmp simply produces assembly that might be just as fast, but simply consumes more power and thus leads to reduced boost clocks. Personally I'm not sure what to make out of that, but I do know that at least __builtin_memcmp looks very promising.

I've published my benchmark here: https://github.com/lhecker/stl-issue-2289


P.S:
The performance of std::[w]string_view::operator== is worse than the others, because these functions (and the functions they call) take their arguments by-value, which has a large impact on performance on the Win64 ABI compared to passing them by reference (or compared to SysV). This is tracked as VSO-1332678, and VSO-685462.

ghost pushed a commit to microsoft/terminal that referenced this issue Nov 17, 2021
…1725)

til::equals:
At the time of writing wmemcmp() is not an intrinsic for MSVC,
but the STL uses it to implement wide string comparisons.
This produces 3x the assembly _per_ comparison and increases
runtime by 2-3x for strings of medium length (16 characters)
and 5x or more for long strings (128 characters or more).
See: microsoft/STL#2289

Additionally a number of case insensitive, locale unaware
helpers for prefix/suffix comparisons are introduced.
@AlexGuteniev
Copy link
Contributor

  1. If __builtin_wmemcmp is slower than wmemcmp at runtime, that should be reported as a compiler bug.

It is indeed: https://godbolt.org/z/Mdr843vxM

It was reported as DevCom-1616711, but was closed as the duplicate of this issue.

The implementation of wmemcmp is also naïve, it compares by wchar_t.
The memcmp compares by 4 bytes.

Both can be vectorized with SSE2/AVX2 to compare by 16/32 bytes.

@AlexGuteniev
Copy link
Contributor

DevCom-1616711 is Closed - Fixed, do we still have this issue?

@lhecker
Copy link
Member

lhecker commented Feb 20, 2023

Given that it doesn't fix the issue for strings >= 16 wide chars (2x slower; 5x slower for 64 wide chars), I feel it should be kept open. Basically, I believe the 1. point in the issue has been addressed, but not 2. and 3.

@lhecker
Copy link
Member

lhecker commented Feb 20, 2023

Now that I'm testing this again, I believe there's additionally still some funky business going on in regards to inlinability of <xstring> functions as I mentioned in passing above: https://godbolt.org/z/ocsEdrdPx

I suppose this is due to VSO-1332678 / VSO-685462 not being properly resolved yet? The extra copy the compiler makes, when passing 128-bit types on the x64 ABI, is probably messing things up here and the peephole optimizations can't fix this in hindsight if I had to take a guess... In any case, something, somewhere prevents some IMO valuable string related optimizations as the example shows.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
performance Must go faster
Projects
None yet
Development

No branches or pull requests

3 participants