`<xstring>`: `__builtin_wmemcmp` is slow #2289

StephanTLavavej · 2021-10-21T01:57:11Z

Original report:

Reported by @lhecker to an internal mailing list, quoted with his permission, edited for Markdown:

Related to this it should also be noted that the default implementation for STL wide string comparisons uses __builtin_wmemcmp, which is about 4x (+/- something) slower than good old memcmp.

Here are some benchmarks with 128-byte long strings (meaning 64 chars for wide strings):

Run on (32 X 3400 MHz CPU s)

CPU Caches:

L1 Data 32 KiB (x16)

L1 Instruction 32 KiB (x16)

L2 Unified 512 KiB (x16)

L3 Unified 32768 KiB (x2)

Benchmark Time CPU Iterations

std_string_view 7.84 ns 7.85 ns 89600000

std_wstring_view 32.0 ns 32.1 ns 22400000

standard_wmemcmp 17.8 ns 18.0 ns 37333333

standard_memcmp 6.88 ns 6.80 ns 89600000

Most curiously, as you can see here, is that __builtin_wmemcmp seems to be significantly slower than regular wmemcmp (the former produces rather abstruse assembly). In any case, if you want performance for string equality tests, I can only suggest using memcmp instead of wmemcmp. Our STL should likely replace any use of wmemcmp for such pure equality tests.

More analysis from me:

Here's where the STL calls __builtin_wmemcmp:

STL/stl/inc/xstring

Lines 240 to 245 in d8f03cf

    
               _NODISCARD static _CONSTEXPR17 int compare(_In_reads_(_Count) const _Elem* const _First1,
 
                   _In_reads_(_Count) const _Elem* const _First2, const size_t _Count) noexcept /* strengthened */ {
 
                   // compare [_First1, _First1 + _Count) with [_First2, ...)
 
           #if _HAS_CXX17
 
                   if constexpr (is_same_v<_Elem, wchar_t>) {
 
                       return __builtin_wmemcmp(_First1, _First2, _Count);

This is called by:

STL/stl/inc/xstring

Lines 564 to 569 in d8f03cf

    
           template <class _Traits>
 
           constexpr bool _Traits_equal(_In_reads_(_Left_size) const _Traits_ptr_t<_Traits> _Left, const size_t _Left_size,
 
               _In_reads_(_Right_size) const _Traits_ptr_t<_Traits> _Right, const size_t _Right_size) noexcept {
 
               // compare [_Left, _Left + _Left_size) to [_Right, _Right + _Right_size) for equality using _Traits
 
               return _Left_size == _Right_size && _Traits::compare(_Left, _Right, _Left_size) == 0;
 
           }

STL/stl/inc/xstring

Lines 1441 to 1443 in d8f03cf

    
           constexpr bool _Equal(const basic_string_view _Right) const noexcept {
 
               return _Traits_equal<_Traits>(_Mydata, _Mysize, _Right._Mydata, _Right._Mysize);
 
           }

STL/stl/inc/xstring

Lines 1715 to 1719 in d8f03cf

    
           template <class _Elem, class _Traits>
 
           _NODISCARD constexpr bool operator==(
 
               const basic_string_view<_Elem, _Traits> _Lhs, const basic_string_view<_Elem, _Traits> _Rhs) noexcept {
 
               return _Lhs._Equal(_Rhs);
 
           }

There are a few issues here:

If __builtin_wmemcmp is slower than wmemcmp at runtime, that should be reported as a compiler bug.
For wstring/wstring_view relational comparison (</<=/>/>=/<=>), we can work around that compiler bug by checking is_constant_evaluated and calling wmemcmp at runtime. This is less convenient than calling the builtin form unconditionally, but it's worth paying that code complexity for runtime performance (fixing a regression). As usual, compiler bug workarounds should be commented as TRANSITION.
For wstring/wstring_view equality comparison (==/!=), we need to retain a constexpr-compatible codepath, but at runtime, we can take advantage of the knowledge that we only need an "equal / non-equal" answer, for which memcmp is inherently faster than wmemcmp as Leonard measured.
- I believe that _Traits_equal is the right place to make this change. We still need to handle user-defined traits, but it should be possible to use if constexpr to detect when the traits are char_traits<wchar_t> or char_traits<char16_t>.

The text was updated successfully, but these errors were encountered:

lhecker · 2021-10-21T15:29:35Z

@StephanTLavavej tl;dr: We probably won't have to do anything about __builtin_wmemcmp for now. 🙂 This largely depends on whether others can reproduce the issue with that intrinsic.

Since yesterday I got a bit scared that I made a mistake during my benchmarks. I did my due diligence today and re-ran the benchmark on my 5950X with:

All but one CCD disabled
Hyper-Threading disabled
Core clock fixed at 4 GHz
MSVC v143
Windows 11 SDK (22000)

Repeated runs of the benchmark produced results that were within <1% of each other.
As such they might be the ones we can rely on for now:

bytes compared ->	4	8	16	32	64	128
`std::string_view::operator==`	10.70 ns	9.85 ns	10.20 ns	10.50 ns	11.40 ns	12.70 ns
`std::wstring_view::operator==`	11.00 ns	11.60 ns	12.50 ns	15.50 ns	21.00 ns	39.40 ns
`__builtin_wmemcmp`	2.27 ns	3.28 ns	5.30 ns	10.10 ns	18.10 ns	37.90 ns
`wmemcmp`	2.27 ns	3.28 ns	5.32 ns	9.31 ns	18.10 ns	34.30 ns
`__builtin_memcmp`	3.78 ns	4.28 ns	4.55 ns	4.05 ns	4.79 ns	6.32 ns

However I also gave the benchmark to a colleague with a Intel i7 9700k and it produced the exact same weird __builtin_wmemcmp behavior:

bytes compared ->	4	8	16	32	64	128
`std::string_view::operator==`	7.95 ns	7.50 ns	7.95 ns	8.16 ns	9.21 ns	11.00 ns
`std::wstring_view::operator==`	7.25 ns	8.02 ns	9.52 ns	13.80 ns	26.10 ns	33.00 ns
`__builtin_wmemcmp`	2.89 ns	4.30 ns	6.98 ns	12.30 ns	27.90 ns	43.90 ns
`wmemcmp`	2.61 ns	3.53 ns	5.31 ns	8.72 ns	15.70 ns	29.80 ns
`__builtin_memcmp`	4.14 ns	4.24 ns	4.76 ns	4.30 ns	5.31 ns	6.98 ns

But again this was also with Hyper-Threading, Turbo Boost, etc. enabled as I did initially.
(I don't want to inconvenience them to change all these settings in their BIOS first.)
One could theorize that __builtin_wmemcmp simply produces assembly that might be just as fast, but simply consumes more power and thus leads to reduced boost clocks. Personally I'm not sure what to make out of that, but I do know that at least __builtin_memcmp looks very promising.

I've published my benchmark here: https://github.com/lhecker/stl-issue-2289

P.S:
The performance of std::[w]string_view::operator== is worse than the others, because these functions (and the functions they call) take their arguments by-value, which has a large impact on performance on the Win64 ABI compared to passing them by reference (or compared to SysV). This is tracked as VSO-1332678, and VSO-685462.

…1725) til::equals: At the time of writing wmemcmp() is not an intrinsic for MSVC, but the STL uses it to implement wide string comparisons. This produces 3x the assembly _per_ comparison and increases runtime by 2-3x for strings of medium length (16 characters) and 5x or more for long strings (128 characters or more). See: microsoft/STL#2289 Additionally a number of case insensitive, locale unaware helpers for prefix/suffix comparisons are introduced.

AlexGuteniev · 2021-12-21T22:11:17Z

If __builtin_wmemcmp is slower than wmemcmp at runtime, that should be reported as a compiler bug.

It is indeed: https://godbolt.org/z/Mdr843vxM

It was reported as DevCom-1616711, but was closed as the duplicate of this issue.

The implementation of wmemcmp is also naïve, it compares by wchar_t.
The memcmp compares by 4 bytes.

Both can be vectorized with SSE2/AVX2 to compare by 16/32 bytes.

AlexGuteniev · 2023-02-20T07:30:09Z

DevCom-1616711 is Closed - Fixed, do we still have this issue?

lhecker · 2023-02-20T10:33:02Z

Given that it doesn't fix the issue for strings >= 16 wide chars (2x slower; 5x slower for 64 wide chars), I feel it should be kept open. Basically, I believe the 1. point in the issue has been addressed, but not 2. and 3.

lhecker · 2023-02-20T10:46:06Z

Now that I'm testing this again, I believe there's additionally still some funky business going on in regards to inlinability of <xstring> functions as I mentioned in passing above: https://godbolt.org/z/ocsEdrdPx

I suppose this is due to VSO-1332678 / VSO-685462 not being properly resolved yet? The extra copy the compiler makes, when passing 128-bit types on the x64 ABI, is probably messing things up here and the peephole optimizations can't fix this in hindsight if I had to take a guess... In any case, something, somewhere prevents some IMO valuable string related optimizations as the example shows.

StephanTLavavej added the performance Must go faster label Oct 21, 2021

lhecker mentioned this issue Nov 10, 2021

Introduce new TIL string helpers, teach older ones to use wmemcmp microsoft/terminal#11725

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

`<xstring>`: `__builtin_wmemcmp` is slow #2289

`<xstring>`: `__builtin_wmemcmp` is slow #2289

StephanTLavavej commented Oct 21, 2021

lhecker commented Oct 21, 2021 •

edited

AlexGuteniev commented Dec 21, 2021

AlexGuteniev commented Feb 20, 2023

lhecker commented Feb 20, 2023

lhecker commented Feb 20, 2023

<xstring>: __builtin_wmemcmp is slow #2289

<xstring>: __builtin_wmemcmp is slow #2289

Comments

StephanTLavavej commented Oct 21, 2021

Original report:

More analysis from me:

lhecker commented Oct 21, 2021 • edited

AlexGuteniev commented Dec 21, 2021

AlexGuteniev commented Feb 20, 2023

lhecker commented Feb 20, 2023

lhecker commented Feb 20, 2023

`<xstring>`: `__builtin_wmemcmp` is slow #2289

`<xstring>`: `__builtin_wmemcmp` is slow #2289

lhecker commented Oct 21, 2021 •

edited