`<locale>`: Double-checked locking for locale::classic #3048

MattStephanson · 2022-08-21T02:54:27Z

Closes #3030.

The original issue suggests double-checked locking of global_locale in locale::_Init, but I don't think that would work without significantly more work, if it's possible at all. The reason is that the global locale can be modified, while at the same time a default constructed locale could be trying to load the global locale and increment it's refcount. This is the kind of lock-free "juggling with knives" that I don't want to even try. The classic locale, on the other hand, is a typical singleton and more amenable to DCL.

ABI alert: InterlockedMeowPointer requires their arguments be 32/64 bit aligned, thus the added alignas. Is that an ABI issue?

AlexGuteniev · 2022-08-21T21:31:47Z

That's not a good idea to have DCL via CAS. On x86 and x64 it is an expensive operation. Still a win over a mutex, as a mutex would yeld two such ops, but still need to have plain read, if want real DCL gain.

stl/src/locale0.cpp

AlexGuteniev · 2022-08-21T21:43:39Z

I don't see a justification to use WinAPI / Intrinsics.

If you can change the locking tech, you may go with std::call_once as out-of-box DCL. Or std::atomic if `std::call_once does not suit. You are in a .cpp and don't need to worry about includes and a downgrade standard mode.

If you need to preserve the variable, you can still use std::atomic_ref, but then I doubt if this change is safe at all.

MattStephanson · 2022-08-21T23:15:59Z

I don't see a justification to use WinAPI / Intrinsics.

If you can change the locking tech, you may go with std::call_once as out-of-box DCL. Or std::atomic if `std::call_once does not suit. You are in a .cpp and don't need to worry about includes and a downgrade standard mode.

If you need to preserve the variable, you can still use std::atomic_ref, but then I doubt if this change is safe at all.

On the original issue, @CaseyCarter noted (via @StephanTLavavej):

This needs to be compatible with /clr:pure (which is incompatible with atomics),

This is injected into the STL's import library, where we recently had trouble with atomic appearing (and fixed it by not dragging in atomic)

MattStephanson · 2022-08-21T23:33:34Z

That's not a good idea to have DCL via CAS. On x86 and x64 it is an expensive operation. Still a win over a mutex, as a mutex would yeld two such ops, but still need to have plain read, if want real DCL gain.

Yes, I'm not a fan of the CAS. I think <atomic> uses either __iso_volatile_meow or __ldrexd and _Compiler_or_memory_barrier, which is either _ReadWriteBarrier or __dmb(0xB). Do you think that would be better?

STL/stl/inc/atomic

Lines 155 to 167 in fef8191

    
           #define _ATOMIC_STORE_PREFIX(_Width, _Ptr, _Desired)      \
 
               case _Atomic_memory_order_relaxed:                    \
 
                   __iso_volatile_store##_Width((_Ptr), (_Desired)); \
 
                   return;                                           \
 
               case _Atomic_memory_order_release:                    \
 
                   _Compiler_or_memory_barrier();                    \
 
                   __iso_volatile_store##_Width((_Ptr), (_Desired)); \
 
                   return;                                           \
 
               default:                                              \
 
               case _Atomic_memory_order_consume:                    \
 
               case _Atomic_memory_order_acquire:                    \
 
               case _Atomic_memory_order_acq_rel:                    \
 
                   _INVALID_MEMORY_ORDER;                            \

STL/stl/inc/atomic

Lines 1036 to 1045 in fef8191

    
               _NODISCARD _TVal load(const memory_order _Order) const noexcept { // load with given memory order
 
                   const auto _Mem = _Atomic_address_as<long long>(_Storage);
 
           #ifdef _M_ARM
 
                   long long _As_bytes = __ldrexd(_Mem);
 
           #else
 
                   long long _As_bytes = __iso_volatile_load64(_Mem);
 
           #endif
 
                   _ATOMIC_LOAD_VERIFY_MEMORY_ORDER(static_cast<unsigned int>(_Order))
 
                   return reinterpret_cast<_TVal&>(_As_bytes);
 
               }

STL/stl/inc/atomic

Lines 97 to 107 in fef8191

    
           #define _Compiler_barrier() _STL_DISABLE_DEPRECATED_WARNING _ReadWriteBarrier() _STL_RESTORE_DEPRECATED_WARNING
 
           #if defined(_M_ARM) || defined(_M_ARM64) || defined(_M_ARM64EC)
 
           #define _Memory_barrier()             __dmb(0xB) // inner shared data memory barrier
 
           #define _Compiler_or_memory_barrier() _Memory_barrier()
 
           #elif defined(_M_IX86) || defined(_M_X64)
 
           // x86/x64 hardware only emits memory barriers inside _Interlocked intrinsics
 
           #define _Compiler_or_memory_barrier() _Compiler_barrier()
 
           #else // ^^^ x86/x64 / unsupported hardware vvv
 
           #error Unsupported hardware
 
           #endif // hardware

AlexGuteniev · 2022-08-22T07:46:02Z

Yes, I'm not a fan of the CAS. I think <atomic> uses either __iso_volatile_meow or __ldrexd and _Compiler_or_memory_barrier, which is either _ReadWriteBarrier or __dmb(0xB). Do you think that would be better?

For x86/x64 -- yes, absolutely.
For ARM situation is as follows:

Ideally we want LDAR here, but we can't now, see <atomic>: ARM64 should use LDAR and STLR for weaker than full SC #83 (Current ARM atomics are suboptimal due to that and other issues)
InterlockedCompareExchangePointerAcquire does not compile to true ARM64 CAS, it is LL/SC, so it consists of load, store, a barrier, and a retry loop.
I don't have a sense on ARM perf, and cannot tell whether __dmb(0xB) + plain load is better or worse, but my guess it is that fake CAS described above is worse than __dmb(0xB) + plain load, but the real CAS could have been better

My understanding is that we should primarily optimize for x86 / x64, and don't have much divergence between these and ARM, so should go with a barrier plus a load.

AlexGuteniev · 2022-08-22T07:53:38Z

On the original issue, @CaseyCarter noted (via @StephanTLavavej):

This needs to be compatible with /clr:pure (which is incompatible with atomics),

This is injected into the STL's import library, where we recently had trouble with atomic appearing (and fixed it by not dragging in atomic)

I see, but is there a problem with std::call_once, or its WinAPI equivalent?

StephanTLavavej · 2022-08-24T21:15:32Z

Thanks! We're very interested in reviewing this performance enhancement, but because locales are by far the scariest area of the STL, we'd like to target merging this at the beginning of the VS 2022 17.5 Preview 1 cycle (so we have time to react to any bug reports), instead of the end of the 17.4 Preview 3 cycle. Fortunately, this won't require waiting long - the date when changes start flowing into 17.5 Preview 1 is 2022-09-02 (i.e. next week Friday).

stl/src/locale0.cpp

MattStephanson · 2022-08-30T03:38:02Z

I see, but is there a problem with std::call_once, or its WinAPI equivalent?

Not necessarily, but (a) some rough tests show it's about 2x slower on my machine (3.85 ns vs 1.70 ns for copy-pasting the atomic load/store code; always-locking code is 16.7 ns) and (b) there's some business in xonce2.cpp with /ALTERNATENAME that I don't understand.

AlexGuteniev · 2022-08-30T05:04:42Z

it's about 2x slower on my machine (3.85 ns vs 1.70 ns for copy-pasting the atomic load/store code; always-locking code

Well, that's expected. I assumed it to be uniform for x86/ARM, and faster than InterlockedCompareExchangePointer, but not faster than plain load due to indirection.

stl/src/locale0.cpp

barcharcraz

This appears correct to me after looking at it for about an hour.

Yes, Ideally, we would use LL/SC on ARM but that can wait until we actually have LL/SC in msvc and implement it for atomics.

stl/src/locale0.cpp

barcharcraz

given the issue with xatomic not being core I'm OK with taking this, and doing any possible unification with atomic as a follow-up.

This is the third copy of this code we have though (there's another copy in vcruntime), which is a little annoying.

stl/src/locale0.cpp

StephanTLavavej · 2022-10-26T18:53:44Z

FYI @barcharcraz, we've made changes since you approved.

------------------------------------------------------------
Benchmark                  Time             CPU   Iterations
------------------------------------------------------------
main:
BM_locale_classic       30.1 ns         29.2 ns     23578947

This PR:
BM_locale_classic       2.89 ns         2.92 ns    235789474

Thanks @MattStephanson for this major improvement in a notoriously problematic area! 🎉

StephanTLavavej · 2022-10-26T22:58:41Z

I'm mirroring this to the MSVC-internal repo - please notify me if any further changes are pushed.

StephanTLavavej · 2022-10-27T02:24:16Z

🎉 😻 ⏱️ 🚀

Double-checked locking for locale::classic

1446049

MattStephanson requested a review from a team as a code owner August 21, 2022 02:54

AlexGuteniev reviewed Aug 21, 2022

View reviewed changes

stl/src/locale0.cpp Outdated Show resolved Hide resolved

CaseyCarter added the performance Must go faster label Aug 22, 2022

CaseyCarter added this to Initial Review in Code Reviews via automation Aug 22, 2022

StephanTLavavej reviewed Aug 24, 2022

View reviewed changes

stl/src/locale0.cpp Outdated Show resolved Hide resolved

Replace Interlocked calls with intrinsic load/store and barriers.

e9222c6

StephanTLavavej assigned barcharcraz Aug 31, 2022

Code Reviews automation moved this from Initial Review to Work In Progress Sep 2, 2022

StephanTLavavej requested changes Sep 2, 2022

View reviewed changes

stl/src/locale0.cpp Outdated Show resolved Hide resolved

always use locking for /clr:pure

a4dd979

StephanTLavavej moved this from Work In Progress to Initial Review in Code Reviews Sep 5, 2022

barcharcraz requested changes Sep 7, 2022

View reviewed changes

stl/src/locale0.cpp Outdated Show resolved Hide resolved

Code Reviews automation moved this from Initial Review to Work In Progress Sep 7, 2022

StephanTLavavej unassigned barcharcraz Sep 7, 2022

StephanTLavavej moved this from Work In Progress to Initial Review in Code Reviews Sep 8, 2022

barcharcraz approved these changes Sep 14, 2022

View reviewed changes

StephanTLavavej mentioned this pull request Sep 14, 2022

<xatomic.h>: Investigate making this a core header #3103

Closed

StephanTLavavej moved this from Initial Review to Final Review in Code Reviews Sep 14, 2022

StephanTLavavej self-assigned this Sep 14, 2022

frederick-vs-ja mentioned this pull request Sep 23, 2022

Move barrier macros from <atomic> to <xatomic.h> #3124

Merged

StephanTLavavej requested changes Oct 25, 2022

View reviewed changes

stl/src/locale0.cpp Outdated Show resolved Hide resolved

stl/src/locale0.cpp Outdated Show resolved Hide resolved

stl/src/locale0.cpp Outdated Show resolved Hide resolved

stl/src/locale0.cpp Outdated Show resolved Hide resolved

Code Reviews automation moved this from Final Review to Work In Progress Oct 25, 2022

StephanTLavavej removed their assignment Oct 25, 2022

MattStephanson added 3 commits October 25, 2022 19:13

Merge branch 'main' into classic-locale-locking

8890971

review feedback

2907701

D'oh

5b944d5

StephanTLavavej moved this from Work In Progress to Final Review in Code Reviews Oct 26, 2022

StephanTLavavej self-assigned this Oct 26, 2022

StephanTLavavej added 2 commits October 26, 2022 11:06

Merge branch 'main' into classic-locale-locking

6519e24

Add a benchmark. Also sort the list of benchmarks.

6160021

StephanTLavavej reviewed Oct 26, 2022

View reviewed changes

stl/src/locale0.cpp Show resolved Hide resolved

StephanTLavavej approved these changes Oct 26, 2022

View reviewed changes

StephanTLavavej removed their assignment Oct 26, 2022

StephanTLavavej moved this from Final Review to Ready To Merge in Code Reviews Oct 26, 2022

StephanTLavavej self-assigned this Oct 26, 2022

StephanTLavavej merged commit e9c3b55 into microsoft:main Oct 27, 2022

Code Reviews automation moved this from Ready To Merge to Done Oct 27, 2022

MattStephanson deleted the classic-locale-locking branch May 10, 2023 01:33

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

`<locale>`: Double-checked locking for locale::classic #3048

`<locale>`: Double-checked locking for locale::classic #3048

MattStephanson commented Aug 21, 2022

AlexGuteniev commented Aug 21, 2022

AlexGuteniev commented Aug 21, 2022

MattStephanson commented Aug 21, 2022

MattStephanson commented Aug 21, 2022

AlexGuteniev commented Aug 22, 2022

AlexGuteniev commented Aug 22, 2022

StephanTLavavej commented Aug 24, 2022

MattStephanson commented Aug 30, 2022

AlexGuteniev commented Aug 30, 2022

barcharcraz left a comment

barcharcraz left a comment

StephanTLavavej commented Oct 26, 2022

StephanTLavavej commented Oct 26, 2022

StephanTLavavej commented Oct 27, 2022

<locale>: Double-checked locking for locale::classic #3048

<locale>: Double-checked locking for locale::classic #3048

Conversation

MattStephanson commented Aug 21, 2022

AlexGuteniev commented Aug 21, 2022

AlexGuteniev commented Aug 21, 2022

MattStephanson commented Aug 21, 2022

MattStephanson commented Aug 21, 2022

AlexGuteniev commented Aug 22, 2022

AlexGuteniev commented Aug 22, 2022

StephanTLavavej commented Aug 24, 2022

MattStephanson commented Aug 30, 2022

AlexGuteniev commented Aug 30, 2022

barcharcraz left a comment

Choose a reason for hiding this comment

barcharcraz left a comment

Choose a reason for hiding this comment

StephanTLavavej commented Oct 26, 2022

StephanTLavavej commented Oct 26, 2022

StephanTLavavej commented Oct 27, 2022

🎉 😻 ⏱️ 🚀

`<locale>`: Double-checked locking for locale::classic #3048

`<locale>`: Double-checked locking for locale::classic #3048