Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

<locale>: Double-checked locking for locale::classic #3048

Merged
merged 8 commits into from
Oct 27, 2022

Conversation

MattStephanson
Copy link
Contributor

Closes #3030.

The original issue suggests double-checked locking of global_locale in locale::_Init, but I don't think that would work without significantly more work, if it's possible at all. The reason is that the global locale can be modified, while at the same time a default constructed locale could be trying to load the global locale and increment it's refcount. This is the kind of lock-free "juggling with knives" that I don't want to even try. The classic locale, on the other hand, is a typical singleton and more amenable to DCL.

ABI alert: InterlockedMeowPointer requires their arguments be 32/64 bit aligned, thus the added alignas. Is that an ABI issue?

@MattStephanson MattStephanson requested a review from a team as a code owner August 21, 2022 02:54
@AlexGuteniev
Copy link
Contributor

That's not a good idea to have DCL via CAS. On x86 and x64 it is an expensive operation. Still a win over a mutex, as a mutex would yeld two such ops, but still need to have plain read, if want real DCL gain.

stl/src/locale0.cpp Outdated Show resolved Hide resolved
@AlexGuteniev
Copy link
Contributor

I don't see a justification to use WinAPI / Intrinsics.

If you can change the locking tech, you may go with std::call_once as out-of-box DCL. Or std::atomic if `std::call_once does not suit. You are in a .cpp and don't need to worry about includes and a downgrade standard mode.

If you need to preserve the variable, you can still use std::atomic_ref, but then I doubt if this change is safe at all.

@MattStephanson
Copy link
Contributor Author

I don't see a justification to use WinAPI / Intrinsics.

If you can change the locking tech, you may go with std::call_once as out-of-box DCL. Or std::atomic if `std::call_once does not suit. You are in a .cpp and don't need to worry about includes and a downgrade standard mode.

If you need to preserve the variable, you can still use std::atomic_ref, but then I doubt if this change is safe at all.

On the original issue, @CaseyCarter noted (via @StephanTLavavej):

  • This needs to be compatible with /clr:pure (which is incompatible with atomics),
  • This is injected into the STL's import library, where we recently had trouble with atomic appearing (and fixed it by not dragging in atomic)

@MattStephanson
Copy link
Contributor Author

That's not a good idea to have DCL via CAS. On x86 and x64 it is an expensive operation. Still a win over a mutex, as a mutex would yeld two such ops, but still need to have plain read, if want real DCL gain.

Yes, I'm not a fan of the CAS. I think <atomic> uses either __iso_volatile_meow or __ldrexd and _Compiler_or_memory_barrier, which is either _ReadWriteBarrier or __dmb(0xB). Do you think that would be better?

STL/stl/inc/atomic

Lines 155 to 167 in fef8191

#define _ATOMIC_STORE_PREFIX(_Width, _Ptr, _Desired) \
case _Atomic_memory_order_relaxed: \
__iso_volatile_store##_Width((_Ptr), (_Desired)); \
return; \
case _Atomic_memory_order_release: \
_Compiler_or_memory_barrier(); \
__iso_volatile_store##_Width((_Ptr), (_Desired)); \
return; \
default: \
case _Atomic_memory_order_consume: \
case _Atomic_memory_order_acquire: \
case _Atomic_memory_order_acq_rel: \
_INVALID_MEMORY_ORDER; \

STL/stl/inc/atomic

Lines 1036 to 1045 in fef8191

_NODISCARD _TVal load(const memory_order _Order) const noexcept { // load with given memory order
const auto _Mem = _Atomic_address_as<long long>(_Storage);
#ifdef _M_ARM
long long _As_bytes = __ldrexd(_Mem);
#else
long long _As_bytes = __iso_volatile_load64(_Mem);
#endif
_ATOMIC_LOAD_VERIFY_MEMORY_ORDER(static_cast<unsigned int>(_Order))
return reinterpret_cast<_TVal&>(_As_bytes);
}

STL/stl/inc/atomic

Lines 97 to 107 in fef8191

#define _Compiler_barrier() _STL_DISABLE_DEPRECATED_WARNING _ReadWriteBarrier() _STL_RESTORE_DEPRECATED_WARNING
#if defined(_M_ARM) || defined(_M_ARM64) || defined(_M_ARM64EC)
#define _Memory_barrier() __dmb(0xB) // inner shared data memory barrier
#define _Compiler_or_memory_barrier() _Memory_barrier()
#elif defined(_M_IX86) || defined(_M_X64)
// x86/x64 hardware only emits memory barriers inside _Interlocked intrinsics
#define _Compiler_or_memory_barrier() _Compiler_barrier()
#else // ^^^ x86/x64 / unsupported hardware vvv
#error Unsupported hardware
#endif // hardware

@AlexGuteniev
Copy link
Contributor

Yes, I'm not a fan of the CAS. I think <atomic> uses either __iso_volatile_meow or __ldrexd and _Compiler_or_memory_barrier, which is either _ReadWriteBarrier or __dmb(0xB). Do you think that would be better?

For x86/x64 -- yes, absolutely.
For ARM situation is as follows:

  • Ideally we want LDAR here, but we can't now, see <atomic>: ARM64 should use LDAR and STLR for weaker than full SC #83 (Current ARM atomics are suboptimal due to that and other issues)
  • InterlockedCompareExchangePointerAcquire does not compile to true ARM64 CAS, it is LL/SC, so it consists of load, store, a barrier, and a retry loop.
  • I don't have a sense on ARM perf, and cannot tell whether __dmb(0xB) + plain load is better or worse, but my guess it is that fake CAS described above is worse than __dmb(0xB) + plain load, but the real CAS could have been better

My understanding is that we should primarily optimize for x86 / x64, and don't have much divergence between these and ARM, so should go with a barrier plus a load.

@AlexGuteniev
Copy link
Contributor

On the original issue, @CaseyCarter noted (via @StephanTLavavej):

  • This needs to be compatible with /clr:pure (which is incompatible with atomics),
  • This is injected into the STL's import library, where we recently had trouble with atomic appearing (and fixed it by not dragging in atomic)

I see, but is there a problem with std::call_once, or its WinAPI equivalent?

@CaseyCarter CaseyCarter added the performance Must go faster label Aug 22, 2022
@CaseyCarter CaseyCarter added this to Initial Review in Code Reviews via automation Aug 22, 2022
@StephanTLavavej
Copy link
Member

Thanks! We're very interested in reviewing this performance enhancement, but because locales are by far the scariest area of the STL, we'd like to target merging this at the beginning of the VS 2022 17.5 Preview 1 cycle (so we have time to react to any bug reports), instead of the end of the 17.4 Preview 3 cycle. Fortunately, this won't require waiting long - the date when changes start flowing into 17.5 Preview 1 is 2022-09-02 (i.e. next week Friday).

@MattStephanson
Copy link
Contributor Author

I see, but is there a problem with std::call_once, or its WinAPI equivalent?

Not necessarily, but (a) some rough tests show it's about 2x slower on my machine (3.85 ns vs 1.70 ns for copy-pasting the atomic load/store code; always-locking code is 16.7 ns) and (b) there's some business in xonce2.cpp with /ALTERNATENAME that I don't understand.

@AlexGuteniev
Copy link
Contributor

it's about 2x slower on my machine (3.85 ns vs 1.70 ns for copy-pasting the atomic load/store code; always-locking code

Well, that's expected. I assumed it to be uniform for x86/ARM, and faster than InterlockedCompareExchangePointer, but not faster than plain load due to indirection.

Code Reviews automation moved this from Initial Review to Work In Progress Sep 2, 2022
stl/src/locale0.cpp Outdated Show resolved Hide resolved
@StephanTLavavej StephanTLavavej moved this from Work In Progress to Initial Review in Code Reviews Sep 5, 2022
Copy link
Member

@barcharcraz barcharcraz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This appears correct to me after looking at it for about an hour.

Yes, Ideally, we would use LL/SC on ARM but that can wait until we actually have LL/SC in msvc and implement it for atomics.

stl/src/locale0.cpp Outdated Show resolved Hide resolved
Code Reviews automation moved this from Initial Review to Work In Progress Sep 7, 2022
@StephanTLavavej StephanTLavavej moved this from Work In Progress to Initial Review in Code Reviews Sep 8, 2022
Copy link
Member

@barcharcraz barcharcraz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

given the issue with xatomic not being core I'm OK with taking this, and doing any possible unification with atomic as a follow-up.

This is the third copy of this code we have though (there's another copy in vcruntime), which is a little annoying.

stl/src/locale0.cpp Outdated Show resolved Hide resolved
stl/src/locale0.cpp Outdated Show resolved Hide resolved
stl/src/locale0.cpp Outdated Show resolved Hide resolved
stl/src/locale0.cpp Outdated Show resolved Hide resolved
Code Reviews automation moved this from Final Review to Work In Progress Oct 25, 2022
@StephanTLavavej StephanTLavavej removed their assignment Oct 25, 2022
@StephanTLavavej StephanTLavavej moved this from Work In Progress to Final Review in Code Reviews Oct 26, 2022
@StephanTLavavej StephanTLavavej self-assigned this Oct 26, 2022
@StephanTLavavej
Copy link
Member

FYI @barcharcraz, we've made changes since you approved.

------------------------------------------------------------
Benchmark                  Time             CPU   Iterations
------------------------------------------------------------
main:
BM_locale_classic       30.1 ns         29.2 ns     23578947

This PR:
BM_locale_classic       2.89 ns         2.92 ns    235789474

Thanks @MattStephanson for this major improvement in a notoriously problematic area! 🎉

@StephanTLavavej StephanTLavavej removed their assignment Oct 26, 2022
@StephanTLavavej StephanTLavavej moved this from Final Review to Ready To Merge in Code Reviews Oct 26, 2022
@StephanTLavavej StephanTLavavej self-assigned this Oct 26, 2022
@StephanTLavavej
Copy link
Member

I'm mirroring this to the MSVC-internal repo - please notify me if any further changes are pushed.

@StephanTLavavej StephanTLavavej merged commit e9c3b55 into microsoft:main Oct 27, 2022
Code Reviews automation moved this from Ready To Merge to Done Oct 27, 2022
@StephanTLavavej
Copy link
Member

🎉 😻 ⏱️ 🚀

@MattStephanson MattStephanson deleted the classic-locale-locking branch May 10, 2023 01:33
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
performance Must go faster
Projects
No open projects
Development

Successfully merging this pull request may close these issues.

<locale>: Unnecessary locking around classic locale
5 participants