`<atomic>`: Improve ARM64 performance #3399

StephanTLavavej · 2023-02-10T04:17:33Z

This mirrors Ben Niu's internal MSVC-PR-449792 "Re-implement std::atomic acquire/release/seqcst load/store using __load_acquire/__stlr" as of iteration 15. Note that this PR is targeted at the internal branch prod/be, thus there will be temporary divergence between GitHub and our usual branch prod/fe.

This relies on new compiler intrinsics, thus it won't be immediately active on GitHub until the necessary compiler and VCRuntime changes ship in a public Preview. Ben's benchmarking indicates massive performance improvements for load-acquire and store-release (around 14.1x to 23.8x speedups for officially supported chips - yes, times not percent) and significant performance improvements for sequentially consistent stores (1.58x speedups).

Ben has sworn a solemn oath on a basket of fluffy kittens that this does not break bincompat. 🧺 😻

Fixes #83.

☢️ 🦾

"Re-implement std::atomic acquire/release/seqcst load/store using __load_acquire/__stlr"

… by the updated header.

CaseyCarter · 2023-02-10T05:51:32Z

Ben has sworn a solemn oath on a basket of fluffy kittens that this does not break bincompat. 🧺 😻

It's not like we've recently had to service a slew of regressions and ABI breaks. I'm not at all nervous. 😅

mcfi · 2023-02-11T07:00:18Z

Recent ARM also added a new instruction which is just acq_rel too.

LDAPR, added in arm v8.3

If you use the official 17.6 release later and pass /arch:armv8.3, you will get LDAPR instead of LDAR emitted for acquired loads.

jpark37 · 2023-02-18T04:43:41Z

Am I reading the code wrong, or did load() overloads get overlooked? Here's _Atomic_storage<_Ty, 4> for example:

    _NODISCARD _TVal load() const noexcept { // load with sequential consistency
        const auto _Mem = _Atomic_address_as<int>(_Storage);
        int _As_bytes   = __iso_volatile_load32(_Mem);
        _Compiler_or_memory_barrier();
        return reinterpret_cast<_TVal&>(_As_bytes);
    }

    _NODISCARD _TVal load(const memory_order _Order) const noexcept { // load with given memory order
        const auto _Mem = _Atomic_address_as<int>(_Storage);
        int _As_bytes;
#if _STD_ATOMIC_USE_ARM64_LDAR_STLR == 1
        _ATOMIC_LOAD_ARM64(_As_bytes, 32, _Mem, static_cast<unsigned int>(_Order))
#else
        _As_bytes = __iso_volatile_load32(_Mem);
        _ATOMIC_LOAD_VERIFY_MEMORY_ORDER(static_cast<unsigned int>(_Order))
#endif
        return reinterpret_cast<_TVal&>(_As_bytes);
    }

jpark37 · 2023-02-18T05:17:41Z

Can you explain why this needs _Memory_barrier()? I thought STLR was supposed to support seq_cst on its own:

#define _ATOMIC_STORE_SEQ_CST_ARM64(_Width, _Ptr, _Desired)                               \
    _Compiler_barrier();                                                                  \
    __stlr##_Width(reinterpret_cast<volatile unsigned __int##_Width*>(_Ptr), (_Desired)); \
    _Memory_barrier();

BillyONeal · 2023-02-18T10:34:11Z

I thought STLR was supposed to support seq_cst on its own:

STLR provides seq_cst when the load side does LDAR, but we have already shipped callers doing LD+DMB ISH. The extra barrier can be removed when we ABI break.

jpark37 · 2023-08-10T04:41:31Z

STLR provides seq_cst when the load side does LDAR, but we have already shipped callers doing LD+DMB ISH. The extra barrier can be removed when we ABI break.

I'm reading that __load_acquire can be LDAPR (presumably if the compiler as asked to compile for a sufficient ARM version), and I'm reading somewhere else that that's too weak for seq_cst. Is the barrier guarding against that maybe?

Also the load() overloads don't seem to leverage _STD_ATOMIC_USE_ARM64_LDAR_STLR at HEAD to match the behavior of load(seq_cst).

Mirror MSVC-PR-449792 iteration 14.

31c6a97

"Re-implement std::atomic acquire/release/seqcst load/store using __load_acquire/__stlr"

StephanTLavavej added performance Must go faster ARM64 Related to the ARM64 architecture labels Feb 10, 2023

StephanTLavavej requested a review from a team as a code owner February 10, 2023 04:17

StephanTLavavej added this to Final Review in Code Reviews Feb 10, 2023

Consolidate modes: __ldarN, __load_acquireN, __stlrN are all provided…

526d4a4

… by the updated header.

This comment was marked as resolved.

Sign in to view

This comment was marked as outdated.

Sign in to view

This comment was marked as resolved.

Sign in to view

barcharcraz approved these changes Feb 10, 2023

View reviewed changes

StephanTLavavej moved this from Final Review to Ready To Merge in Code Reviews Feb 10, 2023

StephanTLavavej merged commit 1abaa14 into microsoft:main Feb 10, 2023

Code Reviews automation moved this from Ready To Merge to Done Feb 10, 2023

StephanTLavavej deleted the arm64_atomics branch February 10, 2023 22:55

StephanTLavavej mentioned this pull request Feb 11, 2023

<atomic>: improve code documentation #3406

Merged

StephanTLavavej mentioned this pull request Apr 18, 2023

<atomic>: Consider using __atomic_load_n/__atomic_store_n for Clang #3659

Open

StephanTLavavej mentioned this pull request Dec 1, 2023

<atomic>: Fix ARM64EC and CHPE codegen #4222

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

`<atomic>`: Improve ARM64 performance #3399

`<atomic>`: Improve ARM64 performance #3399

StephanTLavavej commented Feb 10, 2023 •

edited

CaseyCarter commented Feb 10, 2023

This comment was marked as resolved.

This comment was marked as outdated.

This comment was marked as resolved.

This comment was marked as resolved.

This comment was marked as resolved.

This comment was marked as resolved.

This comment was marked as resolved.

This comment was marked as resolved.

mcfi commented Feb 11, 2023

jpark37 commented Feb 18, 2023

jpark37 commented Feb 18, 2023

BillyONeal commented Feb 18, 2023

jpark37 commented Aug 10, 2023

<atomic>: Improve ARM64 performance #3399

<atomic>: Improve ARM64 performance #3399

Conversation

StephanTLavavej commented Feb 10, 2023 • edited

☢️ 🦾

CaseyCarter commented Feb 10, 2023

This comment was marked as resolved.

This comment was marked as outdated.

This comment was marked as resolved.

This comment was marked as resolved.

This comment was marked as resolved.

This comment was marked as resolved.

This comment was marked as resolved.

This comment was marked as resolved.

mcfi commented Feb 11, 2023

jpark37 commented Feb 18, 2023

jpark37 commented Feb 18, 2023

BillyONeal commented Feb 18, 2023

jpark37 commented Aug 10, 2023

`<atomic>`: Improve ARM64 performance #3399

`<atomic>`: Improve ARM64 performance #3399

StephanTLavavej commented Feb 10, 2023 •

edited