Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

<atomic>: Improve ARM64 performance #3399

Merged
merged 2 commits into from Feb 10, 2023

Conversation

StephanTLavavej
Copy link
Member

@StephanTLavavej StephanTLavavej commented Feb 10, 2023

This mirrors Ben Niu's internal MSVC-PR-449792 "Re-implement std::atomic acquire/release/seqcst load/store using __load_acquire/__stlr" as of iteration 15. Note that this PR is targeted at the internal branch prod/be, thus there will be temporary divergence between GitHub and our usual branch prod/fe.

This relies on new compiler intrinsics, thus it won't be immediately active on GitHub until the necessary compiler and VCRuntime changes ship in a public Preview. Ben's benchmarking indicates massive performance improvements for load-acquire and store-release (around 14.1x to 23.8x speedups for officially supported chips - yes, times not percent) and significant performance improvements for sequentially consistent stores (1.58x speedups).

Ben has sworn a solemn oath on a basket of fluffy kittens that this does not break bincompat. 🧺 😻

Fixes #83.

☢️ 🦾

"Re-implement std::atomic acquire/release/seqcst load/store using __load_acquire/__stlr"
@StephanTLavavej StephanTLavavej added performance Must go faster ARM64 Related to the ARM64 architecture labels Feb 10, 2023
@StephanTLavavej StephanTLavavej requested a review from a team as a code owner February 10, 2023 04:17
@StephanTLavavej StephanTLavavej added this to Final Review in Code Reviews Feb 10, 2023
@CaseyCarter
Copy link
Member

Ben has sworn a solemn oath on a basket of fluffy kittens that this does not break bincompat. 🧺 😻

It's not like we've recently had to service a slew of regressions and ABI breaks. I'm not at all nervous. 😅

@AlexGuteniev

This comment was marked as resolved.

@BillyONeal

This comment was marked as outdated.

@BillyONeal

This comment was marked as resolved.

@barcharcraz

This comment was marked as resolved.

@barcharcraz

This comment was marked as resolved.

@BillyONeal

This comment was marked as resolved.

@BillyONeal

This comment was marked as resolved.

@StephanTLavavej

This comment was marked as resolved.

@StephanTLavavej StephanTLavavej moved this from Final Review to Ready To Merge in Code Reviews Feb 10, 2023
@StephanTLavavej StephanTLavavej merged commit 1abaa14 into microsoft:main Feb 10, 2023
Code Reviews automation moved this from Ready To Merge to Done Feb 10, 2023
@StephanTLavavej StephanTLavavej deleted the arm64_atomics branch February 10, 2023 22:55
@mcfi
Copy link

mcfi commented Feb 11, 2023

Recent ARM also added a new instruction which is just acq_rel too.

LDAPR, added in arm v8.3

If you use the official 17.6 release later and pass /arch:armv8.3, you will get LDAPR instead of LDAR emitted for acquired loads.

@jpark37
Copy link

jpark37 commented Feb 18, 2023

Am I reading the code wrong, or did load() overloads get overlooked? Here's _Atomic_storage<_Ty, 4> for example:

    _NODISCARD _TVal load() const noexcept { // load with sequential consistency
        const auto _Mem = _Atomic_address_as<int>(_Storage);
        int _As_bytes   = __iso_volatile_load32(_Mem);
        _Compiler_or_memory_barrier();
        return reinterpret_cast<_TVal&>(_As_bytes);
    }

    _NODISCARD _TVal load(const memory_order _Order) const noexcept { // load with given memory order
        const auto _Mem = _Atomic_address_as<int>(_Storage);
        int _As_bytes;
#if _STD_ATOMIC_USE_ARM64_LDAR_STLR == 1
        _ATOMIC_LOAD_ARM64(_As_bytes, 32, _Mem, static_cast<unsigned int>(_Order))
#else
        _As_bytes = __iso_volatile_load32(_Mem);
        _ATOMIC_LOAD_VERIFY_MEMORY_ORDER(static_cast<unsigned int>(_Order))
#endif
        return reinterpret_cast<_TVal&>(_As_bytes);
    }

@jpark37
Copy link

jpark37 commented Feb 18, 2023

Can you explain why this needs _Memory_barrier()? I thought STLR was supposed to support seq_cst on its own:

#define _ATOMIC_STORE_SEQ_CST_ARM64(_Width, _Ptr, _Desired)                               \
    _Compiler_barrier();                                                                  \
    __stlr##_Width(reinterpret_cast<volatile unsigned __int##_Width*>(_Ptr), (_Desired)); \
    _Memory_barrier();

@BillyONeal
Copy link
Member

I thought STLR was supposed to support seq_cst on its own:

STLR provides seq_cst when the load side does LDAR, but we have already shipped callers doing LD+DMB ISH. The extra barrier can be removed when we ABI break.

@jpark37
Copy link

jpark37 commented Aug 10, 2023

STLR provides seq_cst when the load side does LDAR, but we have already shipped callers doing LD+DMB ISH. The extra barrier can be removed when we ABI break.

I'm reading that __load_acquire can be LDAPR (presumably if the compiler as asked to compile for a sufficient ARM version), and I'm reading somewhere else that that's too weak for seq_cst. Is the barrier guarding against that maybe?

Also the load() overloads don't seem to leverage _STD_ATOMIC_USE_ARM64_LDAR_STLR at HEAD to match the behavior of load(seq_cst).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
ARM64 Related to the ARM64 architecture performance Must go faster
Projects
No open projects
Development

Successfully merging this pull request may close these issues.

<atomic>: ARM64 should use LDAR and STLR for weaker than full SC
7 participants