-
Notifications
You must be signed in to change notification settings - Fork 1.6k
Description
Nowadays we can without breaking ABI do this for atomic_ref<T>, in vNext will shared this with atomic<T>.
Not long ago CPU vendors retrospectively confirmed that AVX means atomicity for 16-byte loads and stores
https://discord.com/channels/737189251069771789/737734473751330856/1181320524178149427
It is possible to make use of it, instead of performing lock cmpxhg for plain loads and stores:
- Any atomic loads with
movdqa,movapsormovapdwith source memory operand (followed by a store to the result variable) - Any atomic stores except
memory_order_seq_cstwithmovdqa,movapsormovapdwith destination memory operand (preceded by a load from the first parameter)
It can be done with runtime CPU feature detection and also by relying on compile time defines, it is AVX in either cases.
AMD guarantee can be found here
https://www.amd.com/content/dam/amd/en/documents/processor-tech-docs/programmer-references/24593.pdf
7.3.2 Access Atomicity
Processors that report CPUID Fn0000_0001_ECX[AVX](bit 28) = 1 extend the atomicity for
cacheable, naturally-aligned single loads or stores from a quadword to a double quadword
Intel guarantee can be found here
https://www.intel.com/content/www/us/en/developer/articles/technical/intel-sdm.html
Intel® 64 and IA-32 Architectures Software Developer's Manual Volume 3A: System Programming Guide, Part 1
9.1.1 Guaranteed Atomic Operations
Processors that enumerate support for Intel® AVX (by setting the feature flag CPUID.01H:ECX.AVX[bit 28]) guarantee that the 16-byte memory operations performed by the following instructions will always be carried out atomically:
• MOVAPD, MOVAPS, and MOVDQA.
• VMOVAPD, VMOVAPS, and VMOVDQA when encoded with VEX.128.
• VMOVAPD, VMOVAPS, VMOVDQA32, and VMOVDQA64 when encoded with EVEX.128 and k0 (masking disabled).
(Note that these instructions require the linear addresses of their memory operands to be 16-byte aligned.)
This can't be done right away though, needs intrinsic exposure. The available _mm_store_si128 and friends can optimize away, as @Alcaro pointed out. We need like __iso_volatile_load128/__iso_volatile_store128 similar to the existing __iso_volatile_load64/__iso_volatile_store64.