`<atomic>`: Make 128-bit `atomic_ref<T>` using plain loads and stores

Nowadays we can without breaking ABI do this for `atomic_ref<T>`, in vNext will shared this with `atomic<T>`.

Not long ago CPU vendors retrospectively confirmed that AVX means atomicity for 16-byte loads and stores
https://discord.com/channels/737189251069771789/737734473751330856/1181320524178149427

It is possible to make use of it, instead of performing `lock cmpxhg` for plain loads and stores:
 * Any atomic loads with `movdqa`, `movaps` or `movapd` with source memory operand (followed by a store to the result variable)
 * Any atomic stores except `memory_order_seq_cst` with `movdqa`, `movaps` or `movapd` with destination memory operand (preceded by a load from the first parameter)

It can be done with runtime CPU feature detection and also by relying on compile time defines, it is AVX in either cases.

----

AMD guarantee can be found here
https://www.amd.com/content/dam/amd/en/documents/processor-tech-docs/programmer-references/24593.pdf

> ## 7.3.2 Access Atomicity
> Processors that report CPUID Fn0000_0001_ECX[AVX](bit 28) = 1 extend the atomicity for
cacheable, naturally-aligned single loads or stores from a quadword to a double quadword

----

Intel guarantee can be found here
https://www.intel.com/content/www/us/en/developer/articles/technical/intel-sdm.html
Intel® 64 and IA-32 Architectures Software Developer's Manual Volume 3A: System Programming Guide, Part 1

> ## 9.1.1 Guaranteed Atomic Operations
> Processors that enumerate support for Intel® AVX (by setting the feature flag CPUID.01H:ECX.AVX[bit 28]) guarantee that the 16-byte memory operations performed by the following instructions will always be carried out atomically:
> • MOVAPD, MOVAPS, and MOVDQA.
> • VMOVAPD, VMOVAPS, and VMOVDQA when encoded with VEX.128.
>• VMOVAPD, VMOVAPS, VMOVDQA32, and VMOVDQA64 when encoded with EVEX.128 and k0 (masking disabled).
> (Note that these instructions require the linear addresses of their memory operands to be 16-byte aligned.)

----

This can't be done right away though, needs intrinsic exposure. The available `_mm_store_si128` and friends can optimize away, as @Alcaro pointed out. We need like  `__iso_volatile_load128`/`__iso_volatile_store128` similar to the existing `__iso_volatile_load64`/`__iso_volatile_store64`.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

`<atomic>`: Make 128-bit `atomic_ref<T>` using plain loads and stores #4480

7.3.2 Access Atomicity

9.1.1 Guaranteed Atomic Operations

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

<atomic>: Make 128-bit atomic_ref<T> using plain loads and stores #4480

Description

7.3.2 Access Atomicity

9.1.1 Guaranteed Atomic Operations

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions

`<atomic>`: Make 128-bit `atomic_ref<T>` using plain loads and stores #4480