Skip to content

<atomic>: Make 128-bit atomic_ref<T> using plain loads and stores #4480

@AlexGuteniev

Description

@AlexGuteniev

Nowadays we can without breaking ABI do this for atomic_ref<T>, in vNext will shared this with atomic<T>.

Not long ago CPU vendors retrospectively confirmed that AVX means atomicity for 16-byte loads and stores
https://discord.com/channels/737189251069771789/737734473751330856/1181320524178149427

It is possible to make use of it, instead of performing lock cmpxhg for plain loads and stores:

  • Any atomic loads with movdqa, movaps or movapd with source memory operand (followed by a store to the result variable)
  • Any atomic stores except memory_order_seq_cst with movdqa, movaps or movapd with destination memory operand (preceded by a load from the first parameter)

It can be done with runtime CPU feature detection and also by relying on compile time defines, it is AVX in either cases.


AMD guarantee can be found here
https://www.amd.com/content/dam/amd/en/documents/processor-tech-docs/programmer-references/24593.pdf

7.3.2 Access Atomicity

Processors that report CPUID Fn0000_0001_ECX[AVX](bit 28) = 1 extend the atomicity for
cacheable, naturally-aligned single loads or stores from a quadword to a double quadword


Intel guarantee can be found here
https://www.intel.com/content/www/us/en/developer/articles/technical/intel-sdm.html
Intel® 64 and IA-32 Architectures Software Developer's Manual Volume 3A: System Programming Guide, Part 1

9.1.1 Guaranteed Atomic Operations

Processors that enumerate support for Intel® AVX (by setting the feature flag CPUID.01H:ECX.AVX[bit 28]) guarantee that the 16-byte memory operations performed by the following instructions will always be carried out atomically:
• MOVAPD, MOVAPS, and MOVDQA.
• VMOVAPD, VMOVAPS, and VMOVDQA when encoded with VEX.128.
• VMOVAPD, VMOVAPS, VMOVDQA32, and VMOVDQA64 when encoded with EVEX.128 and k0 (masking disabled).
(Note that these instructions require the linear addresses of their memory operands to be 16-byte aligned.)


This can't be done right away though, needs intrinsic exposure. The available _mm_store_si128 and friends can optimize away, as @Alcaro pointed out. We need like __iso_volatile_load128/__iso_volatile_store128 similar to the existing __iso_volatile_load64/__iso_volatile_store64.

Metadata

Metadata

Assignees

No one assigned

    Labels

    blockedSomething is preventing work on thiscompilerCompiler work involvedperformanceMust go faster

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions