Correct memory ordering for rcu atomic assign/deref on Apple M1 cpus #23974

nhorman · 2024-03-26T13:59:58Z

Since the addition of macos14 M1 runners in our CI jobs we've been
seeing periodic random failures in the test_threads CI job.
Specifically we've seen instances in which the shared pointer in the
test (which points to a monotonically incrementing uint64_t went
backwards.

I believe the following is whats happening:

ossl_rcu_uptr_deref and ossl_rcu_assign_uptr are implemented using
__atomic_load_n and __atomic_store_n. The store operation is tagged
as __ATOMIC_RELEASE and the load is tagged as __ATOMIC_ACQUIRE.
On strongly ordered memory access cpus, like x86, this is sufficient,
as instructions are emitted for these builtins to ensure proper
fencing of any prior reads/writes (i.e. a lock prefix may be prepended
to ensure cache synchronization. See:
https://en.cppreference.com/w/c/atomic/memory_order
Specifically the section on release-acquire ordering
On weakly ordered memory access cpus, like arm, additional fencing
instructions must be issued to ensure prior reads/writes become
visible to other cpus. Again see:
https://en.cppreference.com/w/c/atomic/memory_order
Ostensibly, I would have assumed that an
__atomic_load_n/__atomic_store_n with the tagged semantics would
have resulted in the implicit emission of a fence instruction to ensure
ordering. However, disassembling the output of a build on a macosx 14
system with the clang compiler shows that not to be the case:
0000000100120488 <_ossl_rcu_uptr_deref>:
100120488: f8bfc000 ldapr x0, [x0]
10012048c: d65f03c0 ret

0000000100120490 <_ossl_rcu_assign_uptr>:
100120490: f9400028 ldr x8, [x1]
100120494: c89ffc08 stlr x8, [x0]
100120498: d65f03c0 ret

I'm not well versed enough in the standard to know if this translation
would be a compiler bug or not, but it does seem that, in a weakly
ordered system, this series of instructions, when run on multiple
threads may lead to writes getting published to other threads late,
resulting in a counter that appeared to go backwards.

So what to do about it? From my interpretation of memory fencing:
https://en.cppreference.com/w/c/atomic/atomic_thread_fence
It seems we can issue these atomic loads/stores with a relaxed ordering
and independent fencing.

by doing so, the M1 assembly turns into this:
0000000100120488 <_ossl_rcu_uptr_deref>:
100120488: f9400000 ldr x0, [x0]
10012048c: d50339bf dmb ishld
100120490: d65f03c0 ret

0000000100120494 <_ossl_rcu_assign_uptr>:
100120494: d5033bbf dmb ish
100120498: f9400028 ldr x8, [x1]
10012049c: f9000008 str x8, [x0]
1001204a0: d65f03c0 ret

In which dmb (memory barrier) instructions are inserted as expected,
enforcing publishing of all prior issued memory instructions.

I've run these changes 10 times through CI (which is the only place
we've found that reproduces this issue), and encountered no subsequent
failures.

Note: To avoid tsan errors that arise from the inability of the
gcc/clang thread sanitizer to properly interpret thread fences, the
addition of TSAN notifications are required to properly inform the
sanitizer that this code is properly ordered

nhorman · 2024-03-29T13:10:34Z

@tom-cosgrove-arm would greatly appreciate your input on this issue please. Specifically if you could comment on the memory synchronization properties of arm cores in general (and the M1 specifically if possible) that would be a big help. Trying to concretely understand if an _atomic[load|store] operation with __ATOMIC_RELEASE | __ATOMIC_ACQUIRE ordering is expected to imply a memory fence on code targeting those processors. The assembly I've observed seems to say "no, it doesnt" but I'm not sure if thats expected or not

nhorman · 2024-03-29T13:36:37Z

To clarify what I think is happening here, as a sequence diagram:

In the above sequence diagram, because (as I believe) arm cpus don't imply fencing on atomic loads/stores, what we see happening is that, multiple writer threads issue store instructions for data in addresses shared by multiple threads, but because of their weak memory ordering semantics, its possible for those writes to be completed in a different order than what two independent cpus issued in the timeline. As a result, two subsequent reads of that data might see out of order results (i.e. a reading thread might see the value posted by writer thread B first, and then later see the value posted by writer thread A, leading to the error we see in which the reader thinks the shared updated counter went "backwards")

tom-cosgrove-arm · 2024-03-29T13:49:34Z

@nhorman Yes, I would generally expect an __atomic_load_n(foo, __ATOMIC_ACQUIRE) or __atomic_store(foo, bar, __ATOMIC_RELEASE) to issue the required memory barrier instructions, and I would assume lack of these to be a compiler bug.

I will check with our compiler folks to get their opinions, however - and with the bank holidays, it will probably take a few days to get an answer

nhorman · 2024-03-29T13:51:42Z

@tom-cosgrove-arm thank you, no rush. We have a workaround for the 3.3 release on this, just trying to get it fixed properly during 3.4

tom-cosgrove-arm · 2024-03-29T15:28:05Z

@nhorman A bit of experimenting with multiple compilers and versions (on Linux) shows that neither gcc nor clang add the memory barriers. On that basis, clearly I was wrong to expect this to happen, and it's not a compiler bug. I'll ask our compiler folks to understand why this happens

nhorman · 2024-03-29T15:38:48Z

@tom-cosgrove-arm that matches my observations in the initial problem statement here. My assumption is that the memory order parameter to these operations only specify restrictions to code hoisting/sinking within the given compilation unit. It just so happens that on x86, the compiler, in its use of the lock prefix on the load and store instruction implies memory synchronization between cpu caches that isn't guaranteed to exist on other arches. I'll be interested to see if your compiler team agrees with that assessment.

nhorman · 2024-04-02T15:35:47Z

@tom-cosgrove-arm apologies, don't mean to rush you here, but theres been some shared interest here in determining what the expected behavior on arm is, and if this patch is on the right track for fixing it.

nhorman · 2024-04-04T19:55:51Z

@tom-cosgrove-arm yeah, I can be on board with that.

Question: How do we want to detect apple silicon? The easy method would be to use #ifdef aarch64, but that nets in all arm silicon, not just apple. I could detect MACOSX, which would be a bit closer to the mark, but would net both physical and virtual cpus, which may be the best we can do without a runtime check (which I would like to avoid for performance reasons)

nhorman · 2024-04-04T21:02:53Z

ok, so since the other PR was just for testing, and all our conversations have been here, I'm going to update this PR with the change to implement __atomic_load using ldar rather than ldpar here

tom-cosgrove-arm · 2024-04-04T21:20:55Z

Question: How do we want to detect apple silicon?

I would suggest #if defined(__APPLE__) && defined(__aarch64__). Yes, this will catch physical CPUs as well as virtual ones, but I don't think we can help that.

(And please, a lovely succinct comment explaining why we're special-casing Darwin/aarch64, for when we come back to it in N months time and wonder "huh?"!)

paulidale · 2024-04-04T23:07:15Z

Do both gcc and clang output the errant instruction?
If not, we should condition on the compiler too.

nhorman · 2024-04-05T00:36:46Z

@pauli, unsure, gcc command on our m1 is aliased to clang

paulidale · 2024-04-05T05:23:38Z

The cross compile builds include aarch64 with gcc from memory.

nSircombe · 2024-04-05T05:42:55Z

@pauli, unsure, gcc command on our m1 is aliased to clang

@nhorman - you can install an actual gcc on macOS alongside Clang with Homebrew.

> gcc --version
Apple clang version 15.0.0 (clang-1500.3.9.4)
Target: arm64-apple-darwin23.1.0
Thread model: posix
InstalledDir: /Applications/Xcode.app/Contents/Developer/Toolchains/XcodeDefault.xctoolchain/usr/bin

> gcc-12 --version
gcc-12 (Homebrew GCC 12.3.0) 12.3.0
Copyright (C) 2022 Free Software Foundation, Inc.
This is free software; see the source for copying conditions.  There is NO
warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.

nhorman · 2024-04-05T17:59:28Z

@nSircombe interestingly, I installed gcc-13 via homebrew on our development M1 and built with:

CC=gcc-13 ./Configure

dumping out the assembly for the ossl_rcu_uptr_deref function:

0000000100143a40 <_ossl_rcu_uptr_deref>:
100143a40: c8dffc00     ldar    x0, [x0]
100143a44: d65f03c0     ret     
100143a48: d503201f     nop     
100143a4c: d503201f     nop

so there is clearly a discrepancy here between gcc and apples clang, which produces:

0000000100120488 <_ossl_rcu_uptr_deref>:
100120488: f8bfc000 ldapr x0, [x0]
10012048c: d65f03c0 ret

The CI job we run on this doesn't install any additional software from homebrew, so I'm relatively certain that its using the apple clang 15 compiler.

Since the addition of macos14 M1 runners in our CI jobs we've been seeing periodic random failures in the test_threads CI job. Specifically we've seen instances in which the shared pointer in the test (which points to a monotonically incrementing uint64_t went backwards. From taking a look at the disassembled code in the failing case, we see that __atomic_load_n when emitted in clang 15 looks like this 0000000100120488 <_ossl_rcu_uptr_deref>: 100120488: f8bfc000 ldapr x0, [x0] 10012048c: d65f03c0 ret Notably, when compiling with gcc on the same system we get this output instead: 0000000100120488 <_ossl_rcu_uptr_deref>: 100120488: f8bfc000 ldar x0, [x0] 10012048c: d65f03c0 ret Checking the arm docs for the difference between ldar and ldapr: https://developer.arm.com/documentation/ddi0602/2023-09/Base-Instructions/LDAPR--Load-Acquire-RCpc-Register- https://developer.arm.com/documentation/dui0802/b/A64-Data-Transfer-Instructions/LDAR It seems that the ldar instruction provides a global cpu fence, not completing until all writes in a given cpus writeback queue have completed Conversely, the ldapr instruction attmpts to achieve performance improvements by honoring the Local Ordering register available in the system coprocessor, only flushing writes in the same address region as other cpus on the system. I believe that on M1 virtualized cpus the ldapr is not properly ordering writes, leading to an out of order read, despite the needed fencing. I've opened an issue with apple on this here: https://developer.apple.com/forums/thread/749530 I believe that it is not safe to issue an ldapr instruction unless the programmer knows that the Local order registers are properly configured for use on the system. So to fix it I'm proposing with this patch that we, in the event that: 1) __APPLE__ is defined AND 2) __clang__ is defined AND 3) __aarch64__ is defined during the build, that we override the ATOMIC_LOAD_N macro in the rcu code such that it uses a custom function with inline assembly to emit the ldar instruction rather than the ldapr instruction. The above conditions should get us to where this is only used on more recent MAC cpus, and only in the case where the affected clang compiler emits the offending instruction. I've run this patch 10 times in our CI and failed to reproduce the issue, whereas previously I could trigger it within 5 runs routinely.

nhorman · 2024-04-05T21:34:31Z

Ok, I've updated this patch to not use the relaxed ordering/fencing approach, and instead implemented an ldar based atomic_load function, to be used only when APPLE && clang and aarch64 is defined. I've run it several times, and encountered no issues on macosx. Please review and (re)approve when you all can.

tom-cosgrove-arm

LGTM - nice. Now let's hope we never get one of these test failures again!

tom-cosgrove-arm · 2024-04-06T07:34:01Z

I've updated the labels to indicate my committer approval and that an OTC (re-)approval is required

openssl-machine · 2024-04-08T10:00:12Z

This pull request is ready to merge

t8m · 2024-04-10T07:19:14Z

Merged to the master and 3.3 branches. Thank you.

Since the addition of macos14 M1 runners in our CI jobs we've been seeing periodic random failures in the test_threads CI job. Specifically we've seen instances in which the shared pointer in the test (which points to a monotonically incrementing uint64_t went backwards. From taking a look at the disassembled code in the failing case, we see that __atomic_load_n when emitted in clang 15 looks like this 0000000100120488 <_ossl_rcu_uptr_deref>: 100120488: f8bfc000 ldapr x0, [x0] 10012048c: d65f03c0 ret Notably, when compiling with gcc on the same system we get this output instead: 0000000100120488 <_ossl_rcu_uptr_deref>: 100120488: f8bfc000 ldar x0, [x0] 10012048c: d65f03c0 ret Checking the arm docs for the difference between ldar and ldapr: https://developer.arm.com/documentation/ddi0602/2023-09/Base-Instructions/LDAPR--Load-Acquire-RCpc-Register- https://developer.arm.com/documentation/dui0802/b/A64-Data-Transfer-Instructions/LDAR It seems that the ldar instruction provides a global cpu fence, not completing until all writes in a given cpus writeback queue have completed Conversely, the ldapr instruction attmpts to achieve performance improvements by honoring the Local Ordering register available in the system coprocessor, only flushing writes in the same address region as other cpus on the system. I believe that on M1 virtualized cpus the ldapr is not properly ordering writes, leading to an out of order read, despite the needed fencing. I've opened an issue with apple on this here: https://developer.apple.com/forums/thread/749530 I believe that it is not safe to issue an ldapr instruction unless the programmer knows that the Local order registers are properly configured for use on the system. So to fix it I'm proposing with this patch that we, in the event that: 1) __APPLE__ is defined AND 2) __clang__ is defined AND 3) __aarch64__ is defined during the build, that we override the ATOMIC_LOAD_N macro in the rcu code such that it uses a custom function with inline assembly to emit the ldar instruction rather than the ldapr instruction. The above conditions should get us to where this is only used on more recent MAC cpus, and only in the case where the affected clang compiler emits the offending instruction. I've run this patch 10 times in our CI and failed to reproduce the issue, whereas previously I could trigger it within 5 runs routinely. Reviewed-by: Tom Cosgrove <tom.cosgrove@arm.com> Reviewed-by: Paul Dale <ppzgs1@gmail.com> Reviewed-by: Tomas Mraz <tomas@openssl.org> (Merged from #23974)

Since the addition of macos14 M1 runners in our CI jobs we've been seeing periodic random failures in the test_threads CI job. Specifically we've seen instances in which the shared pointer in the test (which points to a monotonically incrementing uint64_t went backwards. From taking a look at the disassembled code in the failing case, we see that __atomic_load_n when emitted in clang 15 looks like this 0000000100120488 <_ossl_rcu_uptr_deref>: 100120488: f8bfc000 ldapr x0, [x0] 10012048c: d65f03c0 ret Notably, when compiling with gcc on the same system we get this output instead: 0000000100120488 <_ossl_rcu_uptr_deref>: 100120488: f8bfc000 ldar x0, [x0] 10012048c: d65f03c0 ret Checking the arm docs for the difference between ldar and ldapr: https://developer.arm.com/documentation/ddi0602/2023-09/Base-Instructions/LDAPR--Load-Acquire-RCpc-Register- https://developer.arm.com/documentation/dui0802/b/A64-Data-Transfer-Instructions/LDAR It seems that the ldar instruction provides a global cpu fence, not completing until all writes in a given cpus writeback queue have completed Conversely, the ldapr instruction attmpts to achieve performance improvements by honoring the Local Ordering register available in the system coprocessor, only flushing writes in the same address region as other cpus on the system. I believe that on M1 virtualized cpus the ldapr is not properly ordering writes, leading to an out of order read, despite the needed fencing. I've opened an issue with apple on this here: https://developer.apple.com/forums/thread/749530 I believe that it is not safe to issue an ldapr instruction unless the programmer knows that the Local order registers are properly configured for use on the system. So to fix it I'm proposing with this patch that we, in the event that: 1) __APPLE__ is defined AND 2) __clang__ is defined AND 3) __aarch64__ is defined during the build, that we override the ATOMIC_LOAD_N macro in the rcu code such that it uses a custom function with inline assembly to emit the ldar instruction rather than the ldapr instruction. The above conditions should get us to where this is only used on more recent MAC cpus, and only in the case where the affected clang compiler emits the offending instruction. I've run this patch 10 times in our CI and failed to reproduce the issue, whereas previously I could trigger it within 5 runs routinely. Reviewed-by: Tom Cosgrove <tom.cosgrove@arm.com> Reviewed-by: Paul Dale <ppzgs1@gmail.com> Reviewed-by: Tomas Mraz <tomas@openssl.org> (Merged from #23974) (cherry picked from commit f5b5a35)

nhorman self-assigned this Mar 26, 2024

nhorman marked this pull request as draft March 26, 2024 14:00

github-actions bot added the severity: fips change The pull request changes FIPS provider sources label Mar 26, 2024

nhorman force-pushed the fencing branch 6 times, most recently from 2aed0d2 to 69e9e98 Compare March 27, 2024 15:02

nhorman marked this pull request as ready for review March 27, 2024 20:07

nhorman force-pushed the fencing branch from 69e9e98 to ec79537 Compare March 28, 2024 12:02

nhorman changed the title ~~experimenting with atomic_seq_cst~~ Correct memory ordering for rcu atomc assign/deref on Apple M1 cpus Mar 28, 2024

nhorman added issue: bug report The issue was opened to report a bug triaged: bug The issue/pr is/fixes a bug severity: important Important bugs affecting a released version labels Mar 28, 2024

nhorman requested review from hlandau, t8m and mattcaswell March 28, 2024 12:04

nhorman changed the title ~~Correct memory ordering for rcu atomc assign/deref on Apple M1 cpus~~ Correct memory ordering for rcu atomic assign/deref on Apple M1 cpus Mar 28, 2024

nhorman added branch: master Merge to master branch branch: 3.3 Merge to openssl-3.3 labels Mar 28, 2024

nhorman force-pushed the fencing branch from ec79537 to 92dc731 Compare March 29, 2024 12:45

nhorman requested a review from tom-cosgrove-arm March 29, 2024 13:08

nhorman added the 3.4.0 Planned eng view label Apr 4, 2024

nhorman force-pushed the fencing branch 2 times, most recently from e791fb8 to 8773494 Compare April 5, 2024 18:31

nhorman force-pushed the fencing branch from 8773494 to 079a90c Compare April 5, 2024 21:33

tom-cosgrove-arm approved these changes Apr 6, 2024

View reviewed changes

tom-cosgrove-arm added approval: otc review pending This pull request needs review by an OTC member and removed approval: review pending This pull request needs review by a committer labels Apr 6, 2024

paulidale approved these changes Apr 7, 2024

View reviewed changes

paulidale added approval: done This pull request has the required number of approvals and removed approval: otc review pending This pull request needs review by an OTC member labels Apr 7, 2024

openssl-machine removed the approval: done This pull request has the required number of approvals label Apr 8, 2024

openssl-machine added the approval: ready to merge The 24 hour grace period has passed, ready to merge label Apr 8, 2024

t8m added the tests: exempted The PR is exempt from requirements for testing label Apr 8, 2024

t8m approved these changes Apr 8, 2024

View reviewed changes

t8m closed this Apr 10, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Correct memory ordering for rcu atomic assign/deref on Apple M1 cpus #23974

Correct memory ordering for rcu atomic assign/deref on Apple M1 cpus #23974

nhorman commented Mar 26, 2024 •

edited

nhorman commented Mar 29, 2024

nhorman commented Mar 29, 2024

tom-cosgrove-arm commented Mar 29, 2024 •

edited

nhorman commented Mar 29, 2024

tom-cosgrove-arm commented Mar 29, 2024

nhorman commented Mar 29, 2024

nhorman commented Apr 2, 2024

nhorman commented Apr 4, 2024

nhorman commented Apr 4, 2024

tom-cosgrove-arm commented Apr 4, 2024

paulidale commented Apr 4, 2024

nhorman commented Apr 5, 2024

paulidale commented Apr 5, 2024

nSircombe commented Apr 5, 2024

nhorman commented Apr 5, 2024

nhorman commented Apr 5, 2024 •

edited

tom-cosgrove-arm left a comment

tom-cosgrove-arm commented Apr 6, 2024

openssl-machine commented Apr 8, 2024

t8m commented Apr 10, 2024

Correct memory ordering for rcu atomic assign/deref on Apple M1 cpus #23974

Correct memory ordering for rcu atomic assign/deref on Apple M1 cpus #23974

Conversation

nhorman commented Mar 26, 2024 • edited

nhorman commented Mar 29, 2024

nhorman commented Mar 29, 2024

tom-cosgrove-arm commented Mar 29, 2024 • edited

nhorman commented Mar 29, 2024

tom-cosgrove-arm commented Mar 29, 2024

nhorman commented Mar 29, 2024

nhorman commented Apr 2, 2024

nhorman commented Apr 4, 2024

nhorman commented Apr 4, 2024

tom-cosgrove-arm commented Apr 4, 2024

paulidale commented Apr 4, 2024

nhorman commented Apr 5, 2024

paulidale commented Apr 5, 2024

nSircombe commented Apr 5, 2024

nhorman commented Apr 5, 2024

nhorman commented Apr 5, 2024 • edited

tom-cosgrove-arm left a comment

Choose a reason for hiding this comment

tom-cosgrove-arm commented Apr 6, 2024

openssl-machine commented Apr 8, 2024

t8m commented Apr 10, 2024

nhorman commented Mar 26, 2024 •

edited

tom-cosgrove-arm commented Mar 29, 2024 •

edited

nhorman commented Apr 5, 2024 •

edited