New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Correct memory ordering for rcu atomic assign/deref on Apple M1 cpus #23974
Conversation
2aed0d2
to
69e9e98
Compare
@tom-cosgrove-arm would greatly appreciate your input on this issue please. Specifically if you could comment on the memory synchronization properties of arm cores in general (and the M1 specifically if possible) that would be a big help. Trying to concretely understand if an _atomic[load|store] operation with __ATOMIC_RELEASE | __ATOMIC_ACQUIRE ordering is expected to imply a memory fence on code targeting those processors. The assembly I've observed seems to say "no, it doesnt" but I'm not sure if thats expected or not |
@nhorman Yes, I would generally expect an I will check with our compiler folks to get their opinions, however - and with the bank holidays, it will probably take a few days to get an answer |
@tom-cosgrove-arm thank you, no rush. We have a workaround for the 3.3 release on this, just trying to get it fixed properly during 3.4 |
@nhorman A bit of experimenting with multiple compilers and versions (on Linux) shows that neither gcc nor clang add the memory barriers. On that basis, clearly I was wrong to expect this to happen, and it's not a compiler bug. I'll ask our compiler folks to understand why this happens |
@tom-cosgrove-arm that matches my observations in the initial problem statement here. My assumption is that the memory order parameter to these operations only specify restrictions to code hoisting/sinking within the given compilation unit. It just so happens that on x86, the compiler, in its use of the lock prefix on the load and store instruction implies memory synchronization between cpu caches that isn't guaranteed to exist on other arches. I'll be interested to see if your compiler team agrees with that assessment. |
@tom-cosgrove-arm apologies, don't mean to rush you here, but theres been some shared interest here in determining what the expected behavior on arm is, and if this patch is on the right track for fixing it. |
@tom-cosgrove-arm yeah, I can be on board with that. Question: How do we want to detect apple silicon? The easy method would be to use #ifdef aarch64, but that nets in all arm silicon, not just apple. I could detect MACOSX, which would be a bit closer to the mark, but would net both physical and virtual cpus, which may be the best we can do without a runtime check (which I would like to avoid for performance reasons) |
ok, so since the other PR was just for testing, and all our conversations have been here, I'm going to update this PR with the change to implement __atomic_load using ldar rather than ldpar here |
I would suggest (And please, a lovely succinct comment explaining why we're special-casing Darwin/aarch64, for when we come back to it in N months time and wonder "huh?"!) |
Do both gcc and clang output the errant instruction? |
@pauli, unsure, gcc command on our m1 is aliased to clang |
The cross compile builds include aarch64 with gcc from memory. |
@nhorman - you can install an actual gcc on macOS alongside Clang with Homebrew.
|
@nSircombe interestingly, I installed gcc-13 via homebrew on our development M1 and built with:
dumping out the assembly for the ossl_rcu_uptr_deref function:
so there is clearly a discrepancy here between gcc and apples clang, which produces:
The CI job we run on this doesn't install any additional software from homebrew, so I'm relatively certain that its using the apple clang 15 compiler. |
e791fb8
to
8773494
Compare
Since the addition of macos14 M1 runners in our CI jobs we've been seeing periodic random failures in the test_threads CI job. Specifically we've seen instances in which the shared pointer in the test (which points to a monotonically incrementing uint64_t went backwards. From taking a look at the disassembled code in the failing case, we see that __atomic_load_n when emitted in clang 15 looks like this 0000000100120488 <_ossl_rcu_uptr_deref>: 100120488: f8bfc000 ldapr x0, [x0] 10012048c: d65f03c0 ret Notably, when compiling with gcc on the same system we get this output instead: 0000000100120488 <_ossl_rcu_uptr_deref>: 100120488: f8bfc000 ldar x0, [x0] 10012048c: d65f03c0 ret Checking the arm docs for the difference between ldar and ldapr: https://developer.arm.com/documentation/ddi0602/2023-09/Base-Instructions/LDAPR--Load-Acquire-RCpc-Register- https://developer.arm.com/documentation/dui0802/b/A64-Data-Transfer-Instructions/LDAR It seems that the ldar instruction provides a global cpu fence, not completing until all writes in a given cpus writeback queue have completed Conversely, the ldapr instruction attmpts to achieve performance improvements by honoring the Local Ordering register available in the system coprocessor, only flushing writes in the same address region as other cpus on the system. I believe that on M1 virtualized cpus the ldapr is not properly ordering writes, leading to an out of order read, despite the needed fencing. I've opened an issue with apple on this here: https://developer.apple.com/forums/thread/749530 I believe that it is not safe to issue an ldapr instruction unless the programmer knows that the Local order registers are properly configured for use on the system. So to fix it I'm proposing with this patch that we, in the event that: 1) __APPLE__ is defined AND 2) __clang__ is defined AND 3) __aarch64__ is defined during the build, that we override the ATOMIC_LOAD_N macro in the rcu code such that it uses a custom function with inline assembly to emit the ldar instruction rather than the ldapr instruction. The above conditions should get us to where this is only used on more recent MAC cpus, and only in the case where the affected clang compiler emits the offending instruction. I've run this patch 10 times in our CI and failed to reproduce the issue, whereas previously I could trigger it within 5 runs routinely.
Ok, I've updated this patch to not use the relaxed ordering/fencing approach, and instead implemented an ldar based atomic_load function, to be used only when APPLE && clang and aarch64 is defined. I've run it several times, and encountered no issues on macosx. Please review and (re)approve when you all can. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM - nice. Now let's hope we never get one of these test failures again!
I've updated the labels to indicate my committer approval and that an OTC (re-)approval is required |
This pull request is ready to merge |
Merged to the master and 3.3 branches. Thank you. |
Since the addition of macos14 M1 runners in our CI jobs we've been seeing periodic random failures in the test_threads CI job. Specifically we've seen instances in which the shared pointer in the test (which points to a monotonically incrementing uint64_t went backwards. From taking a look at the disassembled code in the failing case, we see that __atomic_load_n when emitted in clang 15 looks like this 0000000100120488 <_ossl_rcu_uptr_deref>: 100120488: f8bfc000 ldapr x0, [x0] 10012048c: d65f03c0 ret Notably, when compiling with gcc on the same system we get this output instead: 0000000100120488 <_ossl_rcu_uptr_deref>: 100120488: f8bfc000 ldar x0, [x0] 10012048c: d65f03c0 ret Checking the arm docs for the difference between ldar and ldapr: https://developer.arm.com/documentation/ddi0602/2023-09/Base-Instructions/LDAPR--Load-Acquire-RCpc-Register- https://developer.arm.com/documentation/dui0802/b/A64-Data-Transfer-Instructions/LDAR It seems that the ldar instruction provides a global cpu fence, not completing until all writes in a given cpus writeback queue have completed Conversely, the ldapr instruction attmpts to achieve performance improvements by honoring the Local Ordering register available in the system coprocessor, only flushing writes in the same address region as other cpus on the system. I believe that on M1 virtualized cpus the ldapr is not properly ordering writes, leading to an out of order read, despite the needed fencing. I've opened an issue with apple on this here: https://developer.apple.com/forums/thread/749530 I believe that it is not safe to issue an ldapr instruction unless the programmer knows that the Local order registers are properly configured for use on the system. So to fix it I'm proposing with this patch that we, in the event that: 1) __APPLE__ is defined AND 2) __clang__ is defined AND 3) __aarch64__ is defined during the build, that we override the ATOMIC_LOAD_N macro in the rcu code such that it uses a custom function with inline assembly to emit the ldar instruction rather than the ldapr instruction. The above conditions should get us to where this is only used on more recent MAC cpus, and only in the case where the affected clang compiler emits the offending instruction. I've run this patch 10 times in our CI and failed to reproduce the issue, whereas previously I could trigger it within 5 runs routinely. Reviewed-by: Tom Cosgrove <tom.cosgrove@arm.com> Reviewed-by: Paul Dale <ppzgs1@gmail.com> Reviewed-by: Tomas Mraz <tomas@openssl.org> (Merged from #23974)
Since the addition of macos14 M1 runners in our CI jobs we've been seeing periodic random failures in the test_threads CI job. Specifically we've seen instances in which the shared pointer in the test (which points to a monotonically incrementing uint64_t went backwards. From taking a look at the disassembled code in the failing case, we see that __atomic_load_n when emitted in clang 15 looks like this 0000000100120488 <_ossl_rcu_uptr_deref>: 100120488: f8bfc000 ldapr x0, [x0] 10012048c: d65f03c0 ret Notably, when compiling with gcc on the same system we get this output instead: 0000000100120488 <_ossl_rcu_uptr_deref>: 100120488: f8bfc000 ldar x0, [x0] 10012048c: d65f03c0 ret Checking the arm docs for the difference between ldar and ldapr: https://developer.arm.com/documentation/ddi0602/2023-09/Base-Instructions/LDAPR--Load-Acquire-RCpc-Register- https://developer.arm.com/documentation/dui0802/b/A64-Data-Transfer-Instructions/LDAR It seems that the ldar instruction provides a global cpu fence, not completing until all writes in a given cpus writeback queue have completed Conversely, the ldapr instruction attmpts to achieve performance improvements by honoring the Local Ordering register available in the system coprocessor, only flushing writes in the same address region as other cpus on the system. I believe that on M1 virtualized cpus the ldapr is not properly ordering writes, leading to an out of order read, despite the needed fencing. I've opened an issue with apple on this here: https://developer.apple.com/forums/thread/749530 I believe that it is not safe to issue an ldapr instruction unless the programmer knows that the Local order registers are properly configured for use on the system. So to fix it I'm proposing with this patch that we, in the event that: 1) __APPLE__ is defined AND 2) __clang__ is defined AND 3) __aarch64__ is defined during the build, that we override the ATOMIC_LOAD_N macro in the rcu code such that it uses a custom function with inline assembly to emit the ldar instruction rather than the ldapr instruction. The above conditions should get us to where this is only used on more recent MAC cpus, and only in the case where the affected clang compiler emits the offending instruction. I've run this patch 10 times in our CI and failed to reproduce the issue, whereas previously I could trigger it within 5 runs routinely. Reviewed-by: Tom Cosgrove <tom.cosgrove@arm.com> Reviewed-by: Paul Dale <ppzgs1@gmail.com> Reviewed-by: Tomas Mraz <tomas@openssl.org> (Merged from #23974) (cherry picked from commit f5b5a35)
Since the addition of macos14 M1 runners in our CI jobs we've been
seeing periodic random failures in the test_threads CI job.
Specifically we've seen instances in which the shared pointer in the
test (which points to a monotonically incrementing uint64_t went
backwards.
I believe the following is whats happening:
ossl_rcu_uptr_deref and ossl_rcu_assign_uptr are implemented using
__atomic_load_n and __atomic_store_n. The store operation is tagged
as __ATOMIC_RELEASE and the load is tagged as __ATOMIC_ACQUIRE.
On strongly ordered memory access cpus, like x86, this is sufficient,
as instructions are emitted for these builtins to ensure proper
fencing of any prior reads/writes (i.e. a lock prefix may be prepended
to ensure cache synchronization. See:
https://en.cppreference.com/w/c/atomic/memory_order
Specifically the section on release-acquire ordering
On weakly ordered memory access cpus, like arm, additional fencing
instructions must be issued to ensure prior reads/writes become
visible to other cpus. Again see:
https://en.cppreference.com/w/c/atomic/memory_order
Ostensibly, I would have assumed that an
__atomic_load_n/__atomic_store_n with the tagged semantics would
have resulted in the implicit emission of a fence instruction to ensure
ordering. However, disassembling the output of a build on a macosx 14
system with the clang compiler shows that not to be the case:
0000000100120488 <_ossl_rcu_uptr_deref>:
100120488: f8bfc000 ldapr x0, [x0]
10012048c: d65f03c0 ret
0000000100120490 <_ossl_rcu_assign_uptr>:
100120490: f9400028 ldr x8, [x1]
100120494: c89ffc08 stlr x8, [x0]
100120498: d65f03c0 ret
I'm not well versed enough in the standard to know if this translation
would be a compiler bug or not, but it does seem that, in a weakly
ordered system, this series of instructions, when run on multiple
threads may lead to writes getting published to other threads late,
resulting in a counter that appeared to go backwards.
So what to do about it? From my interpretation of memory fencing:
https://en.cppreference.com/w/c/atomic/atomic_thread_fence
It seems we can issue these atomic loads/stores with a relaxed ordering
and independent fencing.
by doing so, the M1 assembly turns into this:
0000000100120488 <_ossl_rcu_uptr_deref>:
100120488: f9400000 ldr x0, [x0]
10012048c: d50339bf dmb ishld
100120490: d65f03c0 ret
0000000100120494 <_ossl_rcu_assign_uptr>:
100120494: d5033bbf dmb ish
100120498: f9400028 ldr x8, [x1]
10012049c: f9000008 str x8, [x0]
1001204a0: d65f03c0 ret
In which dmb (memory barrier) instructions are inserted as expected,
enforcing publishing of all prior issued memory instructions.
I've run these changes 10 times through CI (which is the only place
we've found that reproduces this issue), and encountered no subsequent
failures.
Note: To avoid tsan errors that arise from the inability of the
gcc/clang thread sanitizer to properly interpret thread fences, the
addition of TSAN notifications are required to properly inform the
sanitizer that this code is properly ordered