Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Correct memory ordering for rcu atomic assign/deref on Apple M1 cpus #23974

Closed
wants to merge 1 commit into from

Conversation

nhorman
Copy link
Contributor

@nhorman nhorman commented Mar 26, 2024

Since the addition of macos14 M1 runners in our CI jobs we've been
seeing periodic random failures in the test_threads CI job.
Specifically we've seen instances in which the shared pointer in the
test (which points to a monotonically incrementing uint64_t went
backwards.

I believe the following is whats happening:

  1. ossl_rcu_uptr_deref and ossl_rcu_assign_uptr are implemented using
    __atomic_load_n and __atomic_store_n. The store operation is tagged
    as __ATOMIC_RELEASE and the load is tagged as __ATOMIC_ACQUIRE.

  2. On strongly ordered memory access cpus, like x86, this is sufficient,
    as instructions are emitted for these builtins to ensure proper
    fencing of any prior reads/writes (i.e. a lock prefix may be prepended
    to ensure cache synchronization. See:
    https://en.cppreference.com/w/c/atomic/memory_order
    Specifically the section on release-acquire ordering

  3. On weakly ordered memory access cpus, like arm, additional fencing
    instructions must be issued to ensure prior reads/writes become
    visible to other cpus. Again see:
    https://en.cppreference.com/w/c/atomic/memory_order

  4. Ostensibly, I would have assumed that an
    __atomic_load_n/__atomic_store_n with the tagged semantics would
    have resulted in the implicit emission of a fence instruction to ensure
    ordering. However, disassembling the output of a build on a macosx 14
    system with the clang compiler shows that not to be the case:
    0000000100120488 <_ossl_rcu_uptr_deref>:
    100120488: f8bfc000 ldapr x0, [x0]
    10012048c: d65f03c0 ret

    0000000100120490 <_ossl_rcu_assign_uptr>:
    100120490: f9400028 ldr x8, [x1]
    100120494: c89ffc08 stlr x8, [x0]
    100120498: d65f03c0 ret

    I'm not well versed enough in the standard to know if this translation
    would be a compiler bug or not, but it does seem that, in a weakly
    ordered system, this series of instructions, when run on multiple
    threads may lead to writes getting published to other threads late,
    resulting in a counter that appeared to go backwards.

    So what to do about it? From my interpretation of memory fencing:
    https://en.cppreference.com/w/c/atomic/atomic_thread_fence
    It seems we can issue these atomic loads/stores with a relaxed ordering
    and independent fencing.

    by doing so, the M1 assembly turns into this:
    0000000100120488 <_ossl_rcu_uptr_deref>:
    100120488: f9400000 ldr x0, [x0]
    10012048c: d50339bf dmb ishld
    100120490: d65f03c0 ret

    0000000100120494 <_ossl_rcu_assign_uptr>:
    100120494: d5033bbf dmb ish
    100120498: f9400028 ldr x8, [x1]
    10012049c: f9000008 str x8, [x0]
    1001204a0: d65f03c0 ret

    In which dmb (memory barrier) instructions are inserted as expected,
    enforcing publishing of all prior issued memory instructions.

    I've run these changes 10 times through CI (which is the only place
    we've found that reproduces this issue), and encountered no subsequent
    failures.

    Note: To avoid tsan errors that arise from the inability of the
    gcc/clang thread sanitizer to properly interpret thread fences, the
    addition of TSAN notifications are required to properly inform the
    sanitizer that this code is properly ordered

@nhorman nhorman self-assigned this Mar 26, 2024
@nhorman nhorman marked this pull request as draft March 26, 2024 14:00
@github-actions github-actions bot added the severity: fips change The pull request changes FIPS provider sources label Mar 26, 2024
@nhorman nhorman force-pushed the fencing branch 6 times, most recently from 2aed0d2 to 69e9e98 Compare March 27, 2024 15:02
@nhorman nhorman marked this pull request as ready for review March 27, 2024 20:07
@nhorman nhorman changed the title experimenting with atomic_seq_cst Correct memory ordering for rcu atomc assign/deref on Apple M1 cpus Mar 28, 2024
@nhorman nhorman added issue: bug report The issue was opened to report a bug triaged: bug The issue/pr is/fixes a bug severity: important Important bugs affecting a released version labels Mar 28, 2024
@nhorman nhorman changed the title Correct memory ordering for rcu atomc assign/deref on Apple M1 cpus Correct memory ordering for rcu atomic assign/deref on Apple M1 cpus Mar 28, 2024
@nhorman nhorman added branch: master Merge to master branch branch: 3.3 Merge to openssl-3.3 labels Mar 28, 2024
@nhorman
Copy link
Contributor Author

nhorman commented Mar 29, 2024

@tom-cosgrove-arm would greatly appreciate your input on this issue please. Specifically if you could comment on the memory synchronization properties of arm cores in general (and the M1 specifically if possible) that would be a big help. Trying to concretely understand if an _atomic[load|store] operation with __ATOMIC_RELEASE | __ATOMIC_ACQUIRE ordering is expected to imply a memory fence on code targeting those processors. The assembly I've observed seems to say "no, it doesnt" but I'm not sure if thats expected or not

@nhorman
Copy link
Contributor Author

nhorman commented Mar 29, 2024

To clarify what I think is happening here, as a sequence diagram:
sequence

In the above sequence diagram, because (as I believe) arm cpus don't imply fencing on atomic loads/stores, what we see happening is that, multiple writer threads issue store instructions for data in addresses shared by multiple threads, but because of their weak memory ordering semantics, its possible for those writes to be completed in a different order than what two independent cpus issued in the timeline. As a result, two subsequent reads of that data might see out of order results (i.e. a reading thread might see the value posted by writer thread B first, and then later see the value posted by writer thread A, leading to the error we see in which the reader thinks the shared updated counter went "backwards")

@tom-cosgrove-arm
Copy link
Contributor

tom-cosgrove-arm commented Mar 29, 2024

@nhorman Yes, I would generally expect an __atomic_load_n(foo, __ATOMIC_ACQUIRE) or __atomic_store(foo, bar, __ATOMIC_RELEASE) to issue the required memory barrier instructions, and I would assume lack of these to be a compiler bug.

I will check with our compiler folks to get their opinions, however - and with the bank holidays, it will probably take a few days to get an answer

@nhorman
Copy link
Contributor Author

nhorman commented Mar 29, 2024

@tom-cosgrove-arm thank you, no rush. We have a workaround for the 3.3 release on this, just trying to get it fixed properly during 3.4

@tom-cosgrove-arm
Copy link
Contributor

@nhorman A bit of experimenting with multiple compilers and versions (on Linux) shows that neither gcc nor clang add the memory barriers. On that basis, clearly I was wrong to expect this to happen, and it's not a compiler bug. I'll ask our compiler folks to understand why this happens

@nhorman
Copy link
Contributor Author

nhorman commented Mar 29, 2024

@tom-cosgrove-arm that matches my observations in the initial problem statement here. My assumption is that the memory order parameter to these operations only specify restrictions to code hoisting/sinking within the given compilation unit. It just so happens that on x86, the compiler, in its use of the lock prefix on the load and store instruction implies memory synchronization between cpu caches that isn't guaranteed to exist on other arches. I'll be interested to see if your compiler team agrees with that assessment.

@nhorman
Copy link
Contributor Author

nhorman commented Apr 2, 2024

@tom-cosgrove-arm apologies, don't mean to rush you here, but theres been some shared interest here in determining what the expected behavior on arm is, and if this patch is on the right track for fixing it.

@nhorman
Copy link
Contributor Author

nhorman commented Apr 4, 2024

@tom-cosgrove-arm yeah, I can be on board with that.

Question: How do we want to detect apple silicon? The easy method would be to use #ifdef aarch64, but that nets in all arm silicon, not just apple. I could detect MACOSX, which would be a bit closer to the mark, but would net both physical and virtual cpus, which may be the best we can do without a runtime check (which I would like to avoid for performance reasons)

@nhorman
Copy link
Contributor Author

nhorman commented Apr 4, 2024

ok, so since the other PR was just for testing, and all our conversations have been here, I'm going to update this PR with the change to implement __atomic_load using ldar rather than ldpar here

@tom-cosgrove-arm
Copy link
Contributor

Question: How do we want to detect apple silicon?

I would suggest #if defined(__APPLE__) && defined(__aarch64__). Yes, this will catch physical CPUs as well as virtual ones, but I don't think we can help that.

(And please, a lovely succinct comment explaining why we're special-casing Darwin/aarch64, for when we come back to it in N months time and wonder "huh?"!)

@paulidale
Copy link
Contributor

Do both gcc and clang output the errant instruction?
If not, we should condition on the compiler too.

@nhorman
Copy link
Contributor Author

nhorman commented Apr 5, 2024

@pauli, unsure, gcc command on our m1 is aliased to clang

@paulidale
Copy link
Contributor

The cross compile builds include aarch64 with gcc from memory.

@nSircombe
Copy link

@pauli, unsure, gcc command on our m1 is aliased to clang

@nhorman - you can install an actual gcc on macOS alongside Clang with Homebrew.

> gcc --version
Apple clang version 15.0.0 (clang-1500.3.9.4)
Target: arm64-apple-darwin23.1.0
Thread model: posix
InstalledDir: /Applications/Xcode.app/Contents/Developer/Toolchains/XcodeDefault.xctoolchain/usr/bin

> gcc-12 --version
gcc-12 (Homebrew GCC 12.3.0) 12.3.0
Copyright (C) 2022 Free Software Foundation, Inc.
This is free software; see the source for copying conditions.  There is NO
warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.

@nhorman
Copy link
Contributor Author

nhorman commented Apr 5, 2024

@nSircombe interestingly, I installed gcc-13 via homebrew on our development M1 and built with:

CC=gcc-13 ./Configure

dumping out the assembly for the ossl_rcu_uptr_deref function:

0000000100143a40 <_ossl_rcu_uptr_deref>:
100143a40: c8dffc00     ldar    x0, [x0]
100143a44: d65f03c0     ret     
100143a48: d503201f     nop     
100143a4c: d503201f     nop   

so there is clearly a discrepancy here between gcc and apples clang, which produces:

0000000100120488 <_ossl_rcu_uptr_deref>:
100120488: f8bfc000 ldapr x0, [x0]
10012048c: d65f03c0 ret

The CI job we run on this doesn't install any additional software from homebrew, so I'm relatively certain that its using the apple clang 15 compiler.

@nhorman nhorman force-pushed the fencing branch 2 times, most recently from e791fb8 to 8773494 Compare April 5, 2024 18:31
Since the addition of macos14 M1 runners in our CI jobs we've been
seeing periodic random failures in the test_threads CI job.
Specifically we've seen instances in which the shared pointer in the
test (which points to a monotonically incrementing uint64_t went
backwards.

From taking a look at the disassembled code in the failing case, we see
that __atomic_load_n when emitted in clang 15 looks like this
0000000100120488 <_ossl_rcu_uptr_deref>:
100120488: f8bfc000     ldapr   x0, [x0]
10012048c: d65f03c0     ret

Notably, when compiling with gcc on the same system we get this output
instead:
0000000100120488 <_ossl_rcu_uptr_deref>:
100120488: f8bfc000     ldar   x0, [x0]
10012048c: d65f03c0     ret

Checking the arm docs for the difference between ldar and ldapr:
https://developer.arm.com/documentation/ddi0602/2023-09/Base-Instructions/LDAPR--Load-Acquire-RCpc-Register-
https://developer.arm.com/documentation/dui0802/b/A64-Data-Transfer-Instructions/LDAR

It seems that the ldar instruction provides a global cpu fence, not
completing until all writes in a given cpus writeback queue have
completed

Conversely, the ldapr instruction attmpts to achieve performance
improvements by honoring the Local Ordering register available in the
system coprocessor, only flushing writes in the same address region as
other cpus on the system.

I believe that on M1 virtualized cpus the ldapr is not properly ordering
writes, leading to an out of order read, despite the needed fencing.
I've opened an issue with apple on this here:
https://developer.apple.com/forums/thread/749530

I believe that it is not safe to issue an ldapr instruction unless the
programmer knows that the Local order registers are properly configured
for use on the system.

So to fix it I'm proposing with this patch that we, in the event that:
1) __APPLE__ is defined
AND
2) __clang__ is defined
AND
3) __aarch64__ is defined

during the build, that we override the ATOMIC_LOAD_N macro in the rcu
code such that it uses a custom function with inline assembly to emit
the ldar instruction rather than the ldapr instruction.  The above
conditions should get us to where this is only used on more recent MAC
cpus, and only in the case where the affected clang compiler emits the
offending instruction.

I've run this patch 10 times in our CI and failed to reproduce the
issue, whereas previously I could trigger it within 5 runs routinely.
@nhorman
Copy link
Contributor Author

nhorman commented Apr 5, 2024

Ok, I've updated this patch to not use the relaxed ordering/fencing approach, and instead implemented an ldar based atomic_load function, to be used only when APPLE && clang and aarch64 is defined. I've run it several times, and encountered no issues on macosx. Please review and (re)approve when you all can.

Copy link
Contributor

@tom-cosgrove-arm tom-cosgrove-arm left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM - nice. Now let's hope we never get one of these test failures again!

@tom-cosgrove-arm tom-cosgrove-arm added approval: otc review pending This pull request needs review by an OTC member and removed approval: review pending This pull request needs review by a committer labels Apr 6, 2024
@tom-cosgrove-arm
Copy link
Contributor

I've updated the labels to indicate my committer approval and that an OTC (re-)approval is required

@paulidale paulidale added approval: done This pull request has the required number of approvals and removed approval: otc review pending This pull request needs review by an OTC member labels Apr 7, 2024
@openssl-machine openssl-machine removed the approval: done This pull request has the required number of approvals label Apr 8, 2024
@openssl-machine
Copy link
Collaborator

This pull request is ready to merge

@openssl-machine openssl-machine added the approval: ready to merge The 24 hour grace period has passed, ready to merge label Apr 8, 2024
@t8m t8m added the tests: exempted The PR is exempt from requirements for testing label Apr 8, 2024
@t8m
Copy link
Member

t8m commented Apr 10, 2024

Merged to the master and 3.3 branches. Thank you.

@t8m t8m closed this Apr 10, 2024
openssl-machine pushed a commit that referenced this pull request Apr 10, 2024
Since the addition of macos14 M1 runners in our CI jobs we've been
seeing periodic random failures in the test_threads CI job.
Specifically we've seen instances in which the shared pointer in the
test (which points to a monotonically incrementing uint64_t went
backwards.

From taking a look at the disassembled code in the failing case, we see
that __atomic_load_n when emitted in clang 15 looks like this
0000000100120488 <_ossl_rcu_uptr_deref>:
100120488: f8bfc000     ldapr   x0, [x0]
10012048c: d65f03c0     ret

Notably, when compiling with gcc on the same system we get this output
instead:
0000000100120488 <_ossl_rcu_uptr_deref>:
100120488: f8bfc000     ldar   x0, [x0]
10012048c: d65f03c0     ret

Checking the arm docs for the difference between ldar and ldapr:
https://developer.arm.com/documentation/ddi0602/2023-09/Base-Instructions/LDAPR--Load-Acquire-RCpc-Register-
https://developer.arm.com/documentation/dui0802/b/A64-Data-Transfer-Instructions/LDAR

It seems that the ldar instruction provides a global cpu fence, not
completing until all writes in a given cpus writeback queue have
completed

Conversely, the ldapr instruction attmpts to achieve performance
improvements by honoring the Local Ordering register available in the
system coprocessor, only flushing writes in the same address region as
other cpus on the system.

I believe that on M1 virtualized cpus the ldapr is not properly ordering
writes, leading to an out of order read, despite the needed fencing.
I've opened an issue with apple on this here:
https://developer.apple.com/forums/thread/749530

I believe that it is not safe to issue an ldapr instruction unless the
programmer knows that the Local order registers are properly configured
for use on the system.

So to fix it I'm proposing with this patch that we, in the event that:
1) __APPLE__ is defined
AND
2) __clang__ is defined
AND
3) __aarch64__ is defined

during the build, that we override the ATOMIC_LOAD_N macro in the rcu
code such that it uses a custom function with inline assembly to emit
the ldar instruction rather than the ldapr instruction.  The above
conditions should get us to where this is only used on more recent MAC
cpus, and only in the case where the affected clang compiler emits the
offending instruction.

I've run this patch 10 times in our CI and failed to reproduce the
issue, whereas previously I could trigger it within 5 runs routinely.

Reviewed-by: Tom Cosgrove <tom.cosgrove@arm.com>
Reviewed-by: Paul Dale <ppzgs1@gmail.com>
Reviewed-by: Tomas Mraz <tomas@openssl.org>
(Merged from #23974)
openssl-machine pushed a commit that referenced this pull request Apr 10, 2024
Since the addition of macos14 M1 runners in our CI jobs we've been
seeing periodic random failures in the test_threads CI job.
Specifically we've seen instances in which the shared pointer in the
test (which points to a monotonically incrementing uint64_t went
backwards.

From taking a look at the disassembled code in the failing case, we see
that __atomic_load_n when emitted in clang 15 looks like this
0000000100120488 <_ossl_rcu_uptr_deref>:
100120488: f8bfc000     ldapr   x0, [x0]
10012048c: d65f03c0     ret

Notably, when compiling with gcc on the same system we get this output
instead:
0000000100120488 <_ossl_rcu_uptr_deref>:
100120488: f8bfc000     ldar   x0, [x0]
10012048c: d65f03c0     ret

Checking the arm docs for the difference between ldar and ldapr:
https://developer.arm.com/documentation/ddi0602/2023-09/Base-Instructions/LDAPR--Load-Acquire-RCpc-Register-
https://developer.arm.com/documentation/dui0802/b/A64-Data-Transfer-Instructions/LDAR

It seems that the ldar instruction provides a global cpu fence, not
completing until all writes in a given cpus writeback queue have
completed

Conversely, the ldapr instruction attmpts to achieve performance
improvements by honoring the Local Ordering register available in the
system coprocessor, only flushing writes in the same address region as
other cpus on the system.

I believe that on M1 virtualized cpus the ldapr is not properly ordering
writes, leading to an out of order read, despite the needed fencing.
I've opened an issue with apple on this here:
https://developer.apple.com/forums/thread/749530

I believe that it is not safe to issue an ldapr instruction unless the
programmer knows that the Local order registers are properly configured
for use on the system.

So to fix it I'm proposing with this patch that we, in the event that:
1) __APPLE__ is defined
AND
2) __clang__ is defined
AND
3) __aarch64__ is defined

during the build, that we override the ATOMIC_LOAD_N macro in the rcu
code such that it uses a custom function with inline assembly to emit
the ldar instruction rather than the ldapr instruction.  The above
conditions should get us to where this is only used on more recent MAC
cpus, and only in the case where the affected clang compiler emits the
offending instruction.

I've run this patch 10 times in our CI and failed to reproduce the
issue, whereas previously I could trigger it within 5 runs routinely.

Reviewed-by: Tom Cosgrove <tom.cosgrove@arm.com>
Reviewed-by: Paul Dale <ppzgs1@gmail.com>
Reviewed-by: Tomas Mraz <tomas@openssl.org>
(Merged from #23974)

(cherry picked from commit f5b5a35)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
3.4.0 Planned eng view approval: ready to merge The 24 hour grace period has passed, ready to merge branch: master Merge to master branch branch: 3.3 Merge to openssl-3.3 severity: fips change The pull request changes FIPS provider sources severity: important Important bugs affecting a released version tests: exempted The PR is exempt from requirements for testing triaged: bug The issue/pr is/fixes a bug
Projects
Status: Done
Status: Done
Development

Successfully merging this pull request may close these issues.

None yet

7 participants