New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Armv7-a]: Control-dependency between atomic accesses removed by -O1. #65106
Comments
@llvm/issue-subscribers-backend-arm |
You "C test" has undefined behavior, I think? There's a race on "x". |
Apologies - this test was automatically generated. If we make x atomic, and make its accesses relaxed, the behaviour is the same and it is well-defined, according to the C/C++ model:
Further the assembly is still predicated: https://godbolt.org/z/jzGcfT97P |
If I'm following correctly, the model is assuming the CPU will speculatively execute predicated ops. Are there any actual ARM CPUs which work that way, though? My impression based on optimization guides for ARM CPUs is that CPU treat a predicated ops as having a data dependency on the flags register. Given a CPU which works that way, I think the bad case would be impossible: the computation of the flags register depends on the load. Is disabling predication of atomic accesses sufficient to solve this issue? That's cheap enough that I wouldn't mind doing it in any case, but I'm not sure that's sufficient. |
The problem is two parts:
If removing predication enforces the use of an explicit branch instruction in the backend, then this should fix both issues. I advise removing predication and adding an explicit branch - the The historic arm model allows the re-ordering although I could not find any hardware that exhibits this - we cannot rule it out however. What we do know is that the newer AArch32 model forbids the reordering - so the Armv8 backend should be ok. We should conservatively guard against re-ordering on some old hardware we do not have access to. More generally, we should restore the control dependency to enforce the intended program constraints. |
Without predication, we branch in this particular case. But not in general, I think... for example, consider the following:
This lowers to:
According to your model, I think this is a miscompile? I have no idea how we could change LLVM to avoid this. |
I had a chat with Wilco, I think in general we cannot prevent all conditional execution but in this specific case if we turn it off for atomics we should be ok (and in your above case use a The aim of the game is to prevent outcomes of concurrent programs that are forbidden by the source model, for sequential programs conditional execution is ok but we should endeavour to be careful when atomics are used as this is when quirky reordering can be observed. |
The problem for my example is that there isn't any way to tell there's a control dependency we need to preserve. The load, the conditional, and the store could all be in different functions. The only solution I can think of that reliably works is to insert a dummy dependency after every relaxed load. (Unless there's some additional constraint I'm not seeing...?) |
If atomics are marked as non-hoistable, non-sinkable, non-mergeable, non-predicable then it should block all of these examples. |
I don't see how this helps. If you rewrite my example to
Not sure where you're going with this... the C++ standard just requires there exists some sequence that corresponds to a relaxed load. If that sequence is |
That said, I'm reluctant to change the default lowering of relaxed atomic loads without any evidence this is actually a realistic issue. |
That makes the store data, not control dependent. And that is like conditional execution. A relaxed load is simply a load. That's not the issue here. However certain optimizations should be blocked to avoid creating incorrect orderings. It appears the issue is more widespread, I think this will fail on any target:
So if the C memory model requires the control dependency preserved for ordering, we can't optimize the branch and merge the stores (but both GCC and LLVM do). |
Most CPUs naturally preserve the dependency because of causality: either the inputs values to the store aren't available until after the load is complete, or speculative stores aren't actually committed until the CPU is sure it won't need to roll back its branch prediction. According to this bug report, ARMv7 (but not ARMv8) cores can speculate a store in a way that's visible to other cores before the core is actually sure the store will execute. Which... you could theoretically design a CPU that works this way, but I'm skeptical any ARM CPUs exist that actually do this. |
I support doing this.
I've observed load buffering under models of AArch64 (official), ARM, RISC-V (official), and PPC. The specific conditionalised flavour is allowed by the older ARM model, but not the AArch32 model, rightfully fixed.
This is exactly why we need to ensure the ctrl dependency is preserved. The ARM model allows this reordering, and there are so many machines, programs, and compilers out there that finding a chip+compiled(program) that exhibits this behaviour is prohibitively hard. It is harder to prove it doesn't happen than it is to guard against. Besides we need to ensure this program is ABI compatible against current programs and future hardware that doesn't exist yet. The work I have done - using models to test compiler implementations - is to avoid the problems of interpreting the standards (C/C++ and Arch) above, and to find tricky bugs like these where the programmers intent (ctrl deps) are violated. I advise to mark atomics as Wilco mentions to guard against the current bug, and if the performance hit is too much, we can re-introduce the old behaviour behind a flag (the opposite of what we did with the LDAPR/LDAR swap). |
Say you have the following:
This is, as far as I can tell, essentially equivalent to the other testcases we've been discussing. So our choices for generic lowering are:
Any suggestion for which one we want? (All the discussion about "non-hoistable, non-sinkable, non-mergeable, non-predicable" is irrelevant, as far as I can tell; my example doesn't require hosting, sinking, merging, or predicating an atomic op.) |
I’ll prepare some tests to evaluate those options for you on Monday, thanks for this good discussion! |
In the meantime, could you confirm what you mean by dependency-preserving branch? |
Something like the "ldr r0, [r0]; cbz r0, #0" pattern I mentioned, which to my understanding should preserve the dependency even though the destination doesn't actually vary. |
That makes sense - I've done some thinking about the cases and I'll send something tomorrow. Still testing some other architectures. |
Righto, I've done some analysis on these cases by generating several hundred tests. This is the simplest case and is telling. Right away we should reject "Introduce a memory barrier for relaxed atomic ops". This is too expensive - and for the cases where reordering matters a branch after a load works best.
Under the AArch64 model we observe the outcomes:
So this would prevent the outcome { P0:r0=1; P1:r0 = 1 }. This is the same for RISC-V, ARM, and PPC tests. It does not make a difference for MIPS or X86_64. As far as the branch before the store is concerned - the argument is symmetric. I think we should use a branch after relaxed atomic loads however, as there are more frequent cases (e.g. in the linux kernel) where a store depends on the value of a load (the converse doesn't make sense without an additional read to the new value stored). In the case of the original bug report, if we put branches after loads whether a control dependency is emitted or masked is irrelevant as the reordering cannot occur. I think in the spirit of the original issue we should also mark atomics as non-predicable and preserve the source control dependency. We should make this behaviour default, and put the old behaviour behind a command line flag if people want to use it. I doubt there are enough people using relaxed atomics for this to make a perf difference - but the flag will give them the option of the old behaviour if they need it. |
It's probably worth posting a summary of this to Discourse to get more eyes on this. |
Also, do you have some reference for your definition of a control dependency? The ARM reference manual says:
|
There was an issue with C++ consume semantics which looks very similar to this (how to track and avoid optimizing away the dependency). So perhaps this should be reported as a C++ defect. Compilers have been emitting relaxed accesses as normal loads/stores for decades, so changing this now would be quite some work. Blocking a few compiler optimizations still looks like a simpler approach (all examples shown so far can be avoided that way). |
So I think we are discusssing two things here. I think we should go ahead and fix the above bug here, as is being done in GCC. I will raise a separate proposal on Discourse regarding putting a dependency after relaxed loads.
It is buried in the standard unfortunately, but I picked out: §5.1.2.3, 14 of ISO C11 (with similar wording in newer versions):
We use a simulator that encodes such dependencies (based on Fig.22 of this paper) . In short, if a dependency is removed by the compiler, we may observe bugs under the architecture model. |
You mean, we should disable predication on atomic load/store? I mean, it's not a big deal if we do that, I guess, but I thought we concluded it doesn't solve the general issue? And on the other hand, if we solve the general issue, it's fine to continue predicating stores. |
Apologies for my delay - so the general issue is related to the question of what relaxed atomics should do, whereas this issue focuses on the loss of a control dependency. I am preparing a proposal to put on discourse - I will share it with you when it's ready. Apologies again for the delay. |
Hi @efriedma-quic I would appreciate your thoughts on this proposal, so as to account for questions I may have missed. I still believe that this patch should be fixed by preventing certain optimisations as Wilco suggests, but in the above proposal I address the general question of the semantics of Relaxed Atomics in practice. |
So a general issue exists on targets with a relaxed memory model where a load can be delayed past a store. If a relaxed load is followed by a relaxed store, there can be a "thin-air" value: the load can see the result of code executing in a different thread which depends on the stored value. (And this isn't just a formal model issue; you've proven such hardware actually exists.) If there's a branch that depends on the loaded value between the load and the store, this can't happen according to ARM hardware models: branch prediction machinery buffers the store until the load resolves. The proposal is to take advantage of this, and ensure there's always such a branch between an atomic load and a subsequent atomic stores. (Another possible way to handle this which your paper doesn't mention is to ensure the address of any subsequent atomic stores is somehow based on the value of the load. For example, if every relaxed load modifies the stack pointer, and the address of every relaxed store is based on the stack pointer. But that's probably slower.) The issue hits more cases on armv7 formal models, compared to armv8 formal models: predication doesn't force buffering in the same way as a branch. If we disable predication of atomic ops on armv7, that's resolved, and we can treat armv7 exactly the same way as other similar targets... which may be worth doing just to make it easier to reason about armv7. Thinking about it a bit more, for a dependency-preserving branch, maybe better to use Are there any cases where we can elide the branch? I mean, obviously we can if there's an existing branch, but can we take advantage of other memory barriers? ldar? stlr? Can this show up with atomic ops that aren't just loads, like cas? |
Hi Eli,
Indeed
Yes there are data, address, or control dependencies that can handle this, in addition to the normal fences/acquire-release instructions. I wanted to keep the proposal simple for non-concurrency engineers 🙂
Ideally this would fix this current issue, leaving only the general issue in the link above yes.
Yes whatever seems better - I can test any variant of the 'branch-after-load' proposal.
A valid case, as long as that branch isn't then later removed.
This works, but has worse perf than a control branch
One of the other proposals is 'Strong-release-acquire' that either makes all stores release, or all loads acquire. It has the same effect as the branch-after-load proposal, but performs slightly worse on PPC machines. Also we can't use load-acquire on machines that don't have those instructions (v6-m), whereas almost every machine supported by llvm has branch instructions. See the questions section of the paper.
Yes - a relaxed atomic-exchange for armv8.2-a is implemented using a |
Right, we'd have to do this pretty late in compilation to avoid issues with branch optimizations.
I meant, if the compiler sees there's an existing acquire load/release store/fence between the relaxed load and the relaxed store, can the compiler elide the branch?
On a related note, if we lower an atomic operation to a loop (cas loop, or ldxr/stxr), does that naturally protect us from issues like this? |
One more thing I just thought of; this probably has some implications for target-independent compiler optimizations; I don't think we currently have a rule that forbids sinking a relaxed load past a relaxed store, or hoist a relaxed store past a relaxed load. I don't expect this to have a significant performance impact because we rarely do that sort of transform anyway, but probably we need to add some IR modeling for the "global atomic state". (Currently, IR-level analysis models relaxed loads as just a read+write to the address.) |
In theory yes - we would need to be careful with funky instructions like One way is to view the branch as a barrier internally in LLVM, then the current optimisations wouldn't touch it - it would be compatible with the current practice without too much additional work.
Not by default, consider the following litmus test that has a CAS loop on
When run under the AArch64 model, we get the outcomes:
To enforce order, we need the branch to depend on the value loaded as you mentioned prior, and not the exclusivity check:
which has outcomes:
Yes I suspect this will be a valuable addition to help |
Also, I will add some answers to these questions in the questions section of the proposal as they are good! |
A one-way acquire/release doesn't work for this case. If there was a 2-way barrier then it would block reordering and load-buffering.
This example looks incorrect - the right-hand side must block reordering too. And then I think it will work. |
Indeed - we need to be careful not to mix in LDAPR and relaxed semantics here - perhaps the compiler could carry this metadata to prevent it.
We still observe the bad outcome even if we put a
when run under the AArch64 model we get the outcomes:
Although Wilco and I are discussing if we can observe the |
Your model for stxr seems weird; naively, I'd expect that the cbnz depends on the stxr, and the stxr depends on the ldxr, therefore the cbnz depends on the ldxr. And it's hard for me to imagine a design where that doesn't hold; the core would have to somehow decide the stxr is going to succeed before it actually loads the value for the ldxr. Maybe the people writing the model wanted to leave the possibility open, though... |
Agreed, there is a direct dependency between LDXR and STXR, but even without it the STXR can only succeed if the LDXR finishes. So I don't see any potential for reordering either. There is a related case where the model allows a later LDXR to reorder and execute in the middle of a LDAXR/STXR loop - however this is impossible since this must clear the exclusive monitor (and it does not happen on any hardware we tried).
Actually LDAR/STLR are not sufficient barriers since they will allow relaxed loads/stores to pass in one direction, and then reorder with other relaxed loads/stores. So anything but a full barrier would not work. |
Agreed
I agree - I have raised this with the model experts and hopefully they should fix it or explain why this is the case. |
I have updated the blog post with your questions. you may need to refresh your cache to see them |
Also Wilco has now fixed original issue (control dependency masked at higher opt levels in GCC, which I have validated with tests). I propose we fix armv7 in clang in the same way for conformance. |
So, to summarize:
I'm not against disabling predication of atomic ops... the runtime cost is probably unmeasurable in practice. But probably better to hold off on it if you're planning to start a wider discussion of general codegen for relaxed loads, just to avoid any potential confusion. |
Wilco has already fixed predication in GCC and it is separate, but happens to mask an effect here that has a more general cause. We should fix predication regardless, it's just unfortunate that the general issue is happening at the same time: here |
Can you explain why the example (as modified with x as a relaxed atomic) is disallowed by the standard? I would have thought it allowed, but I'm sure I could be mistaken. |
Hi James, Sure I can do that - please bear with me. |
Hi James, So, the test (where Consider the stronger test - where we put branches on both threads, makes
When run under C/C++ models (C23): We get the following outcomes.
The outcome Notice the outcome The outcome So the case where both loads are re-ordered is allowed, but the case where just one re-orders is forbidden. This doesn't make much sense does it? I have tested this on 3 separate encodings of the C11 memory model (from peer review papers - happy to link as needed), all give the same results. My comment below finds where in the standard this is explicitly stated. Unfortunately, when atomics are compiled to predicated instructions for Armv7-a we do observe the outcome
which has outcomes under the
So:
In any case for this bug report, using
EDIT: under arm.cat, aarch32.cat, ppc, RISC-V, and AArch64 models we only observe: Do you oppose this change? Separately, there is the proposal to outlaw load-store re-ordering which is trying to strengthen our implementation under a flag regardless of modelling issues. I researched this issue following this bug report, and chose to write it up separately, it is unfortunate that they are linked. |
As for the standards doc, the above 'double-branch' example comes up in Recommended practice, section 7.17.3 of C23:
|
In my opinion, as long as we're within-spec to do what we're doing here, we should not forbid conversion to predicated instructions as a "stand-alone" change, since doing so doesn't fully address the issues raised (per additional examples by @efriedma-quic). Additionally, since the particular issue with predicated instructions is only observed on the model, and not on actual hardware, there doesn't seem to be a forcing-function to make this particular change, ahead of the discussion on whether and how to strengthen relaxed atomics in general (e.g. as your paper proposed). |
Almost all of the examples shown are generic, ie. they do not rely on v7 conditional execution and can equally happen on other targets using conditional moves. The only v7 specific examples are conditionally executed atomics. Avoiding conditional execution of atomics on v7 means it is now equivalent to the v8 model. That by itself removes a layer of complexity and unnecessary differences. The cost of this is negligible, and in GCC it was a good code cleanup, with the resulting code being smaller and easier to maintain. |
The original bug report above is incorrect. Relaxed atomics do not impose any ordering semantics between threads. It is therefore possible to get the scenario where different threads disagree on the order of operations on distinct variables. This can lead to "out of thin air" consequences, where a value is stored on a branch, and the branch is only taken if that store is seen by another thread. Though the C++ standard discourages implementations from actually creating such scenarios, it is permitted by the memory model. Note that in this case, the acquire load has no effect on the consequences, because there is not a release store to the same memory location. I do not see a need to change this aspect of relaxed atomics. The point of relaxed atomics is to maximize the opportunity for performance by minimizing constraints. If you want ordering constraints, then you should put them in your source code: use acquire and release operations, or even SC operations. |
Thank you to everyone who has provided feedback on this. As a PhD student I appreciate every opportunity to learn! In response to the feedback I have received, I am closing this bug (and all other open bugs of this kind) as invalid. I am checking other bug reports to ensure there are no other invalid ones. |
Consider the following litmus test that captures bad behaviour:
where 'P0:r0 = 0' means thread P0, local variable r0 has value 0.
When simulating this test under the C/C++ model from its initial state, the outcome of execution in the exists clause is forbidden by the source model. The allowed outcomes are:
When compiling to target armv7-a using either GCC or LLVM trunk (https://godbolt.org/z/nxWTbEfK1), the compiled program has the following outcomes when simulated under the armv7 model:
Since the
CMP;MOVEQ r1, #1;STREQ
sequence is predicated on the result of CMP, but the STREQ can be reordered before the LDR of x on thread P0, allowing the outcome {P0:r0=1; P1:r0=1;}.The issue is that the control dependency between the load and store on P0 has been removed by the compiler. Both GCC and LLVM produce the following (replace symbolic registers e.g. %P0_x with concrete registers containing x).
The forbidden {P0:r0=1; P1:r0=1;} behaviour disappears in GCC when optimising using -O2 or above, since R2 is reused in the STREQ when the EQ condition holds - a data dependency masks the loss of the control dependency. This buggy behaviour remains in clang -O2 and above however (https://godbolt.org/z/hGv7P3vGW).
To fix this - mark relaxed atomic load/stores as not predicated, and ideally use an explicit branch to enforce the control dependency so that it is not lost:
This prevents the buggy behaviour.
We will follow up by reporting the bug for GCC.
The text was updated successfully, but these errors were encountered: