New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Repeated Atomic.get
optimized away incorrectly
#12713
Comments
I'm not sure that this optimization is forbidden by the model. After all, all traces of execution for the optimized program are valid for the non-optimized program. let rec will_this_ever_return x =
let before = Atomic.get x in
let () = Sys.opaque_identity () in
if Atomic.get x == before then
will_this_ever_return x (It introduces a small instruction to store the constant 1 into a register that will never be used, so it's not completely free, but close enough.) |
I'm probably just missing something, but doesn't that description just explain that those atomic reads might return different values? That doesn't mean that there is any situation where they must return different values. Both reads returning the same value seems clearly allowed, regardless of what other threads are doing, so the optimisation is correct. What am I missing here? |
... after giving this more thought, I agree. I thought initially that CSE could result in reading the same value forever, which is incorrect in general (if someone writes to |
In particular, not that even with CSE disabled, this program can loop indefinitely, and it is in fact fairly likely to do so if there are only a few writes to |
What you write is correct, but the compiler is not required to preserve all the traces, it can produce an output program that has more determinism and only has a subset of the possible traces. For example it does this when it picks an evaluation order for certain program constructs. |
In this case I agree that CSE should not be allowed, with the following example in mind:
In this example, doing the CSE allows observing the result (x0 = 0, y0 = 1, x0 = 0), while this is not a valid result for the program above according to the operational semantics: once you observe y0=1, all further reads of x should also see 1. |
My conclusion is that an atomic load, on the OCaml 5 operational semantics, acts as both a load and a store: it reads memory but it also updates the frontier, which is part of the state. This frontier update can force us to observe a different value for further atomic loads -- here, reading The example you initially gave is the only sort of example where CSE is valid, where we read the same The CSE machinery has no support for this kind of operations today, and it seems far simpler (and innocuous from a performance perspective) to disable CSE on atomic loads rather than to do something smart. |
@gasche's example at #12713 (comment) is the classic litmus test showing that CSE of loads is incorrect (= produces behaviors that cannot be observed on the original program) in a sequentially-consistent model. I'm sure @maranget has a name for that test. However, CSE of two loads from the same address without intervening loads from other addresses, as in @polytypic 's original example, is correct for SC (= does not introduce new behaviors), as far as I know. Just making sure: OCaml's atomics are supposed to guarantee SC, right? I don't think we're going to make a special case for the "two atomic loads from the same address without intervening loads", so something like #12175 should be applied. |
I agree turning it off is the right decision. I guess if we wanted to support it you would basically treat it as an operation that could be CSE'd (like a load) but that killed all preexisting known loads (like a store). I think that gives the right behaviour because a second read of the same address gets optimised away, but any interveneing read of a different address would prevent further reads from being optimised. |
I reopened #12715 and anyone is welcome to review :-) |
In the operational model of the paper you mention, if you just read This being said, I don't think that this case is important to optimize, and it requires some extra complexity to the CSE machinery, so there is no proposal to do it for now. |
Atomics have SC behaviour. And it is well-known that CSE breaks SC in the general case. Currently, the compiler optimises the following function: let foo a b =
let r1 = 2 * Atomic.get a in
let r2 = Atomic.get b + 1 in
let r3 = 2 * Atomic.get a in
(r1, r2, r3) to let foo' a b = (* renamed for clarity *)
let r1 = 2 * Atomic.get a in
let r2 = Atomic.get b + 1 in
(r1, r2, r1) Assume |
Again: compilers are definitely allowed to rule out possible traces. You find this surprising, but this is the common standard for compiler specifications. When we say that a compiler is correct, we allow it to rule out possible traces. "Behaving like a simple interpreter would", on the other hand, is not a correctness criterion. If you want to write code that appears to perform no-ops and want to ensure that the compiler does not optimize something away, you can use Taking a step back, one difference between CSE for non-atomic loads and the tiny fragment of CSE on atomic loads that is sound (sharing reads of the same location) is as follows: in the non-atomic case, CSE is almost always an optimization that makes the program faster. In the atomic case, CSE can be a pessimization in your examples. This is a reasonable argument against this program transformation, even if it is correct -- it goes against the programmers' mental model, it is more likely to pessimize programs. etc. Note that we were already convinced not to do this program transformation (CSE only on same-location atomic loads) because it is complex. |
Below is a detailed proof. (It may not be "clear" as those proofs are Let us say that a machine M "can terminate" if is has a finite Let us call a "thread context" M[□] a term of the form
where a thread (F, e) is replaced by a hole □;; we can write M[(F, e)] Let us write (e1 refines e2) if, for any thread context
This is an instance of a standard notion of contextual refinement, Now let us prove that the expression
is refined by the expression
(that is, that CSE for two consecutive atomic loads is "correct" in We have to show that if M[(F,e1)] can terminate, then M[(F,e2)] can terminate. Suppose that we have a terminating reduction sequence starting from
where Fx is the frontier of the atomic x in M, and V is an irreducible
We can now conclude by building a terminating reduction sequence for
(Note: the proof here is restricted to the case where the two atomic |
@polytypic https://web.cs.ucla.edu/~todd/research/pldi11.pdf is a relevant paper for your questions. Section 2.1 "SC-Preserving Transformations" is very relevant.
Reg the elimination of redundant load,
|
I disagree with this. The model shouldn't explicitly say which traces are allowed to be ruled out. The model doesn't know anything about the cost model. So whether an optimisation is useful or not is not a concern that we can address in the definition of the relaxed memory model. All the relaxed memory model says is that an optimisation is correct if the set of observable behaviours in the optimised program is a subset of those observed in the unoptimised program. The OCaml memory model paper does not say anything about what optimisations are allowed on atomics. However, that does not mean that atomics cannot be optimised. C++ compilers are known to optimise atomics http://www.open-std.org/jtc1/sc22/wg21/docs/papers/2015/n4455.html (though the initial example is optimised neither by GCC nor by clang). While optimising atomics is tricky, given the results in https://fzn.fr/readings/c11comp.pdf (Section 7.1), it is very likely that eliminating redundant loads and other similar peephole optimisation will be correct in the OCaml memory model. |
Consider the following function/loop:
In OCaml 5, a call of
will_this_ever_return x
will never return even if the atomic locationx
would be mutated in parallel. The reason is that the compiler optimizes the secondAtomic.get
away:The Bounding Data Races in Space and Time paper has an example of a "Redundant load (RL)" optimization (page 8), but the location
a
in that example is a nonatomic location. No "Redundant load" optimization should be applied in the case of an atomic location.The text was updated successfully, but these errors were encountered: