Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Repeated Atomic.get optimized away incorrectly #12713

Closed
polytypic opened this issue Nov 3, 2023 · 17 comments · Fixed by #12715
Closed

Repeated Atomic.get optimized away incorrectly #12713

polytypic opened this issue Nov 3, 2023 · 17 comments · Fixed by #12715

Comments

@polytypic
Copy link

polytypic commented Nov 3, 2023

Consider the following function/loop:

let rec will_this_ever_return x =
  let before = Atomic.get x in
  if Atomic.get x == before then
    will_this_ever_return x

In OCaml 5, a call of will_this_ever_return x will never return even if the atomic location x would be mutated in parallel. The reason is that the compiler optimizes the second Atomic.get away:

camlExample__will_this_ever_return_268:
        subq    $8, %rsp
.L101:
        cmpq    (%r14), %r15
        jbe     .L102
.L103:
        movq    (%rax), %rbx  ;; <-- let before = Atomic.get x
        cmpq    %rbx, %rbx    ;; <-- Atomic.get x == before
        jne     .L100
        jmp     .L101
.L100:
        movl    $1, %eax
        addq    $8, %rsp
        ret
.L102:
        call    caml_call_gc@PLT
.L104:
        jmp     .L103

The Bounding Data Races in Space and Time paper has an example of a "Redundant load (RL)" optimization (page 8), but the location a in that example is a nonatomic location. No "Redundant load" optimization should be applied in the case of an atomic location.

@lthls
Copy link
Contributor

lthls commented Nov 3, 2023

I'm not sure that this optimization is forbidden by the model. After all, all traces of execution for the optimized program are valid for the non-optimized program.
Removing this optimization for atomic loads is not very hard (make class_of_operation return Op_other for atomic loads in CSEgen.ml), so if someone manages to convince me that we should do so I can take care of submitting a PR.
Alternatively, simply tricking the compiler into thinking a write might occur between the reads should work too:

let rec will_this_ever_return x =
  let before = Atomic.get x in
  let () = Sys.opaque_identity () in
  if Atomic.get x == before then
    will_this_ever_return x

(It introduces a small instruction to store the constant 1 into a register that will never be used, so it's not completely free, but close enough.)

gasche added a commit to gasche/ocaml that referenced this issue Nov 3, 2023
@gasche
Copy link
Member

gasche commented Nov 3, 2023

I agree that the optimi[sz]ation is incorrect.

I propose a quick fix in #12715 which happens to exactly correspond with @lthls' recommendation.

@lpw25
Copy link
Contributor

lpw25 commented Nov 3, 2023

I'm probably just missing something, but doesn't that description just explain that those atomic reads might return different values? That doesn't mean that there is any situation where they must return different values. Both reads returning the same value seems clearly allowed, regardless of what other threads are doing, so the optimisation is correct. What am I missing here?

@gasche
Copy link
Member

gasche commented Nov 3, 2023

... after giving this more thought, I agree. I thought initially that CSE could result in reading the same value forever, which is incorrect in general (if someone writes to x then to y, and you observe the write of y, then you have to observe the write to x as well), but here each loop iteration can read the new/updated value, and merging the two reads inside the each iteration seems correct.

@gasche
Copy link
Member

gasche commented Nov 3, 2023

In particular, not that even with CSE disabled, this program can loop indefinitely, and it is in fact fairly likely to do so if there are only a few writes to x, and not another domain writing a new value constantly until it race with the loop body.

@gasche
Copy link
Member

gasche commented Nov 3, 2023

What you write is correct, but the compiler is not required to preserve all the traces, it can produce an output program that has more determinism and only has a subset of the possible traces. For example it does this when it picks an evaluation order for certain program constructs.

@gasche
Copy link
Member

gasche commented Nov 3, 2023

In this case I agree that CSE should not be allowed, with the following example in mind:

let x, y = Atomic.make 0, Atomic.make 0

in-parallel:
  (let x0 = Atomic.get x
   let y0 = Atomic.get y
   let x1 = Atomic.get x)
  (Atomic.set x 1;
   Atomic.set y 1)

In this example, doing the CSE allows observing the result (x0 = 0, y0 = 1, x0 = 0), while this is not a valid result for the program above according to the operational semantics: once you observe y0=1, all further reads of x should also see 1.

@gasche
Copy link
Member

gasche commented Nov 3, 2023

My conclusion is that an atomic load, on the OCaml 5 operational semantics, acts as both a load and a store: it reads memory but it also updates the frontier, which is part of the state. This frontier update can force us to observe a different value for further atomic loads -- here, reading y changes the correct value for the next read of x.

The example you initially gave is the only sort of example where CSE is valid, where we read the same x again without another atomic load (or store) in between. Reading x updates the frontier, but the value we receive on the first time is the same as the value we receive again if nothing else changed.

The CSE machinery has no support for this kind of operations today, and it seems far simpler (and innocuous from a performance perspective) to disable CSE on atomic loads rather than to do something smart.

@xavierleroy
Copy link
Contributor

@gasche's example at #12713 (comment) is the classic litmus test showing that CSE of loads is incorrect (= produces behaviors that cannot be observed on the original program) in a sequentially-consistent model. I'm sure @maranget has a name for that test.

However, CSE of two loads from the same address without intervening loads from other addresses, as in @polytypic 's original example, is correct for SC (= does not introduce new behaviors), as far as I know.

Just making sure: OCaml's atomics are supposed to guarantee SC, right?

I don't think we're going to make a special case for the "two atomic loads from the same address without intervening loads", so something like #12175 should be applied.

@lpw25
Copy link
Contributor

lpw25 commented Nov 3, 2023

I agree turning it off is the right decision. I guess if we wanted to support it you would basically treat it as an operation that could be CSE'd (like a load) but that killed all preexisting known loads (like a store). I think that gives the right behaviour because a second read of the same address gets optimised away, but any interveneing read of a different address would prevent further reads from being optimised.

@gasche
Copy link
Member

gasche commented Nov 3, 2023

I reopened #12715 and anyone is welcome to review :-)

@gasche
Copy link
Member

gasche commented Nov 4, 2023

In the operational model of the paper you mention, if you just read x, you can perform another reduction step immediately to read x again and at this point you know that you are going to observe the same value, as the value depends only on the frontier of the domain and the frontier of the location, which are the same after the first read. It is valid for the compiler to force this scheduling strategy.
This optimization remains valid as long as nothing between the reads let us observe that other threads have made progress.

This being said, I don't think that this case is important to optimize, and it requires some extra complexity to the CSE machinery, so there is no proposal to do it for now.

@kayceesrk
Copy link
Contributor

Atomics have SC behaviour. And it is well-known that CSE breaks SC in the general case. Currently, the compiler optimises the following function:

let foo a b =
  let r1 = 2 * Atomic.get a in
  let r2 = Atomic.get b + 1 in
  let r3 = 2 * Atomic.get a in
  (r1, r2, r3)

to

let foo' a b = (* renamed for clarity *)
  let r1 = 2 * Atomic.get a in
  let r2 = Atomic.get b + 1 in
  (r1, r2, r1)

Assume a == b (they are aliases) and Atomic.get a = 1 before the call to foo. Suppose a concurrent thread does Atomic.set b 0 after the first atomic get. foo will return (2,0,0) whereas foo' will return (2,0,2). The latter result cannot be witnessed under SC and is a bug.

gasche added a commit to gasche/ocaml that referenced this issue Nov 4, 2023
@gasche
Copy link
Member

gasche commented Nov 5, 2023

Again: compilers are definitely allowed to rule out possible traces. You find this surprising, but this is the common standard for compiler specifications. When we say that a compiler is correct, we allow it to rule out possible traces. "Behaving like a simple interpreter would", on the other hand, is not a correctness criterion.

If you want to write code that appears to perform no-ops and want to ensure that the compiler does not optimize something away, you can use Sys.opaque_identity. For example using Atomic.get (Sys.opaque_identity x) == before definitely rules out CSE even in cases where it is technically valid.

Taking a step back, one difference between CSE for non-atomic loads and the tiny fragment of CSE on atomic loads that is sound (sharing reads of the same location) is as follows: in the non-atomic case, CSE is almost always an optimization that makes the program faster. In the atomic case, CSE can be a pessimization in your examples. This is a reasonable argument against this program transformation, even if it is correct -- it goes against the programmers' mental model, it is more likely to pessimize programs. etc. Note that we were already convinced not to do this program transformation (CSE only on same-location atomic loads) because it is complex.

@gasche
Copy link
Member

gasche commented Nov 5, 2023

What I'd love to see is a clear proof, based on the paper, that a
program transformation where one of two consecutive atomic loads of
the same location is eliminated is correct.

Below is a detailed proof. (It may not be "clear" as those proofs are
always fairly technical.) It is based on the operational semantics,
not on the axiomatic semantics which I am unfamiliar with.

Let us say that a machine M "can terminate" if is has a finite
reduction sequence to a machine that cannot reduce further. (Reduction
is non-deterministic so M may also admit infinite reduction
sequences at the same time.)

Let us call a "thread context" M[□] a term of the form

<S, P[i ↦ □]>

where a thread (F, e) is replaced by a hole □;; we can write M[(F, e)]
for the machine where □ is replaced by the thread (F,e).

Let us write (e1 refines e2) if, for any thread context
M[□] and any frontier F, we have:

if M[(F,e1)] can terminate,  then M[(F,e2)] can terminate

This is an instance of a standard notion of contextual refinement,
which is a commonly accepted criterion for the correctness of a
program transformation -- is it correct to transform e2 into e1 if
(e1 refines e2).

Now let us prove that the expression

e2 :=
  let x0 = Atomic.get x in
  let x1 = Atomic.get x in
  e

is refined by the expression

e1 :=
  let x0 = Atomic.get x in
  let x1 = x0 in
  e

(that is, that CSE for two consecutive atomic loads is "correct" in
the sense of being a refinement)

We have to show that if M[(F,e1)] can terminate, then M[(F,e2)] can terminate.

Suppose that we have a terminating reduction sequence starting from
M[(F,e1)]. A terminating sequence reduces all threads to an irreducible
expression, so it necessarily contains a reduction step of the form

  M'[(F,e1)]
=
  M'[(F,
     let x0 = Atomic.get x in
     let x1 = x0 in
     e)]
~> (atomic read of v at x)
 M'[(F⊔Fx,
    let x0 = v in
    let x1 = x0 in
    e)]
~>*
  V

where Fx is the frontier of the atomic x in M, and V is an irreducible
machine. The final reductions to V necessarily involve reducing (let
x0 = v in let x1 = x0 in e), whose first two reductions are
silent. Silent reduction cannot influence any other part of the
machine M', so they can be "permuted" to happen first in the reduction
sequence without changing the final result: there is a (possibly
different) sequence that looks like

  M'[(F,e1)]
=
  M'[(F,
     let x0 = Atomic.get x in
     let x1 = x0 in
     e)]
~> (atomic read of v at x)
 M'[(F⊔Fx,
    let x0 = v in
    let x1 = x0 in
    e)]
~> (silent)
  M'[(F⊔Fx,
    let x1 = v in
    e[x0:=v])]
~> (silent)
  M'[(F⊔Fx, e[x0:=v, x1:=v])]
~>*
  V

We can now conclude by building a terminating reduction sequence for
M[(F,e2)] from this:

  M'[(F,e2)]
=
  M'[(F,
     let x0 = Atomic.get x in
     let x1 = Atomic.get x in
     e)]
~> (atomic read of v at x)
 M'[(F⊔Fx,
    let x0 = v in
    let x1 = Atomic.get x in
    e)]
~> (silent)
  M'[(F⊔Fx,
    let x1 = Atomic.get x in
    e[x0:=v])]
~> (atomic read of v at x)
  M'[(F⊔Fx,
    let x1 = v in
    e[x0:=v])]
~> (silent)
  M'[(F⊔Fx, e[x0:=v, x1:=v])]
~>*
  V

(Note: the proof here is restricted to the case where the two atomic
reads appear at the toplevel of the expression, not under an arbitrary
context. Generalizing under an arbitrary expression context would be
possible but even more cumbersome. A more pleasant way to prove such
results would be to define a relational program logic, for example a
relational version of the Cosmo program logic
(https://dl.acm.org/doi/pdf/10.1145/3408978) –- to my knowledge none
has been defined for now).

@kayceesrk
Copy link
Contributor

@polytypic https://web.cs.ucla.edu/~todd/research/pldi11.pdf is a relevant paper for your questions. Section 2.1 "SC-Preserving Transformations" is very relevant.

A compiler transformation is SC-preserving if every behavior of the transformed program is a behavior of the original program. Note that it is acceptable for a compiler transformation to reduce the set of behaviors.

Reg the elimination of redundant load,

Furthermore, some transformations involving a single shared variable are SC preserving [47]. For example, if a program performs two consecutive loads of the same variable, as in Figure 3(a), the compiler can remove the second load.

@kayceesrk
Copy link
Contributor

In other words, given an ideal machine that could, at any step, perform a step from any thread, there is a set of possible traces. The model should explicitly specify which possible traces are allowed to be ruled out by the implementation.

I disagree with this. The model shouldn't explicitly say which traces are allowed to be ruled out. The model doesn't know anything about the cost model. So whether an optimisation is useful or not is not a concern that we can address in the definition of the relaxed memory model. All the relaxed memory model says is that an optimisation is correct if the set of observable behaviours in the optimised program is a subset of those observed in the unoptimised program.

The OCaml memory model paper does not say anything about what optimisations are allowed on atomics. However, that does not mean that atomics cannot be optimised. C++ compilers are known to optimise atomics http://www.open-std.org/jtc1/sc22/wg21/docs/papers/2015/n4455.html (though the initial example is optimised neither by GCC nor by clang). While optimising atomics is tricky, given the results in https://fzn.fr/readings/c11comp.pdf (Section 7.1), it is very likely that eliminating redundant loads and other similar peephole optimisation will be correct in the OCaml memory model.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

6 participants