Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

disable CSE for atomic loads #12715

Merged
merged 1 commit into from
Nov 4, 2023
Merged

disable CSE for atomic loads #12715

merged 1 commit into from
Nov 4, 2023

Conversation

gasche
Copy link
Member

@gasche gasche commented Nov 3, 2023

(fixes #12713)

@lthls made the exact same recommendation in the discussion of #12713. I would be interested in @xavierleroy's confirmation that this is a reasonable thing to do.

(I wonder why this was not done back when Multicore was merged. Was it a blind spot, we didn't think of which compiler optimisations should be disabled? Did we say we would disable CSE and forget to do it? Are there other bugs of this category looking around?)

@gasche
Copy link
Member Author

gasche commented Nov 3, 2023

There is a debate at #12713 about whether CSE for atomic loads is correct. I think that this debate should be resolved before we consider the present PR. I am closing for now, and we can reopen any time we want.

@gasche gasche closed this Nov 3, 2023
@gasche gasche reopened this Nov 3, 2023
@gasche
Copy link
Member Author

gasche commented Nov 3, 2023

Further examples from @polytypic over at #12713 suggest the following example where CSE is clearly (?) invalid:

  let x0 = Atomic.get x in
  let y0 = Atomic.get y in
  let x1 = Atomic.get x in

Merging the two reads of x is wrong, because the read of y0 could have observed a write on x that happens after the read of x0. In operational terms, the read of y0 updated our frontier, so the read of x1 happens in a different state. Atomic loads mutate the frontier state.

Copy link
Contributor

@lthls lthls left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me.

I've read the arguments in #12713 and agree with the conclusion (CSE of atomic loads is only allowed if no other memory operation occurs in-between, and it's not worth adding a special case for that).

Copy link
Contributor

@kayceesrk kayceesrk left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The change looks good to me.

@kayceesrk
Copy link
Contributor

kayceesrk commented Nov 4, 2023

I wonder why this was not done back when Multicore was merged. Was it a blind spot, we didn't think of which compiler optimisations should be disabled? Did we say we would disable CSE and forget to do it? Are there other bugs of this category looking around?

Earlier, atomic loads were translated to an external call to caml_atomic_load. The external calls are not optimised by the compiler and hence, the SC semantics for atomics was preserved. Later ocaml-multicore/ocaml-multicore#251 lowered it through the compiler. In this PR, we didn't take care of the interaction with the compiler optimisations. This was an oversight.

Copy link
Contributor

@xavierleroy xavierleroy left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me. For those who wonder, the Op_other classification mean that the result of an atomic load is treated as unpredictable, and so cannot be replaced by a move from the result of an earlier load at the same address. It's better than treating atomic loads as external calls, since regular loads occurring before and after the atomic load can still be factored out, while an external call is assumed to write all over memory.

@kayceesrk
Copy link
Contributor

kayceesrk commented Nov 4, 2023

Atomic.set is compiled to an atomic exchange, which is translated to an external call

ocaml/asmcomp/cmmgen.ml

Lines 1118 to 1120 in d77bc97

| Patomic_exchange ->
Cop (Cextcall ("caml_atomic_exchange", typ_val, [], false),
[transl env arg1; transl env arg2], dbg)

Hence, the compiler should not optimise atomic stores.

It would be useful to do an audit of all the optimisations that we perform on loads of mutable locations. The easiest would be to disable any optimisations of atomic loads. For example, we have disabled the reordering of atomic loads in the scheduling pass in #12248.

Changes Outdated
@@ -540,6 +540,9 @@ Working version
- #12684: fix locations filename in AST produced by the `-pp` option
(Gabriel Scherer, review by Florian Angeletti)

- ???: disable common subexpression elimination for atomic loads
(Gabriel Scherer, review by ???, report by Vesa Karvonen)

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It was Carine Morel (@lyrm) who wrote the test that triggered the bug. Initially we assumed it was my mistake, but later I could not understand why the bug would be triggered (Atomic.get is supposed to have an acquire fence and that should prevent the reordering from happening), so I examined the compiler output and realized that the compiler had eliminated the Atomic.get instruction completely.

@gasche gasche added the merge-me label Nov 4, 2023
@gasche gasche merged commit 0c963ce into ocaml:trunk Nov 4, 2023
9 of 10 checks passed
@kayceesrk
Copy link
Contributor

It would be useful to do an audit of all the optimisations that we perform on loads of mutable locations.

@lthls Is there a list of optimisations that we perform on loads of mutable locations? If such a list doesn't exist (very likely), how would we go about gathering it? I assume this would also be needed for flambda.

@lthls
Copy link
Contributor

lthls commented Nov 6, 2023

No, there isn't such a list. For Flambda we decided to give some sensible semantics to the internal IR, and as a result we can check our optimisations against that (although we never published the semantics properly, hopefully we will correct that with Flambda 2).
One thing that makes these discussions particularly tricky is that what is considered as a mutable location can change during compilation. For example, CSE of mutable loads across a mutable write are forbidden in the backend, but if the right conditions are met the middle-end might be able to do the CSE because at that point the loads are not mutable yet. (Example: let x0 = fst x in y := 1; let x1 = fst x).

My intuition though is that for multicore-related issues we should only look at what happens from Mach and forward. Everything before that is too high-level (for instance, evaluation order is not finalised until the translation to Mach).
That still leaves a decent number of files to look at, but I believe they're relatively easy to sort into optional optimisation passes and required, non-optimising passes. From memory I would say that the optimisation passes are Deadcode, Split, and CSE. Although it's likely that a few optimisations are hidden in the translation passes (Selection, Linearize, Emit`).

@Octachron
Copy link
Member

Since this change is disabling a problematic optimisation, and we are moving to 5.1.1 as the reference version for 5.1, I think it is better to cherry-pick it to 5.1.

dra27 pushed a commit to dra27/ocaml that referenced this pull request Dec 6, 2023
disable CSE for atomic loads

(cherry picked from commit 0c963ce)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Repeated Atomic.get optimized away incorrectly
6 participants