Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Avoid field reload in ephe_get_field #11749

Merged
merged 2 commits into from Nov 24, 2022
Merged

Conversation

jmid
Copy link
Contributor

@jmid jmid commented Nov 23, 2022

Consider the following program, performing unsynchronized Weak.get and Weak.set:

let test () =
  let weak_size = 8 in
  let t = Weak.create weak_size in
  let () = Weak.set t 3 (Some 2L) in
  let wait = Atomic.make true in
  let d1 = Domain.spawn (fun () -> while Atomic.get wait do Domain.cpu_relax() done; Weak.set t 3 None) in
  let d2 = Domain.spawn (fun () -> Atomic.set wait false; let res = Weak.get t 3 in
                          match res with
                          | None
                          | Some 2L -> ()
                          | Some i -> failwith ("received: " ^ (Int64.to_string i))
                        ) in
  let () = Domain.join d1 in
  let () = Domain.join d2 in
  ()

let () =
  for i = 0 to 1000 do
    test ()
  done

This can result in weird values being observed:

$ ocamlopt weaktest.ml 
$ ./a.out 
Fatal error: exception Failure("received: 94032997501288")

The problem is in ephe_get_field where we should avoid reloading the field, as it may have been altered in parallel:

ocaml/runtime/weak.c

Lines 244 to 256 in 408bba1

static value ephe_get_field (value e, mlsize_t offset)
{
CAMLparam1(e);
CAMLlocal2 (res, elt);
clean_field(e, offset);
elt = Field(e, offset);
if (elt == caml_ephe_none) {
res = Val_none;
} else {
elt = Field(e, offset);
caml_darken (Caml_state, elt, 0);

I'm unsure

  • if this requires a Changes entry
  • if I should include the test case in the PR

The PR was a tag-team effort with @kayceesrk who cooked up a quick fix after I had found the above test case.

@dra27 dra27 added this to the 5.0 milestone Nov 23, 2022
@kayceesrk
Copy link
Contributor

Thanks @jmid. The fix is correct.

A shortened test case is useful. I was able to reliably repeat the crashes with the for loop going only until 100 instead of 1000. With 100 iterations, the test runs in 50ms and does not slowdown the CI much.

Since the program only has 3 domains running in parallel (one of which is quickly expected to wait on Domain.join), it is not necessary to disable the test on CI machines with no more than 2 cores as in 5a9db7f.

@kayceesrk kayceesrk added the bug label Nov 24, 2022
@kayceesrk kayceesrk changed the title Avoid racy field load in ephe_get_field Avoid field reload in ephe_get_field Nov 24, 2022
@jmid
Copy link
Contributor Author

jmid commented Nov 24, 2022

OK, test case added with 100 iterations in a8f5da6

@kayceesrk
Copy link
Contributor

AppVeyor failure is on an unrelated test https://ci.appveyor.com/project/avsm/ocaml/builds/45478262#L4785.

@Octachron Octachron merged commit 82c6a8d into ocaml:trunk Nov 24, 2022
@jmid jmid deleted the fix-ephe-get-field branch November 24, 2022 13:00
@gadmm
Copy link
Contributor

gadmm commented Nov 24, 2022

Ouch, we knew this sort of race could appear in legacy code in theory but I did not expect that it would appear so quickly in practice. Looking at weak.c, there are other places where the field seems to be reloaded (not clearly in error unlike here, but also not obviously correct for someone who is not familiar with the implementation like me). Have you tried to see if other bugs of this kind might be present in weak.c, or is this something to look into?

@jmid
Copy link
Contributor Author

jmid commented Nov 24, 2022

I wrote tests of the low-level Weak-API for a start as part of the multicoretests-effort: ocaml-multicore/multicoretests#214
None of these caused a crash, but the produced counterexamples puzzled me.
As such, it would probably be more fair to say this was discovered by torture than by practical use... 😉
I think it should be straightforward to stress test the HashSet-part by similar means, so I'll try that.

@kayceesrk
Copy link
Contributor

Ouch, we knew this sort of race could appear in legacy code in theory but I did not expect that it would appear so quickly in practice

to be clear, weak module is an almost complete reimplementation on Multicore. The buggy code is not legacy code. It is some code that kind of got left in as the weak.c file was refactored from the concurrent minor GC to the parallel minor GC. (Walking the Git Blame history on the deleted line should tell you the entire picture).

@gadmm
Copy link
Contributor

gadmm commented Nov 24, 2022

Ok, I now understand this is a simple bug unrelated to backwards-compatibility issues. This also means that there is no reason to look for other issues of this kind in the same file.

@kayceesrk
Copy link
Contributor

Indeed. If we are looking for bugs introduced in legacy code due to the addition of parallelism, we will need to look elsewhere.

Octachron added a commit that referenced this pull request Nov 24, 2022
Avoid field reload in ephe_get_field

(cherry picked from commit 82c6a8d)
@Octachron
Copy link
Member

Cherry-picked on 5.0 as 3061e71 .

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

5 participants