New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
publish.ml memory model failure in bytecode on Windows #11853
Comments
It's failing too frequently - see separate tracking issue in ocaml#11853
Why is the test failing? Is this a bug with the Windows implementation, or more generally an OCaml bug? The test code is cryptic, but it is here: ocaml/testsuite/tests/memory-model/publish.ml Lines 125 to 137 in 23bf905
My understanding is that, in this test, one domain repeatedly updates a reference pointing to a string, while another domain accesses the reference and checks that the observed string is "valid" (one the values that the first domain writes). This failure does look like something bad to me (violating my expectation about concurrent programming in OCaml). I have no clue, but my guess as what is happening would be that some string-construction functions are internally using in-place update on a |
It's failing too frequently - see separate tracking issue in ocaml#11853
(cc @maranget I guess, who wrote the memory-model tests in the first place.) |
I'm curious to see whether MSVC can trigger this as well - the purpose in opening the tracking issue is to let all the other Windows core developers loose on the problem, while stopping Windows CI being completely ignored on other PRs 🙂 |
Hi @dra27, it is not easy for me to run the test on windows. Would you mind attempting to reproduce the problem, running the test directly with verbose output enabled? More precisely, you can compile the test with
If your machine has N cores, it may be interesting to add option Having a verbose output may help. Thanks. |
No problem at all, @maranget! Here's with the default (
and here's with
Interestingly, at |
For the record:
My repro code was as follows. (* non-atomic references *)
module Ref = struct
let make v = ref v
let get r = !r
let set r v = r := v
end
(* (* to get atomic references, comment above and uncomment below *)
module Ref = Atomic
*)
let digits = List.init 10 (fun digit ->
Char.chr (digit + Char.code '0'))
let msg_len = 32
let values = List.map (fun c -> String.make msg_len c) digits
let valid =
let module SSet = Set.Make(String) in
let values = SSet.of_list values in
fun s -> SSet.mem s values
let shared = Ref.make (List.hd values)
let producing = Ref.make true
let runs = 100_000
let producer () =
for _ = 1 to runs do
digits |> List.iter @@ fun digit ->
let msg = String.make msg_len digit in
let copy = String.sub msg 0 msg_len in
Ref.set shared copy
done;
Ref.set producing false
let consumer () =
while Ref.get producing do
let msg = Ref.get shared in
assert (valid msg);
Domain.cpu_relax ();
done
let () =
let doms =
List.init (Domain.recommended_domain_count () - 1)
(fun _ -> Domain.spawn consumer) in
producer ();
List.iter Domain.join doms; I sort of hoped to observe an assert failure, but did not. (The test uses a very short |
It looks like you're suspecting weak memory model issues to be the cause for this issue, but from a discussion yesterday with @maranget :
|
@maranget is currently jumping up and down in my temporary office, excited at the idea that |
The plot thickens... I'm afraid we're about to conclude that the C standard library is unusable in the (Multicore) OCaml runtime system and we need to rewrite everything ourselves... |
But let's not forget the possibility that this is a "plain" concurrency bug somewhere else in the runtime system, of the kind that can happen in an SC execution. |
Primitive
Said otherwise, a non-temporal write (possibly issued by memmove) and an ordinary write (issued by |
I've managed to trigger the segfault on the same machine, but using the msvc64 port instead. I was rather hoping the Just-in-Time debugger would attach properly, but it didn't so it's running again. Anyhow, what's interesting from that is mingw-w64 uses a different, much older CRT from the msvc64 port, which is using UCRT. There are two possible things going in - with Were I a gambler, I think I'd be putting my chips on a concurrency bug, too 🤔 |
Can you find the implementation of |
Yes for the msvc64 one (turns out that's actually in vcruntime rather than UCRT 🤷) - it's in |
Small update. I've altered the test to run itself infinitely many times, which finally allowed me to attach a debugger to the segfaulting version. The crash is at |
This issue has been open one year with no activity. Consequently, it is being marked with the "stale" label. What this means is that the issue will be automatically closed in 30 days unless more comments are added or the "stale" label is removed. Comments that provide new information on the issue are especially welcome: is it still reproducible? did it appear in other contexts? how critical is it? etc. |
@dra27, what is the current status of this bug? (I am liberally applying my policies that the stale bot should not close bugs without human intervention). |
My understanding is that the bug is still failing randomly (see #12425 for the list of known CI failure, the last one was three weeks ago), but we have no clue why and how to approach the issue. |
This issue was about Offering pointers, as I don't have the time myself at the moment to keep digging with this:
@maranget, @gasche and I had a brief private thread looking at the content of |
tests/memory-model/publish.ml
is failing too frequently in AppVeyor. I think we should disable the test, so this issue is to track that fact (if necessary, the stale bot can act as a trigger to run more tests manually).I've done some limited bulk runs of the test on two of my machines. On my laptop (Intel i7-8650U) I can relatively easily get this failure in bytecode (the most recent took 152 runs):
I ran the test for several hours with just the native code version (> 500 runs) and did not get a failure.
On my desktop (AMD Threadripper 3990X), I left it running the test overnight (> 700 runs) and haven't had a crash in either bytecode or native code.
AppVeyor seems to be getting a segfault rather than the test failure I was seeing. In all these instances,
OCAML_TEST_SIZE
should be unset, so these should be executed 2 core versions of the test.The text was updated successfully, but these errors were encountered: