Add Runtime_events.EV_EMPTY_MINOR#13407
Conversation
|
The "MSVC 32 bits" test |
|
Thanks @talex5 , these look good to have. The only thing I'm worried about is the FUTEX_WAIT event being inside the loop - if for some reason it spins frequently it will cause lots of other events in the ring to be dropped. Does it make sense to move it outside of the while loop? |
I don't mind either way. My thinking was that if it keeps being woken up unnecessarily then that's interesting to see. |
|
Also, I guess putting it outside the loop would need some extra logic to avoid tracing the fast path where we never suspend. |
|
Actually, having a look again it's probably fine where it is. @talex5 would you be able to rebase on trunk? That should fix MSVC. I can then merge. Probably needs Changes update too. |
70cc3fc to
ac479bf
Compare
|
I rebased, but the test is now failing on "Build/normal": Which commit was supposed to fix it? |
|
Oh sorry, I mistakenly assumed that PR was fixing the failing test. Hrm. This is failing with |
|
I can get this test to failure if I run the testsuite in parallel but annoyingly not if I only run the lib-runtime-events tests. |
|
Perhaps it would be better to open a new issue for the test failure, as it seems unrelated to this PR. |
|
Is there anything stopping this from being merged? |
|
I've created #13446 for this. Let me merge this PR. |
|
I kicked off a CI run again, and there is a different testcase failure now. The new failure is https://github.com/ocaml/ocaml/actions/runs/10771211633/job/30191687315?pr=13407#step:7:1901 |
That also seems unrelated. The test was added by @NickBarnes in #13419. The CI also failed there, but only on i386. Something to do with running as root? |
|
Please could you rebase the PR on to trunk and force push? |
|
What I think is happening (without going and double-checking the docs) is that you are getting the workflow for GitHub actions from this branch, but then actions/checkout is of course testing the merge commit - so you have the test from #13419 without the fix to .github/workflows/build.yml which went with it. |
ac479bf to
1378d7c
Compare
|
OK, I rebased it again and now it's back to the expected |
|
From a distance my impression is not that the test is flaky (that it fails non-deterministically), rather that it is fragile / not robust, and that it breaks deterministically with this PR (that it gets worse). No strong opinion here, but I think a little investigation could help. |
|
I started down a rabbit hole with that test in #12397 |
|
As an update, I finally managed to get the dropped events test to segfault locally and have a core dump. Am debugging. |
|
Got it: So it looks like it's not safe to write events in I need to think more deeply about whether there's a way around this. |
What exactly are the locking rules here? If we're checking for interrupts then we have the domain lock I presume? But I guess (and since this test also fails without this PR, something else must be writing events at the wrong time too then) |
1378d7c to
41789ba
Compare
|
So, I think the issue might actually be that we don't stop other domains writing to the ring buffers until after we've unmapped them: https://github.com/ocaml/ocaml/blob/trunk/runtime/runtime_events.c#L186 I think because teardown is done in a stop-the-world it's not been an issue until now. If we fix this by moving writing the flag to before the ring is unsafe to write then I think we're good to go. |
|
Can we risk reverting #13476 on the understanding that if the failures recur then we re-commit it? |
|
Yes, please feel free. |
OK, I see. So this can happen:
So the flag needs to be written before entering the barrier. Then we know every domain has seen the stopped flag before we unmap. But I think we also need to ensure that |
41789ba to
3f889c4
Compare
|
The data race in The fix isn't complete as reading the The crash in the debug mode still remains. |
|
I would like to better understand who is allowed to access the runtime-event ring, and when. If we assume (as I think your quick fix does) that only STW participants can access it, then it should suffice to have an STW event to shut down the ring, and no other cross-domain coordination should be necessary. But I think that maybe we want more, at least if the intent is to allow threads that do not currently own their domain lock to access the runtime ring (to log their own attempts to regain control back, for example). (Is |
|
Agreed with you that only STW participants should access the ring. I have a patch here that does that: kayceesrk@446b1d1. But it doesn't fix the data race, which is surprising. I'll need to investigate this. Is it necessary for the domain lock to be held to access the ring? @sadiqj. |
|
Here, https://github.com/kayceesrk/ocaml/actions/runs/11268926573/job/31336768641#step:5:735, you will observe there that only the write to ring in |
|
I still see failures after (1) replacing relaxed loads with acquire loads and (2) allowing only stw participants to write to the ring. I'm a bit confused what the expectations are for ring write. Perhaps @sadiqj can say more. |
EV_EMPTY_MINOR shows when a domain is trying to empty its minor heap. It may be a long time between starting this process and actually performing a minor GC if, for example, another domain is holding the platform lock. Without this event, profiling tools tend to under-report the amount of time spent on GC.
|
I've removed the The only CI failure now is |
|
Thanks. Removing |
|
The PR looks good, and the testsuite passes except for the unrelated Windows failure. I'll merge the PR. |
EV_OS_WAITis useful to see in traces when OCaml asked the OS to suspend a domain. If the domain is waiting for another domain on the same CPU, this is the point where it is likely to be scheduled. Update: I have removed this for now.EV_EMPTY_MINORshows when a domain is trying to empty its minor heap. It may be a long time between starting this process and actually performing a minor GC if, for example, another domain is holding the platform lock. Without this event, profiling tools tend to under-report the amount of time spent on GC./cc @sadiqj @kayceesrk (https://discuss.ocaml.org/t/ocaml-5-performance/15014/19?u=talex5)
Example trace showing the new events: