Skip to content

Add Runtime_events.EV_EMPTY_MINOR#13407

Merged
kayceesrk merged 1 commit intoocaml:trunkfrom
talex5:gc-events-5.3
Oct 18, 2024
Merged

Add Runtime_events.EV_EMPTY_MINOR#13407
kayceesrk merged 1 commit intoocaml:trunkfrom
talex5:gc-events-5.3

Conversation

@talex5
Copy link
Copy Markdown
Contributor

@talex5 talex5 commented Aug 28, 2024

EV_OS_WAIT is useful to see in traces when OCaml asked the OS to suspend a domain. If the domain is waiting for another domain on the same CPU, this is the point where it is likely to be scheduled. Update: I have removed this for now.

EV_EMPTY_MINOR shows when a domain is trying to empty its minor heap. It may be a long time between starting this process and actually performing a minor GC if, for example, another domain is holding the platform lock. Without this event, profiling tools tend to under-report the amount of time spent on GC.

/cc @sadiqj @kayceesrk (https://discuss.ocaml.org/t/ocaml-5-performance/15014/19?u=talex5)

Example trace showing the new events:

Trace showing new events

@talex5 talex5 changed the title Add Runtime_events.EV_OS_WAIT and EV_EMPTY_MINOR Add Runtime_events.EV_FUTEX_WAIT and EV_EMPTY_MINOR Aug 30, 2024
Copy link
Copy Markdown
Contributor

@kayceesrk kayceesrk left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM.

@talex5
Copy link
Copy Markdown
Contributor Author

talex5 commented Aug 30, 2024

The "MSVC 32 bits" test tests/lib-runtime-events/test_dropped_events.ml is failing in CI. However, I see that it is also failing in the unrelated PR #13408 (MSVC 64 bits), so I assume this test is known to be flaky?

@sadiqj
Copy link
Copy Markdown
Contributor

sadiqj commented Aug 30, 2024

Thanks @talex5 , these look good to have.

The only thing I'm worried about is the FUTEX_WAIT event being inside the loop - if for some reason it spins frequently it will cause lots of other events in the ring to be dropped. Does it make sense to move it outside of the while loop?

@talex5
Copy link
Copy Markdown
Contributor Author

talex5 commented Aug 30, 2024

The only thing I'm worried about is the FUTEX_WAIT event being inside the loop - if for some reason it spins frequently it will cause lots of other events in the ring to be dropped. Does it make sense to move it outside of the while loop?

I don't mind either way. My thinking was that if it keeps being woken up unnecessarily then that's interesting to see.

@talex5
Copy link
Copy Markdown
Contributor Author

talex5 commented Aug 30, 2024

Also, I guess putting it outside the loop would need some extra logic to avoid tracing the fast path where we never suspend.

@sadiqj
Copy link
Copy Markdown
Contributor

sadiqj commented Sep 9, 2024

Actually, having a look again it's probably fine where it is.

@talex5 would you be able to rebase on trunk? That should fix MSVC. I can then merge. Probably needs Changes update too.

@talex5
Copy link
Copy Markdown
Contributor Author

talex5 commented Sep 9, 2024

I rebased, but the test is now failing on "Build/normal":

 ... testing 'test_dropped_events.ml' with line 9 (bytecode) => failed (Running program /home/runner/work/ocaml/ocaml/testsuite/tests/lib-runtime-events/_ocamltest/tests/lib-runtime-events/test_dropped_events/ocamlc.byte/test_dropped_events.byte without any argument: command
/home/runner/work/ocaml/ocaml/testsuite/tests/lib-runtime-events/_ocamltest/tests/lib-runtime-events/test_dropped_events/ocamlc.byte/test_dropped_events.byte 
failed with exit code -11)

Which commit was supposed to fix it?

@sadiqj
Copy link
Copy Markdown
Contributor

sadiqj commented Sep 9, 2024

Oh sorry, I mistakenly assumed that PR was fixing the failing test. Hrm.

This is failing with Process 19223 got signal 11(Segmentation fault), core dumped

@sadiqj
Copy link
Copy Markdown
Contributor

sadiqj commented Sep 9, 2024

I can get this test to failure if I run the testsuite in parallel but annoyingly not if I only run the lib-runtime-events tests.

@talex5
Copy link
Copy Markdown
Contributor Author

talex5 commented Sep 10, 2024

Perhaps it would be better to open a new issue for the test failure, as it seems unrelated to this PR.

@talex5
Copy link
Copy Markdown
Contributor Author

talex5 commented Sep 16, 2024

Is there anything stopping this from being merged?

@kayceesrk
Copy link
Copy Markdown
Contributor

I've created #13446 for this. Let me merge this PR.

@kayceesrk
Copy link
Copy Markdown
Contributor

I kicked off a CI run again, and there is a different testcase failure now. The new failure is tests/lib-runtime-events/test_create_cursor_failures.ml.

https://github.com/ocaml/ocaml/actions/runs/10771211633/job/30191687315?pr=13407#step:7:1901

@talex5
Copy link
Copy Markdown
Contributor Author

talex5 commented Sep 16, 2024

The new failure is tests/lib-runtime-events/test_create_cursor_failures.ml.

That also seems unrelated. The test was added by @NickBarnes in #13419. The CI also failed there, but only on i386. Something to do with running as root?

@dra27
Copy link
Copy Markdown
Member

dra27 commented Sep 16, 2024

Please could you rebase the PR on to trunk and force push?

@dra27
Copy link
Copy Markdown
Member

dra27 commented Sep 16, 2024

What I think is happening (without going and double-checking the docs) is that you are getting the workflow for GitHub actions from this branch, but then actions/checkout is of course testing the merge commit - so you have the test from #13419 without the fix to .github/workflows/build.yml which went with it.

@talex5
Copy link
Copy Markdown
Contributor Author

talex5 commented Sep 16, 2024

OK, I rebased it again and now it's back to the expected test_dropped_events error.

@gasche
Copy link
Copy Markdown
Member

gasche commented Sep 16, 2024

From a distance my impression is not that the test is flaky (that it fails non-deterministically), rather that it is fragile / not robust, and that it breaks deterministically with this PR (that it gets worse). No strong opinion here, but I think a little investigation could help.

@dra27
Copy link
Copy Markdown
Member

dra27 commented Sep 16, 2024

I started down a rabbit hole with that test in #12397

@NickBarnes
Copy link
Copy Markdown
Contributor

I started down a rabbit hole with that test in #12397

I suggest @talex5 tries testing this on top of #12397, and see if that fixes the test. Ideally we'd get #12397 fixed-up and merged and avoid further problems with it.

@gasche gasche removed the merge-me label Sep 18, 2024
@sadiqj
Copy link
Copy Markdown
Contributor

sadiqj commented Sep 23, 2024

As an update, I finally managed to get the dropped events test to segfault locally and have a core dump. Am debugging.

@sadiqj
Copy link
Copy Markdown
Contributor

sadiqj commented Sep 23, 2024

Got it:

#0  __unlink (name=name@entry=0x55f29942e8c0 "336531.events") at ../sysdeps/unix/sysv/linux/unlink.c:28
#1  0x000055f297ab6798 in runtime_events_teardown_from_stw_single (remove_file=1) at runtime/runtime_events.c:179
#2  stw_teardown_runtime_events (domain_state=<optimized out>, remove_file_data=<optimized out>, num_participating=<optimized out>, participating_domains=<optimized out>) at runtime/runtime_events.c:435
#3  0x000055f297a976bd in caml_try_run_on_all_domains_with_spin_work (sync=sync@entry=1, handler=handler@entry=0x55f297ab66c0 <stw_teardown_runtime_events>, data=data@entry=0x7ffcfc7c2074, leader_setup=leader_setup@entry=0x0, enter_spin_callback=enter_spin_callback@entry=0x0, enter_spin_data=enter_spin_data@entry=0x0) at runtime/domain.c:1691
#4  0x000055f297a9784d in caml_try_run_on_all_domains (handler=handler@entry=0x55f297ab66c0 <stw_teardown_runtime_events>, data=data@entry=0x7ffcfc7c2074, leader_setup=leader_setup@entry=0x0) at runtime/domain.c:1713
#5  0x000055f297ab6ba5 in caml_runtime_events_destroy () at runtime/runtime_events.c:232
#6  0x000055f297abdd88 in caml_do_exit (retcode=retcode@entry=0) at runtime/sys.c:188
#7  0x000055f297a30009 in main (argc=<optimized out>, argv=<optimized out>) at runtime/main.c:38

Thread 1 (Thread 0x78ad52a006c0 (LWP 336532)):
#0  get_ring_buffer_by_domain_id (domain_id=<optimized out>) at runtime/runtime_events.c:513
#1  write_to_ring (category=category@entry=EV_RUNTIME, type=..., type@entry=..., event_id=event_id@entry=48, event_length=event_length@entry=0, content=content@entry=0x0, word_offset=0) at runtime/runtime_events.c:529
#2  0x000055f297ab6dd7 in caml_ev_begin (phase=phase@entry=EV_FUTEX_WAIT) at runtime/runtime_events.c:618
#3  0x000055f297ab6062 in caml_plat_futex_wait (ftx=ftx@entry=0x55f297b362c0 <stw_request+64>, undesired=1048577) at runtime/platform.c:291
#4  0x000055f297ab61b2 in latchlike_wait (contested=<optimized out>, unreleased=<optimized out>, ftx=0x55f297b362c0 <stw_request+64>) at runtime/platform.c:333
#5  0x000055f297a95737 in global_barrier_wait (num_participating=2, sense=<optimized out>) at runtime/domain.c:1370
#6  global_barrier_wait (num_participating=2, sense=<optimized out>) at runtime/domain.c:1359
#7  caml_global_barrier_and_check_final (num_participating=2) at runtime/domain.c:1393
#8  0x000055f297ab66d7 in stw_teardown_runtime_events (domain_state=<optimized out>, remove_file_data=0x7ffcfc7c2074, num_participating=<optimized out>, participating_domains=<optimized out>) at runtime/runtime_events.c:433
#9  0x000055f297a971c5 in stw_handler (domain=0x78ad4c002b80) at runtime/domain.c:1482
#10 handle_incoming (s=<optimized out>) at runtime/domain.c:351
#11 0x000055f297a97cec in caml_handle_incoming_interrupts () at runtime/domain.c:364
#12 caml_handle_gc_interrupt () at runtime/domain.c:1893
#13 0x000055f297abb723 in caml_do_pending_actions_res () at runtime/signals.c:338
#14 0x000055f297abb985 in caml_process_pending_actions_res () at runtime/signals.c:397
#15 caml_process_pending_actions () at runtime/signals.c:405
#16 <signal handler called>
#17 0x000055f297a3290a in camlTest_dropped_events.producer_303 ()
#18 0x000055f297a5c9f6 in camlStdlib__Domain.body_742 () at domain.ml:266
--Type <RET> for more, q to quit, c to continue without paging--
#19 <signal handler called>
#20 0x000055f297a92d2a in caml_callback_exn (closure=<optimized out>, closure@entry=132685815080880, arg=<optimized out>, arg@entry=1) at runtime/callback.c:205
#21 0x000055f297a9365d in caml_callback_res (closure=closure@entry=132685815080880, arg=arg@entry=1) at runtime/callback.c:320
#22 0x000055f297a96662 in domain_thread_func (v=<optimized out>) at runtime/domain.c:1243
#23 0x000078ad62c9ca94 in start_thread (arg=<optimized out>) at ./nptl/pthread_create.c:447
#24 0x000078ad62d29c3c in clone3 () at ../sysdeps/unix/sysv/linux/x86_64/clone3.S:78

So it looks like it's not safe to write events in caml_plat_futex_wait because another domain might be in a stw trying to teardown the ring at the same time.

I need to think more deeply about whether there's a way around this.

@talex5
Copy link
Copy Markdown
Contributor Author

talex5 commented Sep 23, 2024

So it looks like it's not safe to write events in caml_plat_futex_wait because another domain might be in a stw trying to teardown the ring at the same time.

What exactly are the locking rules here? If we're checking for interrupts then we have the domain lock I presume? But I guess caml_plat_futex_wait doesn't require callers to have the lock?

(and since this test also fails without this PR, something else must be writing events at the wrong time too then)

@sadiqj
Copy link
Copy Markdown
Contributor

sadiqj commented Sep 29, 2024

So, I think the issue might actually be that we don't stop other domains writing to the ring buffers until after we've unmapped them:

https://github.com/ocaml/ocaml/blob/trunk/runtime/runtime_events.c#L186

I think because teardown is done in a stop-the-world it's not been an issue until now. If we fix this by moving writing the flag to before the ring is unsafe to write then I think we're good to go.

@dra27
Copy link
Copy Markdown
Member

dra27 commented Sep 29, 2024

Can we risk reverting #13476 on the understanding that if the failures recur then we re-commit it?

@gasche
Copy link
Copy Markdown
Member

gasche commented Sep 29, 2024

Yes, please feel free.

@talex5
Copy link
Copy Markdown
Contributor Author

talex5 commented Sep 30, 2024

I think because teardown is done in a stop-the-world it's not been an issue until now. If we fix this by moving writing the flag to before the ring is unsafe to write then I think we're good to go.

OK, I see. So this can happen:

  1. stw_teardown_runtime_events waits for all domains to stop.
  2. It then unmaps the ring.
  3. Another domain wants to wait on a futex so it knows when to resume. It tries to record an event for this, which may crash.
  4. The unmapping domain marks events as being stopped.

So the flag needs to be written before entering the barrier. Then we know every domain has seen the stopped flag before we unmap.

But I think we also need to ensure that caml_plat_futex_wait can now only be called with the domain lock held. Otherwise two threads might try to write events at the same time. Is that a reasonable requirement, or does the event need to be moved elsewhere?

@kayceesrk kayceesrk added the run-thread-sanitizer Makes the CI run the testsuite with TSAN enabled label Oct 4, 2024
@kayceesrk
Copy link
Copy Markdown
Contributor

The data race in test_dropped_events.ml is due to insufficient synchronization on reading runtime_events_enabled variable. See https://github.com/kayceesrk/ocaml/pull/18/files#diff-d0d4cad71f36ac3bdd3a51cc0e9edbcec198a0a62b46080d6701efb109ce80adR465 for a quick fix. By changing the relaxed load to an acquire load, the race goes away.

The fix isn't complete as reading the runtime_events_paused variable should also be an acquire load. The read is free on x86 but not on more relaxed architectures such as ARM, Power and RISC-V. The read will have to be performed on all programs irrespective of whether runtime events are enabled.

The crash in the debug mode still remains.

@gasche
Copy link
Copy Markdown
Member

gasche commented Oct 10, 2024

I would like to better understand who is allowed to access the runtime-event ring, and when. If we assume (as I think your quick fix does) that only STW participants can access it, then it should suffice to have an STW event to shut down the ring, and no other cross-domain coordination should be necessary.

But I think that maybe we want more, at least if the intent is to allow threads that do not currently own their domain lock to access the runtime ring (to log their own attempts to regain control back, for example). (Is futex_wait sometimes used for this?) In that case it would make sense to consider "remote" access to the ring (access from non-registered threads), and require a special discipline for those (eg. have a reference count of remote accessors, and wait for it to reach 0 before collecting the ring).

@kayceesrk
Copy link
Copy Markdown
Contributor

Agreed with you that only STW participants should access the ring. I have a patch here that does that: kayceesrk@446b1d1. But it doesn't fix the data race, which is surprising. I'll need to investigate this.

Is it necessary for the domain lock to be held to access the ring? @sadiqj.

@kayceesrk
Copy link
Copy Markdown
Contributor

kayceesrk commented Oct 10, 2024

Here, https://github.com/kayceesrk/ocaml/actions/runs/11268926573/job/31336768641#step:5:735, you will observe there that only the write to ring in caml_ev_end will have a race and never a caml_ev_begin. That is, the domain waits for the ring to be created in the stw section. So the caml_ev_begin does not attempt to write to the ring. And when the waiting domains are let go, they write to the ring. IIUC, I'm surprised that the futex wake up to wait doesn't enforce a happens-before relationship.

@kayceesrk
Copy link
Copy Markdown
Contributor

kayceesrk commented Oct 10, 2024

kayceesrk#18

I still see failures after (1) replacing relaxed loads with acquire loads and (2) allowing only stw participants to write to the ring. I'm a bit confused what the expectations are for ring write. Perhaps @sadiqj can say more.

EV_EMPTY_MINOR shows when a domain is trying to empty its minor heap. It
may be a long time between starting this process and actually performing
a minor GC if, for example, another domain is holding the platform lock.
Without this event, profiling tools tend to under-report the amount of
time spent on GC.
@talex5 talex5 changed the title Add Runtime_events.EV_FUTEX_WAIT and EV_EMPTY_MINOR Add Runtime_events.EV_EMPTY_MINOR Oct 17, 2024
@talex5
Copy link
Copy Markdown
Contributor Author

talex5 commented Oct 17, 2024

I've removed the EV_FUTEX_WAIT event from this PR, as it seems complicated to get it right. EV_EMPTY_MINOR is still useful on its own.

The only CI failure now is Program 'setup.exe' failed to run on Windows, and I see that's also failing on trunk.

@kayceesrk
Copy link
Copy Markdown
Contributor

Thanks. Removing EV_FUTEX_WAIT seems a good way forward.

@kayceesrk
Copy link
Copy Markdown
Contributor

The PR looks good, and the testsuite passes except for the unrelated Windows failure. I'll merge the PR.

@kayceesrk kayceesrk merged commit fb5e30c into ocaml:trunk Oct 18, 2024
shindere added a commit that referenced this pull request Mar 24, 2025
shindere added a commit that referenced this pull request Mar 24, 2025
@shindere
Copy link
Copy Markdown
Contributor

Just pushed f134dcc to trunk to add the reviewers to the Changes entry for this PR. Backported to 5.3 as 41e79cc.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

run-thread-sanitizer Makes the CI run the testsuite with TSAN enabled

Projects

None yet

Development

Successfully merging this pull request may close these issues.

7 participants