Improve backtrace abstractions inside runtime #12383

NickBarnes · 2023-07-17T13:46:28Z

In the trunk runtime, the backtrace API allows backtraces to be obtained either as (a) a single per-domain buffer, reserved for the "backtrace at the last exception raise", or (b) an object on the Caml heap, reflecting the current backtrace. This is insufficient for statmemprof, which needs the current backtrace at arbitrary allocation points, when Caml heap allocation might not be possible (or might change the backtrace, by garbage collection). So this branch changes the backtrace.h abstraction by adding caml_get_callstack().

I took the opportunity to also improve the code which actually iterates over the stack gathering backtrace entries—in each runtime (native-code and bytecode) there's now a single piece of code which does this, and it only iterates over the stack once. This key function (get_callstack() in both backtrace_nat.c and backtrace_byt.c) has a rather ugly ("candy machine") interface, because it turns out that the semantics of this are pretty twisty, depending for instance on whether the callstack should trace up into parent stack fibers or just to the current fiber, which depends (in the native-code runtime) on exactly why we want a backtrace. I'm not sure the current semantics are all that great but I believe I've preserved them.

I was tempted to improve the abstraction further, by making a new type (callstack_t) of C-heap-allocated backtrace objects which have slots for some metadata such as frame count (number of live entries) and size (for re-sizing), and support a small number of operations. That would allow this code to be further improved, removing some duplicated code in each runtime and also moving some source code from backtrace_nat.c and backtrace_byt.c into backtrace.c. However, I've deferred that for now.

This is one of a number of small PRs working towards restoring statmemprof on trunk. The large (unmergeable) PR #12379 shows where these are heading.

gasche · 2023-08-21T22:36:36Z

I tried to review this PR quickly and gave up / timed out because there are many changes together: there are various code style changes/improvements all over the place (which make for a relatively large diff), plus some actual API changes.

Here is what I understood by looking at the current PR and also #12379:

the 4.14 statmemprof implementation solves the problem of "sometimes you cannot call caml_alloc" by exposing a variant that calls caml_alloc_shr (allocates on the major heap, does not run a GC). Statmemprof for multicore [do not merge] #12379 uses a different strategy where the variant calls caml_stat_alloc, and then we copy the data to a caml_alloc block at the point where we decide to call the callback that was postponed. I am not sure why you are using a different strategy (is the previous strategy not valid anymore? is it related to some performance concerns? an issue with the domain calling the callback not necessarily the same as the one recording the callstack? sometehing else?)
This PR contains minor refactorings but also cherry-picking of changes of 4.14 that were not applied in the Multicore branch -- for example various Slot_* macros and the alloc_idx business. Sometimes the changes are directly copied from the 4.14 branch, sometimes they are reimplemented differently.

My gut feeling for how this PR could be made easier to review:

no minor style changes, or minor style changes in separate commits that are clearly marked as preserving the behavior exactly (we can review them without thinking much)
make it easy to review the new non-minor-change code by comparing to 4.14 -- you can point out the places where changes were needed and briefly mention why

NickBarnes · 2023-08-22T14:59:25Z

I will refactor this branch into a few commits as you suggest.

Although it is true that in #12379 I change the strategy of storing the backtrace for an allocation callback (from a shared-heap allocation to a C heap allocation), that's largely independent of this PR, which is about adding two pieces of functionality not present on trunk but needed for statmemprof:

obtaining a backtrace from the current callstack which is neither (a) stored in the single (per-domain) most-recent-exception backtrace buffer (as caml_stash_backtrace does) nor (b) allocated on the Caml minor heap (as caml_get_current_callstack and caml_get_continuation_callstack do); and
allowing a backtrace to include the debuginfo for an allocation point, as the slot at the hot end of the backtrace, without changing the Printexc Caml interface. This allows statmemprof to accurately profile "Comballoc" allocations.

In adding functionality (1) I chose to also refactor the code at the heart of backtracing, so that it is shared between the new function and the two existing use cases (a) and (b), and so that it avoids the ugly and potentially slow double stack iteration (in the traditional way, using a dynamic size-doubling array). This is similar to the way it was done for statmemprof in 4.12 by @stedolan, but some additional complications in multicore (such as fibers) mean that code doesn't port directly across (and result in the complex "candy-machine" interface in this PR).

The implementation of (2) is a direct steal from 4.14 (although I couldn't do it with a cherry-pick as too much else has changed).

4.14 does (1) using a single large backtrace buffer in memprof.c (called callstack_buffer). Obviously this isn't going to work for multicore, so rather than have a large per-domain static buffer I have a per-domain dynamically-grown buffer. But that's for a later PR.

As I say, I'll restructure this PR into separate commits for these distinct items (and maybe lose some of the minor stylistic changes on the way).

gasche

(I wrote some comments while looking at the code, but forgot to commit/publish them, sorry.)

runtime/backtrace_nat.c

gasche · 2023-08-22T19:04:18Z

That sounds good. I'm not asking for the code to be exactly similar to 4.14 (it's not like it was perfect in the first place), but I was planning to review actual behavior changes by doing a three-way compare between 4.14, trunk and your PR: it helps if they are clearly separated and, when you did things differently from 4.14, you can briefly mention why (in the PR discussion, this needs not be a code comment).

gasche · 2023-08-22T19:06:57Z

(I'm generally supportive of changes that add comments and clarify the codebase in a way that makes it easier for future people to understand. It's best if they are clearly marked as such and separated from the rest, because it is a very different review mindset from behavior changes. Then some things such as whether we use !d or d != NULL are mostly noise from my perspective: I would support keeping the codebase consistent in this respect and I can tell that you have the C experience to make informed changes there, but I'm not terribly excited when they show up in a review diff.)

fabbing

I have previously reviewed the "trial PR" proposing these changes (NickBarnes#1), and I think this refactoring looks good.

I have since been informed that it breaks compilation with --enable-tsan, so I propose these minor changes to make it compatible:

diff --git a/runtime/backtrace_nat.c b/runtime/backtrace_nat.c
index 6a376ffcab..62b0c91c43 100644
--- a/runtime/backtrace_nat.c
[fabrice@t15p ocaml]$ git diff > diff.txt
[fabrice@t15p ocaml]$ cat diff.txt 
diff --git a/runtime/backtrace_nat.c b/runtime/backtrace_nat.c
index 6a376ffcab..62b0c91c43 100644
--- a/runtime/backtrace_nat.c
+++ b/runtime/backtrace_nat.c
@@ -34,7 +34,7 @@
 
 /* Returns the next frame descriptor (or NULL if none is available),
    and updates *pc and *sp to point to the following one.  */
-static frame_descr *next_frame_descriptor
+frame_descr *next_frame_descriptor
     (caml_frame_descrs fds, uintnat *pc, char **sp, struct stack_info *stack)
 {
   frame_descr *d;
diff --git a/runtime/caml/frame_descriptors.h b/runtime/caml/frame_descriptors.h
index 71142a5550..4d10b54263 100644
--- a/runtime/caml/frame_descriptors.h
+++ b/runtime/caml/frame_descriptors.h
@@ -149,7 +149,7 @@ caml_frame_descrs caml_get_frame_descrs(void);
 frame_descr* caml_find_frame_descr(caml_frame_descrs fds, uintnat pc);
 
 
-frame_descr * caml_next_frame_descriptor
+frame_descr * next_frame_descriptor
     (caml_frame_descrs fds, uintnat * pc, char ** sp, struct stack_info* stack);
 
 #endif /* CAML_INTERNALS */
diff --git a/runtime/tsan.c b/runtime/tsan.c
index 348d761b99..0e10d63ed3 100644
--- a/runtime/tsan.c
+++ b/runtime/tsan.c
@@ -130,7 +130,7 @@ void caml_tsan_exit_on_raise(uintnat pc, char* sp, char* trapsp)
 
   /* iterate on each frame  */
   while (1) {
-    frame_descr* descr = caml_next_frame_descriptor(fds, &next_pc, &sp,
+    frame_descr* descr = next_frame_descriptor(fds, &next_pc, &sp,
         domain_state->current_stack);
 
     if (descr == NULL) {
@@ -214,7 +214,7 @@ void caml_tsan_exit_on_perform(uintnat pc, char* sp)
 
   /* iterate on each frame  */
   while (1) {
-    frame_descr* descr = caml_next_frame_descriptor(fds, &next_pc, &sp, stack);
+    frame_descr* descr = next_frame_descriptor(fds, &next_pc, &sp, stack);
 
     caml_tsan_debug_log_pc("forced__tsan_func_exit for", pc);
     __tsan_func_exit(NULL);
@@ -243,7 +243,7 @@ CAMLreally_no_tsan void caml_tsan_entry_on_resume(uintnat pc, char* sp,
   caml_frame_descrs fds = caml_get_frame_descrs();
   uintnat next_pc = pc;
 
-  caml_next_frame_descriptor(fds, &next_pc, &sp, (struct stack_info*)stack);
+  next_frame_descriptor(fds, &next_pc, &sp, (struct stack_info*)stack);
   if (next_pc == 0) {
     stack = stack->handler->parent;
     if (!stack) {

NickBarnes · 2023-08-24T14:29:15Z

I've redone this whole PR, breaking it into a few separate commits with distinct effects. I've dropped the candy-machine version of get_callstack(), which was only really present to support caml_stash_callstack() - because in that instance (callstack for the most recent exception) there are several unique requirements (it doesn't necessarily start at the hot end of the stack, it doesn't necessarily start at the beginning of the returned buffer, and it doesn't trace up into parent fibers).

Building the compiler with ThreadSanitizer and running the testsuite caused too many reports in OCaml 5 and was disabled (see ocaml#11040). Since then, the work on TSan support for OCaml programs has led to fix a number of those data races and temporarily silence the ones that are waiting to be investigated (see ocaml#11040 again). As a result, running the testsuite with `--enable-tsan` is now a cheap and effective way of detecting new data races that may be introduced in the runtime . A second good reason to restore the TSan CI is that it will detect early if a recent change has accidentally broken TSan instrumentation (as has happened before as an accidental consequence of removing a symbol ocaml#12383 (review)), or other issues (e.g. a new test revealed a TSan limitation with signals ocaml#12561 (comment)). Adding this test to the Github Actions CI arguably lengthens the runs a bit much (a GHA run on amd64 Linux takes about 50 minutes). A good compromise seems to be the Inria CI which is run on every merge.

…caused too many reports in OCaml 5 and was disabled (see ocaml#11040). Since then, the work on TSan support for OCaml programs has led to fix a number of those data races and temporarily silence the ones that are waiting to be investigated (see ocaml#11040 again). As a result, running the testsuite with `--enable-tsan` is now a cheap and effective way of detecting new data races that may be introduced in the runtime. A second good reason to restore the TSan CI is that it will detect early if a recent change has accidentally broken TSan instrumentation (as has happened before as an accidental consequence of removing a symbol ocaml#12383 (review)), or other issues (e.g. a new test revealed a TSan limitation with signals ocaml#12561 (comment)). Adding this test to the Github Actions CI arguably lengthens the runs (a GHA run on amd64 Linux with TSan takes about 50 minutes). This PR therefore suggests the compromise of enabling it on the Inria CI which is run on every merge.

fabbing

I don't get all the details, but the new changes look good to me... and also, TSan still compiles!

…caused too many reports in OCaml 5 and was disabled (see ocaml#11040). Since then, the work on TSan support for OCaml programs has led to fix a number of those data races and temporarily silence the ones that are waiting to be investigated (see ocaml#11040 again). As a result, running the testsuite with `--enable-tsan` is now a cheap and effective way of detecting new data races that may be introduced in the runtime. A second good reason to restore the TSan CI is that it will detect early if a recent change has accidentally broken TSan instrumentation (as has happened before as an accidental consequence of removing a symbol ocaml#12383 (review)), or other issues (e.g. a new test revealed a TSan limitation with signals ocaml#12561 (comment)). Adding this test to the Github Actions CI arguably lengthens the runs (a GHA run on amd64 Linux with TSan takes about 50 minutes). This PR therefore suggests the compromise of enabling it on the Inria CI which is run on every merge.

NickBarnes · 2023-10-17T12:28:28Z

I've rebased this, and it's approved by @fabbing. Can we merge it?

* Building the compiler with ThreadSanitizer and running the testsuite caused too many reports in OCaml 5 and was disabled (see #11040). Since then, the work on TSan support for OCaml programs has led to fix a number of those data races and temporarily silence the ones that are waiting to be investigated (see #11040 again). As a result, running the testsuite with `--enable-tsan` is now a cheap and effective way of detecting new data races that may be introduced in the runtime. A second good reason to restore the TSan CI is that it will detect early if a recent change has accidentally broken TSan instrumentation (as has happened before as an accidental consequence of removing a symbol #12383 (review)), or other issues (e.g. a new test revealed a TSan limitation with signals #12561 (comment)). Adding this test to the Github Actions CI arguably lengthens the runs (a GHA run on amd64 Linux with TSan takes about 50 minutes). This PR therefore suggests the compromise of enabling it on the Inria CI which is run on every merge. * Disable tests parallel/catch_break with tsan * CI sanitizers: Use clang 14 clang 13 thread sanitizer produces different, less precise traces. Also, clang 14 is the default version in Ubuntu 22.04 LTS. --------- Co-authored-by: Xavier Leroy <xavier.leroy@college-de-france.fr>

NickBarnes · 2023-11-07T12:30:35Z

rebased.

… TSAN).

comballoc allocations.

iterate over the stack once instead of twice, dynamically reallocating the backtrace as required. Allow it to take an existing backtrace buffer and allocated size in arguments. Together, this will allow statmemprof to collect backtraces efficiently and reuse the same buffer for multiple backtraces without statically allocating a large buffer per domain. Also tidied up some variable names for consistency ("backtrace" and "slots", not "frames" - the distinction is important for the native backend; changed the bytecode backend for consistency).

runtimes. This may be used by statmemprof to efficiently capture stacks at allocation points.

sadiqj

Looks good to me, thanks @NickBarnes

kayceesrk added the statmemprof PRs and issues related to statmemprof label Jul 18, 2023

NickBarnes force-pushed the nick-statmemprof-backtrace branch from 86a961a to 10f435b Compare August 21, 2023 16:33

gasche reviewed Aug 22, 2023

View reviewed changes

runtime/backtrace_nat.c Outdated Show resolved Hide resolved

runtime/backtrace_nat.c Show resolved Hide resolved

fabbing suggested changes Aug 24, 2023

View reviewed changes

NickBarnes force-pushed the nick-statmemprof-backtrace branch from 10f435b to a5c4de9 Compare August 24, 2023 14:17

OlivierNicole mentioned this pull request Oct 9, 2023

Re-enable ThreadSanitizer in the Inria CI #12644

Merged

fabbing approved these changes Oct 11, 2023

View reviewed changes

NickBarnes force-pushed the nick-statmemprof-backtrace branch from 9686aa7 to 9a019bb Compare October 17, 2023 12:23

NickBarnes force-pushed the nick-statmemprof-backtrace branch from c14cd70 to d4290e9 Compare November 7, 2023 12:17

NickBarnes added a commit to NickBarnes/ocaml that referenced this pull request Nov 7, 2023

Improve backtrace abstractions in the runtime (ocaml#12383)

ed91b66

NickBarnes added 6 commits November 24, 2023 16:21

Comment the header declaration of caml_next_frame_descriptor (used in…

997f587

… TSAN).

Allow backtrace slots to be either frame descriptors or debuginfos for

798125d

comballoc allocations.

Add caml_get_callstack() function to both native code and bytecode

14ba111

runtimes. This may be used by statmemprof to efficiently capture stacks at allocation points.

Changes.

84e9326

Add Fabrice to reviewers.

1ab959d

NickBarnes force-pushed the nick-statmemprof-backtrace branch from b753991 to 1ab959d Compare November 24, 2023 16:23

NickBarnes added a commit to NickBarnes/ocaml that referenced this pull request Nov 24, 2023

Improve backtrace abstractions in the runtime (ocaml#12383)

a1c4fd1

sadiqj approved these changes Nov 24, 2023

View reviewed changes

sadiqj merged commit 70d7f4e into ocaml:trunk Nov 24, 2023
9 checks passed

NickBarnes added a commit to NickBarnes/ocaml that referenced this pull request Nov 25, 2023

Improve backtrace abstractions in the runtime (ocaml#12383)

c45f619

NickBarnes added a commit to NickBarnes/ocaml that referenced this pull request Nov 27, 2023

Improve backtrace abstractions in the runtime (ocaml#12383)

bab837f

NickBarnes added a commit to NickBarnes/ocaml that referenced this pull request Nov 29, 2023

Improve backtrace abstractions in the runtime (ocaml#12383)

222cb4e

NickBarnes mentioned this pull request Jan 19, 2024

Statmemprof resurrected #12923

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Improve backtrace abstractions inside runtime #12383

Improve backtrace abstractions inside runtime #12383

NickBarnes commented Jul 17, 2023 •

edited

gasche commented Aug 21, 2023

NickBarnes commented Aug 22, 2023

gasche left a comment

gasche commented Aug 22, 2023

gasche commented Aug 22, 2023

fabbing left a comment

NickBarnes commented Aug 24, 2023

fabbing left a comment

NickBarnes commented Oct 17, 2023

NickBarnes commented Nov 7, 2023

sadiqj left a comment

Improve backtrace abstractions inside runtime #12383

Improve backtrace abstractions inside runtime #12383

Conversation

NickBarnes commented Jul 17, 2023 • edited

gasche commented Aug 21, 2023

NickBarnes commented Aug 22, 2023

gasche left a comment

Choose a reason for hiding this comment

gasche commented Aug 22, 2023

gasche commented Aug 22, 2023

fabbing left a comment

Choose a reason for hiding this comment

NickBarnes commented Aug 24, 2023

fabbing left a comment

Choose a reason for hiding this comment

NickBarnes commented Oct 17, 2023

NickBarnes commented Nov 7, 2023

sadiqj left a comment

Choose a reason for hiding this comment

NickBarnes commented Jul 17, 2023 •

edited