Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Restore frame-pointers support for amd64 #11144

Merged
merged 12 commits into from Jun 30, 2022
Merged

Restore frame-pointers support for amd64 #11144

merged 12 commits into from Jun 30, 2022

Conversation

fabbing
Copy link
Contributor

@fabbing fabbing commented Mar 29, 2022

Introduction

Frame-pointers support was partially lost with the multicore merge, this PR aims to restore it for amd64 architecture.
A quick introduction to frame-pointers is available at 11031.

Changes

Most of the changes are located inside amd64.S and are just adaptations of what was already done before. Nevertheless, some of the changes are new because of the way the runtime currently works.

Fiber reallocation

The runtime may need to reallocate a fiber to accomadate for the stack space required for it to perform its work.
When a stack is reallocated, addresses pointing into the reallocated fiber must be updated to reflect the new stack space on the reallocated fiber. Like exception handler addresses are rewritten, the linked list of frame-pointers must be updated.
The rewrite_frame_pointers function does this by following the linked list of frame-pointers and adjusting the address values when necessary; it allows to also update addresses located on other stacks (other fibers, or the C stack).

Reserving rbp for frame-pointers at C calls

The rbp register was used to save information for DWARF unwinding at C calls, the rbx register is now used instead allowing rbp to be used as base pointer.

Added tests to the testsuite

Those should hopefully validate that frame-pointers work as expected.

  • c_call.ml: checks the correctness of the frame-pointers backtrace during the execution of different C calls.
  • effects.ml: checks the correctness of the frame-pointers backtrace in several key locations when using effect.
  • stack_reallocations.ml: checks rewrite_frame_pointers works as expected when the computation fiber needs to be reallocated.
  • stack_reallocations2.ml: checks rewrite_frame_pointers and caml_resume work as expected when the “effect handling” fiber needs to be reallocated.

The last two tests detect more problems when run under the debug runtime, since the reallocated fiber stack space is memset to a cannary value quickly showing dangling base pointer.

Acknowledgement

Thanks to @ctk21, @kayceesrk and @Engil for their advice.
And to @shindere for his help in improving the tests.

delta = (char*)Stack_high(new_stack) - (char*)Stack_high(old_stack);

/* Walk the frame-pointers linked list */
for (frame = __builtin_frame_address(0); frame; frame = next) {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it okay to use this GCC builtin (also supported by clang) which returns the current rbp register value?
An alternative would be to implement our own caml_frame_address in amd64 (and other architectures that requires it).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

At this stage, this should be fine, yes - --enable-frame-pointers-mode has always been for gcc/clang on Linux x86_64.

@kayceesrk
Copy link
Contributor

Thanks for this work @fabbing. For me, the main use of frame pointers is using perf with call graph enabled using fp (frame pointer mode) rather than dwarf mode. Since DWARF unwinding is slow, perf saves a copy of the stack memory at every sample. This can be quite inefficient (and imprecise since our stacks are not contiguous).

It would be useful to take a very small program that uses effect handlers and does some long-running computation. With fp mode you should be able to see the full call graph (all the way down to the main function) in the profile as opposed to a truncated one in dwarf mode. Hopefully, you should also see that the size of the profile is much smaller under fp mode.

Would you be able to confirm this @fabbing?

@gasche
Copy link
Member

gasche commented Mar 31, 2022

Since DWARF unwinding is slow, perf saves a copy of the stack memory at every sample. This can be quite inefficient (and imprecise since our stacks are not contiguous).

Could we discuss this somewhere? (is it discussed somewhere?) I use perf record --call-graph=dwarf for my OCaml profiling (I don't really want to have a new switch just for this), and I naively assumed that someone had magically worked out how to describe the new weird stack layout to the debugging dwarves. Should we assume that --call-graph=dwarf will be broken with the Multicore runtime forever, or does someone have better news?

@kayceesrk
Copy link
Contributor

Could we discuss this somewhere? (is it discussed somewhere?)

Not discussed. But mentioned in the PLDI 2021 paper on retrofitting effect handlers -- Section 5.5 in https://kcsrk.info/papers/retro-concurrency_pldi_21.pdf. Unfortunately, given that perf works by dumping (a segment) of the stack at every sample, there is no easy fix for this. The traces with OCaml 5 will be truncated.

perf record --call-graph=fp will be accurate and fast compared to --call-graph=dwarf. @lpw25 mentioned in the past that within Jane Street they use +fp switch for perf profiling since DWARF based call graph is too slow for them.

@sadiqj
Copy link
Contributor

sadiqj commented Apr 3, 2022

Should we assume that --call-graph=dwarf will be broken with the Multicore runtime forever, or does someone have better news?

On call-graph dwarf, it's bad news yes. perf's method for dwarf call-graphs isn't going to work well when there are non-contiguous stacks. To deal with that I think there'd have to be dwarf unwinding in the kernel. The good news is that if you have a reasonably modern Intel CPU then Last Branch Record (lbr) call-graph sampling works well.

This is a slightly modified sched.ml that runs for many iterations and allocates an array instead of printing something to the console. Here is using perf call-graph DWARF:

image

The screenshot is showing caml_make_vec and caml_c_call as the top entries and you lose the tie-up between OCaml and C stacks.

With call-graph LBR (on my i7-12700F):

image

Here the screenshot shows going from OCaml to C correctly (specifically Sched.say to caml_make_vec)

It seems like AMD Zen 3 also has some kind of branch sampling that's going in to perf: https://lwn.net/Articles/875869/

@kayceesrk
Copy link
Contributor

Last Branch Record (lbr) call-graph sampling works well.

Thanks @sadiqj. Does this come for free on OCaml 5 i.e, no work required on the OCaml side?

@sadiqj
Copy link
Contributor

sadiqj commented Apr 4, 2022

Thanks @sadiqj. Does this come for free on OCaml 5 i.e, no work required on the OCaml side?

Yes, that test was done on trunk.

@fabbing
Copy link
Contributor Author

fabbing commented Apr 5, 2022

With a program very similar to the one used by @sadiqj, we get under perf, as expected, a complete backtrace with frame-pointers:

# To display the perf.data header info, please use --header/--header-only options.
#
#
# Total Lost Samples: 0
#
# Samples: 13K of event 'cycles'
# Event count (approx.): 15072055413
#
# Children      Self       Samples  Command  Shared Object     Symbol                                                  
# ........  ........  ............  .......  ................  ........................................................
#
    85.87%     0.00%             0  Domain0  libc.so.6         [.] __libc_start_call_main
    85.87%     0.00%             0  Domain0  sched_fp.exe      [.] main
    85.87%     0.00%             0  Domain0  sched_fp.exe      [.] caml_start_program
    85.87%     0.00%             0  Domain0  sched_fp.exe      [.] caml_startup__code_begin
    85.38%     0.08%            11  Domain0  sched_fp.exe      [.] Sched.entry
            |          
             --85.30%--Sched.entry
                       |          
                       |--31.49%--caml_runstack
                       |          |          
                       |          |--15.47%--Sched.test_479
                       |          |          |          
                       |          |           --15.15%--Sched.say_289
                       |          |                     |          
                       |          |                      --15.10%--caml_c_call
                       |          |                                |          
                       |          |                                 --14.94%--caml_make_vect
                       |          |                                           |          
                       |          |                                            --0.70%--caml_process_pending_actions
                       |          |                                                      |          
                       |          |                                                       --0.68%--caml_poll_gc_work
                       |          |                                                                 |          
                       |          |                                                                  --0.67%--caml_major_collection_slice
                       |          |                                                                            major_collection_slice
                       |          |          
                       |          |--11.23%--Sched.fun_610
                       |          |          |          
                       |          |           --10.29%--Sched.say_289
                       |          |                     |          
                       |          |                      --10.25%--caml_c_call
                       |          |                                |          
                       |          |                                 --10.19%--caml_make_vect
                       |          |                                           |          
                       |          |                                            --1.11%--caml_process_pending_actions
                       |          |                                                      caml_poll_gc_work
                       |          |                                                      |          
                       |          |                                                       --1.08%--caml_major_collection_slice
                       |          |                                                                 |          
                       |          |                                                                  --1.07%--major_collection_slice
                       |          |                                                                            |          
                       |          |                                                                             --1.01%--caml_sweep
                       |          |                                                                                       |          
                       |          |                                                                                        --0.56%--pool_sweep
                       |          |          
                       |           --4.32%--Sched.fun_622
                       |                     Sched.say_289
                       |                     |          
                       |                      --4.31%--caml_c_call
                       |                                |          
                       |                                 --4.29%--caml_make_vect
                       |          
                       |--15.41%--Sched.fun_589
                       |          |          
                       |           --15.12%--Sched.say_289
                       |                     |          
                       |                      --15.06%--caml_c_call
                       |                                |          
                       |                                 --15.00%--caml_make_vect
                       |                                           |          
                       |                                            --1.95%--caml_process_pending_actions
                       |                                                      |          
                       |                                                       --1.95%--caml_poll_gc_work
                       |                                                                 |          
                       |                                                                  --1.93%--caml_major_collection_slice
                       |                                                                            |          
                       |                                                                             --1.92%--major_collection_slice
                       |                                                                                       |          
                       |                                                                                        --1.38%--caml_sweep
                       |                                                                                                  |          
                       |                                                                                                   --0.66%--pool_sweep
                       |          
                       |--11.71%--Sched.fun_596
                       |          |          
                       |           --11.47%--Sched.say_289
                       |                     |          
                       |                      --11.44%--caml_c_call
                       |                                |          
                       |                                 --11.35%--caml_make_vect
                       |                                           |          
                       |                                            --2.12%--caml_process_pending_actions
                       |                                                      |          
                       |                                                       --2.08%--caml_poll_gc_work
                       |                                                                 caml_major_collection_slice
                       |                                                                 |          
                       |                                                                  --2.05%--major_collection_slice
                       |                                                                            |          
                       |                                                                             --1.56%--caml_sweep
                       |                                                                                       |          
                       |                                                                                        --0.62%--pool_sweep
                       |          
                       |--10.03%--Sched.fun_558
                       |          |          
                       |           --9.98%--Sched.say_289
                       |                     |          
                       |                      --9.96%--caml_c_call
                       |                                |          
                       |                                 --9.88%--caml_make_vect
                       |                                           |          
                       |                                            --0.99%--caml_process_pending_actions
                       |                                                      |          
                       |                                                       --0.98%--caml_poll_gc_work
                       |                                                                 |          
                       |                                                                  --0.95%--caml_major_collection_slice
                       |                                                                            |          
                       |                                                                             --0.93%--major_collection_slice
                       |          
                       |--5.84%--Sched.say_289
                       |          |          
                       |           --5.77%--caml_c_call
                       |                     |          
                       |                      --5.72%--caml_make_vect
                       |                                |          
                       |                                 --1.20%--caml_process_pending_actions
                       |                                           caml_poll_gc_work
                       |                                           |          
                       |                                            --1.17%--caml_major_collection_slice
                       |                                                      |          
                       |                                                       --1.15%--major_collection_slice
                       |                                                                 |          
                       |                                                                  --1.11%--caml_sweep
                       |                                                                            |          
                       |                                                                             --0.52%--pool_sweep
                       |          
                       |--4.50%--Sched.fun_563
                       |          |          
                       |           --4.49%--Sched.say_289
                       |                     |          
                       |                      --4.48%--caml_c_call
                       |                                |          
                       |                                 --4.44%--caml_make_vect
                       |          
                       |--4.35%--Sched.fun_605
                       |          |          
                       |           --4.34%--Sched.say_289
                       |                     |          
                       |                      --4.32%--caml_c_call
                       |                                |          
                       |                                 --4.27%--caml_make_vect
                       |          
                        --0.63%--Stdlib.Effect.match_with_363
                                  |          
                                   --0.51%--caml_c_call

    81.77%     0.45%            60  Domain0  sched_fp.exe      [.] caml_c_call
    81.24%     0.30%            40  Domain0  sched_fp.exe      [.] Sched.say_289
    81.13%    67.64%          8938  Domain0  sched_fp.exe      [.] caml_make_vect
    31.56%     0.45%            59  Domain0  sched_fp.exe      [.] caml_runstack
    15.53%     0.29%            39  Domain0  sched_fp.exe      [.] Sched.test_479
    15.48%     0.09%            12  Domain0  sched_fp.exe      [.] Sched.fun_589
    11.73%     0.04%             5  Domain0  sched_fp.exe      [.] Sched.fun_596
    11.26%     0.14%            19  Domain0  sched_fp.exe      [.] Sched.fun_610
    10.53%     0.00%             0  Domain0  [unknown]         [k] 0000000000000000
    10.05%     0.05%             7  Domain0  sched_fp.exe      [.] Sched.fun_558
     9.20%     5.72%           756  Domain0  libc.so.6         [.] _int_malloc
[...]

As explained by @kayceesrk, recording the execution with frame-pointers unwind method (perf record -g --call-graph fp ./sched_fp.exe) also results in a much smaller profile file of 1,6MB rather than the 106MB one from recording with dwarf (perf record -g --call-graph dwarf ./sched_fp.exe).
In this very specific example there was not much difference in perf execution time between fp and dwarf unwiding.

@sadiqj
Copy link
Contributor

sadiqj commented Apr 5, 2022

In this very specific example there was not much difference in perf execution time between fp and dwarf unwiding.

perf record may have taken similar amounts of time but perf report should have taken much longer with dwarf since that's when the unwinding actually happens.

The other advantage frame pointers has is that you can walk the stack trace from within eBPF probes, something that you can't easily do with dwarf.

@gasche
Copy link
Member

gasche commented Apr 5, 2022

But the problem with frame pointers remain that they are not enabled by default (for reasonable reasons), so they require changing to a different environment to profile, which is a pain, and I don't do it in practice. Of course I agree that it's nice to make it work for people that use it, not trying to criticize the work here, but for my actual usage I'm more interested in @sadiqj's LBR stuff. (I tried to get an AMD CPU on my last laptop for the fun of it, but failed, so I should be good; but with the rapid increase of Mac M1 machines maybe we should be nervous about getting comfy with Intel-only technology.)

I wish these battle-won profiling tricks were documented in a central place. (I think there is some profiling documentation on ocaml.org?) Especially now that the tricks are changing in 5.x, having them spread over many blog posts and internal documents is not very good. (Which reminded me of the distantly-related #11058.)

@gasche
Copy link
Member

gasche commented Apr 5, 2022

(Yes, generating perf reports with call-graph=dwarf is slow, but I never was in a situation where it was slower and more cumbersome than maintaining two switches in parallel, or inconvenient enough that I would move to +fp as my main switch. I tend to be lucky to profile programs that run in at most a handful of seconds, and the report generation also only takes seconds.)

@xavierleroy
Copy link
Contributor

@gasche: frame pointers are like 8-track tapes: obsolete 1970's technologies that some Americans really like and refuse to abandon. Not much we can do here. At least they are happy with an OPAM switch and don't demand frame pointer support by default. Count your blessings.

@kayceesrk
Copy link
Contributor

@fabbing Thanks for the perf work. The PR is marked as draft. Is there still work left to do on this before we can start reviewing?

@fabbing
Copy link
Contributor Author

fabbing commented Apr 6, 2022

Thanks KC!
I think the PR can now drop its draft status and start being reviewed, as I don't expect many changes to be necessary. But there are still a few points I would like to discuss:

  1. Stack alignment
  • Stack alignment should still be correct: the testsuite also passes with the debug runtime enabling the CHECK_STACK_ALIGNMENT macro which checks for correct stack alignment at every C_CALL.
    Moreover the only place in amd64.S where the frame base pointer is actually pushed onto the C stack, and not onto an OCaml one, is for caml_raise_exception and the stack pointer is discarded soon after.

  • But does function call on the OCaml stacks (fibers) also have to be aligned?
    I don't think anything is currently being done to check they are correctly aligned and adding one word at each function prologue (in amd64.S) will have shifted everything!

  1. Frame-pointers at caml_start_program
    The caml_start_program function does not currently push its base pointer during the prologue.
    It's something that wasn't done before, for example in 4.14

    FUNCTION(G(caml_start_program))

    Adding frame-pointer here would actually push rbp onto the C stack and would require another word to enforce alignment; it would also impact caml_callback*_asm functions.
    What is the good reason this wasn't done before? And as corollary, should caml_start_program do this in 5.00?

  2. Saved rbp at main is not NULL
    The fp_backtrace function used for the frame-pointers tests is quite convoluted because the saved rbp at main is not NULL as I naively expected (it's often a small value around 0x1~0xf).
    So a lot of precautions are taken not to crash the test by attempting to dereference an invalid pointer.
    The rewrite_frame_pointers function also carrefully avoids this but more elegantly by testing if it's the top of the current fiber.

  3. Tests
    How many more tests checking the consisency of frame-pointers do we want?
    I think at least one involving effects would be welcome.

@fabbing fabbing marked this pull request as ready for review April 6, 2022 10:28
fabbing added a commit to fabbing/ocaml that referenced this pull request Apr 22, 2022
@fabbing
Copy link
Contributor Author

fabbing commented Apr 22, 2022

The added tests proved useful since they showed an oversight in case of fiber reallocation with effects!
When a computation fiber is spun-up, caml_runstack saves a base pointer pointing into the "effect handling" fiber. If the computation performs an effect, the work done by the "effect handling" fiber may require the reallocation of its stack. The rewrite_frame_pointers function kicks in but is unable to update this saved rbp because it walks the stack using the rbp linked list and at this point of time the computation fiber is no longer part of it.
The solution found, thanks to @ctk21, is to make caml_resume to overwrite the saved rbp in caml_runstack when a computation fiber is continue-d, in case the "effect handling" fiber has been reallocated.

@xavierleroy
Copy link
Contributor

Stack alignment [...]
But does function call on the OCaml stacks (fibers) also have to be aligned?
I don't think anything is currently being done to check they are correctly aligned and adding one word at each function prologue (in amd64.S) will have shifted everything!

Interesting observation, thanks! The only reason to keep the stack 16-aligned in x86-64 is to support the use of SSE instructions that demand 16-byte alignment. C compilers like to produce these instructions even for standard library functions like printf, so the C stack must be kept 16-aligned. In OCaml 4 it required OCaml stack frames to be 16-aligned. But in OCaml 5, the OCaml stacks should be fine with 8-alignment, and we can save stack space by aligning OCaml frames to 8 bytes, not to 16!

@xavierleroy
Copy link
Contributor

Apart from that: this PR needs to be thoroughly reviewed, and I'm not volunteering because I don't care about frame pointer support. I won't object to having them supported again in OCaml 5, but core developers who are actually interested in the feature should do the review.

@kayceesrk kayceesrk self-assigned this May 24, 2022
@let-def let-def self-assigned this May 30, 2022
@vsiles
Copy link

vsiles commented Jun 10, 2022

For what it is worth, we have a couple tools at Meta (perf analysis, flame graphs, ...) that still relies on the frame pointer. Having it around for a few releases would be of great help.

@let-def
Copy link
Contributor

let-def commented Jun 27, 2022

I have reviewed most of the code and I am satisfied with it.

That being said, the representation of stack headers (defined in fiber.h) is still a bit mysterious to me. What I read made sense to me, though I cannot guarantee that it is correct. Here are some notes on the cases I found tricky:

  • with effects, we can have a linked-list of stack fragments/fibers
  • the first frame (the oldest) of each fragment has a pointer to the last frame of parent fragment
  • a fragment can be reallocated, which causes its address to change: in this situation we need to update all frames inside this fragment (but never step outside, the parent pointer can be stale)
  • a fragment can be suspended and its parent resumed: in this situation, the frames in the parent can change, and the parent fragment itself can be reallocated or even deallocated

Because of this last case, the first frame pointer in a fragment can only be trusted if this fragment is "active". When suspended, we cannot make any assumption on the first frame pointer. That's why this pointer is updated every time we switch stack (https://github.com/ocaml/ocaml/pull/11144/files#diff-532fb04aef00f67a2b487d213655f8846c69ed0bcc92d5f0c8fa37e246aeee3fR281): this reattaches the fragment to the active stack. All tools that make use of frame pointers only scan the current stack; they should never look at a suspended fragment, so overwriting on stack switch is enough.

@fabbing
Copy link
Contributor Author

fabbing commented Jun 28, 2022

Thanks a lot for having a look @let-def !

That being said, the representation of stack headers (defined in fiber.h) is still a bit mysterious to me. What I read made sense to me, though I cannot guarantee that it is correct. Here are some notes on the cases I found tricky:

  • with effects, we can have a linked-list of stack fragments/fibers
  • the first frame (the oldest) of each fragment has a pointer to the last frame of parent fragment
  • a fragment can be reallocated, which causes its address to change: in this situation we need to update all frames inside this fragment (but never step outside, the parent pointer can be stale)
  • a fragment can be suspended and its parent resumed: in this situation, the frames in the parent can change, and the parent fragment itself can be reallocated or even deallocated

Because of this last case, the first frame pointer in a fragment can only be trusted if this fragment is "active". When suspended, we cannot make any assumption on the first frame pointer. That's why this pointer is updated every time we switch stack (https://github.com/ocaml/ocaml/pull/11144/files#diff-532fb04aef00f67a2b487d213655f8846c69ed0bcc92d5f0c8fa37e246aeee3fR281): this reattaches the fragment to the active stack. All tools that make use of frame pointers only scan the current stack; they should never look at a suspended fragment, so overwriting on stack switch is enough.

That's right indeed, this was the last issue discovered and hopefully it's now handled correctly and validated by the tests stack_reallocations.ml and stack_reallocations2.ml.

@let-def
Copy link
Contributor

let-def commented Jun 28, 2022

Yes. If that was not clear in my first message: I think this PR is correct and ready for merge, but I don't feel 100% confident on these tricky parts.

@gasche
Copy link
Member

gasche commented Jun 29, 2022

@let-def are you planning to Approve this PR?

Do you trust that there is no regression risk for people who do not use the feature? My impression from skimming the assembly changes is "yes", but I haven't reviewed the PR myself.

If there is opposition to the change, or if some people would like to have the time to review before this might be merged, now is the time to say so.

@kayceesrk
Copy link
Contributor

@gasche

I discovered a bug in frame pointer handling with effect handlers specifically in the place where saved rbp is overwritten in amd64.S. I've had a chat with @fabbing about this, and we have worked out a fix. Please hold off merging until the fix is in.

@fabbing
Copy link
Contributor Author

fabbing commented Jun 30, 2022

@kayceesrk actually found 2 cases that weren't handled correctly:

  • At caml_resume the oldest saved rbp (highest one in the stack) that was updated was the one in the first fiber of the continuation and should instead be updated in the last fiber
  • caml_reperform must also update the oldest saved rbp in case the fiber is reallocated while searching for the matching effect hanlder

We think we have solved both issues and also added another test, designed by @kayceesrk, to the testsuite that triggered both issues.

@gasche
Copy link
Member

gasche commented Jun 30, 2022

Question: is the documentation produced by @let-def in #11144 (comment) integrated in the code as relevant, as a comment? If not, then that could be a useful addition to the PR, right?

runtime/amd64.S Outdated
@@ -849,6 +907,9 @@ CFI_STARTPROC
movq (Stack_handler-1)(%rdi), %r10
movq %rsi, Handler_parent(%r10) /* Append to last_fiber */
leaq 1(%rsi), %rdi /* %rdi (last_fiber) := Val_ptr(old stack) */
/* Need to update the olderst savde frame pointer here as the execution of
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
/* Need to update the olderst savde frame pointer here as the execution of
/* Need to update the oldest saved frame pointer here as the execution of

@kayceesrk
Copy link
Contributor

There are code comments in amd64.S that mention this [1]. Perhaps they can be expanded with information from #11144 (comment) @fabbing?

[1] https://github.com/ocaml/ocaml/pull/11144/files#diff-532fb04aef00f67a2b487d213655f8846c69ed0bcc92d5f0c8fa37e246aeee3fR910-R936

@gasche gasche merged commit 35c15e0 into ocaml:trunk Jun 30, 2022
@shindere
Copy link
Contributor

shindere commented Oct 11, 2022 via email

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

9 participants