-
Notifications
You must be signed in to change notification settings - Fork 1.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Restore frame-pointers support for amd64 #11144
Conversation
delta = (char*)Stack_high(new_stack) - (char*)Stack_high(old_stack); | ||
|
||
/* Walk the frame-pointers linked list */ | ||
for (frame = __builtin_frame_address(0); frame; frame = next) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is it okay to use this GCC builtin (also supported by clang) which returns the current rbp register value?
An alternative would be to implement our own caml_frame_address in amd64 (and other architectures that requires it).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
At this stage, this should be fine, yes - --enable-frame-pointers
-mode has always been for gcc/clang on Linux x86_64.
Thanks for this work @fabbing. For me, the main use of frame pointers is using It would be useful to take a very small program that uses effect handlers and does some long-running computation. With Would you be able to confirm this @fabbing? |
Could we discuss this somewhere? (is it discussed somewhere?) I use |
Not discussed. But mentioned in the PLDI 2021 paper on retrofitting effect handlers -- Section 5.5 in https://kcsrk.info/papers/retro-concurrency_pldi_21.pdf. Unfortunately, given that
|
On call-graph dwarf, it's bad news yes. perf's method for dwarf call-graphs isn't going to work well when there are non-contiguous stacks. To deal with that I think there'd have to be dwarf unwinding in the kernel. The good news is that if you have a reasonably modern Intel CPU then Last Branch Record (lbr) call-graph sampling works well. This is a slightly modified sched.ml that runs for many iterations and allocates an array instead of printing something to the console. Here is using perf call-graph DWARF: The screenshot is showing With call-graph LBR (on my i7-12700F): Here the screenshot shows going from OCaml to C correctly (specifically It seems like AMD Zen 3 also has some kind of branch sampling that's going in to perf: https://lwn.net/Articles/875869/ |
Thanks @sadiqj. Does this come for free on OCaml 5 i.e, no work required on the OCaml side? |
Yes, that test was done on trunk. |
With a program very similar to the one used by @sadiqj, we get under perf, as expected, a complete backtrace with frame-pointers:
As explained by @kayceesrk, recording the execution with frame-pointers unwind method ( |
perf record may have taken similar amounts of time but perf report should have taken much longer with dwarf since that's when the unwinding actually happens. The other advantage frame pointers has is that you can walk the stack trace from within eBPF probes, something that you can't easily do with dwarf. |
But the problem with frame pointers remain that they are not enabled by default (for reasonable reasons), so they require changing to a different environment to profile, which is a pain, and I don't do it in practice. Of course I agree that it's nice to make it work for people that use it, not trying to criticize the work here, but for my actual usage I'm more interested in @sadiqj's LBR stuff. (I tried to get an AMD CPU on my last laptop for the fun of it, but failed, so I should be good; but with the rapid increase of Mac M1 machines maybe we should be nervous about getting comfy with Intel-only technology.) I wish these battle-won profiling tricks were documented in a central place. (I think there is some profiling documentation on ocaml.org?) Especially now that the tricks are changing in 5.x, having them spread over many blog posts and internal documents is not very good. (Which reminded me of the distantly-related #11058.) |
(Yes, generating perf reports with call-graph=dwarf is slow, but I never was in a situation where it was slower and more cumbersome than maintaining two switches in parallel, or inconvenient enough that I would move to +fp as my main switch. I tend to be lucky to profile programs that run in at most a handful of seconds, and the report generation also only takes seconds.) |
@gasche: frame pointers are like 8-track tapes: obsolete 1970's technologies that some Americans really like and refuse to abandon. Not much we can do here. At least they are happy with an OPAM switch and don't demand frame pointer support by default. Count your blessings. |
@fabbing Thanks for the |
Thanks KC!
|
The added tests proved useful since they showed an oversight in case of fiber reallocation with effects! |
Interesting observation, thanks! The only reason to keep the stack 16-aligned in x86-64 is to support the use of SSE instructions that demand 16-byte alignment. C compilers like to produce these instructions even for standard library functions like |
Apart from that: this PR needs to be thoroughly reviewed, and I'm not volunteering because I don't care about frame pointer support. I won't object to having them supported again in OCaml 5, but core developers who are actually interested in the feature should do the review. |
For what it is worth, we have a couple tools at Meta (perf analysis, flame graphs, ...) that still relies on the frame pointer. Having it around for a few releases would be of great help. |
I have reviewed most of the code and I am satisfied with it. That being said, the representation of stack headers (defined in
Because of this last case, the first frame pointer in a fragment can only be trusted if this fragment is "active". When suspended, we cannot make any assumption on the first frame pointer. That's why this pointer is updated every time we switch stack (https://github.com/ocaml/ocaml/pull/11144/files#diff-532fb04aef00f67a2b487d213655f8846c69ed0bcc92d5f0c8fa37e246aeee3fR281): this reattaches the fragment to the active stack. All tools that make use of frame pointers only scan the current stack; they should never look at a suspended fragment, so overwriting on stack switch is enough. |
Thanks a lot for having a look @let-def !
That's right indeed, this was the last issue discovered and hopefully it's now handled correctly and validated by the tests |
Yes. If that was not clear in my first message: I think this PR is correct and ready for merge, but I don't feel 100% confident on these tricky parts. |
@let-def are you planning to Approve this PR? Do you trust that there is no regression risk for people who do not use the feature? My impression from skimming the assembly changes is "yes", but I haven't reviewed the PR myself. If there is opposition to the change, or if some people would like to have the time to review before this might be merged, now is the time to say so. |
@kayceesrk actually found 2 cases that weren't handled correctly:
We think we have solved both issues and also added another test, designed by @kayceesrk, to the testsuite that triggered both issues. |
Question: is the documentation produced by @let-def in #11144 (comment) integrated in the code as relevant, as a comment? If not, then that could be a useful addition to the PR, right? |
runtime/amd64.S
Outdated
@@ -849,6 +907,9 @@ CFI_STARTPROC | |||
movq (Stack_handler-1)(%rdi), %r10 | |||
movq %rsi, Handler_parent(%r10) /* Append to last_fiber */ | |||
leaq 1(%rsi), %rdi /* %rdi (last_fiber) := Val_ptr(old stack) */ | |||
/* Need to update the olderst savde frame pointer here as the execution of |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
/* Need to update the olderst savde frame pointer here as the execution of | |
/* Need to update the oldest saved frame pointer here as the execution of |
There are code comments in |
Thanks, @fabbing!
Am I correct that it's your point 3 which explains why you need ot
install a signal handler in your test?
In my opinion, this should be understood and fixed as much as posisble
so that there is no need for a signal handler in the test because it
feels like it's hiding a real problem under the hood rather than
properly solving the problem once and for all.
|
Introduction
Frame-pointers support was partially lost with the multicore merge, this PR aims to restore it for amd64 architecture.
A quick introduction to frame-pointers is available at 11031.
Changes
Most of the changes are located inside
amd64.S
and are just adaptations of what was already done before. Nevertheless, some of the changes are new because of the way the runtime currently works.Fiber reallocation
The runtime may need to reallocate a fiber to accomadate for the stack space required for it to perform its work.
When a stack is reallocated, addresses pointing into the reallocated fiber must be updated to reflect the new stack space on the reallocated fiber. Like exception handler addresses are rewritten, the linked list of frame-pointers must be updated.
The
rewrite_frame_pointers
function does this by following the linked list of frame-pointers and adjusting the address values when necessary; it allows to also update addresses located on other stacks (other fibers, or the C stack).Reserving rbp for frame-pointers at C calls
The
rbp
register was used to save information for DWARF unwinding at C calls, therbx
register is now used instead allowingrbp
to be used as base pointer.Added tests to the testsuite
Those should hopefully validate that frame-pointers work as expected.
c_call.ml
: checks the correctness of the frame-pointers backtrace during the execution of different C calls.effects.ml
: checks the correctness of the frame-pointers backtrace in several key locations when using effect.stack_reallocations.ml
: checksrewrite_frame_pointers
works as expected when the computation fiber needs to be reallocated.stack_reallocations2.ml
: checksrewrite_frame_pointers
andcaml_resume
work as expected when the “effect handling” fiber needs to be reallocated.The last two tests detect more problems when run under the debug runtime, since the reallocated fiber stack space is
memset
to a cannary value quickly showing dangling base pointer.Acknowledgement
Thanks to @ctk21, @kayceesrk and @Engil for their advice.
And to @shindere for his help in improving the tests.