Skip to content

Avoid traversing C parts of frame pointer chain when reallocating stack#13635

Merged
dra27 merged 1 commit into
ocaml:trunkfrom
stedolan:update-frame-pointers
Dec 6, 2024
Merged

Avoid traversing C parts of frame pointer chain when reallocating stack#13635
dra27 merged 1 commit into
ocaml:trunkfrom
stedolan:update-frame-pointers

Conversation

@stedolan

Copy link
Copy Markdown
Contributor

When the OCaml stack grows we need to rewrite frame pointers (if enabled) to point to the new stack.

However, when using a C library that was not compiled with frame pointers enabled, we cannot assume that there is an unbroken chain of frame pointers through both the OCaml and C parts of the stack. Doing so leads to segfaults (#13575).

Instead, note that the only frame pointers that can point to OCaml stacks (the ones that need updating) are those already on OCaml stacks, plus the first ones pushed after any OCaml->C calls. These can be found by traversing the struct c_stack_link chain, without needing to traverse any intervening C frames.

This imposes a new constraint on the runtime assembly stubs: after switching to C they must not push anything to the stack before calling a C function. This was already true for all but caml_c_call_stack_args. Enforcing this invariant for caml_c_call_stack_args is straightforward enough, and simplifies the DWARF backtrace logic. (A side effect of this change is that DWARF backtraces now work through caml_c_call_stack_args on arm64 - this was broken before)

I've added a new hairy test for this logic, checking that the frame pointer chain continues to make sense after the stack is reallocated inside a call to a C function compiled without frame pointers (qsort from libc, tested on my non-FP-enabled Debian machine), inside a C callback using stack args, inside a C callback not using stack args, inside a finalizer.

Comment thread runtime/arm64.S Outdated
Comment thread runtime/arm64.S
To guarantee this when stack arguments are used, the actual pushing
of arguments is done by this separate function */
FUNCTION(caml_c_call_copy_stack_args)
CFI_STARTPROC

@tmcgilchrist tmcgilchrist Nov 27, 2024

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not labelling this with CFI_SIGNAL_FRAME will mean gdb shows this as a stack frame rather than <signal handler called>. On ARM64 Linux running the c_call.ml test:

(gdb) bt
#0  0x0000aaaaaaae8078 in caml_c_call_copy_stack_args ()
#1  <signal handler called>
#2  0x0000aaaaaaabc618 in camlC_call.f_274 () at c_call.ml:16
#3  0x0000aaaaaaabc6e8 in camlC_call.entry () at c_call.ml:24
#4  0x0000aaaaaaabafdc in caml_program ()
#5  <signal handler called>
#6  0x0000aaaaaaae7b88 in caml_startup_common (pooling=-1430997904, argv=0xaaaaaab4b460) at runtime/startup_nat.c:127

macOS LLDB doesn't care and will show the full backtrace, using the same test case.

* thread #1, queue = 'com.apple.main-thread', stop reason = breakpoint 1.1
  * frame #0: 0x0000000100008a00 c_call.opt`fp_backtrace_many_args(argv0=4302438272, a=3, b=5, c=7, d=9, e=11, f=13, g=15, h=17, i=19, j=21, k=23) at c_call_.c:24:3 [opt]
    frame #1: 0x00000001000391d0 c_call.opt`caml_c_call_copy_stack_args + 36
    frame #2: 0x0000000100039180 c_call.opt`caml_c_call_stack_args + 52
    frame #3: 0x0000000100003a98 c_call.opt`camlC_call.f_274 + 128
    frame #4: 0x0000000100003b68 c_call.opt`camlC_call.entry + 72
    frame #5: 0x000000010000344c c_call.opt`caml_program + 116
    frame #6: 0x0000000100039260 c_call.opt`caml_start_program + 132

I noticed the stack trace gets broken when you enter fp_backtrace_many_args on ARM64 GDB:

(gdb) bt
#0  fp_backtrace_many_args (argv0=281474571960208, a=<optimized out>, b=5, c=7, d=9, e=11, f=13, g=15, h=17, i=19, 
    j=21, k=23) at c_call_.c:25
#1  0x0000aaaaaaae8090 in caml_c_call_copy_stack_args ()
#2  0x0000aaaaaab4b450 in ?? ()
Backtrace stopped: previous frame inner to this frame (corrupt stack?)
(gdb) info breakpoints 
Num     Type           Disp Enb Address            What
1       breakpoint     keep y   0x0000aaaaaaac1600 in fp_backtrace_many_args at c_call_.c:24
	breakpoint already hit 1 time
....

There's probably something slightly wrong with the CFI information being emitted.

For me GDB is getting lost on line 647 stp TMP, TMP2, [sp, -16]!; CFI_ADJUST(16)

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for spotting this!

That CFI_ADJUST is wrong. I removed it in amd64.S and forgot to when mirroring the changes in arm64.S (I don't know how lldb managed to successfully get a backtrace with the broken CFI entry, though - I tested the qsort.ml program, stopping in check_frames).

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not labelling this with CFI_SIGNAL_FRAME will mean gdb shows this as a stack frame rather than <signal handler called>.

This is true but I'm happy with that. The CFI_SIGNAL_FRAME thing is a trick to convince gdb to continue unwinding past a break in stacks, which isn't needed for caml_c_call_copy_stack_args since it's called directly on the C stack. I'm happy for it to appear in backtraces, because that means in particular that it shows up in stack-based profilers: copying stack args is rare and somewhat expensive, and so I think we shouldn't hide that it's happening.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree it's useful to see properly labelled frames.

Comment thread runtime/amd64.S Outdated

@tmcgilchrist tmcgilchrist left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good now.

I get this backtrace with Linux ARM64 as expected.

(gdb) bt
#0  fp_backtrace_many_args (argv0=281474571960208, a=3, b=5, c=7, d=9, e=11, f=13, g=15, h=17, i=19, j=21, k=23)
    at c_call_.c:24
#1  0x0000aaaaaaae8090 in caml_c_call_copy_stack_args ()
#2  <signal handler called>
#3  0x0000aaaaaaabc618 in camlC_call.f_274 () at c_call.ml:16
#4  0x0000aaaaaaabc6e8 in camlC_call.entry () at c_call.ml:24
#5  0x0000aaaaaaabafdc in caml_program ()
#6  <signal handler called>
#7  0x0000aaaaaaae7b88 in caml_startup_common (pooling=-1430997936, argv=0xaaaaaab4b460) at runtime/startup_nat.c:127
#8  caml_startup_common (argv=0xaaaaaab4b460, pooling=-1430997936) at runtime/startup_nat.c:86
#9  0x0000aaaaaaae7c00 in caml_startup_exn (argv=<optimized out>) at runtime/startup_nat.c:134
#10 caml_startup (argv=<optimized out>) at runtime/startup_nat.c:139
#11 caml_main (argv=<optimized out>) at runtime/startup_nat.c:146
#12 0x0000aaaaaaabadd0 in main (argc=<optimized out>, argv=<optimized out>) at runtime/main.c:37
(gdb) q

@dra27

dra27 commented Nov 28, 2024

Copy link
Copy Markdown
Member

The MSVC failures are just because amd64nt.asm needs updating (which I've done locally), buuuut... I'm getting a repeatable segfault with mingw-w64 (contradicting AppVeyor) for:

  • tests/basic-manyargs/manyargs.ml
  • tests/callback/test5.ml
  • tests/unboxed-primitive-args/test.ml
    (more tests fail on msvc64)

tmcgilchrist added a commit to tmcgilchrist/ocaml that referenced this pull request Dec 2, 2024
Comment thread runtime/amd64.S
CFI_STARTPROC
/* Set up a frame pointer even without WITH_FRAME_POINTERS,
which we use to pop an unknown number of arguments later */
pushq %rbp; CFI_ADJUST(8)

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a reason why you don't use ENTER_FUNCTION and LEAVE_FUNCTION, like arm64, in this function?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, it's to achieve Set up a frame pointer even without WITH_FRAME_POINTERS: the ENTER_FUNCTION and LEAVE_FUNCTION macros are noops without frame pointers, but this function actually uses %rbp to restore the stack and needs the push/pop logic even when WITH_FRAME_POINTERS is off.

(This doesn't apply to arm64, because on arm64 ENTER_FUNCTION and LEAVE_FUNCTION unconditionally set up a frame pointer, presumably because arm64 is less register-starved so optimising this away is less important)

When the OCaml stack grows we need to rewrite frame pointers (if enabled)
to point to the new stack.

However, when using a C library that was not compiled with frame pointers
enabled, we cannot assume that there is an unbroken chain of frame pointers
through both the OCaml and C parts of the stack. Doing so leads to segfaults.

Instead, we note that the only frame pointers that can point to OCaml stacks
(the ones that need updating) are those already on OCaml stacks, plus the first
ones pushed after any OCaml->C calls. These can be found by traversing the
struct c_stack_link chain, without needing to traverse any intervening C frames.

This imposes a new constraint on the runtime assembly stubs: after switching to
C they must not push anything to the stack before calling a C function. This
was already true for all but caml_c_call_stack_args. Enforcing this invariant
for caml_c_call_stack_args is straightforward enough, and simplifies the DWARF
backtrace logic.

For arm64, a side-effect of this change is that DWARF backtraces now work on
stacks containing calls to caml_c_call_stack_args, which were broken before.
(Tested with macos lldb)
@stedolan stedolan force-pushed the update-frame-pointers branch from 54f8005 to d32da79 Compare December 5, 2024 16:56

@dra27 dra27 left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This has gone through precheck#999 as well.

Adding a "green tick" to @tmcgilchrist and @fabbing's reviewing... @stedolan and I went through this synchronously yesterday (and thoroughly convinced ourselves that the Windows change is correct!)

@dra27 dra27 merged commit d90601c into ocaml:trunk Dec 6, 2024
@dra27

dra27 commented Dec 6, 2024

Copy link
Copy Markdown
Member

My intuition is that this is not fixing a regression in 5.3.0 itself, and therefore should wait for 5.4.0, but could be cherry-picked to 5.3.1 to get it released a little sooner?

@tmcgilchrist

tmcgilchrist commented Dec 6, 2024 via email

Copy link
Copy Markdown
Contributor

@Octachron

Copy link
Copy Markdown
Member

I admit that I don't have a very clear view of the safety of the fix.

Nevertheless, taking in account that this fix changes the c call assembly even outside of the frame pointer mode, and that we will not have the time for a wide scale test before the upcoming release of OCaml 5.3.0, I am inclined to not include the fix in 5.3.0 and wait for 5.3.1 .

However, if @fabbing and @tmcgilchrist are certain that the fix cannot have introduced any bug, I could be convinced to backport it to 5.3.0 .

@stedolan

Copy link
Copy Markdown
Contributor Author

I'd like to see it in 5.3.1, but I don't think it's worth sneaking this patch into 5.3.0 at the last moment.

5.2.0 has the issue in #13575 and 5.3.0 is no worse in this regard, so the bug is not new and letting the fix wait until 5.3.1 seems fine to me.

@tmcgilchrist

Copy link
Copy Markdown
Contributor

I agree including it in 5.3.1 would be fine.

OlivierNicole pushed a commit to OlivierNicole/ocaml that referenced this pull request Sep 16, 2025
Avoid traversing C parts of frame pointer chain when reallocating stack
tmcgilchrist added a commit to tmcgilchrist/ocaml that referenced this pull request Jan 29, 2026
tmcgilchrist added a commit to tmcgilchrist/ocaml that referenced this pull request Jan 30, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants