WTF: read interrupted SP from ucontext in signalHandlerSuspendResume#235
WTF: read interrupted SP from ucontext in signalHandlerSuspendResume#235robobun wants to merge 1 commit into
Conversation
signalHandlerSuspendResume validated the snapshot by calling currentStackPointer() — the SP of the *handler frame itself*. That assumes the handler always runs on the thread's normal stack. It does, unless someone sets SA_ONSTACK on our sigaction + the thread has a sigaltstack installed, in which case the handler runs on the alt stack and the stack-range check fails on every invocation. The assumption doesn't hold on Linux in practice. Go's cgo runtime walks every signal in runtime.initsig and slaps SA_ONSTACK onto any handler it didn't install itself (including ours). Combined with sanitizer runtimes / libbacktrace / Bun's crash handler installing alternate signal stacks on the main thread, the very next Thread::suspend() delivers SIGPWR onto the alt stack, the check fails on every retry, and the loop spins forever — the Bun event loop hangs at the first WASM compile after the dlopen. See oven-sh/bun#31158 and oven-sh/bun#29843 (Prisma MariaDB adapter, same mechanism). The fix is to read the SP the thread was running at when the signal arrived straight out of the kernel-saved register state. That SP lives in the ucontext and is stable regardless of whether the handler itself runs on the normal or the alt stack. The existing retry-on-alt-stack semantics still work: if the thread genuinely was on an alt stack when the signal arrived (e.g. a nested handler), the ucontext SP reflects that and the check backs off as before. Keep currentStackPointer() as the fallback for platforms without a machine context.
|
Small suggestion: now that the handler is robust to SA_ONSTACK via ucontext SP, the comment on the SIGPWR override (from Suggested replacement: #if OS(LINUX) && USE(BUN_JSC_ADDITIONS)
// Thread suspension on Linux uses a signal (macOS uses Mach ports instead —
// this entire signal-based mechanism is Linux/FreeBSD-only). We override the
// default SIGUSR1 to SIGPWR because npm packages commonly register
// process.on('SIGUSR1') handlers. SIGPWR ("power failure") is effectively
// unused on modern Linux.
//
// Note: dlopen'd runtimes (Go cgo, Mono, JNI) may add SA_ONSTACK to this
// handler's sigaction flags via their init routines. The handler reads the
// interrupted SP from ucontext rather than its own frame pointer, making it
// robust to alt-stack delivery regardless of flag mutation.
// See oven-sh/bun#31158.
g_wtfConfig.sigThreadSuspendResume = SIGPWR;
#endifThis ties together the "why SIGPWR", "why Linux-only", and "why it's safe" in one place. |
|
No actionable comments were generated in the recent review. 🎉 ℹ️ Recent review info⚙️ Run configurationConfiguration used: Organization UI Review profile: ASSERTIVE Plan: Pro Run ID: 📒 Files selected for processing (1)
WalkthroughThe signal-handler suspend/resume mechanism is improved to detect the interrupted thread's stack pointer from machine-context ChangesSignal handler stack pointer reliability
🚥 Pre-merge checks | ✅ 3 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (3 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. Warning Review ran into problems🔥 ProblemsGit: Failed to clone repository. Please run the Comment |
Preview Builds
|
Point WEBKIT_VERSION at the preview autobuild of oven-sh/WebKit#235, which teaches signalHandlerSuspendResume to read the interrupted thread's SP from the ucontext instead of the handler's own currentStackPointer(). That's the root-cause fix — regardless of what any dlopen'd library (Go cgo, Mono, JNI, Rust cdylib, …) does to SA_ONSTACK, the stack-range check always sees the interrupted stack rather than wherever the handler landed. Delete the per-dlopen sigaction scan on the Bun side. With the WTF change in place it's pure overhead, and the reporter rightly pointed out that chaining workarounds for workarounds is the wrong direction. Regression test (test/regression/issue/31158.test.ts) is unchanged and now exercises the upstream fix directly. Swap the preview-pr-235 tag for the merged autobuild hash once oven-sh/WebKit#235 lands.
Fixes the SIGPWR suspend-loop deadlock reported in oven-sh/bun#31158 (and oven-sh/bun#29843 — same mechanism via Prisma's WASM-backed adapter).
The bug
Thread::suspend()sends the GC suspend-resume signal to the target thread and then spins:The handler decides whether to publish a register snapshot by checking
currentStackPointer()— the SP of the handler's own frame — againstthread->m_stack:That only works so long as the handler runs on the normal stack. If
SA_ONSTACKis set on the sigaction and the thread has asigaltstackinstalled, the handler runs on the alt stack, the check fails on every retry, and the loop spins forever.Who trips this
Go's cgo runtime.
runtime.initsigwalks every signal and, for any handler it didn't install itself, callssetsigstackto force-addSA_ONSTACKso Go's own threads' synchronous faults stay on a managed alt stack. That includes our SIGPWR handler. Any c-shared library that follows the same recipe (Mono, some JNI layouts, Rust cdylibs) hits it too.Combined with sanitizer runtimes / libbacktrace / the host's crash handler installing an alternate signal stack on the main thread (ASAN does this unconditionally), the next GC thread-suspend delivers SIGPWR onto the alt stack → stack check fails every retry → the suspender wedges at 100% CPU.
Reproducible in a plain C program to confirm
currentStackPointervs the ucontext-saved SP:The fix
Read the SP the thread was running at when the signal arrived straight out of the ucontext (kernel-saved register state). That SP is stable regardless of whether the handler itself runs on the normal or the alt stack, and the existing retry-on-alt-stack semantics still work: if the thread genuinely was on an alt stack when the signal arrived (e.g. a nested handler), the ucontext SP reflects that and the check backs off as before.
currentStackPointer()stays as the fallback for non-HAVE(MACHINE_CONTEXT) platforms.Test
Regression test coming from the Bun side (oven-sh/bun#31161) — spawns a child that sets
SA_ONSTACKon the suspend-resume signal the same way Go'sinitsigdoes, then drives the WASM install path that triggersresetInstructionCacheOnAllThreads. Passes with this fix applied; hangs to the test-runner timeout without it.