Skip to content

Defer page teardown while worker scripts are evaluating#2398

Merged
karlseguin merged 1 commit into
lightpanda-io:mainfrom
staylor:fix/worker-importscripts-segfault
May 10, 2026
Merged

Defer page teardown while worker scripts are evaluating#2398
karlseguin merged 1 commit into
lightpanda-io:mainfrom
staylor:fix/worker-importscripts-segfault

Conversation

@staylor
Copy link
Copy Markdown
Contributor

@staylor staylor commented May 8, 2026

Summary

Fixes a use-after-free segfault in worker script evaluation when a CDP message arrives mid-fetch during importScripts().

Reproduction

Drive lightpanda serve with puppeteer-core's puppeteer.connect({ browserWSEndpoint }) against any URL that loads dedicated workers calling importScripts() during initial eval. The Allbirds product page (https://www.allbirds.com/products/mens-wool-runners) loads ~8 web-pixel workers each calling importScripts(), and reliably triggered the crash within 1–10 sequential connections to the same server.

Stack signature (truncated):

Segmentation fault at address 0x...
std/hash_map.zig:798:33 in capacity            ← self.metadata.? - 1 dereferences freed page
std/hash_map.zig:1148:39 in getOrPutAssumeCapacityAdapted
src/browser/js/Context.zig:276 in addIdentity   ← identity_map.getOrPut on a freed Identity
src/browser/js/Local.zig:229    in mapZigInstanceToJs
src/browser/js/Caller.zig:382   in handleError  ← mapping a Zig error to a JS exception
src/browser/js/bridge.zig:161   in wrap         ← V8 callback into Zig
... v8 frames ...
src/browser/js/Local.zig:194    in compileAndRun
src/browser/webapi/Worker.zig:190 in loadInitialScript
src/browser/webapi/Worker.zig:178 in httpDoneCallback

Root cause

WorkerGlobalScope.importScripts performs a synchronous HTTP request via HttpClient.syncRequest. To stay responsive during a long fetch, syncRequest pumps the CDP socket via cdp.blocking_read while waiting for the HTTP response. If a CDP message such as Target.closeTarget arrives on that socket mid-fetch, the dispatcher synchronously tore down the page:

Worker JS → importScripts → syncRequest → blocking_read
  → CDP dispatch → Target.closeTarget
  → Session.removePage → Page.deinit → Frame.deinit
  → Worker.deinit (frees worker arena + identity_map)

When control unwound back into the worker's eval, the next addIdentity call dereferenced the freed identity_map metadata pointer and segfaulted (sometimes immediately on the same connection, sometimes a few connections later as the arena pool recycled the freed memory and a different worker's identity got positioned over the old one).

Session.removePage already had a guard for this exact reentrancy pattern via frame._script_manager.base.is_evaluating, but it never tripped in the worker case because worker scripts don't go through the frame's ScriptManager — they have their own _script_manager on WorkerGlobalScope.

Fix

Two small changes:

  1. Worker.loadInitialScript now flips _worker_scope._script_manager.is_evaluating around the eval, with was_evaluating save/restore so nested worker evals (e.g. one worker's importScripts synchronously triggering another worker's done-callback via the curl pump) compose correctly.

  2. New helper Session.anyScriptEvaluating(frame) recursively walks the frame tree (the frame's own ScriptManager + every owned worker's ScriptManager + child frames) and returns true if any is mid-eval. Session.removePage and CDP.disposeBrowserContext use this in place of the frame-only check, so teardown is deferred whenever any script — frame, worker, or subframe — is on the call stack. Final cleanup happens at CDP.deinit on connection close, matching the existing deferred-teardown contract documented in Session.removePage.

Diff is +38 / -2 across three files: src/browser/Session.zig, src/browser/webapi/Worker.zig, src/cdp/CDP.zig.

Verification

  • Repro fixed: 25 consecutive puppeteer-core connect() runs against the Allbirds URL on the same lightpanda serve process. All returned status=200 with the expected <title> and ~922 KB body, server alive throughout. Pre-fix this crashed within 1–10 runs.
  • Mixed clients: interleaved Puppeteer and Playwright connectOverCDP runs against the same server, no crashes (Playwright still times out on page.goto due to a separate, unrelated bug in the synthetic STARTUP session — out of scope here).
  • Unit tests: 521/521 pass (make test).

Notes / out of scope

While reproducing this I noticed that Playwright's chromium.connectOverCDP cannot navigate against lightpanda serve at all: it auto-attaches to the synthetic STARTUP target Lightpanda advertises, sends Page.navigate on that session, and Lightpanda's dispatchStartupCommand blindly replies {} and drops the message — Playwright then waits forever for Page.frameNavigated. Puppeteer's flow (createBrowserContextcreateTarget → real session) is unaffected. That's a separate fix; happy to follow up with another PR if useful.

@github-actions
Copy link
Copy Markdown

github-actions Bot commented May 8, 2026

All contributors have signed the CLA ✍️ ✅
Posted by the CLA Assistant Lite bot.

@staylor
Copy link
Copy Markdown
Contributor Author

staylor commented May 8, 2026

I have read the CLA Document and I hereby sign the CLA

Copy link
Copy Markdown
Collaborator

@karlseguin karlseguin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for this. We've landed a number of features lately that have, if not introduced this issue, then certainly exasperated it. It's something we'd like to fix more holistically, but these fixes are good in the meantime and they buy us time to wrap up some other stuff currently in the pipeline and then put thought into the right design.

Comment thread src/browser/Session.zig Outdated
// have been drained while a Zig->JS->Zig stack (e.g. Worker importScripts
// -> syncRequest -> blocking_read) is mid-flight. Recursive over child
// frames so that an evaluating subframe also defers parent teardown.
pub fn anyScriptEvaluating(frame: *const Frame) bool {
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this might be a bit nicer ergonomics as a method on Frame. In CDP.zig, it would change from:

Session.anyScriptEvaluating(&page.frame)

to:

page.frame.anyScriptEvaluating();

// arena and identity_map underneath us. Session.removePage walks
// every frame's workers and bails out when any is_evaluating, so the
// teardown is deferred until the eval unwinds.
const sm = &self._worker_scope._script_manager;
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should do the same thing in WorkerGlobalScope.importScript.

Worker scripts can call importScripts(), which performs a synchronous
HTTP request via HttpClient.syncRequest. To stay responsive during a
long fetch, syncRequest pumps the CDP socket (cdp.blocking_read) while
waiting. If a CDP message such as Target.closeTarget arrives on that
socket mid-fetch, the previous code path tore down the page
immediately:

    Worker JS -> importScripts -> syncRequest -> blocking_read
      -> CDP dispatch -> Target.closeTarget
      -> Session.removePage -> Page.deinit -> Frame.deinit
      -> Worker.deinit (frees worker arena + identity_map)

When control unwound back into the worker's eval, the next operation
that hit ctx.identity.identity_map.getOrPut dereferenced the freed
metadata pointer and segfaulted (sometimes immediately, sometimes a
few connections later as the arena got recycled).

Reproducer: any URL that loads dedicated workers calling importScripts
during initial eval, driven via puppeteer-core's connectOverCDP. The
allbirds.com product page (which loads ~8 web-pixel workers each
calling importScripts) reliably triggered it within ~10 connections.

Session.removePage already deferred when the frame's own
ScriptManager.is_evaluating was set; that guard never tripped because
worker scripts don't go through the frame's ScriptManager. Fix:

  * Worker.loadInitialScript now sets the worker's own
    _worker_scope._script_manager.is_evaluating around the eval, with
    save/restore so nested worker evals compose correctly.

  * WorkerGlobalScope.importScript also sets its own
    _script_manager.is_evaluating around the syncRequest +
    runMacrotasks. The typical caller (Worker.loadInitialScript)
    already sets this around its outer eval, so the outer guard
    usually covers us; the inner mark is defense-in-depth for callers
    that reach importScripts() from a setTimeout / microtask outside
    the loadInitialScript scope.

  * New Frame.anyScriptEvaluating method walks the frame tree (frame
    ScriptManager + every worker's ScriptManager + child frames) and
    returns true if any is mid-eval. Session.removePage and
    CDP.disposeBrowserContext use this in place of the frame-only
    check, deferring teardown until all evals unwind. Final cleanup
    happens at CDP.deinit on connection close, matching the existing
    deferred-teardown contract.

Verified by running the puppeteer-core repro back-to-back against a
single Lightpanda serve; all returned 200 with the right title, no
UAF crashes (was previously crashing within 1-10 runs). All 521 unit
tests still pass.

Note: a separate, pre-existing latent V8 issue surfaces under stress
on this same code path. After many iterations a Runtime.evaluate
promise tracked by V8's inspector PromiseHandlerTracker is discarded
during garbage collection's first-pass weak callbacks; the discard
sends a failure response which triggers v8::String::NewFromOneByte,
hitting the debug-only assertion AllowHeapAllocation::IsAllowed() in
heap-allocator-inl.h:79 (no allocations allowed during weak callbacks).
This reproduces on a baseline build of this PR commit and on a
baseline build of just the original two-line is_evaluating fix \u2014
i.e. it is not introduced by the deferral logic. The deferral makes
it more visible because inspector callbacks now live longer before
teardown, so they are more likely to be alive during a GC. Tracking
this as a follow-up; the fix here still resolves the UAF that was
crashing the server immediately.
@staylor staylor force-pushed the fix/worker-importscripts-segfault branch from 1f761af to 92607ad Compare May 9, 2026 21:26
@staylor
Copy link
Copy Markdown
Contributor Author

staylor commented May 9, 2026

Both review suggestions applied (force-pushed 92607ad7):

  • anyScriptEvaluating is now a method on Frame instead of a free function on Session, with the call sites in Session.removePage and CDP.disposeBrowserContext updated to page.frame.anyScriptEvaluating().
  • WorkerGlobalScope.importScript now also sets sm.is_evaluating = true (with was_evaluating save/restore) around the syncRequest + runMacrotasks. The typical caller (Worker.loadInitialScript) already covers this via the outer guard, so the inner mark is defense-in-depth for callers that reach importScripts() from a setTimeout / microtask outside the loadInitialScript scope.

Diff is now +59 / -2 across Frame.zig, Session.zig, webapi/Worker.zig, webapi/WorkerGlobalScope.zig, cdp/CDP.zig. 521/521 unit tests still pass.


While re-running the back-to-back puppeteer-core stress test against the Allbirds repro (25 sequential connections to one server), I uncovered a separate, pre-existing latent V8 inspector lifetime bug that this PR makes more visible. Filed as #2407. tl;dr: V8's inspector tries to allocate a JS string for a Runtime.evaluate failure response from inside a GC weak-callback phase, which V8 forbids in debug builds; aborts the server with Fatal error in heap-allocator-inl.h, line 79 - AllowHeapAllocation::IsAllowed(). Reproduces both on this PR's commit and on a baseline build of just the original two-line is_evaluating fix from this PR, so it is not introduced by the deferral logic. The deferral does make it more reachable because pending inspector callbacks now live longer (they would previously have been torn down with the page during the syncRequest reentrancy this PR fixes), but the underlying V8 inspector misuse exists independently. Full stack trace, reproduction recipe, and suggested fix directions in #2407.

The original reentrancy UAF that this PR fixes is straightforwardly resolved (no SEGV, page state stays valid for the duration of the worker eval); the V8 inspector issue can be tracked and fixed separately.

@karlseguin karlseguin merged commit 520d968 into lightpanda-io:main May 10, 2026
34 of 35 checks passed
@github-actions github-actions Bot locked and limited conversation to collaborators May 10, 2026
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants