Defer page teardown while worker scripts are evaluating#2398
Conversation
|
All contributors have signed the CLA ✍️ ✅ |
|
I have read the CLA Document and I hereby sign the CLA |
karlseguin
left a comment
There was a problem hiding this comment.
Thanks for this. We've landed a number of features lately that have, if not introduced this issue, then certainly exasperated it. It's something we'd like to fix more holistically, but these fixes are good in the meantime and they buy us time to wrap up some other stuff currently in the pipeline and then put thought into the right design.
| // have been drained while a Zig->JS->Zig stack (e.g. Worker importScripts | ||
| // -> syncRequest -> blocking_read) is mid-flight. Recursive over child | ||
| // frames so that an evaluating subframe also defers parent teardown. | ||
| pub fn anyScriptEvaluating(frame: *const Frame) bool { |
There was a problem hiding this comment.
this might be a bit nicer ergonomics as a method on Frame. In CDP.zig, it would change from:
Session.anyScriptEvaluating(&page.frame)to:
page.frame.anyScriptEvaluating();| // arena and identity_map underneath us. Session.removePage walks | ||
| // every frame's workers and bails out when any is_evaluating, so the | ||
| // teardown is deferred until the eval unwinds. | ||
| const sm = &self._worker_scope._script_manager; |
There was a problem hiding this comment.
I think we should do the same thing in WorkerGlobalScope.importScript.
Worker scripts can call importScripts(), which performs a synchronous
HTTP request via HttpClient.syncRequest. To stay responsive during a
long fetch, syncRequest pumps the CDP socket (cdp.blocking_read) while
waiting. If a CDP message such as Target.closeTarget arrives on that
socket mid-fetch, the previous code path tore down the page
immediately:
Worker JS -> importScripts -> syncRequest -> blocking_read
-> CDP dispatch -> Target.closeTarget
-> Session.removePage -> Page.deinit -> Frame.deinit
-> Worker.deinit (frees worker arena + identity_map)
When control unwound back into the worker's eval, the next operation
that hit ctx.identity.identity_map.getOrPut dereferenced the freed
metadata pointer and segfaulted (sometimes immediately, sometimes a
few connections later as the arena got recycled).
Reproducer: any URL that loads dedicated workers calling importScripts
during initial eval, driven via puppeteer-core's connectOverCDP. The
allbirds.com product page (which loads ~8 web-pixel workers each
calling importScripts) reliably triggered it within ~10 connections.
Session.removePage already deferred when the frame's own
ScriptManager.is_evaluating was set; that guard never tripped because
worker scripts don't go through the frame's ScriptManager. Fix:
* Worker.loadInitialScript now sets the worker's own
_worker_scope._script_manager.is_evaluating around the eval, with
save/restore so nested worker evals compose correctly.
* WorkerGlobalScope.importScript also sets its own
_script_manager.is_evaluating around the syncRequest +
runMacrotasks. The typical caller (Worker.loadInitialScript)
already sets this around its outer eval, so the outer guard
usually covers us; the inner mark is defense-in-depth for callers
that reach importScripts() from a setTimeout / microtask outside
the loadInitialScript scope.
* New Frame.anyScriptEvaluating method walks the frame tree (frame
ScriptManager + every worker's ScriptManager + child frames) and
returns true if any is mid-eval. Session.removePage and
CDP.disposeBrowserContext use this in place of the frame-only
check, deferring teardown until all evals unwind. Final cleanup
happens at CDP.deinit on connection close, matching the existing
deferred-teardown contract.
Verified by running the puppeteer-core repro back-to-back against a
single Lightpanda serve; all returned 200 with the right title, no
UAF crashes (was previously crashing within 1-10 runs). All 521 unit
tests still pass.
Note: a separate, pre-existing latent V8 issue surfaces under stress
on this same code path. After many iterations a Runtime.evaluate
promise tracked by V8's inspector PromiseHandlerTracker is discarded
during garbage collection's first-pass weak callbacks; the discard
sends a failure response which triggers v8::String::NewFromOneByte,
hitting the debug-only assertion AllowHeapAllocation::IsAllowed() in
heap-allocator-inl.h:79 (no allocations allowed during weak callbacks).
This reproduces on a baseline build of this PR commit and on a
baseline build of just the original two-line is_evaluating fix \u2014
i.e. it is not introduced by the deferral logic. The deferral makes
it more visible because inspector callbacks now live longer before
teardown, so they are more likely to be alive during a GC. Tracking
this as a follow-up; the fix here still resolves the UAF that was
crashing the server immediately.
1f761af to
92607ad
Compare
|
Both review suggestions applied (force-pushed
Diff is now +59 / -2 across While re-running the back-to-back puppeteer-core stress test against the Allbirds repro (25 sequential connections to one server), I uncovered a separate, pre-existing latent V8 inspector lifetime bug that this PR makes more visible. Filed as #2407. tl;dr: V8's inspector tries to allocate a JS string for a The original reentrancy UAF that this PR fixes is straightforwardly resolved (no SEGV, page state stays valid for the duration of the worker eval); the V8 inspector issue can be tracked and fixed separately. |
Summary
Fixes a use-after-free segfault in worker script evaluation when a CDP message arrives mid-fetch during
importScripts().Reproduction
Drive
lightpanda servewithpuppeteer-core'spuppeteer.connect({ browserWSEndpoint })against any URL that loads dedicated workers callingimportScripts()during initial eval. The Allbirds product page (https://www.allbirds.com/products/mens-wool-runners) loads ~8 web-pixel workers each callingimportScripts(), and reliably triggered the crash within 1–10 sequential connections to the same server.Stack signature (truncated):
Root cause
WorkerGlobalScope.importScriptsperforms a synchronous HTTP request viaHttpClient.syncRequest. To stay responsive during a long fetch,syncRequestpumps the CDP socket viacdp.blocking_readwhile waiting for the HTTP response. If a CDP message such asTarget.closeTargetarrives on that socket mid-fetch, the dispatcher synchronously tore down the page:When control unwound back into the worker's eval, the next
addIdentitycall dereferenced the freed identity_map metadata pointer and segfaulted (sometimes immediately on the same connection, sometimes a few connections later as the arena pool recycled the freed memory and a different worker's identity got positioned over the old one).Session.removePagealready had a guard for this exact reentrancy pattern viaframe._script_manager.base.is_evaluating, but it never tripped in the worker case because worker scripts don't go through the frame'sScriptManager— they have their own_script_manageronWorkerGlobalScope.Fix
Two small changes:
Worker.loadInitialScriptnow flips_worker_scope._script_manager.is_evaluatingaround theeval, withwas_evaluatingsave/restore so nested worker evals (e.g. one worker'simportScriptssynchronously triggering another worker's done-callback via the curl pump) compose correctly.New helper
Session.anyScriptEvaluating(frame)recursively walks the frame tree (the frame's ownScriptManager+ every owned worker'sScriptManager+ child frames) and returnstrueif any is mid-eval.Session.removePageandCDP.disposeBrowserContextuse this in place of the frame-only check, so teardown is deferred whenever any script — frame, worker, or subframe — is on the call stack. Final cleanup happens atCDP.deiniton connection close, matching the existing deferred-teardown contract documented inSession.removePage.Diff is +38 / -2 across three files:
src/browser/Session.zig,src/browser/webapi/Worker.zig,src/cdp/CDP.zig.Verification
puppeteer-coreconnect()runs against the Allbirds URL on the samelightpanda serveprocess. All returnedstatus=200with the expected<title>and ~922 KB body, server alive throughout. Pre-fix this crashed within 1–10 runs.connectOverCDPruns against the same server, no crashes (Playwright still times out onpage.gotodue to a separate, unrelated bug in the syntheticSTARTUPsession — out of scope here).make test).Notes / out of scope
While reproducing this I noticed that Playwright's
chromium.connectOverCDPcannot navigate againstlightpanda serveat all: it auto-attaches to the syntheticSTARTUPtarget Lightpanda advertises, sendsPage.navigateon that session, and Lightpanda'sdispatchStartupCommandblindly replies{}and drops the message — Playwright then waits forever forPage.frameNavigated. Puppeteer's flow (createBrowserContext→createTarget→ real session) is unaffected. That's a separate fix; happy to follow up with another PR if useful.