fix(engine): disable browser pool for parallel capture workers#1087
Conversation
|
@miguel-heygen why don't our regression tests capture this? |
…mode BeginFrame's compositor is process-global — when multiple pages in the same Chrome instance drive HeadlessExperimental.beginFrame concurrently, they race the compositor and crash with "Protocol error: Target closed". Only disable the pool when BeginFrame mode would actually be active (Linux + headless-shell + not forceScreenshot). Screenshot mode (macOS/Windows) is unaffected and keeps the pool for memory efficiency. Also extracts the frame capture loop into captureFrameRange to reduce function complexity in executeWorkerTask.
8a1d074 to
dc87192
Compare
because regression tests run with 1 worker in docker.. We should fix this |
vanceingalls
left a comment
There was a problem hiding this comment.
Suggested status: Request changes.
Findings:
- Blocking: the pool-disable predicate does not match the actual BeginFrame-mode predicate.
executeWorkerTask()disables the pool for parallel Linux + headless-shell +!forceScreenshot(packages/engine/src/services/parallelCoordinator.ts:245), butcreateCaptureSession()also falls back to screenshot mode whencaptureOptions.deviceScaleFactor > 1(packages/engine/src/services/frameCapture.ts:197). That is the user-facing--output-resolution/ Lambda supersampling path. Those renders are screenshot-mode and should keep the shared browser pool, but this PR now launches one Chrome per worker, which is exactly the memory-efficiency regression the PR says it avoids for screenshot mode. Please reuse/extract the same capture-mode predicate (including supersampling) and add regression coverage for Linux headless-shell with DPR=1 vs DPR>1.
CI note: several current checks were still in progress when I reviewed; I only did static review plus git diff --check locally.
jrusso1020
left a comment
There was a problem hiding this comment.
Traced the full acquire→capture→teardown chain across parallelCoordinator.ts → frameCapture.ts → browserManager.ts. The fix is correct and the teardown is leak-safe. Clean fix for a real-user P0 — lean toward landing.
Per-worker isolation confirmed (not parallelism disabled)
parallel = tasks.length > 1is computed inexecuteParallelCaptureand threaded into each worker.Promise.all(tasks.map(executeWorkerTask))is unchanged — all workers still run concurrently.- The only change per worker is
workerConfig = { ...config, enableBrowserPool: false }whenneedsSeparateBrowsers. Each worker then callscreateCaptureSession(..., workerConfig)→acquireBrowser(chromeArgs, workerConfig), and with the pool offacquireBrowserbypasses all pool branches and returns a fresh standalone browser vialaunchBrowser.
So N parallel workers → N separate Chrome processes → no shared process-global BeginFrame compositor → no race. Parallelism is fully preserved (the PR's 4w/6w bench timings confirm: 1m22s / 1m18s, clearly concurrent).
No resource leak (the hf#1039 tie-in)
The acquire/release pair is symmetric on the same flag:
session.config = workerConfig(stored increateCaptureSession).finally { if (session) await closeCaptureSession(session).catch(() => {}) }runs for every worker, success or error.closeCaptureSession→releaseBrowser(session.browser, session.config); withenableBrowserPool: false,releaseBrowsershort-circuits toawait browser.close()directly — no refcount, no pool retention.- Belt-and-suspenders:
closeCaptureSessionwraps the close inwaitForCloseWithTimeout(5s) and callsforceReleaseBrowser(SIGKILL on the Chrome process) if it hangs. That's the orphan-Chrome guard from hf#1039 doing its job on the per-worker path too.
So each worker's dedicated browser is closed on worker exit. No accumulation across the chunkWorkerCount × chunks matrix.
The guard errs on the safe side (minor perf note, non-blocking)
needsSeparateBrowsers = parallel && linux && !forceScreenshot && resolveHeadlessShellPath(config) !== undefined.
The actual BeginFrame-active condition in createCaptureSession is slightly narrower — it also requires !supersampling (deviceScaleFactor ≤ 1), and launchBrowser can downgrade beginframe→screenshot at launch if probeBeginFrameSupport fails on newer Chromium builds. In those two cases the worker runs in screenshot mode (pool-safe) but needsSeparateBrowsers still disabled the pool → separate browsers spawned unnecessarily. That's a memory pessimization (N browsers when 1 shared would've been fine), never a correctness issue. The guard being a superset of the danger condition is the right call — the dangerous direction (pool ON while BeginFrame active) is impossible here. Fine as-is; just noting if anyone later wonders why a supersampled Linux parallel render uses more RAM.
Highest-leverage follow-up: multi-worker regression lane
This is the prevention item, and it's Miguel's own call from the thread — worth making it a tracked issue, not just a comment. The bug is invisible at workerCount=1 (the PR table: 1 worker works before and after). The regression suite runs single-worker, which is exactly why CI was green through the entire window this shipped a crash to a real user. A regression lane that runs at least one composition at --workers 4 on Linux + headless-shell would have caught the Target closed crash on the first CI run. Strongly recommend filing that as a follow-up so this class can't silently regress.
On the unchecked macOS test-plan box
"macOS parallel render retains pool behavior" is verifiable by inspection: process.platform === "linux" is false on macOS → needsSeparateBrowsers false → workerConfig === config → pool path unchanged. The logic guarantees it even though the box is unticked.
Verdict
Correct fix, leak-safe teardown, conservative guard. Real-user P0 with a clean, single-file change. CI green on the fast lane (Build / Test / Typecheck / Lint / Fallow / CLI smoke / CodeQL); regression-shards + Windows still in flight. Assuming those land green, this should go in — with the multi-worker regression lane as the tracked follow-up.
— Rames Jusso (hyperframes)
Match the full capture-mode predicate from createCaptureSession: DPR > 1 (supersampling) forces screenshot mode, which is pool-safe. Without this check, supersampled parallel renders on Linux would unnecessarily launch separate browsers.
|
Thanks for the review Vance — good catch on the supersampling gap. Pushed 4fc53b5 which adds the const supersampling = (captureOptions.deviceScaleFactor ?? 1) > 1;
const needsSeparateBrowsers =
parallel &&
process.platform === "linux" &&
!config?.forceScreenshot &&
!supersampling &&
resolveHeadlessShellPath(config) !== undefined;This mirrors Multi-worker regression lane is tracked as a follow-up PR. |
jrusso1020
left a comment
There was a problem hiding this comment.
Re-reviewed 4fc53b5 — the supersampling tightening I flagged. Correct and precise.
const supersampling = (captureOptions.deviceScaleFactor ?? 1) > 1;
const needsSeparateBrowsers =
parallel &&
process.platform === "linux" &&
!config?.forceScreenshot &&
!supersampling &&
resolveHeadlessShellPath(config) !== undefined;This now mirrors createCaptureSession's beginframe predicate exactly — same (deviceScaleFactor ?? 1) > 1 check, read from the same captureOptions source. So a supersampled (DPR > 1) parallel Linux render now correctly keeps the shared pool (it runs screenshot mode, which is pool-safe) instead of needlessly launching N browsers.
The one residual I noted earlier — launchBrowser's runtime probeBeginFrameSupport downgrade (beginframe→screenshot on newer Chromium builds) — is still not reflected in this static predicate, and that's correct to leave alone: it's unknowable pre-launch, and the failure direction is safe (pool disabled when screenshot-mode would've been fine = a little extra RAM, never a crash). The predicate being a superset of the true danger condition is the right invariant.
Nothing else changed in this commit (the capture-loop extraction + per-worker isolation + leak-safe teardown from dc87192 are unchanged). My prior verification stands: per-worker isolation correct, finally-block teardown closes each worker's browser with the SIGKILL fallback.
CI is re-running on the new SHA (was green on the fast lane for the parent commit). Assuming it lands green, this is good to merge — with the multi-worker regression lane as the tracked follow-up Miguel committed to.
— Rames Jusso (hyperframes)
* test(producer): add parallel capture regression test Add a regression fixture that forces workers: 2, ensuring the parallel capture code path (browser-per-worker in BeginFrame mode) is exercised in CI. All existing fixtures pin workers: 1, so this is the first test that would catch a regression in the multi-worker pool isolation fix from PR #1087. The composition is 5s @ 30fps (150 frames), which exceeds both MIN_FRAMES_PER_WORKER * 2 (60) and minParallelFrames (120), so the parallel coordinator will always split work across workers. Baseline output/output.mp4 must be generated inside Dockerfile.test before the fixture can run in CI. * test(producer): bump parallel capture test to 4 workers Matches realistic auto-mode worker counts (4-6 on typical machines), not just the minimum (2) that triggers the bug. * test(producer): add golden baseline for parallel capture regression Generated inside Dockerfile.test on amd64 Linux (Docker image hyperframes-producer:test) to match the CI rendering environment. * test(producer): address review feedback on parallel capture test - Add fixture to shard-5 in regression.yml so CI actually runs it - Reframe description: multi-worker path coverage (frame distribution, reorder buffer, per-worker browser lifecycle), not GPU-specific crash guard — SwiftShader CI can't reproduce the hardware compositor race - Remove dead @Keyframes count-up (content doesn't apply to div) - Remove unused CSS animation reference on .counter - Regenerate golden baseline with cleaned-up HTML * fix(producer): replace rAF + CSS keyframes with GSAP in parallel-capture test The composition used requestAnimationFrame for a frame counter and CSS @Keyframes for animations, which triggered screenshot capture mode (non-deterministic across workers) and caused 29 PSNR failures in CI. All animations now use the GSAP timeline, keeping the render in deterministic BeginFrame mode. Baseline regenerated in Docker.
Summary
HeadlessExperimental.beginFrameconcurrently, they race the compositor and Chrome crashes withProtocol error (Runtime.callFunctionOn): Target closed.workerCount > 1. On hosts with hardware GPU + EGL (e.g., AMD Radeon iGPU), the failure manifests as an indefinite hang rather than a clean crash — the render never completes.forceScreenshot). Screenshot mode (macOS/Windows) is unaffected and keeps the shared browser for memory efficiency. Also extracts the frame capture loop intocaptureFrameRangeto reduce function complexity.Reproduction
Tested on AWS Ubuntu (8 vCPU, 30 GB RAM) with a 121.2s 1080×1920 composition containing 13 video clips and 8 sub-compositions:
macOS users are unaffected — screenshot mode (the default on macOS) supports the shared browser pool.
Test plan