fix(engine): disable browser pool for parallel capture workers by miguel-heygen · Pull Request #1087 · heygen-com/hyperframes

miguel-heygen · 2026-05-26T23:50:24Z

Summary

Root cause: BeginFrame's compositor is process-global in Chrome. The browser pool shares one Chrome instance across all parallel workers via refcount. When multiple pages in the same browser drive HeadlessExperimental.beginFrame concurrently, they race the compositor and Chrome crashes with Protocol error (Runtime.callFunctionOn): Target closed.
Impact: Parallel renders fail on Linux (where BeginFrame mode is the default capture path) for any composition with workerCount > 1. On hosts with hardware GPU + EGL (e.g., AMD Radeon iGPU), the failure manifests as an indefinite hang rather than a clean crash — the render never completes.
Fix: Only disable the browser pool when BeginFrame mode would be active (Linux + chrome-headless-shell + not forceScreenshot). Screenshot mode (macOS/Windows) is unaffected and keeps the shared browser for memory efficiency. Also extracts the frame capture loop into captureFrameRange to reduce function complexity.

Reproduction

Tested on AWS Ubuntu (8 vCPU, 30 GB RAM) with a 121.2s 1080×1920 composition containing 13 video clips and 8 sub-compositions:

Workers	Before	After
1	2m 11s (ok)	2m 11s (ok)
2	Crash at frame 60	—
4	Crash immediately	1m 22s (ok)
6	Crash immediately	1m 18s (ok)

macOS users are unaffected — screenshot mode (the default on macOS) supports the shared browser pool.

Test plan

632 engine tests pass
Full monorepo build succeeds
Lint/format/fallow clean
4-worker render completes on Linux (previously crashed)
6-worker render completes on Linux (previously crashed)
1-worker render still works (unaffected path)
macOS parallel render retains pool behavior (screenshot mode doesn't trigger the guard)

jrusso1020 · 2026-05-26T23:52:07Z

@miguel-heygen why don't our regression tests capture this?

…mode BeginFrame's compositor is process-global — when multiple pages in the same Chrome instance drive HeadlessExperimental.beginFrame concurrently, they race the compositor and crash with "Protocol error: Target closed". Only disable the pool when BeginFrame mode would actually be active (Linux + headless-shell + not forceScreenshot). Screenshot mode (macOS/Windows) is unaffected and keeps the pool for memory efficiency. Also extracts the frame capture loop into captureFrameRange to reduce function complexity in executeWorkerTask.

miguel-heygen · 2026-05-26T23:58:06Z

@miguel-heygen why don't our regression tests capture this?

because regression tests run with 1 worker in docker.. We should fix this

vanceingalls

Suggested status: Request changes.

Findings:

Blocking: the pool-disable predicate does not match the actual BeginFrame-mode predicate. executeWorkerTask() disables the pool for parallel Linux + headless-shell + !forceScreenshot (packages/engine/src/services/parallelCoordinator.ts:245), but createCaptureSession() also falls back to screenshot mode when captureOptions.deviceScaleFactor > 1 (packages/engine/src/services/frameCapture.ts:197). That is the user-facing --output-resolution / Lambda supersampling path. Those renders are screenshot-mode and should keep the shared browser pool, but this PR now launches one Chrome per worker, which is exactly the memory-efficiency regression the PR says it avoids for screenshot mode. Please reuse/extract the same capture-mode predicate (including supersampling) and add regression coverage for Linux headless-shell with DPR=1 vs DPR>1.

CI note: several current checks were still in progress when I reviewed; I only did static review plus git diff --check locally.

jrusso1020

Traced the full acquire→capture→teardown chain across parallelCoordinator.ts → frameCapture.ts → browserManager.ts. The fix is correct and the teardown is leak-safe. Clean fix for a real-user P0 — lean toward landing.

Per-worker isolation confirmed (not parallelism disabled)

parallel = tasks.length > 1 is computed in executeParallelCapture and threaded into each worker.
Promise.all(tasks.map(executeWorkerTask)) is unchanged — all workers still run concurrently.
The only change per worker is workerConfig = { ...config, enableBrowserPool: false } when needsSeparateBrowsers. Each worker then calls createCaptureSession(..., workerConfig) → acquireBrowser(chromeArgs, workerConfig), and with the pool off acquireBrowser bypasses all pool branches and returns a fresh standalone browser via launchBrowser.

So N parallel workers → N separate Chrome processes → no shared process-global BeginFrame compositor → no race. Parallelism is fully preserved (the PR's 4w/6w bench timings confirm: 1m22s / 1m18s, clearly concurrent).

No resource leak (the hf#1039 tie-in)

The acquire/release pair is symmetric on the same flag:

session.config = workerConfig (stored in createCaptureSession).
finally { if (session) await closeCaptureSession(session).catch(() => {}) } runs for every worker, success or error.
closeCaptureSession → releaseBrowser(session.browser, session.config); with enableBrowserPool: false, releaseBrowser short-circuits to await browser.close() directly — no refcount, no pool retention.
Belt-and-suspenders: closeCaptureSession wraps the close in waitForCloseWithTimeout (5s) and calls forceReleaseBrowser (SIGKILL on the Chrome process) if it hangs. That's the orphan-Chrome guard from hf#1039 doing its job on the per-worker path too.

So each worker's dedicated browser is closed on worker exit. No accumulation across the chunkWorkerCount × chunks matrix.

The guard errs on the safe side (minor perf note, non-blocking)

needsSeparateBrowsers = parallel && linux && !forceScreenshot && resolveHeadlessShellPath(config) !== undefined.

The actual BeginFrame-active condition in createCaptureSession is slightly narrower — it also requires !supersampling (deviceScaleFactor ≤ 1), and launchBrowser can downgrade beginframe→screenshot at launch if probeBeginFrameSupport fails on newer Chromium builds. In those two cases the worker runs in screenshot mode (pool-safe) but needsSeparateBrowsers still disabled the pool → separate browsers spawned unnecessarily. That's a memory pessimization (N browsers when 1 shared would've been fine), never a correctness issue. The guard being a superset of the danger condition is the right call — the dangerous direction (pool ON while BeginFrame active) is impossible here. Fine as-is; just noting if anyone later wonders why a supersampled Linux parallel render uses more RAM.

Highest-leverage follow-up: multi-worker regression lane

This is the prevention item, and it's Miguel's own call from the thread — worth making it a tracked issue, not just a comment. The bug is invisible at workerCount=1 (the PR table: 1 worker works before and after). The regression suite runs single-worker, which is exactly why CI was green through the entire window this shipped a crash to a real user. A regression lane that runs at least one composition at --workers 4 on Linux + headless-shell would have caught the Target closed crash on the first CI run. Strongly recommend filing that as a follow-up so this class can't silently regress.

On the unchecked macOS test-plan box

"macOS parallel render retains pool behavior" is verifiable by inspection: process.platform === "linux" is false on macOS → needsSeparateBrowsers false → workerConfig === config → pool path unchanged. The logic guarantees it even though the box is unticked.

Verdict

Correct fix, leak-safe teardown, conservative guard. Real-user P0 with a clean, single-file change. CI green on the fast lane (Build / Test / Typecheck / Lint / Fallow / CLI smoke / CodeQL); regression-shards + Windows still in flight. Assuming those land green, this should go in — with the multi-worker regression lane as the tracked follow-up.

— Rames Jusso (hyperframes)

Match the full capture-mode predicate from createCaptureSession: DPR > 1 (supersampling) forces screenshot mode, which is pool-safe. Without this check, supersampled parallel renders on Linux would unnecessarily launch separate browsers.

miguel-heygen · 2026-05-27T00:07:13Z

Thanks for the review Vance — good catch on the supersampling gap. Pushed 4fc53b5 which adds the !supersampling (deviceScaleFactor > 1) check to match the full BeginFrame-mode predicate from createCaptureSession. The guard now reads:

const supersampling = (captureOptions.deviceScaleFactor ?? 1) > 1;
const needsSeparateBrowsers =
    parallel &&
    process.platform === "linux" &&
    !config?.forceScreenshot &&
    !supersampling &&
    resolveHeadlessShellPath(config) !== undefined;

This mirrors frameCapture.ts:197-199 exactly. Supersampled renders (DPR > 1) force screenshot mode, which is pool-safe — they'll keep the shared browser as expected.

Multi-worker regression lane is tracked as a follow-up PR.

jrusso1020

Re-reviewed 4fc53b5 — the supersampling tightening I flagged. Correct and precise.

const supersampling = (captureOptions.deviceScaleFactor ?? 1) > 1;
const needsSeparateBrowsers =
  parallel &&
  process.platform === "linux" &&
  !config?.forceScreenshot &&
  !supersampling &&
  resolveHeadlessShellPath(config) !== undefined;

This now mirrors createCaptureSession's beginframe predicate exactly — same (deviceScaleFactor ?? 1) > 1 check, read from the same captureOptions source. So a supersampled (DPR > 1) parallel Linux render now correctly keeps the shared pool (it runs screenshot mode, which is pool-safe) instead of needlessly launching N browsers.

The one residual I noted earlier — launchBrowser's runtime probeBeginFrameSupport downgrade (beginframe→screenshot on newer Chromium builds) — is still not reflected in this static predicate, and that's correct to leave alone: it's unknowable pre-launch, and the failure direction is safe (pool disabled when screenshot-mode would've been fine = a little extra RAM, never a crash). The predicate being a superset of the true danger condition is the right invariant.

Nothing else changed in this commit (the capture-loop extraction + per-worker isolation + leak-safe teardown from dc87192 are unchanged). My prior verification stands: per-worker isolation correct, finally-block teardown closes each worker's browser with the SIGKILL fallback.

CI is re-running on the new SHA (was green on the fast lane for the parent commit). Assuming it lands green, this is good to merge — with the multi-worker regression lane as the tracked follow-up Miguel committed to.

— Rames Jusso (hyperframes)

* test(producer): add parallel capture regression test Add a regression fixture that forces workers: 2, ensuring the parallel capture code path (browser-per-worker in BeginFrame mode) is exercised in CI. All existing fixtures pin workers: 1, so this is the first test that would catch a regression in the multi-worker pool isolation fix from PR #1087. The composition is 5s @ 30fps (150 frames), which exceeds both MIN_FRAMES_PER_WORKER * 2 (60) and minParallelFrames (120), so the parallel coordinator will always split work across workers. Baseline output/output.mp4 must be generated inside Dockerfile.test before the fixture can run in CI. * test(producer): bump parallel capture test to 4 workers Matches realistic auto-mode worker counts (4-6 on typical machines), not just the minimum (2) that triggers the bug. * test(producer): add golden baseline for parallel capture regression Generated inside Dockerfile.test on amd64 Linux (Docker image hyperframes-producer:test) to match the CI rendering environment. * test(producer): address review feedback on parallel capture test - Add fixture to shard-5 in regression.yml so CI actually runs it - Reframe description: multi-worker path coverage (frame distribution, reorder buffer, per-worker browser lifecycle), not GPU-specific crash guard — SwiftShader CI can't reproduce the hardware compositor race - Remove dead @Keyframes count-up (content doesn't apply to div) - Remove unused CSS animation reference on .counter - Regenerate golden baseline with cleaned-up HTML * fix(producer): replace rAF + CSS keyframes with GSAP in parallel-capture test The composition used requestAnimationFrame for a frame counter and CSS @Keyframes for animations, which triggered screenshot capture mode (non-deterministic across workers) and caused 29 PSNR failures in CI. All animations now use the GSAP timeline, keeping the render in deterministic BeginFrame mode. Baseline regenerated in Docker.

miguel-heygen force-pushed the worktree-fix+parallel-capture-browser-pool branch from 8a1d074 to dc87192 Compare May 26, 2026 23:57

vanceingalls requested changes May 27, 2026

View reviewed changes

jrusso1020 reviewed May 27, 2026

View reviewed changes

miguel-heygen mentioned this pull request May 27, 2026

test(producer): add parallel capture regression test #1088

Merged

3 tasks

miguel-heygen merged commit a9482ed into main May 27, 2026
56 of 58 checks passed

miguel-heygen deleted the worktree-fix+parallel-capture-browser-pool branch May 27, 2026 00:24

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(engine): disable browser pool for parallel capture workers#1087

fix(engine): disable browser pool for parallel capture workers#1087
miguel-heygen merged 2 commits into
mainfrom
worktree-fix+parallel-capture-browser-pool

miguel-heygen commented May 26, 2026 •

edited

Loading

Uh oh!

jrusso1020 commented May 26, 2026

Uh oh!

miguel-heygen commented May 26, 2026

Uh oh!

vanceingalls left a comment

Uh oh!

jrusso1020 left a comment

Uh oh!

miguel-heygen commented May 27, 2026

Uh oh!

jrusso1020 left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

miguel-heygen commented May 26, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Reproduction

Test plan

Uh oh!

jrusso1020 commented May 26, 2026

Uh oh!

miguel-heygen commented May 26, 2026

Uh oh!

vanceingalls left a comment

Choose a reason for hiding this comment

Uh oh!

jrusso1020 left a comment

Choose a reason for hiding this comment

Uh oh!

miguel-heygen commented May 27, 2026

Uh oh!

jrusso1020 left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

miguel-heygen commented May 26, 2026 •

edited

Loading