fix(producer): harden capture against timeouts, transient tab deaths, and OOM (P2-5)#1842
Conversation
… and OOM Addresses the P2-5 capture-timeout/stability failure bucket (~15K errors / ~7K users / 30d): "Target closed", "Runtime.callFunctionOn timed out", "Streaming encoder exited before frame 0", "Set maximum size exceeded". - Scale the CDP protocolTimeout up for large canvases by *output* pixel area, applied BEFORE the probe launches its browser (protocolTimeout is baked in at ppt.launch() and the probe browser is reused for capture, so a post-probe mutation was a no-op on the common path). Only ever raises; clamped to 30min. - Single bounded retry on a transient browser death (Target closed / Page crashed / Session closed) in the parallel disk-capture loop, at the same worker count and independent of forward progress, so a tab that dies before frame 0 recovers. Rethrows immediately when the render was aborted so cancellation never burns a retry. - Turn cryptic OOM failures (Set maximum size exceeded, heap-limit aborts) into actionable guidance naming the output dimensions and the levers that reduce memory pressure. Memory-exhaustion classification is deliberately narrow and disjoint from transient classification. - Surface FFmpeg's own failure reason in "Streaming encoder exited before frame N" via a new StreamingEncoder.getExitError(), instead of a bare frame-indexed message. Scope: transient capture retry covers the parallel disk path; sequential and streaming paths run a single stateful session/encoder and are out of scope here (probeStage already retries the shared session-init phase). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
miga-heygen
left a comment
There was a problem hiding this comment.
Review — fix(producer): harden capture against timeouts, transient tab deaths, and OOM (P2-5)
Verdict: LGTM ✅
Four independent hardening changes, each well-scoped and well-tested.
1. protocolTimeout auto-scaling
scaleProtocolTimeoutForComposition scales by output pixel area relative to a 1080p reference. Pure function, correct properties:
- Never scales down (small composition → keeps base). ✅
- Ceiling at 30 min (pathological canvas → bounded). ✅
- An already-high base is never lowered (
ceiling = max(base, MAX)). ✅ - Degenerate dims (0, NaN) → base. ✅
- Applied BEFORE probe browser launch — correct, since
protocolTimeoutis a Puppeteer connection-level setting baked in atppt.launch()and immutable afterwards. The probe browser is reused for capture on the common single-worker path, so a post-probe mutation would be a no-op. ✅
2. Single bounded transient retry
MAX_TRANSIENT_CAPTURE_RETRIES = 1. On isTransientBrowserError (Target closed / Page crashed / Session closed):
- Retries once at the SAME worker count (parallelism isn't the problem). ✅
- Does NOT require forward progress (a tab that dies before frame 0 is the exact case to recover). ✅
- Rethrows immediately when the render was aborted (cancellation never burns a retry). ✅
- Bounded at 1 so a deterministically-dying tab still fails. ✅
- Runs BEFORE the worker-halving retry in the catch block — correct ordering. ✅
- Abort check runs FIRST in the catch block — correct priority. ✅
Scope decision documented: covers the parallel disk-capture path only. Sequential and streaming paths have their own session-init retry via probeStage. Noted, not blocking.
3. Actionable OOM handling
isMemoryExhaustionError — deliberately narrow patterns (Set maximum size exceeded, Map maximum size, Invalid array/string length, Array buffer allocation failed, Reached heap limit, JavaScript heap out of memory). Intentionally does NOT match bare "out of memory" (WebGL/GPU console noise). Test explicitly asserts disjointness from transient classification (a retry re-hits the same ceiling). ✅
describeMemoryExhaustion — formats actionable guidance naming the output dimensions (device-scaled, not CSS) that exhausted memory, plus the levers that actually reduce pressure (lower resolution, fewer workers, low-memory mode). Correct: memory is allocated at output resolution. ✅
captureCompositionWidth/captureCompositionHeight/captureTotalFrames are declared at function scope and assigned inside the try — structurally required to cross the try/catch boundary.
4. Encoder exit error
StreamingEncoder.getExitError() exposes exit code + stderr tail synchronously. ensureFrameWritten now includes it — so a frame-0 encoder death (bad args, unsupported codec) is actionable instead of a bare "encoder exited before frame 0". Tests verify non-zero exit, clean exit (undefined), and the frame-write integration. ✅
Ponytail lens
Lean already. Ship.
— Miga
vanceingalls
left a comment
There was a problem hiding this comment.
🟢 Well-scoped hardening; classifier disjointness and abort-first ordering both look right
Audited under the widened-retry-policy lens: MAX_TRANSIENT_CAPTURE_RETRIES = 1 bounds the blast radius to 2× wall-clock at the same worker count, transient/OOM classifiers are provably disjoint (test at frameCapture-transientErrors.test.ts explicitly asserts this), abort short-circuits BEFORE the transient branch (renderOrchestrator.ts:751-753), and executeParallelCapture spins fresh worker browsers per call so the "fresh session" claim in the log is factual, not aspirational. Nothing blocking.
Two heads-ups and one nit for R2/follow-up consideration.
🟠 Telemetry gap on the new retry branch
File: packages/producer/src/services/renderOrchestrator.ts:780-791
Transient retry fires only a log.warn. Neither chaos.transient_retry.fired nor chaos.transient_retry.exhausted are emitted, and neither is the OOM classification (via describeMemoryExhaustion). This is the exact shape of the telemetry-gap-on-parallel-branch failure mode: when this ships, we won't be able to answer "are we burning the retry budget?" or "is OOM now the dominant tail?" from a metric — only from log grep. Given the PR is explicitly tied to PostHog dashboard 1783183, plumbing a counter alongside the log (even a single render.transient_retry counter with {result: fired|exhausted} tags, and render.oom_classified for the OOM branch) would close the loop. Not blocking for the fix, but worth stapling on before we lose the observability chance.
🟠 cfg.protocolTimeout mutated in place
File: packages/producer/src/services/renderOrchestrator.ts:848
cfg.protocolTimeout = scaledProtocolTimeout mutates the caller's config object. If cfg is reused across renders in the same worker process (I couldn't verify one way or the other from the diff), the scaled ceiling persists into the next job — a big 4K job then a small 720p job would keep the 4K's scaled timeout. Prefer a local const effectiveProtocolTimeout = scaleProtocolTimeoutForComposition(...) threaded to launchProbeStage (or wherever ppt.launch is called), leaving cfg immutable. If the audit shows cfg is per-job in practice, downgrade to nit.
nit — transient list includes deterministic launch-failure pattern
File: packages/engine/src/services/frameCapture.ts:1993
/Failed to launch the browser process/i is in the transient set. On a host with a permanent problem (no shm, missing headless-shell binary quirk on that release, wrong sandboxing flag), this classification will burn the one retry before falling through to the halving path. It's still bounded at 1 retry, so this doesn't warrant a block — but worth documenting that this specific pattern is deliberately included despite occasionally being deterministic, or splitting it out with a shorter budget. Sequential/streaming paths being out of scope is called out cleanly.
🟠 pr-envelope
Commit trailer Co-Authored-By: Claude Opus 4.8 (1M context) + body footer 🤖 Generated with [Claude Code] are present. Miguel's bar on HF treats both as request-changes; flagging for the author to strip before merge per team convention.
Review by Via
miguel-heygen
left a comment
There was a problem hiding this comment.
Clean P2-5 hardening pass; I agree with the ship shape.
What I verified:
packages/producer/src/services/renderOrchestrator.ts:792-810adds exactly one transient browser retry, checks abort first at:760-767, and retries with the same worker count rather than mixing it with the worker-halving path.packages/engine/src/services/frameCapture.ts:1984-2007andpackages/engine/src/services/frameCapture-transientErrors.test.ts:49-90keep transient and memory-exhaustion classifiers disjoint, so OOM doesn't burn the transient retry budget.packages/producer/src/services/renderOrchestrator.ts:1996-2001surfaces memory-exhaustion guidance with output dimensions/frame count when available.packages/engine/src/services/streamingEncoder.ts:143-150pluspackages/producer/src/services/render/stages/captureHdrFrameShared.tslet frame-write failures include the FFmpeg exit reason instead of the old bare frame-0 message.- On Via's config-mutation note:
packages/producer/src/services/renderOrchestrator.ts:991clonesjob.config.producerConfiginto a localcfg, so thecfg.protocolTimeoutscaling at:1194-1207does not leak back into the caller's config across jobs.
Non-blocking follow-up: packages/producer/src/services/renderOrchestrator.ts:792-807 logs transient retries but does not emit a dedicated metric/counter, so ops will need logs/perf summaries to quantify retry hit-rate until that is added. CI is green.
Verdict: APPROVE
Reasoning: The retry, OOM, timeout, and encoder-error paths are bounded, disjoint where they need to be, and covered by focused tests; the remaining telemetry gap is a follow-up, not a merge blocker.
— Magi
james-russo-rames-d-jusso
left a comment
There was a problem hiding this comment.
Reviewed at 2674fc4fdd.
Peer scan: miga-heygen COMMENT (LGTM, full audit), vanceingalls COMMENT (3 flags — telemetry gap, cfg mutation, launch-failure pattern nit), miguel-heygen APPROVED (verified cfg is cloned at renderOrchestrator.ts:991, so Vance's cfg-mutation flag is a non-issue). Layered on top of theirs.
Part of the render-reliability P1/P2 stack: #1843 / #1841 / #1842.
🟠 AI trailer + Claude co-author
- PR body ends with
🤖 Generated with [Claude Code]. - Commit
2674fc4fddcarriesCo-Authored-By: Claude.
Per team convention (see miguel-heygen bar), strip both before merge. Already flagged by vanceingalls, re-noting so it doesn't slip.
🟠 Observability: retry + OOM + encoder-exit paths all log-only
Extending vanceingalls's telemetry-gap finding — three metric emissions are all missing, not just one:
-
Transient retry branch (
renderOrchestrator.ts:780-791) —log.warnonly. Norender.transient_retrycounter with{result: fired|exhausted}. This is the highest-value one because the PR is explicitly wired to PostHog dashboard 1783183 (~15K errors / ~7K users) — post-ship we need to answer "is the retry burning its budget or absorbing the failures?" from a metric, not log grep. -
OOM classification (
renderOrchestrator.ts:1996-2001,describeMemoryExhaustion) — no counter. When OOM becomes the dominant tail after transient-retry absorbs the tab-death cohort, we won't see the shift in a metric. This is the classic observability PR that skips its own success path shape. -
Encoder exit-before-frame-N (
ensureFrameWrittenw/getExitError) — the error message is now actionable, but no counter emits when a frame-0 encoder death happens. Codec/args failures at frame 0 are exactly the class where a metric surface tells youlibx264isn't available on the fleet vs. a bad-args config regression, without ssh'ing into hosts.
All three are cheap to bolt on (one counter per branch, shared tag set: {stage, cause}). Non-blocking for the fix, but the observability window closes when this lands and the failure buckets shrink — measure before the population changes.
🟡 OOM guidance may mis-lead on producer-side heap OOM
packages/producer/src/services/renderOrchestrator.ts:606-619 — describeMemoryExhaustion suggests --workers 1 as a memory-pressure lever. This is correct for browser-side allocation failures (Set maximum size exceeded firing inside the CDP page context) — but for producer-process V8 heap OOM (Reached heap limit, JavaScript heap out of memory from Node itself), reducing workers reduces the browser footprint, not the producer heap where the frame-cache / reorder-buffer live. On the producer path a user could drop to --workers 1 and re-hit the same wall.
Not a blocker — the other levers (lower resolution / duration / low-memory-mode) do cover the producer-heap case. But the --workers 1 line could be gated on "heap limit" / "JavaScript heap out of memory" NOT matching, or reworded to "reduce parallel browser memory footprint" to be precise about what it does. Worth a follow-up.
🟡 Abort-during-retry: single-throw-time coverage
The abort short-circuit at renderOrchestrator.ts:751-753 is correctly ordered — it fires FIRST in the catch, BEFORE the transient classification. The test at renderOrchestrator.test.ts:527 covers the "abort at throw time" case (controller.abort() inside the mock, then executeParallelCapture throws). But the test doesn't cover "abort fires DURING the between-attempt gap" (after the first executeParallelCapture throws transient, before the retry's next call). The abort flag is persistent so this looks correct — the check at :751 catches it on the next iteration's catch — but there's no test pinning that. Worth an "abort between attempts" case for the retry-branch coverage. Not blocking.
⚡ Rollout suggestion
The bundle is four independent hardenings with different blast radii:
- protocolTimeout scale-up: raise-only, per-render, low risk.
- Transient retry: bounded at 1, could theoretically 2× wall-clock for a truly-broken deterministic-death case (still bounded).
- OOM classification: catch-side reformatter, zero behavior change on the success path.
- Encoder exit-reason: strictly additive to error text.
If canary/dashboard 1783183 is being watched, worth calling out in the deploy note that transient-retry can 2× the render duration for the specific "tab dies pre-frame-0 twice" cohort — makes the p99 tail move by design, and you want to distinguish that from a regression.
↩️ Questions
-
getExitError()returns the ffmpeg exit reason synchronously afterexitStatus === "error". BetweenwriteFramereturningfalse(encoder is gone) and the child-processcloseevent firing, is there a window whereexitStatusis still"running"andgetExitError()returnsundefined, causingensureFrameWrittento fall back to the bare message? If yes → the frame-0 actionability regresses under a race. If no (the write returns false only afterclosefires) → confirm and forget. -
The transient-retry cohort re-runs
executeParallelCapturewith the sameinitialWorkerCount— but does it also re-run the probe (probeStage) with the fresh browser, or does it reuse the probe browser that just died? The scope-decision doc mentions probeStage has its own transient retry for session-init; want to confirm the transient-retry here doesn't sit above a probe browser that is itself already crashed.
✅ Well-done
- Abort-first ordering in the catch block is correct and cheap to re-verify (line 751 before line 778) — the single line makes the invariant obvious.
isMemoryExhaustionErrordeliberately narrow-patterned + the disjointness test atframeCapture-transientErrors.test.ts:88-90explicitly asserts OOM doesn't leak into transient (which would burn the retry on a resource ceiling). Exactly the shape observability-PR-with-a-classifier should have.scaleProtocolTimeoutForCompositionis a pure function with degenerate-input coverage + never-scales-down invariant tested. Theceiling = max(base, MAX)line atconfig.ts:299is subtle but load-bearing (preserves the "only ever raise" contract for an above-ceiling base) and is covered by the "never lowers a base already above the ceiling" test.- Device-scaled output dims (not CSS composition size) plumbed through to both the protocolTimeout scaling AND the OOM guidance — memory is allocated at output resolution, so this is the right number to key both on.
- Scope-decision doc in the PR body is a model of "what I did NOT touch and why" — the sequential/streaming paths + auto-recovery deferral are called out explicitly, not silently deferred.
— Rames D Jusso
|
Thanks @vanceingalls — quick resolution on the two 🟠 items:
|
…pture hardening Follow-up to #1842 / #1843 review feedback (Via + Magi): the P2-5 capture hardening was log-only, so its effect is invisible on PostHog dashboard 1783183 until an incident. Thread two counters through the existing observability → CLI-telemetry pipeline (no new PostHog wiring): - **transient-retry burn** — `CaptureAttemptSummary.reason` gains a `"transient-retry"` variant (distinct from the worker-halving `"retry"`), so `executeDiskCaptureWithAdaptiveRetry` tags same-worker retries after a transient tab death. `executeRenderJob` counts them into `RenderCaptureObservability.transientRetries` on successful renders — answers "are we burning the retry budget but recovering?". - **OOM classification** — the top-level catch sets `RenderCaptureObservability.memoryExhaustionDetected` whenever `describeMemoryExhaustion` classifies the failure — answers "is OOM the dominant failure tail?". Both surface via `renderObservabilityTelemetryPayload` → `capture_transient_retries` and `capture_memory_exhaustion_detected` PostHog props on the render events. Tests: producer asserts the transient retry attempt is tagged `transient-retry`; new CLI test locks the payload mapping (present when set, undefined otherwise). Scope: the encoder-frame-0-exit signal (also log-only) and the P1-3 pre-flight- rejection / P1-4 cli_env_check counters remain a further follow-up — they live in different subsystems (streaming stage, CLI render/doctor commands). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
… hardening (#1850) Follow-up to the render-reliability batch (#1841/#1842/#1843). Threads two capture-reliability counters through the existing observability → CLI-telemetry pipeline (no new PostHog wiring) so #1842's hardening is measurable on dashboard 1783183: - transient-retry burn (CaptureAttemptSummary.reason gains "transient-retry"; counted into RenderCaptureObservability.transientRetries on BOTH the recovered and the still-failed paths via a shared helper). - OOM classification (memoryExhaustionDetected set when describeMemoryExhaustion classifies the failure). Surfaced as capture_transient_retries + capture_memory_exhaustion_detected render-event props. Tests cover the attempt tagging and the payload mapping. Further follow-up (different subsystems): encoder-frame-0-exit signal, and P1-3 pre-flight-rejection / P1-4 cli_env_check counters. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Summary
Render-reliability workstream P2-5 (Capture timeouts / stability). Targets the capture-timeout/stability failure bucket (~15K errors / ~7K users over 30 days) tracked on PostHog dashboard 1783183:
Protocol error (Page.captureScreenshot): Target closedRuntime.callFunctionOn timed outStreaming encoder exited before frame 0Set maximum size exceeded(OOM / oversized composition)What landed
1. protocolTimeout auto-scaling (fixes
Runtime.callFunctionOn timed out)A single CDP seek+capture call scales with output pixel area (device-scaled), so a fixed 300s ceiling intermittently kills legitimate slow-but-valid large renders.
scaleProtocolTimeoutForCompositionraises the timeout proportionally to output area, clamped to 30 min, never lowering the configured base.Applied before the probe launches its browser —
protocolTimeoutis baked into the Puppeteer connection atppt.launch()and the probe browser is reused for capture on the common single-worker path, so a post-probe mutation would be a no-op there. Scales onoutputWidth/outputHeight(width × deviceScaleFactor), so high-DPI--resolutionrenders get the bump too.2. Single bounded transient retry (fixes transient
Target closed)A
Target closed/Page crashed/Session closedis usually a dead tab, not a broken composition. The parallel disk-capture loop now retries once at the same worker count (parallelism isn't the problem) and independent of forward progress (a tab that dies before frame 0 is exactly the case to recover). Bounded byMAX_TRANSIENT_CAPTURE_RETRIES = 1so a deterministically-dying tab still fails instead of looping. Rethrows immediately when the render was aborted so cancellation never burns a retry.3. Actionable OOM handling (fixes cryptic
Set maximum size exceeded)isMemoryExhaustionErrorclassifies V8/Node allocation-failure signatures (deliberately narrow — no bareout of memory, which appears in benign WebGL/GPU console noise, and disjoint from transient classification). The top-level render catch turns these into guidance naming the output dimensions that exhausted memory and the levers that reduce pressure (lower resolution/duration/fps,--workers 1,--low-memory-mode).4. Encoder-exit-before-frame-0 (better error surfaced)
Streaming encoder exited before frame Ndiscarded FFmpeg's own failure reason. NewStreamingEncoder.getExitError()exposes the exit code + stderr tail synchronously, andensureFrameWrittenincludes it — so a frame-0 encoder death (bad args, unsupported codec) is now actionable instead of cryptic.Scope decisions
probeStagealready retries the shared session-init phase all paths go through.Testing
scaleProtocolTimeoutForComposition,isMemoryExhaustionError,getExitError,describeMemoryExhaustion,ensureFrameWritten, and the transient-retry + abort behaviors all have unit/integration tests following existing patterns (renderOrchestrator.test.ts,frameCapture-transientErrors.test.ts,streamingEncoder.test.ts,config.test.ts).bun run build,tsc --noEmit(engine + producer), engine suite (834 tests) + producer orchestrator suite (78 tests),oxlint,oxfmt --check,fallow audit --base origin/main --fail-on-issues(exit 0).Self-reviewed via
/code-review(high effort); findings on protocolTimeout altitude (post-probe no-op), device-scaling, abort-during-retry, and the over-broad OOM regex were fixed before this PR was opened.🤖 Generated with Claude Code