Skip to content

fix(producer): harden capture against timeouts, transient tab deaths, and OOM (P2-5)#1842

Merged
jrusso1020 merged 1 commit into
mainfrom
fix/capture-timeout-stability
Jul 2, 2026
Merged

fix(producer): harden capture against timeouts, transient tab deaths, and OOM (P2-5)#1842
jrusso1020 merged 1 commit into
mainfrom
fix/capture-timeout-stability

Conversation

@jrusso1020

Copy link
Copy Markdown
Collaborator

Summary

Render-reliability workstream P2-5 (Capture timeouts / stability). Targets the capture-timeout/stability failure bucket (~15K errors / ~7K users over 30 days) tracked on PostHog dashboard 1783183:

  • Protocol error (Page.captureScreenshot): Target closed
  • Runtime.callFunctionOn timed out
  • Streaming encoder exited before frame 0
  • Set maximum size exceeded (OOM / oversized composition)

What landed

1. protocolTimeout auto-scaling (fixes Runtime.callFunctionOn timed out)
A single CDP seek+capture call scales with output pixel area (device-scaled), so a fixed 300s ceiling intermittently kills legitimate slow-but-valid large renders. scaleProtocolTimeoutForComposition raises the timeout proportionally to output area, clamped to 30 min, never lowering the configured base.

Applied before the probe launches its browserprotocolTimeout is baked into the Puppeteer connection at ppt.launch() and the probe browser is reused for capture on the common single-worker path, so a post-probe mutation would be a no-op there. Scales on outputWidth/outputHeight (width × deviceScaleFactor), so high-DPI --resolution renders get the bump too.

2. Single bounded transient retry (fixes transient Target closed)
A Target closed / Page crashed / Session closed is usually a dead tab, not a broken composition. The parallel disk-capture loop now retries once at the same worker count (parallelism isn't the problem) and independent of forward progress (a tab that dies before frame 0 is exactly the case to recover). Bounded by MAX_TRANSIENT_CAPTURE_RETRIES = 1 so a deterministically-dying tab still fails instead of looping. Rethrows immediately when the render was aborted so cancellation never burns a retry.

3. Actionable OOM handling (fixes cryptic Set maximum size exceeded)
isMemoryExhaustionError classifies V8/Node allocation-failure signatures (deliberately narrow — no bare out of memory, which appears in benign WebGL/GPU console noise, and disjoint from transient classification). The top-level render catch turns these into guidance naming the output dimensions that exhausted memory and the levers that reduce pressure (lower resolution/duration/fps, --workers 1, --low-memory-mode).

4. Encoder-exit-before-frame-0 (better error surfaced)
Streaming encoder exited before frame N discarded FFmpeg's own failure reason. New StreamingEncoder.getExitError() exposes the exit code + stderr tail synchronously, and ensureFrameWritten includes it — so a frame-0 encoder death (bad args, unsupported codec) is now actionable instead of cryptic.

Scope decisions

  • Transient capture retry covers the parallel disk-capture path. The sequential and streaming capture paths run a single stateful session/encoder — retrying them cleanly means re-spawning the encoder mid-stream, which is a larger, riskier change out of scope here. probeStage already retries the shared session-init phase all paths go through.
  • OOM is surfaced as actionable guidance, not auto-recovered. Auto-dropping to low-memory mode / fewer workers on mid-capture OOM is a possible follow-up; this PR makes the failure legible.
  • protocolTimeout scales on area only, not frame count — frame count governs total wall-clock (already capped by the ffmpeg streaming inactivity timeout), not a single CDP call.

Testing

  • scaleProtocolTimeoutForComposition, isMemoryExhaustionError, getExitError, describeMemoryExhaustion, ensureFrameWritten, and the transient-retry + abort behaviors all have unit/integration tests following existing patterns (renderOrchestrator.test.ts, frameCapture-transientErrors.test.ts, streamingEncoder.test.ts, config.test.ts).
  • Full local gate green: bun run build, tsc --noEmit (engine + producer), engine suite (834 tests) + producer orchestrator suite (78 tests), oxlint, oxfmt --check, fallow audit --base origin/main --fail-on-issues (exit 0).

Self-reviewed via /code-review (high effort); findings on protocolTimeout altitude (post-probe no-op), device-scaling, abort-during-retry, and the over-broad OOM regex were fixed before this PR was opened.

🤖 Generated with Claude Code

… and OOM

Addresses the P2-5 capture-timeout/stability failure bucket (~15K errors /
~7K users / 30d): "Target closed", "Runtime.callFunctionOn timed out",
"Streaming encoder exited before frame 0", "Set maximum size exceeded".

- Scale the CDP protocolTimeout up for large canvases by *output* pixel area,
  applied BEFORE the probe launches its browser (protocolTimeout is baked in
  at ppt.launch() and the probe browser is reused for capture, so a post-probe
  mutation was a no-op on the common path). Only ever raises; clamped to 30min.
- Single bounded retry on a transient browser death (Target closed / Page
  crashed / Session closed) in the parallel disk-capture loop, at the same
  worker count and independent of forward progress, so a tab that dies before
  frame 0 recovers. Rethrows immediately when the render was aborted so
  cancellation never burns a retry.
- Turn cryptic OOM failures (Set maximum size exceeded, heap-limit aborts)
  into actionable guidance naming the output dimensions and the levers that
  reduce memory pressure. Memory-exhaustion classification is deliberately
  narrow and disjoint from transient classification.
- Surface FFmpeg's own failure reason in "Streaming encoder exited before
  frame N" via a new StreamingEncoder.getExitError(), instead of a bare
  frame-indexed message.

Scope: transient capture retry covers the parallel disk path; sequential and
streaming paths run a single stateful session/encoder and are out of scope
here (probeStage already retries the shared session-init phase).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

@miga-heygen miga-heygen left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Review — fix(producer): harden capture against timeouts, transient tab deaths, and OOM (P2-5)

Verdict: LGTM

Four independent hardening changes, each well-scoped and well-tested.

1. protocolTimeout auto-scaling

scaleProtocolTimeoutForComposition scales by output pixel area relative to a 1080p reference. Pure function, correct properties:

  • Never scales down (small composition → keeps base). ✅
  • Ceiling at 30 min (pathological canvas → bounded). ✅
  • An already-high base is never lowered (ceiling = max(base, MAX)). ✅
  • Degenerate dims (0, NaN) → base. ✅
  • Applied BEFORE probe browser launch — correct, since protocolTimeout is a Puppeteer connection-level setting baked in at ppt.launch() and immutable afterwards. The probe browser is reused for capture on the common single-worker path, so a post-probe mutation would be a no-op. ✅

2. Single bounded transient retry

MAX_TRANSIENT_CAPTURE_RETRIES = 1. On isTransientBrowserError (Target closed / Page crashed / Session closed):

  • Retries once at the SAME worker count (parallelism isn't the problem). ✅
  • Does NOT require forward progress (a tab that dies before frame 0 is the exact case to recover). ✅
  • Rethrows immediately when the render was aborted (cancellation never burns a retry). ✅
  • Bounded at 1 so a deterministically-dying tab still fails. ✅
  • Runs BEFORE the worker-halving retry in the catch block — correct ordering. ✅
  • Abort check runs FIRST in the catch block — correct priority. ✅

Scope decision documented: covers the parallel disk-capture path only. Sequential and streaming paths have their own session-init retry via probeStage. Noted, not blocking.

3. Actionable OOM handling

isMemoryExhaustionError — deliberately narrow patterns (Set maximum size exceeded, Map maximum size, Invalid array/string length, Array buffer allocation failed, Reached heap limit, JavaScript heap out of memory). Intentionally does NOT match bare "out of memory" (WebGL/GPU console noise). Test explicitly asserts disjointness from transient classification (a retry re-hits the same ceiling). ✅

describeMemoryExhaustion — formats actionable guidance naming the output dimensions (device-scaled, not CSS) that exhausted memory, plus the levers that actually reduce pressure (lower resolution, fewer workers, low-memory mode). Correct: memory is allocated at output resolution. ✅

captureCompositionWidth/captureCompositionHeight/captureTotalFrames are declared at function scope and assigned inside the try — structurally required to cross the try/catch boundary.

4. Encoder exit error

StreamingEncoder.getExitError() exposes exit code + stderr tail synchronously. ensureFrameWritten now includes it — so a frame-0 encoder death (bad args, unsupported codec) is actionable instead of a bare "encoder exited before frame 0". Tests verify non-zero exit, clean exit (undefined), and the frame-write integration. ✅

Ponytail lens

Lean already. Ship.

— Miga

@vanceingalls vanceingalls left a comment

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🟢 Well-scoped hardening; classifier disjointness and abort-first ordering both look right

Audited under the widened-retry-policy lens: MAX_TRANSIENT_CAPTURE_RETRIES = 1 bounds the blast radius to 2× wall-clock at the same worker count, transient/OOM classifiers are provably disjoint (test at frameCapture-transientErrors.test.ts explicitly asserts this), abort short-circuits BEFORE the transient branch (renderOrchestrator.ts:751-753), and executeParallelCapture spins fresh worker browsers per call so the "fresh session" claim in the log is factual, not aspirational. Nothing blocking.

Two heads-ups and one nit for R2/follow-up consideration.

🟠 Telemetry gap on the new retry branch

File: packages/producer/src/services/renderOrchestrator.ts:780-791

Transient retry fires only a log.warn. Neither chaos.transient_retry.fired nor chaos.transient_retry.exhausted are emitted, and neither is the OOM classification (via describeMemoryExhaustion). This is the exact shape of the telemetry-gap-on-parallel-branch failure mode: when this ships, we won't be able to answer "are we burning the retry budget?" or "is OOM now the dominant tail?" from a metric — only from log grep. Given the PR is explicitly tied to PostHog dashboard 1783183, plumbing a counter alongside the log (even a single render.transient_retry counter with {result: fired|exhausted} tags, and render.oom_classified for the OOM branch) would close the loop. Not blocking for the fix, but worth stapling on before we lose the observability chance.

🟠 cfg.protocolTimeout mutated in place

File: packages/producer/src/services/renderOrchestrator.ts:848

cfg.protocolTimeout = scaledProtocolTimeout mutates the caller's config object. If cfg is reused across renders in the same worker process (I couldn't verify one way or the other from the diff), the scaled ceiling persists into the next job — a big 4K job then a small 720p job would keep the 4K's scaled timeout. Prefer a local const effectiveProtocolTimeout = scaleProtocolTimeoutForComposition(...) threaded to launchProbeStage (or wherever ppt.launch is called), leaving cfg immutable. If the audit shows cfg is per-job in practice, downgrade to nit.

nit — transient list includes deterministic launch-failure pattern

File: packages/engine/src/services/frameCapture.ts:1993

/Failed to launch the browser process/i is in the transient set. On a host with a permanent problem (no shm, missing headless-shell binary quirk on that release, wrong sandboxing flag), this classification will burn the one retry before falling through to the halving path. It's still bounded at 1 retry, so this doesn't warrant a block — but worth documenting that this specific pattern is deliberately included despite occasionally being deterministic, or splitting it out with a shorter budget. Sequential/streaming paths being out of scope is called out cleanly.

🟠 pr-envelope

Commit trailer Co-Authored-By: Claude Opus 4.8 (1M context) + body footer 🤖 Generated with [Claude Code] are present. Miguel's bar on HF treats both as request-changes; flagging for the author to strip before merge per team convention.


Review by Via

@miguel-heygen miguel-heygen left a comment

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Clean P2-5 hardening pass; I agree with the ship shape.

What I verified:

  • packages/producer/src/services/renderOrchestrator.ts:792-810 adds exactly one transient browser retry, checks abort first at :760-767, and retries with the same worker count rather than mixing it with the worker-halving path.
  • packages/engine/src/services/frameCapture.ts:1984-2007 and packages/engine/src/services/frameCapture-transientErrors.test.ts:49-90 keep transient and memory-exhaustion classifiers disjoint, so OOM doesn't burn the transient retry budget.
  • packages/producer/src/services/renderOrchestrator.ts:1996-2001 surfaces memory-exhaustion guidance with output dimensions/frame count when available.
  • packages/engine/src/services/streamingEncoder.ts:143-150 plus packages/producer/src/services/render/stages/captureHdrFrameShared.ts let frame-write failures include the FFmpeg exit reason instead of the old bare frame-0 message.
  • On Via's config-mutation note: packages/producer/src/services/renderOrchestrator.ts:991 clones job.config.producerConfig into a local cfg, so the cfg.protocolTimeout scaling at :1194-1207 does not leak back into the caller's config across jobs.

Non-blocking follow-up: packages/producer/src/services/renderOrchestrator.ts:792-807 logs transient retries but does not emit a dedicated metric/counter, so ops will need logs/perf summaries to quantify retry hit-rate until that is added. CI is green.

Verdict: APPROVE
Reasoning: The retry, OOM, timeout, and encoder-error paths are bounded, disjoint where they need to be, and covered by focused tests; the remaining telemetry gap is a follow-up, not a merge blocker.

— Magi

@james-russo-rames-d-jusso james-russo-rames-d-jusso left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewed at 2674fc4fdd.
Peer scan: miga-heygen COMMENT (LGTM, full audit), vanceingalls COMMENT (3 flags — telemetry gap, cfg mutation, launch-failure pattern nit), miguel-heygen APPROVED (verified cfg is cloned at renderOrchestrator.ts:991, so Vance's cfg-mutation flag is a non-issue). Layered on top of theirs.
Part of the render-reliability P1/P2 stack: #1843 / #1841 / #1842.

🟠 AI trailer + Claude co-author

  • PR body ends with 🤖 Generated with [Claude Code].
  • Commit 2674fc4fdd carries Co-Authored-By: Claude.

Per team convention (see miguel-heygen bar), strip both before merge. Already flagged by vanceingalls, re-noting so it doesn't slip.

🟠 Observability: retry + OOM + encoder-exit paths all log-only

Extending vanceingalls's telemetry-gap finding — three metric emissions are all missing, not just one:

  1. Transient retry branch (renderOrchestrator.ts:780-791) — log.warn only. No render.transient_retry counter with {result: fired|exhausted}. This is the highest-value one because the PR is explicitly wired to PostHog dashboard 1783183 (~15K errors / ~7K users) — post-ship we need to answer "is the retry burning its budget or absorbing the failures?" from a metric, not log grep.

  2. OOM classification (renderOrchestrator.ts:1996-2001, describeMemoryExhaustion) — no counter. When OOM becomes the dominant tail after transient-retry absorbs the tab-death cohort, we won't see the shift in a metric. This is the classic observability PR that skips its own success path shape.

  3. Encoder exit-before-frame-N (ensureFrameWritten w/ getExitError) — the error message is now actionable, but no counter emits when a frame-0 encoder death happens. Codec/args failures at frame 0 are exactly the class where a metric surface tells you libx264 isn't available on the fleet vs. a bad-args config regression, without ssh'ing into hosts.

All three are cheap to bolt on (one counter per branch, shared tag set: {stage, cause}). Non-blocking for the fix, but the observability window closes when this lands and the failure buckets shrink — measure before the population changes.

🟡 OOM guidance may mis-lead on producer-side heap OOM

packages/producer/src/services/renderOrchestrator.ts:606-619describeMemoryExhaustion suggests --workers 1 as a memory-pressure lever. This is correct for browser-side allocation failures (Set maximum size exceeded firing inside the CDP page context) — but for producer-process V8 heap OOM (Reached heap limit, JavaScript heap out of memory from Node itself), reducing workers reduces the browser footprint, not the producer heap where the frame-cache / reorder-buffer live. On the producer path a user could drop to --workers 1 and re-hit the same wall.

Not a blocker — the other levers (lower resolution / duration / low-memory-mode) do cover the producer-heap case. But the --workers 1 line could be gated on "heap limit" / "JavaScript heap out of memory" NOT matching, or reworded to "reduce parallel browser memory footprint" to be precise about what it does. Worth a follow-up.

🟡 Abort-during-retry: single-throw-time coverage

The abort short-circuit at renderOrchestrator.ts:751-753 is correctly ordered — it fires FIRST in the catch, BEFORE the transient classification. The test at renderOrchestrator.test.ts:527 covers the "abort at throw time" case (controller.abort() inside the mock, then executeParallelCapture throws). But the test doesn't cover "abort fires DURING the between-attempt gap" (after the first executeParallelCapture throws transient, before the retry's next call). The abort flag is persistent so this looks correct — the check at :751 catches it on the next iteration's catch — but there's no test pinning that. Worth an "abort between attempts" case for the retry-branch coverage. Not blocking.

⚡ Rollout suggestion

The bundle is four independent hardenings with different blast radii:

  • protocolTimeout scale-up: raise-only, per-render, low risk.
  • Transient retry: bounded at 1, could theoretically 2× wall-clock for a truly-broken deterministic-death case (still bounded).
  • OOM classification: catch-side reformatter, zero behavior change on the success path.
  • Encoder exit-reason: strictly additive to error text.

If canary/dashboard 1783183 is being watched, worth calling out in the deploy note that transient-retry can 2× the render duration for the specific "tab dies pre-frame-0 twice" cohort — makes the p99 tail move by design, and you want to distinguish that from a regression.

↩️ Questions

  1. getExitError() returns the ffmpeg exit reason synchronously after exitStatus === "error". Between writeFrame returning false (encoder is gone) and the child-process close event firing, is there a window where exitStatus is still "running" and getExitError() returns undefined, causing ensureFrameWritten to fall back to the bare message? If yes → the frame-0 actionability regresses under a race. If no (the write returns false only after close fires) → confirm and forget.

  2. The transient-retry cohort re-runs executeParallelCapture with the same initialWorkerCount — but does it also re-run the probe (probeStage) with the fresh browser, or does it reuse the probe browser that just died? The scope-decision doc mentions probeStage has its own transient retry for session-init; want to confirm the transient-retry here doesn't sit above a probe browser that is itself already crashed.

✅ Well-done

  • Abort-first ordering in the catch block is correct and cheap to re-verify (line 751 before line 778) — the single line makes the invariant obvious.
  • isMemoryExhaustionError deliberately narrow-patterned + the disjointness test at frameCapture-transientErrors.test.ts:88-90 explicitly asserts OOM doesn't leak into transient (which would burn the retry on a resource ceiling). Exactly the shape observability-PR-with-a-classifier should have.
  • scaleProtocolTimeoutForComposition is a pure function with degenerate-input coverage + never-scales-down invariant tested. The ceiling = max(base, MAX) line at config.ts:299 is subtle but load-bearing (preserves the "only ever raise" contract for an above-ceiling base) and is covered by the "never lowers a base already above the ceiling" test.
  • Device-scaled output dims (not CSS composition size) plumbed through to both the protocolTimeout scaling AND the OOM guidance — memory is allocated at output resolution, so this is the right number to key both on.
  • Scope-decision doc in the PR body is a model of "what I did NOT touch and why" — the sequential/streaming paths + auto-recovery deferral are called out explicitly, not silently deferred.

— Rames D Jusso

@jrusso1020

Copy link
Copy Markdown
Collaborator Author

Thanks @vanceingalls — quick resolution on the two 🟠 items:

  • cfg.protocolTimeout in-place mutation → confirmed nit. Audited: executeRenderJob builds cfg as a fresh shallow copy per call (const cfg = { ...(job.config.producerConfig ?? resolveConfig()) }, renderOrchestrator.ts:991), and protocolTimeout is a primitive — so the scaled value can't leak into the caller's config or persist across renders in a reused worker. Per your "downgrade to nit if per-job in practice", it's per-job. Leaving as-is to keep the diff surgical.
  • Telemetry gap (transient-retry / OOM counters): agreed this is the right call given the dashboard tie-in, and it's not in the current (approved) diff. I'd rather not re-open an approved PR for it — planning it as a fast follow-up (render.transient_retry{result} + render.oom_classified) so the observability lands without churning this one. Shout if you'd prefer it inline here instead.
  • Failed to launch the browser process in the transient set (nit): deliberately kept — bounded at 1 retry, and the cost of one wasted retry on a genuinely-deterministic launch failure is small vs. recovering the transient cases. Worth a follow-up to split it to a shorter budget if we see retry-burn in the counter above.

@jrusso1020 jrusso1020 merged commit c0c3abf into main Jul 2, 2026
45 checks passed
@jrusso1020 jrusso1020 deleted the fix/capture-timeout-stability branch July 2, 2026 01:50
jrusso1020 added a commit that referenced this pull request Jul 2, 2026
…pture hardening

Follow-up to #1842 / #1843 review feedback (Via + Magi): the P2-5 capture
hardening was log-only, so its effect is invisible on PostHog dashboard 1783183
until an incident. Thread two counters through the existing observability →
CLI-telemetry pipeline (no new PostHog wiring):

- **transient-retry burn** — `CaptureAttemptSummary.reason` gains a
  `"transient-retry"` variant (distinct from the worker-halving `"retry"`), so
  `executeDiskCaptureWithAdaptiveRetry` tags same-worker retries after a
  transient tab death. `executeRenderJob` counts them into
  `RenderCaptureObservability.transientRetries` on successful renders — answers
  "are we burning the retry budget but recovering?".
- **OOM classification** — the top-level catch sets
  `RenderCaptureObservability.memoryExhaustionDetected` whenever
  `describeMemoryExhaustion` classifies the failure — answers "is OOM the
  dominant failure tail?".

Both surface via `renderObservabilityTelemetryPayload` → `capture_transient_retries`
and `capture_memory_exhaustion_detected` PostHog props on the render events.

Tests: producer asserts the transient retry attempt is tagged `transient-retry`;
new CLI test locks the payload mapping (present when set, undefined otherwise).

Scope: the encoder-frame-0-exit signal (also log-only) and the P1-3 pre-flight-
rejection / P1-4 cli_env_check counters remain a further follow-up — they live in
different subsystems (streaming stage, CLI render/doctor commands).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
jrusso1020 added a commit that referenced this pull request Jul 2, 2026
… hardening (#1850)

Follow-up to the render-reliability batch (#1841/#1842/#1843). Threads two capture-reliability counters through the existing observability → CLI-telemetry pipeline (no new PostHog wiring) so #1842's hardening is measurable on dashboard 1783183:

- transient-retry burn (CaptureAttemptSummary.reason gains "transient-retry"; counted into RenderCaptureObservability.transientRetries on BOTH the recovered and the still-failed paths via a shared helper).
- OOM classification (memoryExhaustionDetected set when describeMemoryExhaustion classifies the failure).

Surfaced as capture_transient_retries + capture_memory_exhaustion_detected render-event props. Tests cover the attempt tagging and the payload mapping.

Further follow-up (different subsystems): encoder-frame-0-exit signal, and P1-3 pre-flight-rejection / P1-4 cli_env_check counters.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants