Skip to content

fix(engine): fast-fail on zero duration instead of 45s timeout#1186

Merged
miguel-heygen merged 2 commits into
mainfrom
worktree-fix+hf-ready-timeout-fast-fail
Jun 4, 2026
Merged

fix(engine): fast-fail on zero duration instead of 45s timeout#1186
miguel-heygen merged 2 commits into
mainfrom
worktree-fix+hf-ready-timeout-fast-fail

Conversation

@miguel-heygen
Copy link
Copy Markdown
Collaborator

@miguel-heygen miguel-heygen commented Jun 4, 2026

Summary

  • Replace the dumb 45-second readiness poll with a smart pollHfReady that fast-fails after 10 seconds when it detects the runtime has finished but duration is permanently zero
  • Add rich diagnostic error messages that tell users exactly what went wrong and how to fix it
  • Unify readiness polling between screenshot and beginFrame capture modes

Problem

When a composition has no GSAP timeline and no data-duration attribute, window.__hf.duration stays at 0. The engine polled for duration > 0 for the full 45-second timeout before giving up with a generic error message. Every occurrence wasted 45 seconds of wall-clock time with no actionable feedback.

Reproduction

# Create a minimal zero-duration composition (no GSAP, no data-duration):
mkdir -p /tmp/hf-zero-duration-test
cat > /tmp/hf-zero-duration-test/index.html << 'HTML'
<!DOCTYPE html>
<html>
<head>
  <style>
    body { margin: 0; background: #1a1a2e; display: flex;
           align-items: center; justify-content: center; height: 100vh; }
    h1 { color: white; font-size: 48px; font-family: sans-serif; }
  </style>
</head>
<body>
  <div data-composition-id="zero-duration-test">
    <h1>No GSAP, No data-duration</h1>
  </div>
</body>
</html>
HTML

# ON MAIN — 46 seconds, generic error:
time hyperframes render /tmp/hf-zero-duration-test -o /tmp/out.mp4
# [FrameCapture] window.__hf not ready after 45000ms.
#   Page must expose window.__hf = { duration, seek }.
# real 0m46s

# ON FIX BRANCH — 11 seconds, diagnostic with actionable fix:
time hyperframes render /tmp/hf-zero-duration-test -o /tmp/out.mp4
# [FrameCapture] Composition has zero duration.
#   Runtime ready: true, __player: true, __hf.seek: true,
#   GSAP timeline: false, data-duration: not set
#   → No GSAP timeline registered (window.__timelines is empty).
#     If using CSS/WAAPI/Lottie/Three.js animations, add data-duration
#     to the root element.
#   → Fix: add data-duration="<seconds>" to your root
#     <div data-composition-id="..."> element.
# real 0m11s

Root cause

The readiness condition window.__hf.duration > 0 has five failure modes:

  1. No GSAP timeline + no data-duration (CSS-only, WAAPI, Lottie compositions)
  2. External sub-composition fetch fails (404 → no timeline → duration=0)
  3. GSAP CDN script fails to load (timeout → no gsap → no timeline)
  4. DOMContentLoaded never fires in BeginFrame mode (blocking scripts)
  5. User script overwrites window.__player

All five produce the same symptom: duration stays at 0 indefinitely.

Fix

After 10 seconds of polling, the engine evaluates a diagnostic expression that checks window.__renderReady, window.__player, window.__timelines, and data-duration. If the runtime has finished initializing (renderReady=true) but duration is still zero, it fails immediately with a diagnostic message.

Test plan

  • Build passes (bun run build)
  • All pre-commit hooks pass (lint, format, typecheck, fallow)
  • Zero-duration composition: 46s → 11s, with diagnostic error message showing exact state
  • Error message includes actionable fix (add data-duration="<seconds>")
  • Normal compositions with GSAP timelines render correctly (no regression)
  • Compositions with data-duration render correctly
  • Verify timeout errors drop after deploy

When a composition's runtime finishes initializing but reports zero
duration (no GSAP timeline and no data-duration attribute), the engine
previously polled for the full 45-second timeout before failing.

Now, after 10 seconds of polling, the engine checks whether the runtime
has finished (window.__renderReady === true) with a working seek
function but zero duration. If so, it fails immediately with a
diagnostic message explaining what's wrong and how to fix it.

This also improves the generic timeout error message to include runtime
state (whether __player exists, __hf.seek, GSAP timelines, declared
duration) so users can self-diagnose.

PostHog data: 555-1,234 occurrences/day, each wasting 45s of user time.
Copy link
Copy Markdown
Collaborator

@jrusso1020 jrusso1020 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Review — Fast-fail on permanent zero-duration with diagnostic

Good shape. Replaces two near-identical inline poll loops with a single pollHfReady helper that fast-fails after 10s when the runtime has announced it's ready with a permanently zero duration. The diagnostic message is the strongest part — it tells the user exactly what attribute is missing and how to fix it.

What I verified

  1. Fast-fail guard is correctly narrow. The 10s early-exit fires only when diag.renderReady && diag.hasSeek && diag.duration === 0. That triplet is the "the runtime finished initializing and is reporting zero" signal — not a "still loading, give it more time" signal. Legitimate slow-init compositions (large asset preload, deferred font loading, Three.js bootstrap) still get the full playerReadyTimeout budget because they don't have renderReady=true yet at 10s. ✓
  2. No regression on legitimate low-duration compositions. duration > 0 always passes the ready expression (window.__hf.duration > 0). Only duration === 0 triggers the diagnostic path. A 1s composition with data-duration="1" still resolves cleanly. ✓
  3. Diagnostic message is actionable. The hint chain ("window.__player was never set" → "No GSAP timeline registered" → "Add data-duration to root") walks the user from generic failure to specific fix. The PostHog data (555-1234 occurrences/day, #1 render error) suggests this is hitting the no-data-duration mistake; the diagnostic literally tells them which attribute to add. ✓
  4. Helper consolidation eliminates duplicate poll code. The previous BeginFrame branch (lines 766-779 in the old file) had a hand-rolled poll loop that subtly differed from the non-BeginFrame path. Both now route through pollHfReady, so future contract changes touch one place. ✓
  5. Error wrapping in BeginFrame path is correct. The try { await pollHfReady(...) } catch (err) { warmupState.running = false; throw err; } block (lines 766-770) preserves the warmupState.running = false cleanup that the old inline code did before throwing. ✓

Tiny ask, optional

  • The diagnostic JS at HF_READY_DIAGNOSTIC_EXPR reaches into window.__player, window.__renderReady, window.__timelines, window.__hf, and document.querySelector("[data-composition-id]"). Worth a comment block above that constant documenting which of those are contract (load-bearing for the runtime) vs diagnostic (read for error-message detail but not actually required to render). The next person debugging an unrelated startup issue will read this expression and wonder which fields are intentional handshake points.
  • The FAST_FAIL_AFTER_MS = 10_000 could live alongside DEFAULT_CONFIG as a named config entry rather than a function-local const, so an integrator could lower it for tests or raise it for slow-CI environments. Non-blocking; the hardcoded 10s is reasonable for now.

No blockers. Defensive, low-risk, observability-data-driven. Ship.

Review by Jerrai (hyperframes specialist)

Copy link
Copy Markdown
Collaborator

@vanceingalls vanceingalls left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Review

Good fix for a real, high-volume problem — the PostHog numbers and the actionable error message both land well. A few things need attention before merge.


Important: __renderReady is not a strong enough invariant to guarantee zero-duration is permanent

The fast-fail condition (renderReady && hasSeek && duration === 0) assumes that once the runtime finishes initializing, if __hf.duration is still 0 it will never become non-zero. That isn't quite right in BeginFrame mode.

__renderReady = true is set in init.ts (line 1544) before state.transportRafId = window.requestAnimationFrame(transportTick) (line 2026). The transport tick — which calls bindRootTimelineIfAvailable() and then clock.setDuration(dur) — runs on rAF. In BeginFrame mode, rAF only fires when the warmup loop issues frames. At the moment __renderReady is first seen as true inside the 10s polling window, it's possible the transport tick hasn't yet had a chance to bind the GSAP timeline and update the clock duration.

In practice this is unlikely to bite — the warmup loop runs at ~30fps so by 10s the tick has had ~300 opportunities — but the invariant is stated more strongly in the diagnostic message ("The runtime finished initializing but reported zero duration — this is permanent") than the code can actually guarantee. The correct guard is "renderReady AND hasSeek AND hasTimeline=false AND declaredDuration <= 0", not the current form. That would fast-fail only when there is provably nothing to wire up, rather than relying on a temporal assumption about the transport tick.

Alternatively, add a short stabilization delay (e.g., 500ms) after first seeing renderReady=true before declaring the zero-duration permanent — this gives the transport tick time to run.


Important: evaluateHfDiagnostic runs on every 100ms tick after 10s

After the 10-second mark, the loop calls evaluateHfDiagnostic(page) on every iteration before sleeping. For a composition that genuinely takes 11–44 seconds to initialize (e.g., heavy sub-composition with slow CDN, high-complexity GSAP timeline), this adds ~350 extra page.evaluate calls (2 per 100ms tick × 35s). That's cheap individually, but at 555–1234 failures/day multiplied across renders this is non-trivial CDP overhead.

The simple fix: only re-check the diagnostic every ~1s after the 10s mark (track a lastDiagCheckAt timestamp), or short-circuit once diag.renderReady is confirmed false the first time (if the runtime still isn't ready after 10s, it won't be ready in another 100ms — wait longer before re-querying).


Nit: elapsed calculation is unnecessarily indirect

const elapsed = timeoutMs - (deadline - Date.now());

deadline was captured as Date.now() + timeoutMs, so this simplifies to Date.now() - startTime. Capture const startedAt = Date.now() before the loop and use that directly — the current form requires the reader to reverse-engineer the arithmetic.


Nit: hint ordering produces contradictory output

buildZeroDurationDiagnostic can emit both:

  • "window.__player was never set — the HyperFrames runtime did not initialize."
  • "The runtime finished initializing but reported zero duration — this is permanent."

...in the same diagnostic if hasPlayer=false && hasSeek=true && duration===0 && renderReady=true. These two hints directly contradict each other. Gate the "runtime finished initializing" hint behind diag.hasPlayer being true.


Nit: test coverage gap

buildZeroDurationDiagnostic is a pure function with six boolean combinations driving four distinct hint paths, and it's fixing the #1 render error by volume. There's no test coverage for it. The PR's test plan has two unchecked boxes. pollHfReady's fast-fail logic is also not unit-tested. The engine package has a pattern for this (frameCapture-namePolyfill.test.ts). These don't need to be full Puppeteer tests — the diagnostic builder is pure, and the fast-fail logic can be exercised with a mock page.


Minor: fallow suppression scope

The fallow-ignore-file complexity at the top suppresses complexity checks for the entire 800+ line file. The fallow-ignore-next-line complexity on buildZeroDurationDiagnostic would have been sufficient to cover the new code — the file-level suppression is broader than needed.


CI

All required checks pass (CI: Build, Lint, Typecheck, Test, CLI smoke, Format; CodeQL; Windows verification). Regression shards are in-progress at review time — worth confirming they complete clean.

— Vai

Copy link
Copy Markdown
Collaborator Author

@miguel-heygen miguel-heygen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Root cause confirmed: DEFAULT_CONFIG.playerReadyTimeout is 45s, both polling paths in frameCapture.ts wait for window.__hf.duration > 0, and with no GSAP timeline + no data-duration attribute both getDuration() and getDeclaredDuration() return 0 permanently. The __renderReady flag fires unconditionally after bindRootTimelineIfAvailable() regardless of timeline found — so renderReady=true, hasSeek=true, duration=0 is a stable terminal state.

On Vai's rAF timing concern: worth clarifying. The fast-fail only triggers after 10s in the poll loop, not synchronously. By 10s, any legitimate composition would have had its timeline registered and duration set via rAF ticks — even BeginFrame mode would have processed many frames by then. A composition still at duration=0 after 10s is almost certainly broken, not warming up.

One edge case: GSAP loaded from an external CDN (unusual in render envs but theoretically possible) could still be in-flight at 10s, triggering the fast-fail with a misleading 'add data-duration' message. Rare enough not to block.

The CDP round-trip frequency for evaluateHfDiagnostic is worth a follow-up throttle — 350 calls on a failed render is noisy. Not a merge blocker.

LGTM.

…-fail

Two nit fixes in pollHfReady:

1. Throttle evaluateHfDiagnostic calls to once per ~1000ms after the 10s
   mark. Previously called on every 100ms loop tick, generating ~350 CDP
   round-trips per failed render. One check per second is sufficient to
   detect a permanently-zero composition.

2. Change fast-fail condition from 'duration === 0' to
   '!hasTimeline && declaredDuration <= 0'. A composition with a GSAP
   timeline but no data-duration attribute should not be fast-failed —
   GSAP sets duration synchronously before __renderReady via __timelines,
   so a non-empty __timelines is a reliable signal that duration will
   eventually be non-zero. Only compositions with NEITHER a GSAP timeline
   NOR a declared duration are permanently zero.
Copy link
Copy Markdown
Collaborator

@vanceingalls vanceingalls left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

Copy link
Copy Markdown
Collaborator

@jrusso1020 jrusso1020 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Re-review — commit 0739a91df0

Both Vai's importants addressed. ✓

Verified:

  • Diagnostic throttled to ~1s cadence via DIAGNOSTIC_INTERVAL_MS = 1_000 + lastDiagnosticAt gate. For a 45s playerReadyTimeout, that drops the failed-render diagnostic load from ~350 CDP roundtrips to ~35. ✓
  • Fast-fail guard tightened from diag.duration === 0 to !diag.hasTimeline && diag.declaredDuration <= 0. This is the stronger "permanent zero" signal — a GSAP composition without data-duration is still valid (GSAP drives duration via __timelines), so the old guard would have false-positived on it. ✓
  • The comment block above the new guard explains the rationale ("GSAP sets duration synchronously before __renderReady") — matches what I traced in init.ts:1645-1654. ✓
  • Final-timeout path (after while loop exits) still uses the old diag.hasSeek && diag.duration === 0 condition for the rich error message — that's fine; by that point we're past the timeout and just selecting between rich vs. generic error text. ✓

Ready from my side, pending stamp authorization.

Review by Jerrai (hyperframes specialist)

@miguel-heygen miguel-heygen merged commit efb8d90 into main Jun 4, 2026
42 checks passed
@miguel-heygen miguel-heygen deleted the worktree-fix+hf-ready-timeout-fast-fail branch June 4, 2026 03:06
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants