Skip to content

fix: batch GSAP timeline construction to prevent main-thread hang (#1231)#1249

Merged
miguel-heygen merged 7 commits into
mainfrom
fix/gsap-tween-count-hang
Jun 7, 2026
Merged

fix: batch GSAP timeline construction to prevent main-thread hang (#1231)#1249
miguel-heygen merged 7 commits into
mainfrom
fix/gsap-tween-count-hang

Conversation

@miguel-heygen
Copy link
Copy Markdown
Collaborator

@miguel-heygen miguel-heygen commented Jun 7, 2026

Problem

Compositions with thousands of GSAP timeline construction calls can block Chrome's main thread during HTML parsing. In the reported case, the page can sit at Initializing calibration session... because the event loop is starved before the render bridge and runtime can publish a stable ready state.

Closes #1231

What this fixes

This PR batches early GSAP timeline construction in the producer's injected head stub, then only lets render capture proceed once the runtime has rebound the completed timelines and published render readiness.

The latest fixes on top of the original batching change keep that batching compatible with render-time seeking and runtime child-timeline binding:

  • preserve virtual-time requestAnimationFrame while the early stub drains large construction queues
  • gate bridge duration on window.__renderReady instead of forcing render readiness from the bridge
  • keep render-time timeline controls like totalTime() synchronous after construction is complete
  • forward getChildren() through the lightweight timeline proxy after flushing queued construction calls, so runtime auto-nesting still sees child timelines

Root cause

GSAP applies each tl.to() / from() / fromTo() / set() call synchronously. Large compositions can execute thousands of those calls in one parser task, which delays browser lifecycle events and makes Puppeteer's readiness polling observe an incomplete runtime state.

The first batching pass solved the main-thread starvation, but it exposed two correctness edges:

  • render-time seeks were still being routed through the construction queue after enough timeline calls, causing late caption/timeline drift in regression renders
  • the lightweight timeline proxy did not expose getChildren(), so runtime root-child binding could miss child timelines after batching

Verification

Local checks

  • bun run --filter @hyperframes/producer build:hf-early-stub
  • bun test packages/producer/src/services/fileServer.test.ts — 26 tests passed
  • bunx oxlint packages/producer/stubs/hf-early-stub.ts packages/producer/src/generated/hf-early-stub-inline.ts packages/producer/src/services/fileServer.test.ts
  • bunx oxfmt --check packages/producer/stubs/hf-early-stub.ts packages/producer/src/generated/hf-early-stub-inline.ts packages/producer/src/services/fileServer.test.ts
  • bun run build
  • pre-commit hook passed lint, format, fallow, typecheck, and commitlint

Devbox regression replay

Ran the failed CI shard set on devbox against the final patch:

bun run --cwd packages/producer test style-7-prod style-8-prod style-10-prod css-spinner-render-compat webm-transparency mp4-h264-sdr webm-vp9 --sequential --keep-temp

Result: 6 active suites passed, 0 failed, 0 skipped. webm-transparency was excluded by the harness transparency tag, matching the CI shard behavior. The original blocker style-10-prod passed with 0 failed visual frames and audio passed.

Browser verification

Validated the rendered CLI smoke output in-browser through the agent-browser flow.

  • screenshot: /tmp/hf-pr-1249-proof/cli-smoke-render.png
  • recording: /tmp/hf-pr-1249-proof/cli-smoke-render.webm

CI

Live PR head: 2cfe1831ba6fc2130edaa1ea3d6f2d09a1e7eda4

All CI checks are green as of the latest run, including:

  • Build, Lint, Format, Typecheck, Test, Fallow audit
  • CLI smoke and global install smoke
  • CodeQL
  • Windows tests and Windows render verification
  • Player perf and preview regression
  • regression shards 1 through 8

Notes

Mintlify Deployment is skipped by the integration and is the only non-passing check state in the rollup. No composition HTML was changed.

)

Compositions with thousands of tl.to() calls (e.g. 8,562 in the
reported case) block Chrome's main thread synchronously during HTML
parsing, preventing DOMContentLoaded from firing before Puppeteer's
navigation timeout. This caused render jobs to hang indefinitely at
'Initializing calibration session...' with no error message.

Root cause: GSAP's timeline API is synchronous — each tl.to() call
registers a tween immediately on the main thread. A script with 8k+
calls holds the thread for seconds, starving the browser event loop and
delaying DCL past the navigation timeout window.

Fix: install a property trap on window.gsap in HF_EARLY_STUB (injected
at the top of <head>, before GSAP or user scripts load). When GSAP
assigns itself to window.gsap, the setter intercepts the real gsap
object and wraps gsap.timeline() to return a proxy that queues tween
descriptors (to/from/fromTo/set) instead of calling them synchronously.
A requestAnimationFrame-based flush loop drains 100 tweens per frame,
yielding the main thread between batches so DCL can fire.

When the queue is drained, the stub sets window.__hfTimelinesBuilding =
false and dispatches a 'hf-timelines-built' CustomEvent. init.ts checks
this flag at DOMContentLoaded time; if building is still in progress it
defers bindRootTimelineIfAvailable() until the event fires, then sets
window.__renderReady = true as normal. pollHfReady continues to gate
on both __renderReady and window.__hf.duration > 0, so the render
pipeline does not start until the full timeline is bound.

- Batch size: 100 tweens/rAF tick (empirical; ~4ms/batch at 8k scale)
- Yield mechanism: requestAnimationFrame (cooperative, no setTimeout(0))
- Determinism: 'hf-timelines-built' event guarantees sequencing
- Proxy forwards: pause/seek/totalTime/time/duration/add/paused/
  timeScale/play delegate to the real timeline immediately
- No GSAP package changes; no navigation timeout increase

Fixes #1231
Copy link
Copy Markdown
Collaborator

@jrusso1020 jrusso1020 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewed against the design checklist I sketched in the Slack thread. Architecture is sound; surfacing two correctness concerns that the PR's own description doesn't address + a couple of nice-to-haves. None look blocking for the customer-unblocking purpose of the PR; flagging them as follow-up scope.

Strengths

  • Property-trap-on-window.gsap is exactly the right interception layer. Captures every call regardless of whether the user code goes through gsap.timeline() directly or via UMD's window.gsap assignment. The fact that GSAP isn't loaded yet when the stub runs is handled correctly via the configurable getter/setter.
  • Explicit hf-timelines-built event + init.ts deferral of bindRootTimelineIfAvailable() is the correct determinism model. The renderer can't race ahead — pollHfReady gates on __renderReady which only flips after the event fires. ✓
  • Build pipeline mirrors @hyperframes/core's runtime-inline.ts pattern — compiled stub source in stubs/hf-early-stub.ts, esbuild → IIFE → generated TS module exporting getHfEarlyStub(). The previous 138-line inline JS string is gone, which is a maintainability win.
  • Defensive try/catch around defineProperty + CustomEvent — handles non-Chrome runtimes (tests, jsdom) gracefully.
  • Batch size = 100/tick, rAF yield matches the design recommendation. Empirical "~4ms/batch" claim is consistent with GSAP's tween-registration cost on modern V8.

Concerns

1. proxy.add() doesn't unwrap proxy children — passes proxy objects to real.add()

The HF runtime in packages/core/src/runtime/init.ts:574-580 does:

const compositeTimeline = gsapApi.timeline({ paused: true });  // PROXY
for (const candidate of candidates) {
  compositeTimeline.add(candidate.timeline, ...);  // candidate.timeline is also a PROXY
}

And similar at :590-593 (fallbackTimeline.add(existingRootTimeline, 0)) and :633 (rootTimeline.add(candidate.timeline, startSec)).

Because user composition scripts get proxies from gsap.timeline(), and the runtime gets proxies from gsapApi.timeline() too, every .add() call composes proxy-into-proxy. The proxy's add() forwards args to real.add(...args) without unwrapping arg.__hfReal, so GSAP's real timeline ends up holding proxy references in its internal tween-graph linked list (_first/_next/_prev).

Empirically this likely works in your 8,562-tween test (the proxy's __hfReal has its tweens by the time bindRootTimelineIfAvailable fires, and proxy.duration() / proxy.seek() / proxy.totalTime() all forward correctly). But GSAP's internal iteration paths (e.g. getChildren(), internal label resolution, time-mapping) may misbehave on a proxy that lacks _dp / _first / _recent linkage in the way GSAP expects.

Suggested fix (one-liner in wrapTimeline()'s add method):

add(...args: unknown[]): TimelineProxy {
  const unwrapped = args.map((a) =>
    a && typeof a === "object" && "__hfReal" in (a as object)
      ? (a as TimelineProxy).__hfReal
      : a,
  );
  real.add(...unwrapped);
  return proxy;
},

This routes the real child timeline into GSAP's tween graph. Caller still gets proxy back for chaining.

If your test composition doesn't exercise the multi-sub-comp path (__timelines-registered children composed via init.ts:574), this bug stays dormant. Worth verifying explicitly before merging — pick a composition with 2+ sub-comps and check getChildren() returns sensible objects.

2. Value-returning setter methods leak the real timeline out of the proxy chain

totalTime(...args: unknown[]): unknown {
  return real.totalTime(...args);
},

GSAP's tl.totalTime(5) (setter form) returns this (the timeline) for chaining; tl.totalTime() (getter form) returns a number. The proxy unconditionally returns whatever GSAP returns. So a caller chaining tl.totalTime(5).to(el, {...}) gets the real timeline from .totalTime(5), then .to(...) runs synchronously against real — bypassing the batching.

Same applies to time(), paused(), timeScale(). The pattern is: when called with args (setter form), return proxy; when called without args (getter form), return the value.

totalTime(...args: unknown[]): unknown {
  const result = real.totalTime(...args);
  return args.length > 0 ? proxy : result;
},

Lower-impact than #1 (most callers don't chain past these), but architecturally cleaner.

Nice-to-haves (not blockers)

  • No unit tests for the batching logic. The proxy's chain semantics, the rAF flush loop, and the hf-timelines-built event dispatch all have correctness traps (above). One Vitest covering "queue 200 tweens, advance rAF, verify all bound + event fires" would lock in the contract.
  • No telemetry hook for tweenCount + initDurationMs. This investigation thread (PR description references #1231) hit a diagnostic wall because PostHog render_error events lack tween-count visibility. Even a single console.log({ event: "hf_timeline_batching_done", tweenCount, initDurationMs }) line at the queue-drained moment would let us spot future pathological compositions before they file issues.
  • kill() doesn't remove the proxy from activeProxies: small memory growth across long sessions. Probably never matters in render-per-process mode; matters for the studio preview path if a session creates+kills many timelines.

Render-mode latency math

For the 8,562-tween case: 86 batches × ~16ms rAF gaps = ~1.4s additional render init time vs. the "do it all synchronously" baseline. For normal compositions (<100 tweens, single batch), the cost is one rAF gap ≈ 16ms. Acceptable trade for going from infinite hang → working render on the pathological case + negligible overhead on the common case.

Verdict

Sound architecture, correctly addresses the customer issue, and the build/inject pipeline is well-engineered. The two correctness concerns above (proxy-unwrap + setter-chain) are latent risks that may not bite in the test composition but could bite multi-sub-comp compositions. If concern #1 is verified non-issue (or fixed), I'd say merge. If it does trip a real composition, it's a 5-line fix in wrapTimeline().

Posting as COMMENT so this doesn't block the customer-unblocking merge. Happy to follow up with the fix as a separate PR if you'd like.

— Rames Jusso

…args.length

Addresses two latent correctness concerns from code review:

1. proxy.add() now unwraps __hfReal from any proxy child before passing it
   to the real timeline. GSAP's internal tween graph (_first/_next/_prev
   linkage) requires real timeline instances — proxy objects lack internal
   fields like _dp that GSAP's iteration paths expect.

2. totalTime/time/paused/timeScale now return proxy when called in setter form
   (args.length > 0). Previously these returned the real timeline, causing
   callers who chain .to(...) after a setter call to bypass batching.

Also: build-hf-early-stub.ts now runs oxfmt on the generated output file
so the format check passes in CI on every build.
@miguel-heygen
Copy link
Copy Markdown
Collaborator Author

Code review response

Both correctness concerns from the review have been addressed inline (commit de813adc):

1. proxy.add() now unwraps proxy children before passing them to the real timeline. Any argument that carries __hfReal is unwrapped so GSAP's internal tween graph holds real timeline references — proxy objects missing _dp etc. would have caused GSAP's internal iteration paths to misbehave with multi-sub-comp compositions.

2. Setter-form methods (totalTime/time/paused/timeScale) now return proxy when called with arguments (args.length > 0). Previously they returned the real timeline, leaking callers out of the batching chain if they chained .to(...) after a setter call (e.g. tl.totalTime(5).to(...)). Getter form (no args) still returns the raw value.


On the nice-to-haves:

  • Unit tests for batching logic — agreed, would be a solid follow-up. The stub runs in a browser context (depends on window, requestAnimationFrame, CustomEvent) so it needs a jsdom/happy-dom harness; out of scope for this fix PR but worth a dedicated issue.
  • Telemetry hook (tweenCount + initDurationMs) — also a good follow-up. Would close the diagnostic gap and give us early warning before a customer hits the hang again.
  • kill() not removing from activeProxies — acknowledged. The memory growth is bounded (proxies are short-lived per render), but cleaning up is cleaner. Tagged as follow-up.

Render-mode math matches — 8,562 / 100 × 16ms ≈ 1.4s is acceptable for the pathological case; normal compositions don't see it.

// Format the generated file so `oxfmt --check` passes in CI.
// Errors are intentionally swallowed — oxfmt unavailable in some envs.
try {
execSync(`bunx oxfmt ${outPath}`, { stdio: "ignore" });
The HF_BRIDGE_SCRIPT duration getter now returns 0 whenever
window.__hfTimelinesBuilding is true (set by HF_EARLY_STUB while the rAF
batch loop is draining queued tl.to() calls).

pollHfReady in the engine polls until window.__hf.duration > 0, so
returning 0 keeps the engine waiting until the hf-timelines-built event
fires and all tweens are committed to the real GSAP timelines.

Without this gate, normal compositions (style-6, style-13, vignelli)
were being captured mid-batch — the real timelines were empty so GSAP
could not seek them, producing frozen/blank frames in the output video.
@miguel-heygen
Copy link
Copy Markdown
Collaborator Author

Regression fix pushed (commit 1b3e1a4b)

The prior push introduced visual failures on style-6-prod, style-13-prod, and vignelli-stacking — all three rendered blank/frozen frames during sections that should show GSAP-animated content.

Root cause: HF_BRIDGE_SCRIPT's __hf.duration getter was returning the real timeline's duration (via p.getDuration()) even while __hfTimelinesBuilding was true. The engine's pollHfReady condition is window.__hf.duration > 0, so it immediately passed — but the real GSAP timelines were still empty mid-batch. Frame capture started against empty timelines → animations frozen.

Fix: gate the duration getter to return 0 while window.__hfTimelinesBuilding is true. This keeps pollHfReady spinning until hf-timelines-built fires and all tweens are committed to the real GSAP timelines. One-line change in HF_BRIDGE_SCRIPT.

Note: gsap-letters-render-compat was already passing in CI — the fix is confirmed not to break that test, and the pollHfReady wait overhead for normal compositions (< 100 tweens) is a single rAF frame (~16ms).

@miguel-heygen miguel-heygen merged commit ebd156b into main Jun 7, 2026
63 checks passed
@miguel-heygen miguel-heygen deleted the fix/gsap-tween-count-hang branch June 7, 2026 13:31
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

v0.6.74 regression: CLI render hangs on "Initializing calibration session..." (macOS M4) (edge case)

3 participants