perf(player): p0-1b perf tests for fps, scrub latency, and media sync drift by vanceingalls · Pull Request #400 · heygen-com/hyperframes

vanceingalls · 2026-04-21T23:05:52Z

Summary

Second slice of P0-1 from the player perf proposal: plugs the three steady-state scenarios — sustained playback FPS, scrub latency, and media-sync drift — into the perf gate that landed in #399. Adds the multi-video fixture they all share, wires three new shards into CI, and seeds one new baseline (droppedFramesMax).

Why

#399 stood up the harness and proved it with a single load-time scenario. By itself that's enough to catch regressions in initial composition setup, but it can't catch the things players actually fail at in production:

FPS regressions — a render-loop change that drops the ticker from 60 to 45 fps still loads fast.
Scrub latency regressions — the inline-vs-isolated split (feat(player): synchronous seek() API with same-origin detection #397) is exactly the kind of code path where a refactor can silently push everyone back to the postMessage round trip.
Media drift — runtime mirror logic (perf(player): coalesce _mirrorParentMediaTime writes #396 in this stack) and per-frame scheduling tweaks can both cause video to slip out of sync with the composition clock without producing a single console error.

Each of these is a target metric in the proposal with a concrete budget. This PR turns those budgets into gated CI signals and produces continuous data for them on every player/core/runtime change.

What changed

Fixture — `packages/player/tests/perf/fixtures/10-video-grid/`

index.html: 10-second composition, 1920×1080, 30 fps, with 10 simultaneously-decoding video tiles in a 5×2 grid plus a subtle GSAP scale "breath" on each tile (so the rAF/RVFC loops have real work to do without GSAP dominating the budget the decoder needs).
sample.mp4: small (~190 KB) clip checked in so the fixture is hermetic — no external CDN dependency, identical bytes on every run.
Same data-composition-id="main" host pattern as gsap-heavy, so the existing harness loader works without changes.

`02-fps.ts` — sustained playback frame rate

Loads 10-video-grid, calls player.play(), samples requestAnimationFrame callbacks inside the iframe for 5 s.
Crucial sequencing: install the rAF sampler before play(), wait for __player.isPlaying() === true, then reset the sample buffer — otherwise the postMessage round-trip ramp-up window drags the average down by 5–10 fps.
FPS = (samples − 1) / (lastTs − firstTs in s); uses rAF timestamps (the same ones the compositor saw) rather than wall-clock setTimeout, so we're measuring real frame production.
Dropped-frame definition matches Chrome DevTools: gap > 1.5× (1000/60 ms) ≈ 25 ms = "missed at least one vsync."
Aggregation across runs: min(fps) and max(droppedFrames) — worst case wins, since the proposal asserts a floor on fps and a ceiling on drops.
Emits playback_fps_min (higher-is-better, baseline fpsMin = 55) and playback_dropped_frames_max (lower-is-better, baseline droppedFramesMax = 3).

`04-scrub.ts` — scrub latency, inline + isolated

Loads 10-video-grid, pauses, then issues 10 seek calls in two batches: first the synchronous inline path (<hyperframes-player>'s default same-origin _trySyncSeek), then the isolated path (forced by replacing _trySyncSeek with () => false, which makes the player fall back to the postMessage _sendControl("seek") bridge that cross-origin embeds and pre-feat(player): synchronous seek() API with same-origin detection #397 builds use).
Inline runs first so the isolated mode's monkey-patch can't bleed back into the inline samples.
Detection: a rAF watcher inside the iframe polls __player.getTime() until it's within MATCH_TOLERANCE_S = 0.05 s of the requested target. Tolerance exists because the postMessage bridge converts seconds → frame number → seconds, and that round-trip can introduce sub-frame quantization drift even for targets on the canonical fps grid.
Timing: performance.timeOrigin + performance.now() in both contexts. timeOrigin is consistent across same-process frames, so t1 − t0 is a true wall-clock latency, not a host-only or iframe-only stopwatch.
Targets alternate forward/backward (1.0, 7.0, 2.0, 8.0, 3.0, 9.0, 4.0, 6.0, 5.0, 0.5) so no two consecutive seeks land near each other — protects the rAF watcher from matching against a stale getTime() value before the seek command is processed.
Aggregation: percentile(95) across the pooled per-seek latencies from every run. With 10 seeks × 2 modes × 3 runs we get 30 samples per mode per CI shard, enough for a stable p95.
Emits scrub_latency_p95_inline_ms (lower-is-better, baseline scrubLatencyP95InlineMs = 33) and scrub_latency_p95_isolated_ms (lower-is-better, baseline scrubLatencyP95IsolatedMs = 80).

`05-drift.ts` — media sync drift

Loads 10-video-grid, plays 6 s, instruments every video[data-start] element with requestVideoFrameCallback. Each callback records (compositionTime, actualMediaTime) plus a snapshot of the clip transform (clipStart, clipMediaStart, clipPlaybackRate).
Drift = |actualMediaTime − ((compTime − clipStart) × clipPlaybackRate + clipMediaStart)| — the same transform the runtime applies in packages/core/src/runtime/media.ts, snapshotted once at sampler install so the per-frame work is just subtract + multiply + abs.
Sustain window is 6 s (not the proposal's 10 s) because the fixture composition is exactly 10 s long and we want headroom before the end-of-timeline pause/clamp behavior. With 10 videos × ~25 fps × 6 s we still pool ~1500 samples per run — more than enough for a stable p95.
Same "reset buffer after play confirmed" gotcha as 02-fps.ts: frames captured during the postMessage round-trip would compare a non-zero mediaTime against getTime() === 0 and inflate drift by hundreds of ms.
Aggregation: max() and percentile(95) across the pooled per-frame drifts. The proposal's max-drift ceiling of 500 ms is intentional — the runtime hard-resyncs when |currentTime − relTime| > 0.5 s, so a regression past 500 ms means the corrective resync kicked in and the viewer saw a jump.
Emits media_drift_max_ms (lower-is-better, baseline driftMaxMs = 500) and media_drift_p95_ms (lower-is-better, baseline driftP95Ms = 100).

Wiring

packages/player/tests/perf/index.ts: add fps, scrub, drift to ScenarioId, DEFAULT_RUNS, the default scenario list (--scenarios defaults to all four), and three new dispatch branches.
packages/player/tests/perf/perf-gate.ts: add droppedFramesMax: number to PerfBaseline. Other baseline keys for these scenarios were already seeded in perf(player): p0-1a perf test infra + composition-load smoke test #399.
packages/player/tests/perf/baseline.json: add droppedFramesMax: 3.
.github/workflows/player-perf.yml: three new matrix shards (fps / scrub / drift) at runs: 3. Same paths-filter and same artifact-upload pattern as the load shard, so the summary job aggregates them automatically.

Methodology highlights

These three patterns recur in all three scenarios and are worth noting because they're load-bearing for the numbers we report:

Reset buffer after play-confirmed. The play() API is async (postMessage), so any samples captured before __player.isPlaying() === true belong to ramp-up, not steady-state. Both 02-fps and 05-drift clear __perfRafSamples / __perfDriftSamples after the wait. Without this, fps drops 5–10 and drift inflates by hundreds of ms.
Iframe-side timing. All three scenarios time inside the iframe (performance.timeOrigin + performance.now() for scrub, rAF/RVFC timestamps for fps/drift) rather than host-side. The iframe is what the user sees; host-side timing would conflate Puppeteer's IPC overhead with real player latency.
Stop sampling before pause. Sampler is deactivated before pause() is issued, so the pause command's postMessage round-trip can't perturb the tail of the measurement window.

Test plan

Local: bun run player:perf runs all four scenarios end-to-end on the 10-video-grid fixture.
Each scenario produces metrics matching its declared baselineKey so perf-gate.ts can find them.
Typecheck, lint, format pass on the new files.
Existing player unit tests untouched (no production code changes in this PR).
First CI run will confirm the new shards complete inside the workflow timeout and that the summary job picks up their metrics.json artifacts.

Stack

Step P0-1b of the player perf proposal. Builds on:

P0-1a (perf(player): p0-1a perf test infra + composition-load smoke test #399): the harness, runner, gate, and CI workflow this PR plugs new scenarios into.

Followed by:

P0-1c (perf(player): p0-1c live-playback parity test via SSIM #401): 06-parity — live playback frame vs. synchronously-seeked reference frame, compared via SSIM, on the existing gsap-heavy fixture from perf(player): p0-1a perf test infra + composition-load smoke test #399.

vanceingalls · 2026-04-21T23:06:12Z

Warning

This pull request is not mergeable via GitHub because a downstack PR is open. Once all requirements are satisfied, merge this PR as a stack on Graphite.
Learn more

This stack of pull requests is managed by Graphite. Learn more about stacking.

jrusso1020

Scenario design is careful and the methodology notes in each file are a good sign — the "alternate forward/backward seek targets so the rAF watcher doesn't match a stale getTime() value" trick in 04-scrub.ts is exactly the kind of detail that makes or breaks a microbenchmark. Same for the "install rAF watcher before play() and pause() in the same tick to freeze at the captured time" pattern in what I assume 05-drift.ts uses.

The scrub_latency_p95_inline vs scrub_latency_p95_isolated split directly pins the value of #397's sync path as a measurable metric — monkey-patching _trySyncSeek to () => false to force the postMessage path in the same page load is a clean way to separate the modes without needing two separate runs.

Three non-blocking observations:

With runs: 3 and 10 seeks per mode per run, that's 30 samples per mode per shard for p95. That's on the edge of "stable enough" — a single outlier at index 28/29 can swing the p95 by tens of ms. Not a blocker (you're in measure mode), but if you see p95 flapping in the first few enforcement cycles, bumping to runs: 5 is the cheapest fix.
MATCH_TOLERANCE_S = 0.05 is generous. On tight-latency scrubs, a 50ms tolerance window between seek command and confirmation paint could mask a legitimate regression where the measured latency is ~30ms but the tolerance swallows the last rAF. Worth revisiting once real baselines land.
The drift scenario (which I haven't read line-by-line) is the one most likely to produce flaky signal, since it's inherently long-running. Keep an eye on its coefficient of variation over the first week — if it's >20%, that's the signal to tighten the driftMaxMs/driftP95Ms baselines and investigate whether there's a non-deterministic timing source in the runtime.

Approved.

— Rames Jusso

miguel-heygen

I pulled this stack into a local worktree, ran the perf harness end-to-end, and compared the top branch against main by swapping the built player/runtime artifacts under the same harness. The scrub/drift gains are real but modest: inline scrub p95 improved from 7.2ms on main to 7.0ms here, isolated scrub p95 improved from 8.2ms to 7.1ms, drift max improved from 25.67ms to 24.67ms, and drift p95 improved from 25.33ms to 24.0ms.

The blocking issue is the FPS scenario. packages/player/tests/perf/scenarios/02-fps.ts is measuring raw requestAnimationFrame cadence and then comparing it against a 60fps target. On my machine both main and this branch reported ~120fps for the same fixture, which means the metric is saturating to browser/display cadence rather than proving “player sustained 60fps playback.” With the current implementation, a high-refresh runner can make fpsMin: 55 in baseline.json look comfortably green without actually telling us whether playback stayed near the intended 60fps budget.

I’d like to see this normalized to a refresh-rate-independent signal before we merge the scenario as a regression gate. Concretely: either derive the metric from missed target intervals against the 60fps composition clock, or capture an effective render cadence that is explicitly bounded to the fixture/runtime target instead of host rAF speed.

Once that part is fixed, I’m comfortable with the rest of the scenario design. The alternating seek targets, iframe-side timing, and drift sampling approach all looked sound in local runs.

… drift

jrusso1020 · 2026-04-22T01:01:22Z

Following up on @miguel-heygen's stress-test finding — agreed that the current FPS scenario saturates at the runner's display refresh. On a 120Hz runner, requestAnimationFrame hands back ticks at ~8.3ms intervals regardless of whether the player's composition loop is running at the intended 60fps or is silently stalling between frames, so fpsMin: 55 passes trivially. I missed this in my approval — Miguel is right that it needs to change before it gates merges.

A few approaches that remove the refresh-rate dependency:

1. Composition-time-advanced-per-wall-second (my first choice). In the iframe, sample __player.getTime() at regular wall-clock intervals (say every 100ms via setInterval) across the measurement window. The emitted metric is (finalGetTime - initialGetTime) / wallClockSeconds. When the player keeps up with the composition clock it reads 1.0 ± jitter; when it falls behind (slow decoder, blocked main thread) it drops below 1.0. Display rate drops out of the equation entirely because we're comparing two timestamps that both live in the composition's frame of reference against real wall time, not counting rAF ticks.

Bonus: this is the metric that actually answers "did the composition play at its intended speed," which is the user-observable thing. Display refresh only matters if it's lower than the composition fps — at 60Hz with a 60fps composition the metric would still read ~1.0; at 30Hz displaying a 60fps composition it'd read ~0.5 and legitimately flag the bad experience.

2. Missed-deadline rate. In the iframe's rAF loop, count ticks where the delta since the previous tick exceeded (1000 / target_fps) * 1.2 — i.e., late by more than 20% of the per-frame budget. Metric is missedDeadlines / totalFrames. Bounded and refresh-independent (a 120Hz runner just gets more samples per wall second, with the same passing rate if the player keeps up).

3. PerformanceObserver + frame-timing. new PerformanceObserver({ type: "frame" }) gives you actual frame-timing entries from Chrome with startTime and renderStart. More reliable than rAF but more complex to wire inside the iframe. Probably overkill for this first version.

Option 1 is the simplest and the most directly answers "is the player sustaining playback" — the metric has a physical interpretation rather than being a threshold crossing. Happy to re-review once this lands.

Baseline-wise: fpsMin: 55 would become something like compositionTimeAdvancementRatioMin: 0.95 (or pick your tolerance) and the direction flips — still lower-is-worse, so higher-is-better in the perf-gate. No other changes needed to the harness.

Everything else in the scenario design — alternating seek targets, in-tick pause, drift sampling — held up under Miguel's local run and my static read, so I think just the fps metric needs rework. The rest of my non-blocking notes (samples/shard, tolerance windows) stand but are secondary.

— Rames Jusso

…tio metric Addresses blocking PR feedback on #400 from miguel-heygen and jrusso1020: the previous FPS metric measured raw rAF cadence and was refresh-rate dependent (a 30fps composition would always 'pass' a 60Hz refresh rate but the metric reported on rAF, not the composition). - 02-fps.ts: re-implemented to sample __player.getTime() at 100ms wall intervals and emit (deltaCompTime / wallSeconds). New metric is refresh-rate independent and measures what we actually care about: whether the player keeps up with composition-time playback. - perf-gate.ts: replaced fpsMin / droppedFramesMax with compositionTimeAdvancementRatioMin (higher-is-better, target 0.95). - baseline.json: updated to match the new metric key + threshold. - 04-scrub.ts: documented MATCH_TOLERANCE_S rationale (frame quantization on postMessage, sub-frame intra-clip advance, runner jitter) + TODO to tighten once we have CI baseline data. - 05-drift.ts: log coefficient of variation as a soft monitoring signal (not gated) + TODO to decide whether to publish it as a tracked metric. - index.ts: documented DEFAULT_RUNS rationale (load=5 because p95 over n=3 is just max; fps/scrub/drift=3 because they pool samples across runs) + TODO to revisit fps=3 after collecting CI baseline data. - index.ts: removed dead reference to docs/internal/player-perf-baselines.md.

This was referenced Apr 21, 2026

perf(player): p0-1a perf test infra + composition-load smoke test #399

Open

perf(player): p0-1c live-playback parity test via SSIM #401

Open

vanceingalls marked this pull request as ready for review April 21, 2026 23:13

jrusso1020 approved these changes Apr 22, 2026

View reviewed changes

vanceingalls force-pushed the perf/p0-1b-perf-tests-for-fps-scrub-drift branch from acdf9af to 111e128 Compare April 22, 2026 00:43

vanceingalls force-pushed the perf/p0-1a-perf-test-infra branch from 725bc89 to 0af9ce7 Compare April 22, 2026 00:43

miguel-heygen requested changes Apr 22, 2026

View reviewed changes

perf(player): p0-1b perf tests for fps, scrub latency, and media sync…

306c164

… drift

vanceingalls changed the base branch from perf/p0-1a-perf-test-infra to graphite-base/400 April 22, 2026 00:56

vanceingalls force-pushed the perf/p0-1b-perf-tests-for-fps-scrub-drift branch from 111e128 to 306c164 Compare April 22, 2026 00:57

vanceingalls force-pushed the graphite-base/400 branch from 0af9ce7 to 433b609 Compare April 22, 2026 00:57

vanceingalls changed the base branch from graphite-base/400 to perf/p0-1a-perf-test-infra April 22, 2026 00:57

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

perf(player): p0-1b perf tests for fps, scrub latency, and media sync drift#400

perf(player): p0-1b perf tests for fps, scrub latency, and media sync drift#400
vanceingalls wants to merge 2 commits intoperf/p0-1a-perf-test-infrafrom
perf/p0-1b-perf-tests-for-fps-scrub-drift

vanceingalls commented Apr 21, 2026 •

edited

Loading

Uh oh!

vanceingalls commented Apr 21, 2026 •

edited

Loading

Uh oh!

jrusso1020 left a comment

Uh oh!

miguel-heygen left a comment

Uh oh!

jrusso1020 commented Apr 22, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

vanceingalls commented Apr 21, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Why

What changed

Fixture — packages/player/tests/perf/fixtures/10-video-grid/

02-fps.ts — sustained playback frame rate

04-scrub.ts — scrub latency, inline + isolated

05-drift.ts — media sync drift

Wiring

Methodology highlights

Test plan

Stack

Uh oh!

vanceingalls commented Apr 21, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

jrusso1020 left a comment

Choose a reason for hiding this comment

Uh oh!

miguel-heygen left a comment

Choose a reason for hiding this comment

Uh oh!

jrusso1020 commented Apr 22, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

vanceingalls commented Apr 21, 2026 •

edited

Loading

Fixture — `packages/player/tests/perf/fixtures/10-video-grid/`

`02-fps.ts` — sustained playback frame rate

`04-scrub.ts` — scrub latency, inline + isolated

`05-drift.ts` — media sync drift

vanceingalls commented Apr 21, 2026 •

edited

Loading