Skip to content

perf(player): p0-1b perf tests for fps, scrub latency, and media sync drift#400

Open
vanceingalls wants to merge 2 commits intoperf/p0-1a-perf-test-infrafrom
perf/p0-1b-perf-tests-for-fps-scrub-drift
Open

perf(player): p0-1b perf tests for fps, scrub latency, and media sync drift#400
vanceingalls wants to merge 2 commits intoperf/p0-1a-perf-test-infrafrom
perf/p0-1b-perf-tests-for-fps-scrub-drift

Conversation

@vanceingalls
Copy link
Copy Markdown
Collaborator

@vanceingalls vanceingalls commented Apr 21, 2026

Summary

Second slice of P0-1 from the player perf proposal: plugs the three steady-state scenarios — sustained playback FPS, scrub latency, and media-sync drift — into the perf gate that landed in #399. Adds the multi-video fixture they all share, wires three new shards into CI, and seeds one new baseline (droppedFramesMax).

Why

#399 stood up the harness and proved it with a single load-time scenario. By itself that's enough to catch regressions in initial composition setup, but it can't catch the things players actually fail at in production:

Each of these is a target metric in the proposal with a concrete budget. This PR turns those budgets into gated CI signals and produces continuous data for them on every player/core/runtime change.

What changed

Fixture — packages/player/tests/perf/fixtures/10-video-grid/

  • index.html: 10-second composition, 1920×1080, 30 fps, with 10 simultaneously-decoding video tiles in a 5×2 grid plus a subtle GSAP scale "breath" on each tile (so the rAF/RVFC loops have real work to do without GSAP dominating the budget the decoder needs).
  • sample.mp4: small (~190 KB) clip checked in so the fixture is hermetic — no external CDN dependency, identical bytes on every run.
  • Same data-composition-id="main" host pattern as gsap-heavy, so the existing harness loader works without changes.

02-fps.ts — sustained playback frame rate

  • Loads 10-video-grid, calls player.play(), samples requestAnimationFrame callbacks inside the iframe for 5 s.
  • Crucial sequencing: install the rAF sampler before play(), wait for __player.isPlaying() === true, then reset the sample buffer — otherwise the postMessage round-trip ramp-up window drags the average down by 5–10 fps.
  • FPS = (samples − 1) / (lastTs − firstTs in s); uses rAF timestamps (the same ones the compositor saw) rather than wall-clock setTimeout, so we're measuring real frame production.
  • Dropped-frame definition matches Chrome DevTools: gap > 1.5× (1000/60 ms) ≈ 25 ms = "missed at least one vsync."
  • Aggregation across runs: min(fps) and max(droppedFrames) — worst case wins, since the proposal asserts a floor on fps and a ceiling on drops.
  • Emits playback_fps_min (higher-is-better, baseline fpsMin = 55) and playback_dropped_frames_max (lower-is-better, baseline droppedFramesMax = 3).

04-scrub.ts — scrub latency, inline + isolated

  • Loads 10-video-grid, pauses, then issues 10 seek calls in two batches: first the synchronous inline path (<hyperframes-player>'s default same-origin _trySyncSeek), then the isolated path (forced by replacing _trySyncSeek with () => false, which makes the player fall back to the postMessage _sendControl("seek") bridge that cross-origin embeds and pre-feat(player): synchronous seek() API with same-origin detection #397 builds use).
  • Inline runs first so the isolated mode's monkey-patch can't bleed back into the inline samples.
  • Detection: a rAF watcher inside the iframe polls __player.getTime() until it's within MATCH_TOLERANCE_S = 0.05 s of the requested target. Tolerance exists because the postMessage bridge converts seconds → frame number → seconds, and that round-trip can introduce sub-frame quantization drift even for targets on the canonical fps grid.
  • Timing: performance.timeOrigin + performance.now() in both contexts. timeOrigin is consistent across same-process frames, so t1 − t0 is a true wall-clock latency, not a host-only or iframe-only stopwatch.
  • Targets alternate forward/backward (1.0, 7.0, 2.0, 8.0, 3.0, 9.0, 4.0, 6.0, 5.0, 0.5) so no two consecutive seeks land near each other — protects the rAF watcher from matching against a stale getTime() value before the seek command is processed.
  • Aggregation: percentile(95) across the pooled per-seek latencies from every run. With 10 seeks × 2 modes × 3 runs we get 30 samples per mode per CI shard, enough for a stable p95.
  • Emits scrub_latency_p95_inline_ms (lower-is-better, baseline scrubLatencyP95InlineMs = 33) and scrub_latency_p95_isolated_ms (lower-is-better, baseline scrubLatencyP95IsolatedMs = 80).

05-drift.ts — media sync drift

  • Loads 10-video-grid, plays 6 s, instruments every video[data-start] element with requestVideoFrameCallback. Each callback records (compositionTime, actualMediaTime) plus a snapshot of the clip transform (clipStart, clipMediaStart, clipPlaybackRate).
  • Drift = |actualMediaTime − ((compTime − clipStart) × clipPlaybackRate + clipMediaStart)| — the same transform the runtime applies in packages/core/src/runtime/media.ts, snapshotted once at sampler install so the per-frame work is just subtract + multiply + abs.
  • Sustain window is 6 s (not the proposal's 10 s) because the fixture composition is exactly 10 s long and we want headroom before the end-of-timeline pause/clamp behavior. With 10 videos × ~25 fps × 6 s we still pool ~1500 samples per run — more than enough for a stable p95.
  • Same "reset buffer after play confirmed" gotcha as 02-fps.ts: frames captured during the postMessage round-trip would compare a non-zero mediaTime against getTime() === 0 and inflate drift by hundreds of ms.
  • Aggregation: max() and percentile(95) across the pooled per-frame drifts. The proposal's max-drift ceiling of 500 ms is intentional — the runtime hard-resyncs when |currentTime − relTime| > 0.5 s, so a regression past 500 ms means the corrective resync kicked in and the viewer saw a jump.
  • Emits media_drift_max_ms (lower-is-better, baseline driftMaxMs = 500) and media_drift_p95_ms (lower-is-better, baseline driftP95Ms = 100).

Wiring

  • packages/player/tests/perf/index.ts: add fps, scrub, drift to ScenarioId, DEFAULT_RUNS, the default scenario list (--scenarios defaults to all four), and three new dispatch branches.
  • packages/player/tests/perf/perf-gate.ts: add droppedFramesMax: number to PerfBaseline. Other baseline keys for these scenarios were already seeded in perf(player): p0-1a perf test infra + composition-load smoke test #399.
  • packages/player/tests/perf/baseline.json: add droppedFramesMax: 3.
  • .github/workflows/player-perf.yml: three new matrix shards (fps / scrub / drift) at runs: 3. Same paths-filter and same artifact-upload pattern as the load shard, so the summary job aggregates them automatically.

Methodology highlights

These three patterns recur in all three scenarios and are worth noting because they're load-bearing for the numbers we report:

  1. Reset buffer after play-confirmed. The play() API is async (postMessage), so any samples captured before __player.isPlaying() === true belong to ramp-up, not steady-state. Both 02-fps and 05-drift clear __perfRafSamples / __perfDriftSamples after the wait. Without this, fps drops 5–10 and drift inflates by hundreds of ms.
  2. Iframe-side timing. All three scenarios time inside the iframe (performance.timeOrigin + performance.now() for scrub, rAF/RVFC timestamps for fps/drift) rather than host-side. The iframe is what the user sees; host-side timing would conflate Puppeteer's IPC overhead with real player latency.
  3. Stop sampling before pause. Sampler is deactivated before pause() is issued, so the pause command's postMessage round-trip can't perturb the tail of the measurement window.

Test plan

  • Local: bun run player:perf runs all four scenarios end-to-end on the 10-video-grid fixture.
  • Each scenario produces metrics matching its declared baselineKey so perf-gate.ts can find them.
  • Typecheck, lint, format pass on the new files.
  • Existing player unit tests untouched (no production code changes in this PR).
  • First CI run will confirm the new shards complete inside the workflow timeout and that the summary job picks up their metrics.json artifacts.

Stack

Step P0-1b of the player perf proposal. Builds on:

Followed by:

Copy link
Copy Markdown
Collaborator

@jrusso1020 jrusso1020 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Scenario design is careful and the methodology notes in each file are a good sign — the "alternate forward/backward seek targets so the rAF watcher doesn't match a stale getTime() value" trick in 04-scrub.ts is exactly the kind of detail that makes or breaks a microbenchmark. Same for the "install rAF watcher before play() and pause() in the same tick to freeze at the captured time" pattern in what I assume 05-drift.ts uses.

The scrub_latency_p95_inline vs scrub_latency_p95_isolated split directly pins the value of #397's sync path as a measurable metric — monkey-patching _trySyncSeek to () => false to force the postMessage path in the same page load is a clean way to separate the modes without needing two separate runs.

Three non-blocking observations:

  1. With runs: 3 and 10 seeks per mode per run, that's 30 samples per mode per shard for p95. That's on the edge of "stable enough" — a single outlier at index 28/29 can swing the p95 by tens of ms. Not a blocker (you're in measure mode), but if you see p95 flapping in the first few enforcement cycles, bumping to runs: 5 is the cheapest fix.

  2. MATCH_TOLERANCE_S = 0.05 is generous. On tight-latency scrubs, a 50ms tolerance window between seek command and confirmation paint could mask a legitimate regression where the measured latency is ~30ms but the tolerance swallows the last rAF. Worth revisiting once real baselines land.

  3. The drift scenario (which I haven't read line-by-line) is the one most likely to produce flaky signal, since it's inherently long-running. Keep an eye on its coefficient of variation over the first week — if it's >20%, that's the signal to tighten the driftMaxMs/driftP95Ms baselines and investigate whether there's a non-deterministic timing source in the runtime.

Approved.

Rames Jusso

@vanceingalls vanceingalls force-pushed the perf/p0-1b-perf-tests-for-fps-scrub-drift branch from acdf9af to 111e128 Compare April 22, 2026 00:43
@vanceingalls vanceingalls force-pushed the perf/p0-1a-perf-test-infra branch from 725bc89 to 0af9ce7 Compare April 22, 2026 00:43
Copy link
Copy Markdown
Collaborator

@miguel-heygen miguel-heygen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I pulled this stack into a local worktree, ran the perf harness end-to-end, and compared the top branch against main by swapping the built player/runtime artifacts under the same harness. The scrub/drift gains are real but modest: inline scrub p95 improved from 7.2ms on main to 7.0ms here, isolated scrub p95 improved from 8.2ms to 7.1ms, drift max improved from 25.67ms to 24.67ms, and drift p95 improved from 25.33ms to 24.0ms.

The blocking issue is the FPS scenario. packages/player/tests/perf/scenarios/02-fps.ts is measuring raw requestAnimationFrame cadence and then comparing it against a 60fps target. On my machine both main and this branch reported ~120fps for the same fixture, which means the metric is saturating to browser/display cadence rather than proving “player sustained 60fps playback.” With the current implementation, a high-refresh runner can make fpsMin: 55 in baseline.json look comfortably green without actually telling us whether playback stayed near the intended 60fps budget.

I’d like to see this normalized to a refresh-rate-independent signal before we merge the scenario as a regression gate. Concretely: either derive the metric from missed target intervals against the 60fps composition clock, or capture an effective render cadence that is explicitly bounded to the fixture/runtime target instead of host rAF speed.

Once that part is fixed, I’m comfortable with the rest of the scenario design. The alternating seek targets, iframe-side timing, and drift sampling approach all looked sound in local runs.

@vanceingalls vanceingalls changed the base branch from perf/p0-1a-perf-test-infra to graphite-base/400 April 22, 2026 00:56
@vanceingalls vanceingalls force-pushed the perf/p0-1b-perf-tests-for-fps-scrub-drift branch from 111e128 to 306c164 Compare April 22, 2026 00:57
@vanceingalls vanceingalls changed the base branch from graphite-base/400 to perf/p0-1a-perf-test-infra April 22, 2026 00:57
@jrusso1020
Copy link
Copy Markdown
Collaborator

Following up on @miguel-heygen's stress-test finding — agreed that the current FPS scenario saturates at the runner's display refresh. On a 120Hz runner, requestAnimationFrame hands back ticks at ~8.3ms intervals regardless of whether the player's composition loop is running at the intended 60fps or is silently stalling between frames, so fpsMin: 55 passes trivially. I missed this in my approval — Miguel is right that it needs to change before it gates merges.

A few approaches that remove the refresh-rate dependency:

1. Composition-time-advanced-per-wall-second (my first choice). In the iframe, sample __player.getTime() at regular wall-clock intervals (say every 100ms via setInterval) across the measurement window. The emitted metric is (finalGetTime - initialGetTime) / wallClockSeconds. When the player keeps up with the composition clock it reads 1.0 ± jitter; when it falls behind (slow decoder, blocked main thread) it drops below 1.0. Display rate drops out of the equation entirely because we're comparing two timestamps that both live in the composition's frame of reference against real wall time, not counting rAF ticks.

Bonus: this is the metric that actually answers "did the composition play at its intended speed," which is the user-observable thing. Display refresh only matters if it's lower than the composition fps — at 60Hz with a 60fps composition the metric would still read ~1.0; at 30Hz displaying a 60fps composition it'd read ~0.5 and legitimately flag the bad experience.

2. Missed-deadline rate. In the iframe's rAF loop, count ticks where the delta since the previous tick exceeded (1000 / target_fps) * 1.2 — i.e., late by more than 20% of the per-frame budget. Metric is missedDeadlines / totalFrames. Bounded and refresh-independent (a 120Hz runner just gets more samples per wall second, with the same passing rate if the player keeps up).

3. PerformanceObserver + frame-timing. new PerformanceObserver({ type: "frame" }) gives you actual frame-timing entries from Chrome with startTime and renderStart. More reliable than rAF but more complex to wire inside the iframe. Probably overkill for this first version.

Option 1 is the simplest and the most directly answers "is the player sustaining playback" — the metric has a physical interpretation rather than being a threshold crossing. Happy to re-review once this lands.

Baseline-wise: fpsMin: 55 would become something like compositionTimeAdvancementRatioMin: 0.95 (or pick your tolerance) and the direction flips — still lower-is-worse, so higher-is-better in the perf-gate. No other changes needed to the harness.

Everything else in the scenario design — alternating seek targets, in-tick pause, drift sampling — held up under Miguel's local run and my static read, so I think just the fps metric needs rework. The rest of my non-blocking notes (samples/shard, tolerance windows) stand but are secondary.

Rames Jusso

…tio metric

Addresses blocking PR feedback on #400 from miguel-heygen and jrusso1020:
the previous FPS metric measured raw rAF cadence and was refresh-rate
dependent (a 30fps composition would always 'pass' a 60Hz refresh rate
but the metric reported on rAF, not the composition).

- 02-fps.ts: re-implemented to sample __player.getTime() at 100ms wall
  intervals and emit (deltaCompTime / wallSeconds). New metric is
  refresh-rate independent and measures what we actually care about:
  whether the player keeps up with composition-time playback.
- perf-gate.ts: replaced fpsMin / droppedFramesMax with
  compositionTimeAdvancementRatioMin (higher-is-better, target 0.95).
- baseline.json: updated to match the new metric key + threshold.
- 04-scrub.ts: documented MATCH_TOLERANCE_S rationale (frame quantization
  on postMessage, sub-frame intra-clip advance, runner jitter) + TODO to
  tighten once we have CI baseline data.
- 05-drift.ts: log coefficient of variation as a soft monitoring signal
  (not gated) + TODO to decide whether to publish it as a tracked metric.
- index.ts: documented DEFAULT_RUNS rationale (load=5 because p95 over
  n=3 is just max; fps/scrub/drift=3 because they pool samples across
  runs) + TODO to revisit fps=3 after collecting CI baseline data.
- index.ts: removed dead reference to docs/internal/player-perf-baselines.md.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants