perf(player): p0-1b perf tests for fps, scrub latency, and media sync drift#400
perf(player): p0-1b perf tests for fps, scrub latency, and media sync drift#400vanceingalls wants to merge 2 commits intoperf/p0-1a-perf-test-infrafrom
Conversation
|
Warning This pull request is not mergeable via GitHub because a downstack PR is open. Once all requirements are satisfied, merge this PR as a stack on Graphite.
This stack of pull requests is managed by Graphite. Learn more about stacking. |
jrusso1020
left a comment
There was a problem hiding this comment.
Scenario design is careful and the methodology notes in each file are a good sign — the "alternate forward/backward seek targets so the rAF watcher doesn't match a stale getTime() value" trick in 04-scrub.ts is exactly the kind of detail that makes or breaks a microbenchmark. Same for the "install rAF watcher before play() and pause() in the same tick to freeze at the captured time" pattern in what I assume 05-drift.ts uses.
The scrub_latency_p95_inline vs scrub_latency_p95_isolated split directly pins the value of #397's sync path as a measurable metric — monkey-patching _trySyncSeek to () => false to force the postMessage path in the same page load is a clean way to separate the modes without needing two separate runs.
Three non-blocking observations:
-
With
runs: 3and 10 seeks per mode per run, that's 30 samples per mode per shard for p95. That's on the edge of "stable enough" — a single outlier at index 28/29 can swing the p95 by tens of ms. Not a blocker (you're in measure mode), but if you see p95 flapping in the first few enforcement cycles, bumping toruns: 5is the cheapest fix. -
MATCH_TOLERANCE_S = 0.05is generous. On tight-latency scrubs, a 50ms tolerance window between seek command and confirmation paint could mask a legitimate regression where the measured latency is ~30ms but the tolerance swallows the last rAF. Worth revisiting once real baselines land. -
The drift scenario (which I haven't read line-by-line) is the one most likely to produce flaky signal, since it's inherently long-running. Keep an eye on its coefficient of variation over the first week — if it's >20%, that's the signal to tighten the
driftMaxMs/driftP95Msbaselines and investigate whether there's a non-deterministic timing source in the runtime.
Approved.
— Rames Jusso
acdf9af to
111e128
Compare
725bc89 to
0af9ce7
Compare
miguel-heygen
left a comment
There was a problem hiding this comment.
I pulled this stack into a local worktree, ran the perf harness end-to-end, and compared the top branch against main by swapping the built player/runtime artifacts under the same harness. The scrub/drift gains are real but modest: inline scrub p95 improved from 7.2ms on main to 7.0ms here, isolated scrub p95 improved from 8.2ms to 7.1ms, drift max improved from 25.67ms to 24.67ms, and drift p95 improved from 25.33ms to 24.0ms.
The blocking issue is the FPS scenario. packages/player/tests/perf/scenarios/02-fps.ts is measuring raw requestAnimationFrame cadence and then comparing it against a 60fps target. On my machine both main and this branch reported ~120fps for the same fixture, which means the metric is saturating to browser/display cadence rather than proving “player sustained 60fps playback.” With the current implementation, a high-refresh runner can make fpsMin: 55 in baseline.json look comfortably green without actually telling us whether playback stayed near the intended 60fps budget.
I’d like to see this normalized to a refresh-rate-independent signal before we merge the scenario as a regression gate. Concretely: either derive the metric from missed target intervals against the 60fps composition clock, or capture an effective render cadence that is explicitly bounded to the fixture/runtime target instead of host rAF speed.
Once that part is fixed, I’m comfortable with the rest of the scenario design. The alternating seek targets, iframe-side timing, and drift sampling approach all looked sound in local runs.
111e128 to
306c164
Compare
0af9ce7 to
433b609
Compare
|
Following up on @miguel-heygen's stress-test finding — agreed that the current FPS scenario saturates at the runner's display refresh. On a 120Hz runner, A few approaches that remove the refresh-rate dependency: 1. Composition-time-advanced-per-wall-second (my first choice). In the iframe, sample Bonus: this is the metric that actually answers "did the composition play at its intended speed," which is the user-observable thing. Display refresh only matters if it's lower than the composition fps — at 60Hz with a 60fps composition the metric would still read ~1.0; at 30Hz displaying a 60fps composition it'd read ~0.5 and legitimately flag the bad experience. 2. Missed-deadline rate. In the iframe's rAF loop, count ticks where the delta since the previous tick exceeded 3. PerformanceObserver + frame-timing. Option 1 is the simplest and the most directly answers "is the player sustaining playback" — the metric has a physical interpretation rather than being a threshold crossing. Happy to re-review once this lands. Baseline-wise: Everything else in the scenario design — alternating seek targets, in-tick pause, drift sampling — held up under Miguel's local run and my static read, so I think just the fps metric needs rework. The rest of my non-blocking notes (samples/shard, tolerance windows) stand but are secondary. — Rames Jusso |
…tio metric Addresses blocking PR feedback on #400 from miguel-heygen and jrusso1020: the previous FPS metric measured raw rAF cadence and was refresh-rate dependent (a 30fps composition would always 'pass' a 60Hz refresh rate but the metric reported on rAF, not the composition). - 02-fps.ts: re-implemented to sample __player.getTime() at 100ms wall intervals and emit (deltaCompTime / wallSeconds). New metric is refresh-rate independent and measures what we actually care about: whether the player keeps up with composition-time playback. - perf-gate.ts: replaced fpsMin / droppedFramesMax with compositionTimeAdvancementRatioMin (higher-is-better, target 0.95). - baseline.json: updated to match the new metric key + threshold. - 04-scrub.ts: documented MATCH_TOLERANCE_S rationale (frame quantization on postMessage, sub-frame intra-clip advance, runner jitter) + TODO to tighten once we have CI baseline data. - 05-drift.ts: log coefficient of variation as a soft monitoring signal (not gated) + TODO to decide whether to publish it as a tracked metric. - index.ts: documented DEFAULT_RUNS rationale (load=5 because p95 over n=3 is just max; fps/scrub/drift=3 because they pool samples across runs) + TODO to revisit fps=3 after collecting CI baseline data. - index.ts: removed dead reference to docs/internal/player-perf-baselines.md.

Summary
Second slice of
P0-1from the player perf proposal: plugs the three steady-state scenarios — sustained playback FPS, scrub latency, and media-sync drift — into the perf gate that landed in #399. Adds the multi-video fixture they all share, wires three new shards into CI, and seeds one new baseline (droppedFramesMax).Why
#399 stood up the harness and proved it with a single load-time scenario. By itself that's enough to catch regressions in initial composition setup, but it can't catch the things players actually fail at in production:
Each of these is a target metric in the proposal with a concrete budget. This PR turns those budgets into gated CI signals and produces continuous data for them on every player/core/runtime change.
What changed
Fixture —
packages/player/tests/perf/fixtures/10-video-grid/index.html: 10-second composition, 1920×1080, 30 fps, with 10 simultaneously-decoding video tiles in a 5×2 grid plus a subtle GSAP scale "breath" on each tile (so the rAF/RVFC loops have real work to do without GSAP dominating the budget the decoder needs).sample.mp4: small (~190 KB) clip checked in so the fixture is hermetic — no external CDN dependency, identical bytes on every run.data-composition-id="main"host pattern asgsap-heavy, so the existing harness loader works without changes.02-fps.ts— sustained playback frame rate10-video-grid, callsplayer.play(), samplesrequestAnimationFramecallbacks inside the iframe for 5 s.play(), wait for__player.isPlaying() === true, then reset the sample buffer — otherwise the postMessage round-trip ramp-up window drags the average down by 5–10 fps.(samples − 1) / (lastTs − firstTs in s); uses rAF timestamps (the same ones the compositor saw) rather than wall-clocksetTimeout, so we're measuring real frame production.min(fps)andmax(droppedFrames)— worst case wins, since the proposal asserts a floor on fps and a ceiling on drops.playback_fps_min(higher-is-better, baselinefpsMin = 55) andplayback_dropped_frames_max(lower-is-better, baselinedroppedFramesMax = 3).04-scrub.ts— scrub latency, inline + isolated10-video-grid, pauses, then issues 10 seek calls in two batches: first the synchronous inline path (<hyperframes-player>'s default same-origin_trySyncSeek), then the isolated path (forced by replacing_trySyncSeekwith() => false, which makes the player fall back to the postMessage_sendControl("seek")bridge that cross-origin embeds and pre-feat(player): synchronous seek() API with same-origin detection #397 builds use).__player.getTime()until it's withinMATCH_TOLERANCE_S = 0.05 sof the requested target. Tolerance exists because the postMessage bridge converts seconds → frame number → seconds, and that round-trip can introduce sub-frame quantization drift even for targets on the canonical fps grid.performance.timeOrigin + performance.now()in both contexts.timeOriginis consistent across same-process frames, sot1 − t0is a true wall-clock latency, not a host-only or iframe-only stopwatch.1.0, 7.0, 2.0, 8.0, 3.0, 9.0, 4.0, 6.0, 5.0, 0.5) so no two consecutive seeks land near each other — protects the rAF watcher from matching against a stalegetTime()value before the seek command is processed.percentile(95)across the pooled per-seek latencies from every run. With 10 seeks × 2 modes × 3 runs we get 30 samples per mode per CI shard, enough for a stable p95.scrub_latency_p95_inline_ms(lower-is-better, baselinescrubLatencyP95InlineMs = 33) andscrub_latency_p95_isolated_ms(lower-is-better, baselinescrubLatencyP95IsolatedMs = 80).05-drift.ts— media sync drift10-video-grid, plays 6 s, instruments everyvideo[data-start]element withrequestVideoFrameCallback. Each callback records(compositionTime, actualMediaTime)plus a snapshot of the clip transform (clipStart,clipMediaStart,clipPlaybackRate).|actualMediaTime − ((compTime − clipStart) × clipPlaybackRate + clipMediaStart)|— the same transform the runtime applies inpackages/core/src/runtime/media.ts, snapshotted once at sampler install so the per-frame work is just subtract + multiply + abs.02-fps.ts: frames captured during the postMessage round-trip would compare a non-zeromediaTimeagainstgetTime() === 0and inflate drift by hundreds of ms.max()andpercentile(95)across the pooled per-frame drifts. The proposal's max-drift ceiling of 500 ms is intentional — the runtime hard-resyncs when|currentTime − relTime| > 0.5 s, so a regression past 500 ms means the corrective resync kicked in and the viewer saw a jump.media_drift_max_ms(lower-is-better, baselinedriftMaxMs = 500) andmedia_drift_p95_ms(lower-is-better, baselinedriftP95Ms = 100).Wiring
packages/player/tests/perf/index.ts: addfps,scrub,drifttoScenarioId,DEFAULT_RUNS, the default scenario list (--scenariosdefaults to all four), and three new dispatch branches.packages/player/tests/perf/perf-gate.ts: adddroppedFramesMax: numbertoPerfBaseline. Other baseline keys for these scenarios were already seeded in perf(player): p0-1a perf test infra + composition-load smoke test #399.packages/player/tests/perf/baseline.json: adddroppedFramesMax: 3..github/workflows/player-perf.yml: three new matrix shards (fps/scrub/drift) atruns: 3. Samepaths-filterand same artifact-upload pattern as theloadshard, so the summary job aggregates them automatically.Methodology highlights
These three patterns recur in all three scenarios and are worth noting because they're load-bearing for the numbers we report:
play()API is async (postMessage), so any samples captured before__player.isPlaying() === truebelong to ramp-up, not steady-state. Both02-fpsand05-driftclear__perfRafSamples/__perfDriftSamplesafter the wait. Without this, fps drops 5–10 and drift inflates by hundreds of ms.performance.timeOrigin + performance.now()for scrub, rAF/RVFC timestamps for fps/drift) rather than host-side. The iframe is what the user sees; host-side timing would conflate Puppeteer's IPC overhead with real player latency.pause()is issued, so the pause command's postMessage round-trip can't perturb the tail of the measurement window.Test plan
bun run player:perfruns all four scenarios end-to-end on the 10-video-grid fixture.baselineKeysoperf-gate.tscan find them.metrics.jsonartifacts.Stack
Step
P0-1bof the player perf proposal. Builds on:P0-1a(perf(player): p0-1a perf test infra + composition-load smoke test #399): the harness, runner, gate, and CI workflow this PR plugs new scenarios into.Followed by:
P0-1c(perf(player): p0-1c live-playback parity test via SSIM #401):06-parity— live playback frame vs. synchronously-seeked reference frame, compared via SSIM, on the existinggsap-heavyfixture from perf(player): p0-1a perf test infra + composition-load smoke test #399.