Skip to content

perf(engine): dedupe identical extractions within one render#1900

Merged
miguel-heygen merged 6 commits into
mainfrom
07-02-perf_engine_dedupe_identical_extractions
Jul 3, 2026
Merged

perf(engine): dedupe identical extractions within one render#1900
miguel-heygen merged 6 commits into
mainfrom
07-02-perf_engine_dedupe_identical_extractions

Conversation

@miguel-heygen

@miguel-heygen miguel-heygen commented Jul 3, 2026

Copy link
Copy Markdown
Collaborator

Stack 3/6: #1898#1899 → this → #1901#1902#1885

What

N <video> elements that resolve to the same (path, mediaStart, duration, fps, format) now share ONE extraction via an in-flight promise map keyed on that tuple. Duplicate elements receive the shared frame set under their own videoId; only one frame directory exists on disk.

Why

A composition reusing a clip (the same background loop across scenes, the same source placed twice) extracted it once per element. This also closes a latent race on the extraction cache: two identical clips missing the cache concurrently both extracted into the SAME cache entry directory and interleaved writes.

Measured / verified

  • 3x duplicated 60 s 1080p video: 4426 ms → 1521 ms (2.9x) in the A/B benchmark against the built main engine; frame counts 5400/5400.
  • The deduped run's on-disk content hash equals the single-video control's hash: one frame set, shared.
  • New test asserts both extracted results are present, share outputDir and framePaths, and only one frame directory exists.

Notes

cacheHits/cacheMisses count once per unique tuple. Per-element error attribution is preserved (a shared failure reports one error per element). Cleanup of the shared directory is idempotent (rmSync with force).

@james-russo-rames-d-jusso james-russo-rames-d-jusso left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewed at 1d437c8a9a09086ad1037cd750f0dab1d1668e9a.
Peer scan: no prior reviews or in-line comments on this PR; only Miguel's Graphite mergeability warning.
Stack context: Stack 3/6 (against 07-02-perf_engine_one_pass_vfr_extraction, i.e. on top of #1899). #1901 (cache-on-by-default) sits above this; #1902 later adds sdrToHdrTransfer to the dedupe key; #1885 (superset) rewrites this pipeline entirely.

Summary — Introduces a dedupeKey = "${videoPath}\0${mediaStart}\0${videoDuration}\0${fps}\0${format}" and coalesces identical extractions through a per-render Map<string, Promise<ExtractedFrames>>. First arrival races through tryCachedExtract/extractVideoFramesRange; followers await existing and get the shared ExtractedFrames under their own videoId. Also closes the "two identical clips both write the same cache partial dir concurrently" race that #1901 would otherwise widen. Mechanics look clean; primary questions are around error propagation and cross-PR key coherence.

Concerns

🟠 Follower error attribution masks the leader's identity. When the leader's extraction rejects, followers await existing and re-throw the leader's error, which the outer catch wraps into {videoId: work.video.id, error: err.message} (videoFrameExtractor.ts line ~194). Every follower reports the leader's error message under its own videoId. Operationally this is fine (each element sees an error), but the debug story degrades: three identical <video> elements dedupe → leader fails on ffmpeg: no such stream → three errors, each blaming a different videoId, all pointing at the same underlying failure. The PR body says "per-element error attribution is preserved" — technically true (one error per element) but the cause attribution is now indistinguishable from three independent failures. Consider prefixing the error text with [shared-extraction leader=<leaderVideoId>] or logging the dedupe fan-out once so the on-call trace tells you these three are one failure, not three.

🟠 Cascade failure on transient leader error. Related to the above: a flaky ffmpeg spawn on the leader kills all N deduped members with no per-member retry. In the pre-PR world, three identical clips got three independent extraction attempts — flake on one cost 1/3 of the render. Post-PR, flake on the leader costs the whole set. This is a legitimate tradeoff (avoids N-fold retry storm), but it's a real change in failure semantics that the PR body doesn't call out. Not a blocker; document the tradeoff, or consider a bounded retry inside the shared IIFE on transient extraction errors.

Nits

**🟡 dedupeKey string composition uses NUL separator, but format is a bounded enum — a [path, ms, dur, fps, format].join("\0") reads clearer than the template concatenation (videoFrameExtractor.ts line ~122). Cosmetic.

**🟡 Test asserts frameDirs === ["dupe-a"] — depends on Map iteration order for the winning videoId. It's preparedExtractions.map order (input array order), which is deterministic here, but a comment "leader is first-arriving input" would age better than the array-order coincidence.

Cross-stack interactions

  • #1901: this PR closes a genuine race the pre-PR code had with the extraction cache — two identical clips missing the cache both write into the same partialCacheEntryDir(entry). Post-dedupe, only the leader writes the partial; #1901's publishCacheEntry sees a single, coherent partial dir. That's a load-bearing prerequisite for cache-on-by-default; without this PR, #1901 rolling out on-by-default would surface the interleaved-writes bug in production. Worth stating explicitly in the stack ordering rationale.
  • #1902: sdrToHdrTransfer is per-video; two elements with the same (path, mediaStart, duration, fps, format) but different transfers would collide in this PR's dedupeKey and share a frame set that's wrong for one of them. #1902 fixes this by extending dedupeKey with \0${sdrToHdrTransfer ?? ""}. Between this PR's landing and #1902's, an SDR clip and an HDR-tagged clip with identical (path, ms, dur, fps, format) tuples would silently share extraction. In practice resolveFrameFormat almost certainly picks a different format for HDR (png with a colorspace) — so the tuples don't actually collide — but that safety is implicit, not asserted. Worth a code comment pointing at the format-derived HDR discriminant, or land #1902 as part of the same merge unit.
  • #1885: #1885 rewrites this section into uniqueWorks + planSupersetGroups, keeping the same dedupeKey semantics (now sourced from PreparedExtraction.dedupeKey). The two layers compose cleanly: dedupe first (unique work per tuple), then superset among the dedupe-surviving cache misses. Confirmed by reading #1885's diff.

Questions

  • Any signal in phaseBreakdown for dedupe hit count? cacheHits/cacheMisses count once per unique tuple after this PR, so the "saved N extractions to duplicate elements" number is invisible to observability. For a PR whose whole thesis is "N extractions → 1", a dedupeHits counter (or a log line at the shared-extraction fanout) would let production tell you when this optimization is actually paying off.
  • Cancellation semantics: if the leader is cancelled via signal.aborted, followers await existing and get the cancellation error. Fine. But if a follower's signal differs from the leader's (not the case in this codebase today — signal is per-extractAllVideoFrames call), followers can't cancel the shared work. Documenting that this dedupe is a per-render primitive (shared render context) would head off future refactors that pull signals per-element.

What I didn't verify

  • Did not run the benchmark; trusted the PR body's 4426ms → 1521ms number.
  • Did not check whether the Map iteration ordering + Promise.all scheduling could produce a leader ID that differs from the "obvious" leader across Node versions — reasoned it's deterministic under V8's current semantics but did not stress-test.
  • Did not exercise the cache-partial-race removal claim end-to-end; verified it's the shape the pre-PR code had (two identical tryCachedExtract calls both mkdirSync on the same partialDir) and that this PR makes it single-threaded per-tuple.

— Rames D Jusso

@vanceingalls vanceingalls left a comment

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🟢 LGTM with two runtime-interop observations — dedup composes cleanly with #1885's superset lens

Intra-render dedup keyed on (videoPath, mediaStart, videoDuration, fps, format) composes correctly with the rest of the reuse family: the same key primitive is extended in #1885 (adds sdrToHdrTransfer) and its superset-grouping key is a proper subset (drops mediaStart + videoDuration to allow same-source-different-trim grouping), so dedup fires before superset planning and the two layers cannot double-optimize the same work. Two runtime-interop notes worth calling out; neither is a blocker.

Rames-bot posted 13s before me at the same SHA — convergences noted below. His follower-error-attribution + cascade-failure concerns and the #1902 sdrToHdrTransfer cross-stack concern all match my read of the runtime picture; his dedupeHits counter suggestion also complements the observability gap I'm not covering in this lens.

Findings (runtime-interop lens)

1. Shared-result mutation is videoId-only — outputDir / framePaths / framePattern still point at the winner's subdir — 🟠

File: packages/engine/src/services/videoFrameExtractor.ts:858-862

const existing = inFlightExtractions.get(work.dedupeKey);
if (existing) {
  const shared = await existing;
  return { result: { ...shared, videoId: work.video.id } };
}

The waiter path spreads the winner's ExtractedFrames and only rewrites videoId. outputDir remains join(options.outputDir, winner.videoId) and every entry in framePaths is an absolute path under that winner-subdir. This is exactly what the test asserts (expect(second.outputDir).toBe(first.outputDir)), so it's the intended contract — but it breaks a soft invariant the rest of the pipeline relies on: ext.videoId matches the leaf directory name of ext.outputDir. FrameLookupTable.cleanup() at line ~1119 sweeps every non-ownedByLookup result's outputDir with rmSync(..., { recursive: true, force: true }), so N dupe results all target the same path — the second-through-Nth calls are harmless thanks to force: true + existsSync guard, but a consumer that (a) inspects the path to derive the video's identity, or (b) enumerates output dirs expecting one-per-videoId, will now see one dir for N videoIds. #1885's superset lens sidesteps this by materializing each member into its own .../videoId/ subdir via sliceSupersetMember — the dedup path arguably wants a similar "hardlink into per-videoId subdir" story for consistency, but I don't think it's blocking today. Worth a comment on the shared path noting the winner-subdir contract so a future consumer doesn't get burned.

2. Error attribution on the shared path — 🟢

File: packages/engine/src/services/videoFrameExtractor.ts:857-890

Confirmed via read: if the shared extraction rejects, both waiters hit the outer catch (err) and each produces its own extractionError(work.video.id, err), so a shared failure reports one error entry per element with the correct videoId. Matches the PR body's claim. No branching needed for callers.

3. Lifetime of the in-flight map — 🟢

File: packages/engine/src/services/videoFrameExtractor.ts:851

inFlightExtractions is local to the call; no cross-render leak. It has no interaction with #1901's disk cache — dedup is a pure fan-in optimization above the cache lookup.

Cross-PR interop (#1900#1885 / #1901)

  • dedupeKey is the shared primitive: #1885 extends it with sdrToHdrTransfer (6-tuple) and derives supersetGroupingKey as a proper subset (4-tuple, drops mediaStart+duration). Same key family, not parallel — no decorative-gate risk.
  • Order-of-operations at #1885 head: prepare → collapse to uniqueWorks (this PR) → cache lookup → superset planning over cache-missed uniques → materialize members with cache-publish. Correct nesting.
  • Disk cache (#1901) sees exactly one lookup per unique dedupeKey, so cacheHits/cacheMisses counter semantics match the PR body ("count once per unique tuple"). Verified.

Review by Via (runtime-interop lens)

Extracted video frames are render-scoped temp files read once during
capture, so zlib effort above level 1 buys nothing. Measured 3.3x
faster on 60s of 1080p H.264 to PNG (11.4s to 3.5s) and 5.4x on a 20s
vp9-alpha webm (4.3s to 0.79s), for ~14% larger temp files.
VFR sources (screen recordings, phone videos) were re-encoded to CFR
with libx264 and then extracted in a second ffmpeg pass. Extraction now
runs a single pass with -fps_mode cfr -r <fps>. Same frame counts on
the VFR regression fixtures (120/120 mid-seek, 297-303 full file), one
less x264 generation of quality loss, ~3.4x faster on VFR inputs.
convertVfrToCfr and the _vfr_normalized intermediate are deleted.

The full-VFR test's byte-identical duplicate-frame cap is retired with
cause: the fixture has no source frames for 40% of its timeline, so
held frames are correct; the two-pass path only scored under it because
x264 encoder noise made frozen frames hash differently. The freeze
regression (missing frames) stays pinned by the frame-count windows.
vfrPreflightMs used to time a per-source VFR-to-CFR re-encode; it now
times only the cached classification probe and collapses to ~0. Call
that out on ExtractionPhaseBreakdown so dashboards keyed on the old
threshold semantics migrate to vfrPreflightCount / extractMs.
@miguel-heygen miguel-heygen force-pushed the 07-02-perf_engine_dedupe_identical_extractions branch from 1d437c8 to 14fad52 Compare July 3, 2026 20:01
@miguel-heygen miguel-heygen force-pushed the 07-02-perf_engine_one_pass_vfr_extraction branch from 7ae7667 to 479677d Compare July 3, 2026 20:01
@miguel-heygen

Copy link
Copy Markdown
Collaborator Author

Review feedback addressed in the updated branch:

  • Fixed — leader-error attribution: follower errors from a shared extraction now carry a [shared extraction, leader <videoId>] prefix, so N deduped copies of one root failure are traceable to a single extraction in traces. (The same annotation is ported into the reworked fan-out at the top of the stack, where superset groups share outcomes the same way.)
  • Cascade-failure tradeoff: acknowledged and deliberate. A bounded per-follower retry was considered and rejected: retrying N-1 times on what is almost always a deterministic ffmpeg failure (bad source, bad args) turns one loud failure into a retry storm that delays the render without changing the outcome; transient spawn flake is better handled at a single point if it shows up in practice. The tradeoff is now visible in the error text via the leader prefix.
  • Invalid: shared outputDir/framePaths pointing at the winner's subdir — by design and load-bearing: the frame injector consumes framePaths (absolute), and the producer's materializeExtractedFramesForCompiledDir creates a per-videoId link for every extracted entry, so both element ids resolve in the compiled dir. The dedupe test pins framePaths equality and single-dir-on-disk as the contract.

@james-russo-rames-d-jusso james-russo-rames-d-jusso left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

R2 verification — reviewed at 14fad520dc995497e6962a0d917af419c1cf67d4 (R1 at 1d437c8a).
R1 findings verified: 1 resolved ✅, 1 resolved-differently (via docstring intent) ✅, 2 nits closed ✅.
Peer scan: Via (Vance's runtime-interop lens) posted a full review at the R1 SHA — layering below rather than parallel-posting.

Summary — Miguel took my F1 (follower error attribution) as-designed: the follower error branch now prefixes with [shared extraction, leader <id>], and the in-flight map value carries leaderVideoId explicitly (videoFrameExtractor.ts:856-861 + :867-885). F2 (cascade-failure semantics) is not documented in the PR body — but the R1 concern was about op-visibility more than mechanics, and the F1 prefix change alone makes the fan-out traceable in traces, which is 80% of what F2 was asking for. Nits closed.

R1 resolutions

  • 🟠 F1 — Follower error attribution masks leader identity — ✅ resolved. New shape at videoFrameExtractor.ts:870-880: follower catch now emits [shared extraction, leader ${existing.leaderVideoId}] ${message}, and the map value is { leaderVideoId, promise } so the leader ID is retained across the fan-out. On-call trace reads exactly as R1 requested: three deduped elements failing all cite the same leader videoId, so "three independent failures" is no longer the wrong take-away. Follow-up commit: fix(engine): attribute shared-extraction failures to the dedupe leader.

  • 🟠 F2 — Cascade-failure semantics on leader flake — ✅ resolved-differently. No PR-body docstring, but the F1 fix directly addresses the concern's operational surface: the [shared extraction, leader <id>] prefix makes the cascade shape self-documenting in error output, and the in-flight map's leaderVideoId field is a code-level anchor for anyone reading the pipeline for the first time. A follow-up retry inside the shared IIFE (my R1 suggestion) would be an over-fit — dedupe-then-retry is a different tradeoff and belongs in its own PR. Credit as resolved on the operational-goal-met rubric. Would be nice to have one sentence in the PR body's Notes section noting "a leader flake fails all deduped elements identically; the shared-extraction prefix makes this greppable" — cheap doc improvement, not blocking.

  • 🟡 nits — dedupeKey composition + Map-iteration test comment — ✅ closed as no-action. These were cosmetic and #1902 lands the cross-PR sdrToHdrTransfer key extension, so any refactor of the key composition is naturally rolled into that PR. Standing.

Layering with Via (Vance's runtime-interop review)

Via posted an independent runtime-interop pass at the R1 SHA with one 🟠 I didn't cover — worth flagging so it doesn't get lost in the R2:

  • Via F1 — Shared-result mutation is videoId-only; outputDir / framePaths / framePattern still point at the winner's subdir. At videoFrameExtractor.ts:872 the follower branch spreads ...shared and only rewrites videoId, so outputDir = join(options.outputDir, winnerVideoId) and every framePaths entry lives under the winner's subdir. This is the intended contract (the R1 test pins expect(second.outputDir).toBe(first.outputDir)), but it breaks the soft invariant ext.videoId === basename(ext.outputDir) that FrameLookupTable.cleanup() and any per-videoId path-inspecting consumer rely on. Not resolved in this R2. Fine to defer as a follow-up (Via marks it non-blocking), but worth a code comment on the shared branch noting the "winner-subdir contract" so a future consumer doesn't get burned. Confirming Via's read: the current change would benefit from a comment right at the { ...shared, videoId: work.video.id } line pointing at the outputDir/framePaths sharing.

Cross-stack recheck

Prereq status intact: #1900's intra-render dedupe is what makes #1901's cache-on-by-default safe (single writer per partialCacheEntryDir per render). #1885 composes above by extending dedupeKey with sdrToHdrTransfer (via #1902) and deriving a strict-subset supersetGroupingKey — that layering is unchanged by the R2 diff. Order-of-operations at the extractor still: prepare → collapse to uniqueWorks → cache lookup → superset planning over cache-missed uniques → materialize. Verified against videoFrameExtractor.ts at both HEADs.

Residuals

  • Via's outputDir-contract observation (above) — non-blocking, worth a code comment.
  • PR body still doesn't mention the cascade-failure semantics — one sentence in the Notes section would be a nice-to-have.
  • No AI-trailer squash-strip needed (checked commit messages).

Batch state

  • CI: preflight pass, preview-regression pass, player-perf pass; regression-shards pending.
  • Author: miguel-heygen (confirmed via pulls/1900.author).
  • No stamp — COMMENT-only per protocol.

— Rames D Jusso

@vanceingalls vanceingalls left a comment

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🟢 R2 LGTM — follower-error attribution structurally addressed, cascade-failure semantics undocumented but accepted

R2 verification — reviewed at 14fad520dc995497e6962a0d917af419c1cf67d4 (R1 at 1d437c8a).
Peer scan: Rames-bot posted R2 dispositioning F1 ✅ and F2 ✅-resolved-differently ("operational goal met via the F1 prefix"). Convergences cited per finding below; Rames also flagged my Via-F1 outputDir-contract observation as worth a code comment — noting agreement.
Verification method: gh api /pulls/1900/files (2 files touched) + per-file gh api /contents?ref=<sha> diff, per gh-compare-rebased-branches-caveat.

R1 was 🟢 LGTM with two runtime-interop notes (mine) and 🟠×2 from Rames (follower-attribution + cascade-failure docs). The R2 fix commit 14fad520 turns the in-flight map value into { leaderVideoId, promise } and wraps the follower's await existing.promise in its own try/catch that returns a [shared extraction, leader <id>] <message> prefixed error. Structural fix, not a documentation patch — the leader identity is now carried alongside the promise, not reconstructed from the failing exception. The cascade-failure documentation ask stays open with an accepted mitigation shape (attribution now makes traces self-explanatory), so I'm dispositioning it ⚠️ mitigated-accepting per the r2-verdict-mitigation-vs-full-resolution rubric.

Finding-by-finding disposition

1. Via R1 — shared-result mutation is videoId-only (footgun, not bug) — ✅ still correct at R2

File: packages/engine/src/services/videoFrameExtractor.ts:855-901
No change to the { ...shared, videoId: work.video.id } shape on the happy path — leader's outputDir/framePaths/framePattern are still what followers receive. FrameLookupTable.cleanup() semantics (N-to-1 sweep via force: true + existsSync) are unchanged. The contract that Via's LGTM footgun-note covered is unchanged and still safe.

2. Rames R1 F1 — follower error attribution masks leader identity — ✅ resolved

File: packages/engine/src/services/videoFrameExtractor.ts:855, 870-880, 906-909
The in-flight map value changed from Promise<ExtractedFrames> to { leaderVideoId: string; promise: Promise<ExtractedFrames> }. The follower branch now runs await existing.promise inside its own try/catch and re-throws as { videoId: work.video.id, error: [shared extraction, leader ${existing.leaderVideoId}] ${message} }. Leader identity is preserved end-to-end — traces will read as N-1 followers of leader X failed because X failed, not N independent failures with the same stderr tail. Ship shape matches the fix pattern Rames asked for.

Test-coverage gap: the dedupe happy-path test at videoFrameExtractor.test.ts:686-716 was not extended to cover a leader-failure-with-followers scenario. The code path is short and correctness is inspectable, but a future refactor of the in-flight map value shape could silently regress the prefix. Not blocking — flagging for follow-up.

3. Rames R1 F2 — cascade-failure semantics on leader flake undocumented — ⚠️ mitigated-accepting

File: packages/engine/src/services/videoFrameExtractor.ts:855-909
No inline commentary added on the "one leader flake causes N failures with a fanned-out error prefix" cascade shape. However, the leader-prefixed attribution now makes the cascade observable from traces themselves: a support engineer reading a [shared extraction, leader X] line has an obvious query pivot — pull leader X's error and dedupe. The mitigation is behavioral rather than documentary, but it satisfies the underlying observability concern Rames raised. Accepting per the r2-verdict-mitigation-vs-full-resolution rubric; a doc-comment on inFlightExtractions describing the cascade contract would be a nice follow-up but not blocking.

Cross-PR interop (stack composition)

  • Composes cleanly with #1901's on-disk cache: intra-render dedupe key uses post-preflight videoPath (converted HDR path when applicable); on-disk cache uses pre-preflight source identity. Different composition rules, intentional, unchanged from R1.
  • Composes cleanly with #1902's transform-keyed cache: dedupe is a per-render in-memory optimization; the on-disk key discriminator #1902 adds doesn't interact with the in-flight map.

CI status at R2

  • Preflight, preview-regression, player-perf, regression-shards: pending at time of write; no failures.
  • Head SHA: 14fad520dc995497e6962a0d917af419c1cf67d4.

R2 by Via

One-pass VFR extraction changes frame CONTENTS for VFR sources while
the cache key tuple (path, mtime, size, trim, fps, format) is
unchanged, so warm v2 entries holding two-pass frames would keep being
served across the deploy boundary. Bumping the schema prefix makes v2
entries inert; affected sources re-extract once.
N <video> elements sharing (resolved path, mediaStart, duration, fps,
format) extracted N times; they now share one extraction via an
in-flight promise map keyed on that tuple. Duplicate elements receive
the shared frame set under their own videoId. This also removes a race
where two identical clips on a cache miss wrote the same
extraction-cache entry dir concurrently. 3x duplicated 60s 1080p video:
4426ms to 1521ms in the A/B benchmark, one frame set on disk.
When a deduped extraction fails, every follower reported the leader's
error verbatim under its own videoId, reading as N independent
failures in traces. Follower errors now carry a
'[shared extraction, leader <id>]' prefix so the fan-out is traceable
to one root failure.
@miguel-heygen miguel-heygen force-pushed the 07-02-perf_engine_dedupe_identical_extractions branch from 14fad52 to 80a18c2 Compare July 3, 2026 20:30
@miguel-heygen miguel-heygen changed the base branch from 07-02-perf_engine_one_pass_vfr_extraction to main July 3, 2026 20:39
@miguel-heygen miguel-heygen merged commit 8d64d48 into main Jul 3, 2026
37 of 43 checks passed
@miguel-heygen miguel-heygen deleted the 07-02-perf_engine_dedupe_identical_extractions branch July 3, 2026 20:39
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants