perf(engine): extraction cache on by default with atomic publish and LRU gc by miguel-heygen · Pull Request #1901 · heygen-com/hyperframes

miguel-heygen · 2026-07-03T18:53:28Z

Stack 4/6: #1898 → #1899 → #1900 → this → #1902 → #1885

What

The content-addressed extraction cache turns ON by default, made safe by two mechanisms it previously lacked:

Atomic entry publish. Frames extract into a unique <entry>.partial-<pid>-<uuid> dir, the .hf-complete sentinel is written there, and the dir is rename()d into the final key atomically. Concurrent renders sharing a cache can duplicate ffmpeg work but can never observe a torn entry (the cache was previously documented single-writer: a reader could see a truncated frame mid-write).
Size-capped LRU GC. A best-effort post-extraction sweep evicts oldest-used entries past a 2 GiB budget (HYPERFRAMES_EXTRACT_CACHE_MAX_MB) and clears crashed writers' partial dirs. Entries younger than 60 min are never evicted, so a live render keeps its frames; sentinel mtime (touched on every hit) is the LRU clock. The sweep only looks at cache-shaped children, so the sibling animated-gif subcache is untouched.

Defaults: cache dir <tmpdir>/hyperframes-extract-cache-<uid> (writable everywhere we render, including Lambda's /tmp; per-uid to avoid shared-/tmp collisions). Opt out with HYPERFRAMES_EXTRACT_CACHE_DIR=off|none|false|0. A non-writable cache dir degrades to uncached with a single stderr warning instead of failing the render.

Why

Extraction is the most expensive render phase after capture, and the dominant dev loop is render → tweak → re-render of the same sources. With the cache on, warm re-renders skip extraction entirely.

Measured / verified

Cold vs warm CLI render of a 4-video composition: video_extract 400 ms → 13 ms; cold and warm outputs compare at PSNR inf (pixel-identical).
Tests: publish success / complete-winner collision / stale-dir replacement, GC eviction order + min-age guard + aged-partial cleanup + non-prefix dirs untouched + missing-root no-throw, sentinel touch on hit, unwritable-dir warning fallback, config default / opt-out / max-MB parsing.

Notes

Existing cache keys stay valid (no schema bump). extractCacheMaxBytes threads through the producer's extract stage alongside extractCacheDir.

miguel-heygen · 2026-07-03T18:53:39Z

This stack of pull requests is managed by Graphite. Learn more about stacking.

vanceingalls

🟠 Request changes — atomic-publish + gc mechanics are sound; cross-PR key composition is the runtime-interop hazard

The atomic-publish primitive is well-shaped: partial dir → sentinel-inside → renameSync → same-fs atomic-swap, with a three-branch retry (target-complete → adopt winner; target-stale-unsentineled → force-rm and retry; target-partial-of-live-writer → leave and mark unpublished). The LRU sweep is scoped to SCHEMA_PREFIX/.partial- children (verified against isCacheLikeChild), so the sibling animated-gif subcache under the same root stays untouched. Single call site to lookupCacheEntry/publishCacheEntry/gcExtractionCache at the extractor — no fan-out surprises inside this PR.

The runtime-interop concern for flipping the cache on by default is the cache-key composition vs the HDR preflight, which PR #1902 addresses via a transform axis on CacheKeyInput. As shipped in this PR, the same SDR source file can produce two different cached outputs (raw-SDR frames vs HDR-normalized frames) under identical cache keys depending on the composition it was rendered into. That's a correctness hazard the moment cache-on becomes the default in a mixed workload. Cache-broken vs cache-miss also isn't fully distinguished at the caller boundary — a partial-dir mkdir failure surfaces as an extraction error rather than falling back to workDir extraction.

Findings (runtime-interop lens)

1. Cache-key omits the SDR→HDR transform axis — key aliases produce non-deterministic frames across renders — 🔴

File: packages/engine/src/services/videoFrameExtractor.ts:608-623, 653-707, 774-782
cacheKeyInputs[i] is snapshotted from resolvedVideos[i].videoPath before the HDR preflight (line 608), which is the correct choice for hitting the cache on the user-visible source. But the preflight then mutates entry.videoPath = convertedPath and entry.video.mediaStart = 0 (lines 695, 699), so the extractor at line 800 reads from the converted HDR file at mediaStart=0, while the cache stores those HDR-normalized frames under the original-SDR key (originalPath, originalMtime, originalMediaStart, ...).

Consequence when this PR ships cache-on-by-default:

Render A: mixed timeline dominated by HDR. Clip X (SDR) gets SDR→HDR converted, frames extracted from the converted file. Cache stores HDR-tagged frames under key K(X_SDR).
Render B: SDR-only composition using clip X. Lookup hits K(X_SDR), serves HDR-tagged frames back to an SDR pipeline. Silent color-space corruption on the output.

Symmetric case: Render A extracts raw-SDR frames under K(X_SDR), Render B (mixed HDR) hits the same key and serves raw-SDR frames when it needed HDR-normalized ones.

PR #1902's transform?: string on CacheKeyInput is precisely this axis. Two options: (a) hold this PR until #1902 lands so the default-on flip goes out with the transform axis in one step; or (b) merge order-locked with #1902 immediately behind. As-is, any window where prod has #1901 but not #1902 is a cache-poisoning window. Please block the default-on flip on the transform-axis being in the key composition.

2. LRU eviction can race a concurrent reader — `minAgeMs` is documented as a liveness heuristic, not a lock — 🟠

File: packages/engine/src/services/extractionCache.ts:388-406, 356-365; packages/engine/src/services/videoFrameExtractor.ts:927-933
rehydrateCacheEntry captures framePaths: Map<number, string> — file paths, no held fds. If a second process's gcExtractionCache sweeps between one render's rehydrateCacheEntry and downstream consumption of framePaths, the consumer sees ENOENT.

The ponytail: comment on line 359 accepts this: "a render longer than minAge with a full cache could lose entries mid-read - acceptable, next render re-extracts." The subtle bit is that the current render fails, not the next one — the paths already handed back point at a now-deleted directory. The docstring should reflect that failure mode. If you want to keep the current design (no refcounts, no fd-holding), consider one of: touch-again after rehydrate (moves the LRU clock ahead of any concurrent sweep), or a passive existsSync check in the consumer with re-extract fallback. Not blocking on its own — 60 min is comfortably above typical render durations — but pair it with observability so we can see the failure mode if it ever fires in prod.

3. Cache-broken vs cache-miss not distinguished at the render boundary — 🟠

File: packages/engine/src/services/videoFrameExtractor.ts:744-755, 797-799, 883-916
mkdirSync(configuredCacheRootDir, { recursive: true }) at entry (line 748) has a nice graceful-degrade path with a stderr warning. But mkdirSync(partialDir, { recursive: true }) at line 799 has no equivalent guard. If the cache dir becomes unwritable mid-render (disk full, quota exceeded, EACCES race after another user's uid claims files under the shared /tmp path), the throw is caught at line 914 and surfaces as a {videoId, error} — the render fails for that clip instead of falling back to workDir extraction.

Cheapest fix: wrap the partialDir mkdir + publish in a try/catch that on failure logs, sets a cacheDegraded flag, and returns null from tryCachedExtract so the outer path falls through to the non-cache extractVideoFramesRange at line 901. This matches the spirit of the entry-time "caching disabled for this render" branch.

4. Intra-render dedupe key vs on-disk cache key — different composition rules, verified consistent — 🟢

File: packages/engine/src/services/videoFrameExtractor.ts:858, 774-782
Base PR #1900's dedupeKey = ${videoPath}\0${mediaStart}\0${videoDuration}\0${fps}\0${format} uses the post-preflight videoPath (converted HDR path), while this PR's on-disk key uses mtimeMs/size of the pre-preflight SDR source. Different composition rules, but they're used in different scopes — dedupe within one render (same preflight decisions apply to every ref), cache across renders (needs source-identity stability). Reviewed for coherence: no decorative-gate-style miss here, the two layers are intentionally different. The transform-axis in finding #1 is what makes the cross-render axis correct too.

5. `partialCacheEntryDir` collision surface via truncated UUID — negligible but note — 🟢

File: packages/engine/src/services/extractionCache.ts:167-169
randomUUID().slice(0, 8) gives 32 bits of entropy per pid. With the pid included, birthday collision for a single writer would need ~65k concurrent partial dirs for the same key from the same pid — unreachable. Fine as-is.

Convergence with Rames

Rames-bot had not posted at review time (verified via gh api /pulls/1901/reviews). No convergences to cite.

Review by Via (runtime-interop lens)

james-russo-rames-d-jusso

Reviewed at 299ec9d185570f8ac5aadcfa5e2f0220dd9793e5.
Peer scan: none yet — first review pass; Via has HF-runtime-interop lane so I'm coming in on canonical-rubric + cross-cutting concerns.
Stack: this PR against 07-02-perf_engine_dedupe_identical_extractions (#1900); cross-stack notes for #1902 (transform key) folded in below.

Summary — Turns the content-addressed extraction cache on by default under <tmpdir>/hyperframes-extract-cache-<uid>, and lands two mechanisms that make that safe: (1) publishCacheEntry writes the .hf-complete sentinel inside a .partial-<pid>-<uuid>/ dir then atomically renameSyncs over the final key, (2) gcExtractionCache sweeps a 2 GiB LRU budget after each extraction using sentinel mtime as the LRU clock, guarded by a 60 min minAgeMs liveness heuristic. Warm re-renders drop video_extract from 400 ms → 13 ms.

The core rename-based atomicity and the loser-rehydrates path both look correct; the concerns below are about rollout safety and cross-cutting hygiene, not the mechanism.

Concerns

🟠 Zero-soak introduce-and-flip. The same PR that adds the atomic-publish primitive also flips the cache to default-on, with blast radius = every render on every runtime path (Lambda, worker, CLI). The prior invariant was documented "single-writer; readers can see truncated frames" — every consumer that has ever run against a shared extractCacheDir was implicitly relying on external mutex or single-render assumption. There's no soak period, no cohort gate, no kill-switch beyond HYPERFRAMES_EXTRACT_CACHE_DIR=off. Suggest either (a) landing atomic-publish + LRU-gc in one PR with cache still opt-in, then a separate flip PR after ~a week of prod observation, or (b) adding a short-term "flip via env" mode (e.g. HYPERFRAMES_EXTRACT_CACHE_DEFAULT=on|off) with the switch flipped to on in a follow-up so rollback is a redeploy of one env, not a code revert of the primitive.
🟠 Observability gap for the failure paths. breakdown.cacheHits / cacheMisses are the only signals. For an on-by-default cache the ops-relevant events are all invisible: (1) publishCacheEntry returns { published: false } on sentinel-write / rename failure → the render still succeeds but future renders re-extract; (2) gcExtractionCache evictions per sweep (bytes freed, entries removed); (3) aged-partial cleanups (a proxy for crashed-writer count); (4) the "cache dir is not writable" degradation (currently one stderr line per render, no counter). Once this is default-on, the first regression signal on prod is going to be "why are warm renders slow again?" and there'll be nothing to look at. Suggest at minimum: add cachePublishFailures, cacheGcEvictions, cacheGcBytesFreed, cacheAgedPartialsCleared to phaseBreakdown, and log the unwritable-dir warning through the render log (structured) rather than raw process.stderr.write so it lands in the same place ops looks for the rest of the render's signals.
🟠 TOCTOU in publishCacheEntry's "target exists but unsentineled" recovery. extractionCache.ts:203-244: on ENOTEMPTY/EEXIST, if existsSync(entry.dir/COMPLETE_SENTINEL) is false the code does rmSync(entry.dir, {recursive:true, force:true}) and retries the rename. Between the existsSync check and the rmSync, another writer can complete their publish (their rename lands, their sentinel is visible). Our rmSync then destroys their fully-published entry, and our subsequent rename replaces it with our own (also complete) copy. Net effect isn't corruption — both writers extracted the same content-addressed key, both have valid sentinels — but it's wasted work and it means the "at worst two writers duplicate ffmpeg work and one rehydrates from the winner" claim in the PR body isn't quite right in this branch (the loser can be the winner, kill the winner's dir, and become the winner). Rare in practice, but if you want to defend against it: re-check existsSync(entry.dir/COMPLETE_SENTINEL) after the rmSync failure branch too, or scope the recovery to only unsentineled-and-aged directories (mtime older than a small threshold).

Nits

🟡 extractionCache.ts:328 — // ponytail: in the source. Presumably a personal marker. Squash-strip.
🟡 videoFrameExtractor.ts:823 — the unwritable-cache-dir stderr warning is one-shot per render but not deduped across renders in the same process. On a Lambda cold-start with a mispointed cache dir, every render in the container's lifetime prints the warning. Not urgent, and once (2) above ships this becomes a structured log with a counter attached, so probably moot.
🟡 gcExtractionCache never emits an error/warn line when a sweep can't reach its budget (all entries under minAgeMs). For a runaway concurrent-render burst, the cache can exceed the 2 GiB cap indefinitely and there's no signal. Low-severity because the disk pressure will surface elsewhere first, but worth folding into the metric surface from (2) above.
🟡 HYPERFRAMES_EXTRACT_CACHE_MAX_MB=0 degrades to "evict every entry older than minAgeMs" rather than "disable size cap". envNum returns 0 (finite), then * 1024**2 = 0, then the totalBytes <= 0 guard is only true when the cache is empty. Not a bug — arguably user asked for aggressive eviction — but a user reading _MAX_MB might reasonably expect 0 to mean "unlimited". Doc string could clarify.
🟡 isTargetExistsRenameError accepts EPERM. On Linux this is unusual but does happen on APFS / SMB / some overlays. Fine for the paths this code runs on (Lambda /tmp, worker /tmp) but noting for future readers who see EPERM and wonder — a one-line comment on the union of platform semantics would help.

Cross-stack interactions

#1902 (SDR→HDR transform key) composes cleanly. It extends CacheKeyInput with transform?: string and rides the same partialCacheEntryDir / publishCacheEntry primitive from this PR. Key blob JSON keeps the t field out when undefined so existing entries stay valid (no schema bump). Nothing further to reconcile between #1901 and #1902.
#1900 (intra-render dedupe) is orthogonal. Its dedupeKey is in-memory-only; it doesn't touch the on-disk cache primitive, so the two dedup layers stack (per-render dedupe, per-machine cache) without interaction.
#1885 (superset extraction) not audited here — that's a bigger diff and its cache-key/superset semantics probably deserve their own second pass. Nothing in #1901 pre-commits to a superset-friendly key layout, so #1885 will have to pick its own key shape and I'd expect that review to flag any tension there.
animated-gif subcache safe. Confirmed via isCacheLikeChild filter (SCHEMA_PREFIX prefix or .partial- substring) and the "ignores non-cache-prefix directories" test at extractionCache.test.ts:340. compileStage.ts:127 puts the animated-gif dir inside extractCacheDir but its name is animated-gif — the GC sweep skips it.

Questions

The default extractCacheDir under <tmpdir>/hyperframes-extract-cache-<uid> — on Lambda /tmp is 10 GiB total and is shared with the render's own scratch (downloads, compiled artifacts, per-frame captures). With a 2 GiB LRU budget, warm-container behavior is fine, but on a burst of large-source renders (e.g. 4K source, long duration) a single render can fill several GiB of /tmp outside the cache, and the LRU sweep runs after extraction (not before). Any concern about /tmp ENOSPC during extraction on a warm container, or is the render's own workDir accounting already sized to leave headroom?
60 min minAgeMs — was that picked to bound the max realistic single-render duration, or is there a specific worst-case observed render time this is protecting? Feels right, just want the number in the record if there's a story behind it.

What I didn't verify

Actual rename atomicity semantics on the concrete filesystems the cache lands on in prod (Lambda /tmp = tmpfs; worker /tmp may vary). renameSync over a non-empty target is documented atomic on tmpfs/ext4; behavior on network-mounted extractCacheDir (explicitly warned against in the docstring for a different reason) I didn't trace.
Crash-under-SIGKILL mid-extract behavior — the aged-partial cleanup path is tested but I didn't run the actual crash-and-recover scenario end-to-end.
Interaction with the compile stage's animatedGifCacheDir under a shared-/tmp concurrent-render burst — verified GC skips it, didn't verify the animated-gif subcache's own eviction / lifetime policy.

— Rames D Jusso

miguel-heygen · 2026-07-03T20:02:32Z

Review feedback addressed in the updated branch (commit: 'harden extraction cache publish and surface cache ops signals'):

Fixed (the 🔴) — transform-axis poisoning is now closed within this PR: HDR-converted intermediates bypass the cache entirely (cacheKeyInputs[i] = null at the conversion site), so this PR alone cannot alias converted frames under a plain source key. perf(engine): one-pass SDR-to-HDR extraction with cache-key transform #1902 then re-enables caching for converted entries with a transform-discriminated key and byte-level cross-render regression tests. No point in the stack has the poisoning window.
Fixed — publish TOCTOU: publishCacheEntry now routes both recovery points through adoptPublishedWinner: a winner that completes between the sentinel check and the stale-dir rm is adopted, and a winner that lands between our rm and retry-rename is adopted instead of being reported as a failure. A concurrent writer's completed publish can no longer be destroyed.
Fixed — observability: cachePublishFailures, cacheGcEvictions, cacheGcBytesFreed, cacheAgedPartialsCleared added to ExtractionPhaseBreakdown; gcExtractionCache returns sweep stats. 'Warm renders slow again' now has counters to look at.
minAge eviction race: acknowledged, deliberately a heuristic not a lock — the ponytail comment at the eviction site names the ceiling (a render longer than 60min against a full cache can lose entries mid-read; next render re-extracts). A lockfile protocol is more machinery than the failure mode warrants.
Zero-soak flip: partially disagree, flagging the remainder for the author. The kill-switch already exists and is a one-env redeploy (HYPERFRAMES_EXTRACT_CACHE_DIR=off), not a code revert — that covers the rollback story the review asks for. Whether to additionally split the default-flip into its own PR after a soak window is a rollout-policy call for @miguel-heygen; the stack structure makes that split trivial (move the config hunk to a new top PR) if preferred.

james-russo-rames-d-jusso

R2 verification — reviewed at b75a3d9e0f44610b03375493543a9b17aeff691a (R1 at 299ec9d1).
R1 findings verified as: 🟢-alt / 🟢 / 🟢.
Peer scan: Vance posted a runtime-interop review between my R1 and this R2 (🔴 transform-axis, 🟠 LRU eviction race, 🟠 cache-broken-vs-miss). The 🔴 is now closed within this PR by the HDR-converted bypass (cacheKeyInputs[i] = null at videoFrameExtractor.ts:720); the two 🟠s are noted below as residuals separate from my R1 lane.

Summary — b75a3d9 addresses all three of my R1 findings. Observability + TOCTOU are structurally fixed; the zero-soak flag flip stays default-on-in-same-PR but the env-var kill-switch shape (HYPERFRAMES_EXTRACT_CACHE_DIR=off) is documented as the rollback path, which meets the goal by the alternative-resolution rubric.

R1 resolutions

F1 (zero-soak flip default-on): 🟢-alt — rollback path documented rather than soak-gate added. Cache remains default-on (config.ts:373-376: HYPERFRAMES_EXTRACT_CACHE_DIR undefined → auto-resolves to <tmpdir>/hyperframes-extract-cache-<uid>), and the flip lives in the same PR as the primitive. Author's response ack'd this and named the kill-switch (HYPERFRAMES_EXTRACT_CACHE_DIR=off → resolveExtractCacheDir() returns undefined → cache disabled process-wide, no code revert). Env-flip rollback is a legitimate alternative to soak — one-env redeploy is fast enough that the rollback story is complete. Not blocking; noted that a follow-up PR splitting the default-on flip after a soak window would be cleaner if the team wants belt-and-braces. Flagging @miguel-heygen for the rollout-policy call.
F2 (observability on failure paths): 🟢 — ExtractionPhaseBreakdown at videoFrameExtractor.ts:116-122 now surfaces cachePublishFailures, cacheGcEvictions, cacheGcBytesFreed, cacheAgedPartialsCleared. Wired: publish-failure counter incremented at videoFrameExtractor.ts:837 when publishCacheEntry returns published: false; gcExtractionCache returns GcStats from extractionCache.ts:370-402 (evictedEntries, evictedBytes, agedPartialsRemoved) and the extractor lifts those into the breakdown at videoFrameExtractor.ts:984-986. Unwritable-dir case still surfaces via process.stderr.write warning at videoFrameExtractor.ts:776-778 — no counter but a single-shot warning per render is the right shape for that path. All four failure classes I named in R1 now have signal.
F3 (TOCTOU in publishCacheEntry unsentineled-target recovery): 🟢 — extractionCache.ts:189-232. Two structural fixes: (a) new helper adoptPublishedWinner(entry, partialDir) at :194-198 checks for COMPLETE_SENTINEL and removes the losing partial when a winner exists; (b) both recovery points route through it — before rmSync(entry.dir) at :214 (a winner that completed between the initial rename failure and the rm is now adopted, so the winner's published dir is never destroyed), and after the failed retry rename at :230 (a winner that landed between our rm and retry-rename is now adopted instead of reported as a failure). The check-then-remove sequence I flagged is genuinely closed — a losing writer can no longer destroy a concurrent winner's freshly-published entry. Per feedback_race_fix_preempt_vs_narrow this is a proper PREEMPT (adopt-then-return), not a narrower-race band-aid.

Cross-stack recheck

#1902 composition unchanged: transform-axis on CacheKeyInput still rides on this PR's atomic publish primitive; the cacheKeyInputs[i] = null bypass at videoFrameExtractor.ts:720 for HDR-converted intermediates means #1901 alone can't poison the cache, and #1902's transform-keyed change then re-enables caching for those entries with a discriminated key. No cache-poisoning window at any point in the merged stack.
#1900 (in-memory dedupe) still orthogonal; animated-gif subcache still safe (isCacheLikeChild scoped to SCHEMA_PREFIX + .partial- in extractionCache.ts:262-264).

Residuals

Vance's 🟠 LRU eviction can race a concurrent reader — extractionCache.ts:389-398 + videoFrameExtractor.ts:812-819 rehydrateCacheEntry hands out file paths, no held fds; a concurrent process's gcExtractionCache between rehydrateCacheEntry and downstream consumption would produce ENOENT. Author accepted this as a documented liveness ceiling (60min minAge > typical render). Not in my R1 lane, agree with Vance that it's non-blocking given the observability counters now surface if it ever fires in prod.
Vance's 🟠 cache-broken vs cache-miss not distinguished at render boundary — videoFrameExtractor.ts:824 mkdirSync(partialDir, { recursive: true }) still uncaught; a mid-render disk-full / EACCES surfaces as an extraction error rather than falling back to uncached workDir extraction. Cheap fix would be wrapping the partialDir mkdir + publish so tryCachedExtract returns null on cache-degrade and the outer path falls through to extractVideoFramesRange at :945-953. Deferring to Vance's lane; not blocking my sign-off but worth doing.
F1 rollout note (repeat from above): rollback path is env-flip, so a bad flip requires a config redeploy vs a code revert. Cheap enough that I'd land as-is, but @miguel-heygen your call on whether to split the flip into a separate follow-up.

Batch state

CI: Preflight, preview-regression, player-perf green; 8 regression-shards still in_progress; Graphite / mergeability_check in_progress. Downstack (#1900) needs to land first per Graphite's warning banner. No red required checks.
Approvals: none yet on this PR. Vance's is 🔴 request-changes for transform-axis coupling — that's now closed within #1901 by the HDR bypass, so I read his blocker as resolved but he hasn't re-verified.
Zero-soak call still open for @miguel-heygen (rollout policy, not blocking).

— Rames D Jusso

vanceingalls

🟢 R2 LGTM — HDR-converted bypass closes my 🔴 within this PR, TOCTOU tightened structurally, observability wired end-to-end

R2 verification — reviewed at b75a3d9e0f44610b03375493543a9b17aeff691a (R1 at 299ec9d1).
Peer scan: Rames-bot posted R2 verifying all three of his R1 findings as 🟢 (F1 zero-soak as 🟢-alt with env-flip rollback). Convergences cited per finding below.
Verification method: gh api /pulls/1901/files (7 files touched) + per-file gh api /contents?ref=<sha> diff, per gh-compare-rebased-branches-caveat.

R1 was a 🟠 request-changes from me on the cross-PR transform-axis order-lock. R2 closes that finding within this PR via a targeted cacheKeyInputs[i] = null bypass at videoFrameExtractor.ts:715-720 for HDR-converted intermediates — the poisoning window that motivated the order-lock no longer exists, so #1901 is safe to land ahead of #1902 (which then re-enables caching for those entries under a discriminated key). Rames' R1 findings (zero-soak, observability, TOCTOU) all show structural fixes; the two residuals from my R1 (LRU eviction race, cache-broken vs cache-miss at mkdir) are accepted as documented liveness ceilings, not blocking.

Finding-by-finding disposition

1. Via R1 F1 — cache-key omits SDR→HDR transform axis; poisoning window if #1901 lands without #1902 — ✅ resolved (closed within this PR)

File: packages/engine/src/services/videoFrameExtractor.ts:713-720
The R2 fix intercepts converted intermediates before they can be published under the original-SDR key:

entry.videoPath = convertedPath;
// The converted intermediate carries BT.2020-mapped pixels but the
// cache key snapshot above still describes the ORIGINAL source.
// Publishing converted frames under that key would poison later
// plain-SDR renders of the same trim, so bypass the cache for
// converted entries. (The follow-up transform-keyed cache change
// re-enables caching for these with a discriminated key.)
cacheKeyInputs[i] = null;

This closes the poisoning window I flagged: HDR-converted extractions no longer produce a cached entry at all under this PR, so the SDR-only follow-up render can never hit a poisoned key. #1902 then re-enables the cache for those entries via the transform-axis discriminator. Cross-PR order-lock is no longer a merge gate — #1901 is safe standalone.

Convergence with Rames R2 on the same file+line ("HDR-converted bypass ... means #1901 alone can't poison the cache").

2. Via R1 F2 — LRU eviction can race a concurrent reader — ⚠️ mitigated-accepting (documented liveness ceiling)

File: packages/engine/src/services/extractionCache.ts:389-402; videoFrameExtractor.ts:812-819
R2 did not add reader-side existence guards or refcounts. But it did add the cacheGcEvictions / cacheGcBytesFreed / cacheAgedPartialsCleared observability signals (see F3 below), so if the race ever fires in prod we'll see eviction pressure rising against render duration. The 60-min minAgeMs remains the liveness ceiling. Accepting the documented tradeoff per r2-verdict-mitigation-vs-full-resolution — the counters + doc comment on ExtractionPhaseBreakdown at :117 turn this from an invisible race into an observable one.

Convergence with Rames R2 residuals section.

3. Via R1 F3 — cache-broken vs cache-miss at partialDir mkdir — ❌ still open, not blocking

File: packages/engine/src/services/videoFrameExtractor.ts:824 (unchanged from R1)
mkdirSync(partialDir, { recursive: true }) still throws through to the extraction-error branch on EACCES/ENOSPC mid-render. The cheapest fix (wrap + return null from tryCachedExtract to fall through to non-cache path at :945-953) was not applied. Rames' R2 flagged the same thing and deferred to my lane — I'm ❌-open with a "not blocking, follow-up PR" disposition, since the entry-time cache-init check does have a graceful fallback and this only affects mid-render disk-state changes. Ship-blocking iff Miguel wants belt-and-braces before default-on; otherwise fold into next stack.

4. Rames R1 F1 — zero-soak introduce-and-flip — ⚠️ mitigated-accepting (env-flip kill-switch documented)

File: packages/engine/src/config.ts:373-376
Cache remains default-on in the same PR that introduces the atomic-publish primitive. Rollback path is HYPERFRAMES_EXTRACT_CACHE_DIR=off → resolveExtractCacheDir() returns undefined → cache disabled process-wide, no code revert. Rames dispositioned this as 🟢-alt in R2; I concur — env-flip is a legitimate alternative to a soak-gate when the flip is fast enough. Deferring to Miguel on whether a follow-up split-the-flip PR is worth the belt-and-braces.

5. Rames R1 F2 — observability gap on failure paths — ✅ resolved

File: packages/engine/src/services/videoFrameExtractor.ts:116-122, 557-561, 833-836, 977-986; extractionCache.ts:322-402
Four new fields on ExtractionPhaseBreakdown (cachePublishFailures, cacheGcEvictions, cacheGcBytesFreed, cacheAgedPartialsCleared) with clear doc-comments. Publish-failure counter wired in tryCachedExtract at :833-836. gcExtractionCache returns GcStats (new export) from extractionCache.ts:322-402 — evictedEntries, evictedBytes, agedPartialsRemoved — and the extractor lifts those into the breakdown at :977-986. All four failure classes Rames named in R1 now have signal. The unwritable-dir case at :776-778 stays as a single-shot process.stderr.write warning — right shape for a per-render one-time error.

Convergence with Rames R2 F2 verification (same file:line citations).

6. Rames R1 F3 — TOCTOU in `publishCacheEntry` unsentineled-target recovery — ✅ resolved (structural preempt)

File: packages/engine/src/services/extractionCache.ts:189-232
Two structural fixes:

New adoptPublishedWinner(entry, partialDir) helper at :194-198: checks for COMPLETE_SENTINEL, removes the losing partial, returns the winner's result.
Both recovery points route through it — before rmSync(entry.dir) at :214 (a winner that completed between initial rename-fail and the rm is now adopted, so the winner's published dir is never destroyed), AND in the post-rm retry catch at :226-231 (a winner that landed between our rm and retry-rename is now adopted rather than reported as a failure).

I'll add one nuance not in Rames' verification: there IS still a tighter race window between the adoptPublishedWinner check at :214 and the rmSync(entry.dir) at :217 — if a winner publishes in that microwindow, we still nuke it. But the post-rm retry catch at :226-231 catches the case where a winner publishes into the rm'd slot after our rm succeeded, so the observable damage would be a lost partial rather than a poisoned final. Rames dispositioned as clean preempt; I'd call it a substantially-tightened race with a defense-in-depth catch-block adopt — either framing lands the same disposition. ✅ resolved for the disposition rubric; the residual microwindow is not worth a separate finding.

Convergence with Rames R2 F3 (same helper naming, same :194-198 citation).

Cross-PR interop (#1901 ↔ #1902 order-lock note)

Resolved by the HDR-converted bypass at videoFrameExtractor.ts:715-720. My R1 request-changes was gated on the transform-axis being either in-key or documented as order-locked with #1902. R2 chose a third option: don't cache converted intermediates at all in this PR, then #1902 adds the transform-axis and re-enables caching for those entries under a discriminated key. Result: #1901 can land ahead of #1902 without a cache-poisoning window. My blocker is closed within this PR; #1902 is now a pure feature-add rather than a merge-order dependency.

CI status at R2

Head SHA: b75a3d9e0f44610b03375493543a9b17aeff691a.
CI: pending (no red status contexts). Rames' R2 notes Preflight, preview-regression, player-perf green + 8 regression-shards in progress; the aggregate status endpoint hadn't converged when I ran mine.
Approvals: none yet. My R1 was 🔴 request-changes; this R2 closes it.

R2 by Via

…LRU gc Warm re-renders now skip source-video frame extraction entirely (video_extract 400ms -> 13ms on a 4-video composition; outputs are pixel-identical, PSNR inf). What made default-on safe: - Atomic entry publish: frames extract into a unique .partial-<pid>-<uuid> dir, the completion sentinel is written there, and the dir is renamed into the final key atomically. Concurrent renders sharing a cache can duplicate work but can never serve a torn entry (previously documented as single-writer only). - Size-capped LRU gc: best-effort sweep after extraction evicts oldest-used entries past a 2 GiB default budget (HYPERFRAMES_EXTRACT_CACHE_MAX_MB) and clears crashed writers' partials. Entries younger than 60 min are never evicted so live renders keep their frames. - Default cache dir: <tmpdir>/hyperframes-extract-cache-<uid>. Opt out with HYPERFRAMES_EXTRACT_CACHE_DIR=off (or none/false/0); a non-writable dir degrades to uncached with a single warning instead of failing the render.

…gnals Review hardening for the default-on extraction cache: - Bypass the cache for HDR-converted intermediates: the key snapshot describes the original source, so publishing converted frames under it would poison later plain-SDR renders of the same trim. (The follow-up transform-keyed change re-enables caching for these.) - publishCacheEntry TOCTOU: adopt a concurrent writer's completed entry both before removing an apparently-stale dir and after a failed retry rename, so a winner's publish is never destroyed or reported as a failure. - Observability for the failure paths: cachePublishFailures, cacheGcEvictions, cacheGcBytesFreed, and cacheAgedPartialsCleared on ExtractionPhaseBreakdown; gcExtractionCache now returns sweep stats.

After a SCHEMA_PREFIX bump, old-generation entries (hfcache-v2-*) no longer matched the sweep's prefix filter and would orphan their disk forever. The gc now matches any hfcache-v* generation; superseded entries never receive sentinel touches, so the LRU evicts them first.

This was referenced Jul 3, 2026

perf(engine): superset extraction for overlapping trims of one source #1885

Merged

perf(engine): write PNG frames at compression_level 1 #1898

Merged

perf(engine): one-pass VFR extraction with -fps_mode cfr #1899

Merged

This was referenced Jul 3, 2026

perf(engine): dedupe identical extractions within one render #1900

Merged

perf(engine): one-pass SDR-to-HDR extraction with cache-key transform #1902

Merged

miguel-heygen marked this pull request as ready for review July 3, 2026 18:54

vanceingalls reviewed Jul 3, 2026

View reviewed changes

james-russo-rames-d-jusso reviewed Jul 3, 2026

View reviewed changes

miguel-heygen force-pushed the 07-02-perf_engine_extraction_cache_default_on branch from 299ec9d to b75a3d9 Compare July 3, 2026 20:01

james-russo-rames-d-jusso reviewed Jul 3, 2026

View reviewed changes

vanceingalls reviewed Jul 3, 2026

View reviewed changes

miguel-heygen force-pushed the 07-02-perf_engine_dedupe_identical_extractions branch from 14fad52 to 80a18c2 Compare July 3, 2026 20:30

miguel-heygen force-pushed the 07-02-perf_engine_extraction_cache_default_on branch from b75a3d9 to 39ef203 Compare July 3, 2026 20:30

miguel-heygen changed the base branch from 07-02-perf_engine_dedupe_identical_extractions to main July 3, 2026 20:39

miguel-heygen added 3 commits July 3, 2026 13:40

miguel-heygen force-pushed the 07-02-perf_engine_extraction_cache_default_on branch from 39ef203 to 88cfc2e Compare July 3, 2026 20:41

miguel-heygen merged commit 3459064 into main Jul 3, 2026
33 of 34 checks passed

miguel-heygen deleted the 07-02-perf_engine_extraction_cache_default_on branch July 3, 2026 20:41

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

perf(engine): extraction cache on by default with atomic publish and LRU gc#1901

perf(engine): extraction cache on by default with atomic publish and LRU gc#1901
miguel-heygen merged 3 commits into
mainfrom
07-02-perf_engine_extraction_cache_default_on

miguel-heygen commented Jul 3, 2026 •

edited

Loading

Uh oh!

miguel-heygen commented Jul 3, 2026 •

edited

Loading

Uh oh!

vanceingalls left a comment

Uh oh!

james-russo-rames-d-jusso left a comment

Uh oh!

miguel-heygen commented Jul 3, 2026

Uh oh!

james-russo-rames-d-jusso left a comment

Uh oh!

vanceingalls left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Uh oh!

Conversation

miguel-heygen commented Jul 3, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What

Why

Measured / verified

Notes

Uh oh!

miguel-heygen commented Jul 3, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

vanceingalls left a comment

Choose a reason for hiding this comment

🟠 Request changes — atomic-publish + gc mechanics are sound; cross-PR key composition is the runtime-interop hazard

Findings (runtime-interop lens)

1. Cache-key omits the SDR→HDR transform axis — key aliases produce non-deterministic frames across renders — 🔴

2. LRU eviction can race a concurrent reader — minAgeMs is documented as a liveness heuristic, not a lock — 🟠

3. Cache-broken vs cache-miss not distinguished at the render boundary — 🟠

4. Intra-render dedupe key vs on-disk cache key — different composition rules, verified consistent — 🟢

5. partialCacheEntryDir collision surface via truncated UUID — negligible but note — 🟢

Convergence with Rames

Uh oh!

james-russo-rames-d-jusso left a comment

Choose a reason for hiding this comment

Concerns

Nits

Cross-stack interactions

Questions

What I didn't verify

Uh oh!

miguel-heygen commented Jul 3, 2026

Uh oh!

james-russo-rames-d-jusso left a comment

Choose a reason for hiding this comment

R1 resolutions

Cross-stack recheck

Residuals

Batch state

Uh oh!

vanceingalls left a comment

Choose a reason for hiding this comment

🟢 R2 LGTM — HDR-converted bypass closes my 🔴 within this PR, TOCTOU tightened structurally, observability wired end-to-end

Finding-by-finding disposition

1. Via R1 F1 — cache-key omits SDR→HDR transform axis; poisoning window if #1901 lands without #1902 — ✅ resolved (closed within this PR)

2. Via R1 F2 — LRU eviction can race a concurrent reader — ⚠️ mitigated-accepting (documented liveness ceiling)

3. Via R1 F3 — cache-broken vs cache-miss at partialDir mkdir — ❌ still open, not blocking

4. Rames R1 F1 — zero-soak introduce-and-flip — ⚠️ mitigated-accepting (env-flip kill-switch documented)

5. Rames R1 F2 — observability gap on failure paths — ✅ resolved

6. Rames R1 F3 — TOCTOU in publishCacheEntry unsentineled-target recovery — ✅ resolved (structural preempt)

Cross-PR interop (#1901 ↔ #1902 order-lock note)

CI status at R2

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

miguel-heygen commented Jul 3, 2026 •

edited

Loading

miguel-heygen commented Jul 3, 2026 •

edited

Loading

2. LRU eviction can race a concurrent reader — `minAgeMs` is documented as a liveness heuristic, not a lock — 🟠

5. `partialCacheEntryDir` collision surface via truncated UUID — negligible but note — 🟢

6. Rames R1 F3 — TOCTOU in `publishCacheEntry` unsentineled-target recovery — ✅ resolved (structural preempt)