perf(engine): extraction cache on by default with atomic publish and LRU gc#1901
Conversation
vanceingalls
left a comment
There was a problem hiding this comment.
🟠 Request changes — atomic-publish + gc mechanics are sound; cross-PR key composition is the runtime-interop hazard
The atomic-publish primitive is well-shaped: partial dir → sentinel-inside → renameSync → same-fs atomic-swap, with a three-branch retry (target-complete → adopt winner; target-stale-unsentineled → force-rm and retry; target-partial-of-live-writer → leave and mark unpublished). The LRU sweep is scoped to SCHEMA_PREFIX/.partial- children (verified against isCacheLikeChild), so the sibling animated-gif subcache under the same root stays untouched. Single call site to lookupCacheEntry/publishCacheEntry/gcExtractionCache at the extractor — no fan-out surprises inside this PR.
The runtime-interop concern for flipping the cache on by default is the cache-key composition vs the HDR preflight, which PR #1902 addresses via a transform axis on CacheKeyInput. As shipped in this PR, the same SDR source file can produce two different cached outputs (raw-SDR frames vs HDR-normalized frames) under identical cache keys depending on the composition it was rendered into. That's a correctness hazard the moment cache-on becomes the default in a mixed workload. Cache-broken vs cache-miss also isn't fully distinguished at the caller boundary — a partial-dir mkdir failure surfaces as an extraction error rather than falling back to workDir extraction.
Findings (runtime-interop lens)
1. Cache-key omits the SDR→HDR transform axis — key aliases produce non-deterministic frames across renders — 🔴
File: packages/engine/src/services/videoFrameExtractor.ts:608-623, 653-707, 774-782
cacheKeyInputs[i] is snapshotted from resolvedVideos[i].videoPath before the HDR preflight (line 608), which is the correct choice for hitting the cache on the user-visible source. But the preflight then mutates entry.videoPath = convertedPath and entry.video.mediaStart = 0 (lines 695, 699), so the extractor at line 800 reads from the converted HDR file at mediaStart=0, while the cache stores those HDR-normalized frames under the original-SDR key (originalPath, originalMtime, originalMediaStart, ...).
Consequence when this PR ships cache-on-by-default:
- Render A: mixed timeline dominated by HDR. Clip X (SDR) gets SDR→HDR converted, frames extracted from the converted file. Cache stores HDR-tagged frames under key
K(X_SDR). - Render B: SDR-only composition using clip X. Lookup hits
K(X_SDR), serves HDR-tagged frames back to an SDR pipeline. Silent color-space corruption on the output.
Symmetric case: Render A extracts raw-SDR frames under K(X_SDR), Render B (mixed HDR) hits the same key and serves raw-SDR frames when it needed HDR-normalized ones.
PR #1902's transform?: string on CacheKeyInput is precisely this axis. Two options: (a) hold this PR until #1902 lands so the default-on flip goes out with the transform axis in one step; or (b) merge order-locked with #1902 immediately behind. As-is, any window where prod has #1901 but not #1902 is a cache-poisoning window. Please block the default-on flip on the transform-axis being in the key composition.
2. LRU eviction can race a concurrent reader — minAgeMs is documented as a liveness heuristic, not a lock — 🟠
File: packages/engine/src/services/extractionCache.ts:388-406, 356-365; packages/engine/src/services/videoFrameExtractor.ts:927-933
rehydrateCacheEntry captures framePaths: Map<number, string> — file paths, no held fds. If a second process's gcExtractionCache sweeps between one render's rehydrateCacheEntry and downstream consumption of framePaths, the consumer sees ENOENT.
The ponytail: comment on line 359 accepts this: "a render longer than minAge with a full cache could lose entries mid-read - acceptable, next render re-extracts." The subtle bit is that the current render fails, not the next one — the paths already handed back point at a now-deleted directory. The docstring should reflect that failure mode. If you want to keep the current design (no refcounts, no fd-holding), consider one of: touch-again after rehydrate (moves the LRU clock ahead of any concurrent sweep), or a passive existsSync check in the consumer with re-extract fallback. Not blocking on its own — 60 min is comfortably above typical render durations — but pair it with observability so we can see the failure mode if it ever fires in prod.
3. Cache-broken vs cache-miss not distinguished at the render boundary — 🟠
File: packages/engine/src/services/videoFrameExtractor.ts:744-755, 797-799, 883-916
mkdirSync(configuredCacheRootDir, { recursive: true }) at entry (line 748) has a nice graceful-degrade path with a stderr warning. But mkdirSync(partialDir, { recursive: true }) at line 799 has no equivalent guard. If the cache dir becomes unwritable mid-render (disk full, quota exceeded, EACCES race after another user's uid claims files under the shared /tmp path), the throw is caught at line 914 and surfaces as a {videoId, error} — the render fails for that clip instead of falling back to workDir extraction.
Cheapest fix: wrap the partialDir mkdir + publish in a try/catch that on failure logs, sets a cacheDegraded flag, and returns null from tryCachedExtract so the outer path falls through to the non-cache extractVideoFramesRange at line 901. This matches the spirit of the entry-time "caching disabled for this render" branch.
4. Intra-render dedupe key vs on-disk cache key — different composition rules, verified consistent — 🟢
File: packages/engine/src/services/videoFrameExtractor.ts:858, 774-782
Base PR #1900's dedupeKey = ${videoPath}\0${mediaStart}\0${videoDuration}\0${fps}\0${format} uses the post-preflight videoPath (converted HDR path), while this PR's on-disk key uses mtimeMs/size of the pre-preflight SDR source. Different composition rules, but they're used in different scopes — dedupe within one render (same preflight decisions apply to every ref), cache across renders (needs source-identity stability). Reviewed for coherence: no decorative-gate-style miss here, the two layers are intentionally different. The transform-axis in finding #1 is what makes the cross-render axis correct too.
5. partialCacheEntryDir collision surface via truncated UUID — negligible but note — 🟢
File: packages/engine/src/services/extractionCache.ts:167-169
randomUUID().slice(0, 8) gives 32 bits of entropy per pid. With the pid included, birthday collision for a single writer would need ~65k concurrent partial dirs for the same key from the same pid — unreachable. Fine as-is.
Convergence with Rames
Rames-bot had not posted at review time (verified via gh api /pulls/1901/reviews). No convergences to cite.
Review by Via (runtime-interop lens)
james-russo-rames-d-jusso
left a comment
There was a problem hiding this comment.
Reviewed at 299ec9d185570f8ac5aadcfa5e2f0220dd9793e5.
Peer scan: none yet — first review pass; Via has HF-runtime-interop lane so I'm coming in on canonical-rubric + cross-cutting concerns.
Stack: this PR against 07-02-perf_engine_dedupe_identical_extractions (#1900); cross-stack notes for #1902 (transform key) folded in below.
Summary — Turns the content-addressed extraction cache on by default under <tmpdir>/hyperframes-extract-cache-<uid>, and lands two mechanisms that make that safe: (1) publishCacheEntry writes the .hf-complete sentinel inside a .partial-<pid>-<uuid>/ dir then atomically renameSyncs over the final key, (2) gcExtractionCache sweeps a 2 GiB LRU budget after each extraction using sentinel mtime as the LRU clock, guarded by a 60 min minAgeMs liveness heuristic. Warm re-renders drop video_extract from 400 ms → 13 ms.
The core rename-based atomicity and the loser-rehydrates path both look correct; the concerns below are about rollout safety and cross-cutting hygiene, not the mechanism.
Concerns
-
🟠 Zero-soak introduce-and-flip. The same PR that adds the atomic-publish primitive also flips the cache to default-on, with blast radius = every render on every runtime path (Lambda, worker, CLI). The prior invariant was documented "single-writer; readers can see truncated frames" — every consumer that has ever run against a shared
extractCacheDirwas implicitly relying on external mutex or single-render assumption. There's no soak period, no cohort gate, no kill-switch beyondHYPERFRAMES_EXTRACT_CACHE_DIR=off. Suggest either (a) landing atomic-publish + LRU-gc in one PR with cache still opt-in, then a separate flip PR after ~a week of prod observation, or (b) adding a short-term "flip via env" mode (e.g.HYPERFRAMES_EXTRACT_CACHE_DEFAULT=on|off) with the switch flipped toonin a follow-up so rollback is a redeploy of one env, not a code revert of the primitive. -
🟠 Observability gap for the failure paths.
breakdown.cacheHits/cacheMissesare the only signals. For an on-by-default cache the ops-relevant events are all invisible: (1)publishCacheEntryreturns{ published: false }on sentinel-write / rename failure → the render still succeeds but future renders re-extract; (2)gcExtractionCacheevictions per sweep (bytes freed, entries removed); (3) aged-partial cleanups (a proxy for crashed-writer count); (4) the "cache dir is not writable" degradation (currently one stderr line per render, no counter). Once this is default-on, the first regression signal on prod is going to be "why are warm renders slow again?" and there'll be nothing to look at. Suggest at minimum: addcachePublishFailures,cacheGcEvictions,cacheGcBytesFreed,cacheAgedPartialsClearedtophaseBreakdown, and log the unwritable-dir warning through the renderlog(structured) rather than rawprocess.stderr.writeso it lands in the same place ops looks for the rest of the render's signals. -
🟠 TOCTOU in
publishCacheEntry's "target exists but unsentineled" recovery.extractionCache.ts:203-244: onENOTEMPTY/EEXIST, ifexistsSync(entry.dir/COMPLETE_SENTINEL)is false the code doesrmSync(entry.dir, {recursive:true, force:true})and retries the rename. Between theexistsSynccheck and thermSync, another writer can complete their publish (their rename lands, their sentinel is visible). OurrmSyncthen destroys their fully-published entry, and our subsequent rename replaces it with our own (also complete) copy. Net effect isn't corruption — both writers extracted the same content-addressed key, both have valid sentinels — but it's wasted work and it means the "at worst two writers duplicate ffmpeg work and one rehydrates from the winner" claim in the PR body isn't quite right in this branch (the loser can be the winner, kill the winner's dir, and become the winner). Rare in practice, but if you want to defend against it: re-checkexistsSync(entry.dir/COMPLETE_SENTINEL)after thermSyncfailure branch too, or scope the recovery to only unsentineled-and-aged directories (mtime older than a small threshold).
Nits
-
🟡
extractionCache.ts:328—// ponytail:in the source. Presumably a personal marker. Squash-strip. -
🟡
videoFrameExtractor.ts:823— the unwritable-cache-dir stderr warning is one-shot per render but not deduped across renders in the same process. On a Lambda cold-start with a mispointed cache dir, every render in the container's lifetime prints the warning. Not urgent, and once (2) above ships this becomes a structured log with a counter attached, so probably moot. -
🟡
gcExtractionCachenever emits an error/warn line when a sweep can't reach its budget (all entries underminAgeMs). For a runaway concurrent-render burst, the cache can exceed the 2 GiB cap indefinitely and there's no signal. Low-severity because the disk pressure will surface elsewhere first, but worth folding into the metric surface from (2) above. -
🟡
HYPERFRAMES_EXTRACT_CACHE_MAX_MB=0degrades to "evict every entry older thanminAgeMs" rather than "disable size cap".envNumreturns 0 (finite), then* 1024**2 = 0, then thetotalBytes <= 0guard is only true when the cache is empty. Not a bug — arguably user asked for aggressive eviction — but a user reading_MAX_MBmight reasonably expect0to mean "unlimited". Doc string could clarify. -
🟡
isTargetExistsRenameErroracceptsEPERM. On Linux this is unusual but does happen on APFS / SMB / some overlays. Fine for the paths this code runs on (Lambda /tmp, worker /tmp) but noting for future readers who seeEPERMand wonder — a one-line comment on the union of platform semantics would help.
Cross-stack interactions
-
#1902 (SDR→HDR transform key) composes cleanly. It extends
CacheKeyInputwithtransform?: stringand rides the samepartialCacheEntryDir/publishCacheEntryprimitive from this PR. Key blob JSON keeps thetfield out when undefined so existing entries stay valid (no schema bump). Nothing further to reconcile between #1901 and #1902. -
#1900 (intra-render dedupe) is orthogonal. Its
dedupeKeyis in-memory-only; it doesn't touch the on-disk cache primitive, so the two dedup layers stack (per-render dedupe, per-machine cache) without interaction. -
#1885 (superset extraction) not audited here — that's a bigger diff and its cache-key/superset semantics probably deserve their own second pass. Nothing in #1901 pre-commits to a superset-friendly key layout, so #1885 will have to pick its own key shape and I'd expect that review to flag any tension there.
-
animated-gifsubcache safe. Confirmed viaisCacheLikeChildfilter (SCHEMA_PREFIXprefix or.partial-substring) and the "ignores non-cache-prefix directories" test atextractionCache.test.ts:340.compileStage.ts:127puts the animated-gif dir insideextractCacheDirbut its name isanimated-gif— the GC sweep skips it.
Questions
-
The default
extractCacheDirunder<tmpdir>/hyperframes-extract-cache-<uid>— on Lambda/tmpis 10 GiB total and is shared with the render's own scratch (downloads, compiled artifacts, per-frame captures). With a 2 GiB LRU budget, warm-container behavior is fine, but on a burst of large-source renders (e.g. 4K source, long duration) a single render can fill several GiB of/tmpoutside the cache, and the LRU sweep runs after extraction (not before). Any concern about/tmpENOSPC during extraction on a warm container, or is the render's own workDir accounting already sized to leave headroom? -
60 min
minAgeMs— was that picked to bound the max realistic single-render duration, or is there a specific worst-case observed render time this is protecting? Feels right, just want the number in the record if there's a story behind it.
What I didn't verify
-
Actual rename atomicity semantics on the concrete filesystems the cache lands on in prod (Lambda
/tmp= tmpfs; worker/tmpmay vary).renameSyncover a non-empty target is documented atomic on tmpfs/ext4; behavior on network-mountedextractCacheDir(explicitly warned against in the docstring for a different reason) I didn't trace. -
Crash-under-
SIGKILLmid-extract behavior — the aged-partial cleanup path is tested but I didn't run the actual crash-and-recover scenario end-to-end. -
Interaction with the compile stage's
animatedGifCacheDirunder a shared-/tmpconcurrent-render burst — verified GC skips it, didn't verify the animated-gif subcache's own eviction / lifetime policy.
— Rames D Jusso
299ec9d to
b75a3d9
Compare
|
Review feedback addressed in the updated branch (commit: 'harden extraction cache publish and surface cache ops signals'):
|
james-russo-rames-d-jusso
left a comment
There was a problem hiding this comment.
R2 verification — reviewed at b75a3d9e0f44610b03375493543a9b17aeff691a (R1 at 299ec9d1).
R1 findings verified as: 🟢-alt / 🟢 / 🟢.
Peer scan: Vance posted a runtime-interop review between my R1 and this R2 (🔴 transform-axis, 🟠 LRU eviction race, 🟠 cache-broken-vs-miss). The 🔴 is now closed within this PR by the HDR-converted bypass (cacheKeyInputs[i] = null at videoFrameExtractor.ts:720); the two 🟠s are noted below as residuals separate from my R1 lane.
Summary — b75a3d9 addresses all three of my R1 findings. Observability + TOCTOU are structurally fixed; the zero-soak flag flip stays default-on-in-same-PR but the env-var kill-switch shape (HYPERFRAMES_EXTRACT_CACHE_DIR=off) is documented as the rollback path, which meets the goal by the alternative-resolution rubric.
R1 resolutions
-
F1 (zero-soak flip default-on): 🟢-alt — rollback path documented rather than soak-gate added. Cache remains default-on (
config.ts:373-376:HYPERFRAMES_EXTRACT_CACHE_DIRundefined → auto-resolves to<tmpdir>/hyperframes-extract-cache-<uid>), and the flip lives in the same PR as the primitive. Author's response ack'd this and named the kill-switch (HYPERFRAMES_EXTRACT_CACHE_DIR=off→resolveExtractCacheDir()returns undefined → cache disabled process-wide, no code revert). Env-flip rollback is a legitimate alternative to soak — one-env redeploy is fast enough that the rollback story is complete. Not blocking; noted that a follow-up PR splitting the default-on flip after a soak window would be cleaner if the team wants belt-and-braces. Flagging @miguel-heygen for the rollout-policy call. -
F2 (observability on failure paths): 🟢 —
ExtractionPhaseBreakdownatvideoFrameExtractor.ts:116-122now surfacescachePublishFailures,cacheGcEvictions,cacheGcBytesFreed,cacheAgedPartialsCleared. Wired: publish-failure counter incremented atvideoFrameExtractor.ts:837whenpublishCacheEntryreturnspublished: false;gcExtractionCachereturnsGcStatsfromextractionCache.ts:370-402(evictedEntries,evictedBytes,agedPartialsRemoved) and the extractor lifts those into the breakdown atvideoFrameExtractor.ts:984-986. Unwritable-dir case still surfaces viaprocess.stderr.writewarning atvideoFrameExtractor.ts:776-778— no counter but a single-shot warning per render is the right shape for that path. All four failure classes I named in R1 now have signal. -
F3 (TOCTOU in
publishCacheEntryunsentineled-target recovery): 🟢 —extractionCache.ts:189-232. Two structural fixes: (a) new helperadoptPublishedWinner(entry, partialDir)at :194-198 checks forCOMPLETE_SENTINELand removes the losing partial when a winner exists; (b) both recovery points route through it — beforermSync(entry.dir)at :214 (a winner that completed between the initial rename failure and the rm is now adopted, so the winner's published dir is never destroyed), and after the failed retry rename at :230 (a winner that landed between our rm and retry-rename is now adopted instead of reported as a failure). The check-then-remove sequence I flagged is genuinely closed — a losing writer can no longer destroy a concurrent winner's freshly-published entry. Perfeedback_race_fix_preempt_vs_narrowthis is a proper PREEMPT (adopt-then-return), not a narrower-race band-aid.
Cross-stack recheck
- #1902 composition unchanged: transform-axis on
CacheKeyInputstill rides on this PR's atomic publish primitive; thecacheKeyInputs[i] = nullbypass atvideoFrameExtractor.ts:720for HDR-converted intermediates means #1901 alone can't poison the cache, and #1902's transform-keyed change then re-enables caching for those entries with a discriminated key. No cache-poisoning window at any point in the merged stack. - #1900 (in-memory dedupe) still orthogonal;
animated-gifsubcache still safe (isCacheLikeChildscoped toSCHEMA_PREFIX+.partial-inextractionCache.ts:262-264).
Residuals
- Vance's 🟠 LRU eviction can race a concurrent reader —
extractionCache.ts:389-398+videoFrameExtractor.ts:812-819rehydrateCacheEntryhands out file paths, no held fds; a concurrent process'sgcExtractionCachebetweenrehydrateCacheEntryand downstream consumption would produce ENOENT. Author accepted this as a documented liveness ceiling (60min minAge > typical render). Not in my R1 lane, agree with Vance that it's non-blocking given the observability counters now surface if it ever fires in prod. - Vance's 🟠 cache-broken vs cache-miss not distinguished at render boundary —
videoFrameExtractor.ts:824mkdirSync(partialDir, { recursive: true })still uncaught; a mid-render disk-full / EACCES surfaces as an extraction error rather than falling back to uncached workDir extraction. Cheap fix would be wrapping the partialDir mkdir + publish sotryCachedExtractreturnsnullon cache-degrade and the outer path falls through toextractVideoFramesRangeat :945-953. Deferring to Vance's lane; not blocking my sign-off but worth doing. - F1 rollout note (repeat from above): rollback path is env-flip, so a bad flip requires a config redeploy vs a code revert. Cheap enough that I'd land as-is, but @miguel-heygen your call on whether to split the flip into a separate follow-up.
Batch state
- CI:
Preflight,preview-regression,player-perfgreen; 8regression-shardsstillin_progress;Graphite / mergeability_checkin_progress. Downstack (#1900) needs to land first per Graphite's warning banner. No red required checks. - Approvals: none yet on this PR. Vance's is 🔴 request-changes for transform-axis coupling — that's now closed within #1901 by the HDR bypass, so I read his blocker as resolved but he hasn't re-verified.
- Zero-soak call still open for @miguel-heygen (rollout policy, not blocking).
— Rames D Jusso
vanceingalls
left a comment
There was a problem hiding this comment.
🟢 R2 LGTM — HDR-converted bypass closes my 🔴 within this PR, TOCTOU tightened structurally, observability wired end-to-end
R2 verification — reviewed at b75a3d9e0f44610b03375493543a9b17aeff691a (R1 at 299ec9d1).
Peer scan: Rames-bot posted R2 verifying all three of his R1 findings as 🟢 (F1 zero-soak as 🟢-alt with env-flip rollback). Convergences cited per finding below.
Verification method: gh api /pulls/1901/files (7 files touched) + per-file gh api /contents?ref=<sha> diff, per gh-compare-rebased-branches-caveat.
R1 was a 🟠 request-changes from me on the cross-PR transform-axis order-lock. R2 closes that finding within this PR via a targeted cacheKeyInputs[i] = null bypass at videoFrameExtractor.ts:715-720 for HDR-converted intermediates — the poisoning window that motivated the order-lock no longer exists, so #1901 is safe to land ahead of #1902 (which then re-enables caching for those entries under a discriminated key). Rames' R1 findings (zero-soak, observability, TOCTOU) all show structural fixes; the two residuals from my R1 (LRU eviction race, cache-broken vs cache-miss at mkdir) are accepted as documented liveness ceilings, not blocking.
Finding-by-finding disposition
1. Via R1 F1 — cache-key omits SDR→HDR transform axis; poisoning window if #1901 lands without #1902 — ✅ resolved (closed within this PR)
File: packages/engine/src/services/videoFrameExtractor.ts:713-720
The R2 fix intercepts converted intermediates before they can be published under the original-SDR key:
entry.videoPath = convertedPath;
// The converted intermediate carries BT.2020-mapped pixels but the
// cache key snapshot above still describes the ORIGINAL source.
// Publishing converted frames under that key would poison later
// plain-SDR renders of the same trim, so bypass the cache for
// converted entries. (The follow-up transform-keyed cache change
// re-enables caching for these with a discriminated key.)
cacheKeyInputs[i] = null;This closes the poisoning window I flagged: HDR-converted extractions no longer produce a cached entry at all under this PR, so the SDR-only follow-up render can never hit a poisoned key. #1902 then re-enables the cache for those entries via the transform-axis discriminator. Cross-PR order-lock is no longer a merge gate — #1901 is safe standalone.
Convergence with Rames R2 on the same file+line ("HDR-converted bypass ... means #1901 alone can't poison the cache").
2. Via R1 F2 — LRU eviction can race a concurrent reader — ⚠️ mitigated-accepting (documented liveness ceiling)
File: packages/engine/src/services/extractionCache.ts:389-402; videoFrameExtractor.ts:812-819
R2 did not add reader-side existence guards or refcounts. But it did add the cacheGcEvictions / cacheGcBytesFreed / cacheAgedPartialsCleared observability signals (see F3 below), so if the race ever fires in prod we'll see eviction pressure rising against render duration. The 60-min minAgeMs remains the liveness ceiling. Accepting the documented tradeoff per r2-verdict-mitigation-vs-full-resolution — the counters + doc comment on ExtractionPhaseBreakdown at :117 turn this from an invisible race into an observable one.
Convergence with Rames R2 residuals section.
3. Via R1 F3 — cache-broken vs cache-miss at partialDir mkdir — ❌ still open, not blocking
File: packages/engine/src/services/videoFrameExtractor.ts:824 (unchanged from R1)
mkdirSync(partialDir, { recursive: true }) still throws through to the extraction-error branch on EACCES/ENOSPC mid-render. The cheapest fix (wrap + return null from tryCachedExtract to fall through to non-cache path at :945-953) was not applied. Rames' R2 flagged the same thing and deferred to my lane — I'm ❌-open with a "not blocking, follow-up PR" disposition, since the entry-time cache-init check does have a graceful fallback and this only affects mid-render disk-state changes. Ship-blocking iff Miguel wants belt-and-braces before default-on; otherwise fold into next stack.
4. Rames R1 F1 — zero-soak introduce-and-flip — ⚠️ mitigated-accepting (env-flip kill-switch documented)
File: packages/engine/src/config.ts:373-376
Cache remains default-on in the same PR that introduces the atomic-publish primitive. Rollback path is HYPERFRAMES_EXTRACT_CACHE_DIR=off → resolveExtractCacheDir() returns undefined → cache disabled process-wide, no code revert. Rames dispositioned this as 🟢-alt in R2; I concur — env-flip is a legitimate alternative to a soak-gate when the flip is fast enough. Deferring to Miguel on whether a follow-up split-the-flip PR is worth the belt-and-braces.
5. Rames R1 F2 — observability gap on failure paths — ✅ resolved
File: packages/engine/src/services/videoFrameExtractor.ts:116-122, 557-561, 833-836, 977-986; extractionCache.ts:322-402
Four new fields on ExtractionPhaseBreakdown (cachePublishFailures, cacheGcEvictions, cacheGcBytesFreed, cacheAgedPartialsCleared) with clear doc-comments. Publish-failure counter wired in tryCachedExtract at :833-836. gcExtractionCache returns GcStats (new export) from extractionCache.ts:322-402 — evictedEntries, evictedBytes, agedPartialsRemoved — and the extractor lifts those into the breakdown at :977-986. All four failure classes Rames named in R1 now have signal. The unwritable-dir case at :776-778 stays as a single-shot process.stderr.write warning — right shape for a per-render one-time error.
Convergence with Rames R2 F2 verification (same file:line citations).
6. Rames R1 F3 — TOCTOU in publishCacheEntry unsentineled-target recovery — ✅ resolved (structural preempt)
File: packages/engine/src/services/extractionCache.ts:189-232
Two structural fixes:
- New
adoptPublishedWinner(entry, partialDir)helper at :194-198: checks forCOMPLETE_SENTINEL, removes the losing partial, returns the winner's result. - Both recovery points route through it — before
rmSync(entry.dir)at :214 (a winner that completed between initial rename-fail and the rm is now adopted, so the winner's published dir is never destroyed), AND in the post-rm retry catch at :226-231 (a winner that landed between our rm and retry-rename is now adopted rather than reported as a failure).
I'll add one nuance not in Rames' verification: there IS still a tighter race window between the adoptPublishedWinner check at :214 and the rmSync(entry.dir) at :217 — if a winner publishes in that microwindow, we still nuke it. But the post-rm retry catch at :226-231 catches the case where a winner publishes into the rm'd slot after our rm succeeded, so the observable damage would be a lost partial rather than a poisoned final. Rames dispositioned as clean preempt; I'd call it a substantially-tightened race with a defense-in-depth catch-block adopt — either framing lands the same disposition. ✅ resolved for the disposition rubric; the residual microwindow is not worth a separate finding.
Convergence with Rames R2 F3 (same helper naming, same :194-198 citation).
Cross-PR interop (#1901 ↔ #1902 order-lock note)
Resolved by the HDR-converted bypass at videoFrameExtractor.ts:715-720. My R1 request-changes was gated on the transform-axis being either in-key or documented as order-locked with #1902. R2 chose a third option: don't cache converted intermediates at all in this PR, then #1902 adds the transform-axis and re-enables caching for those entries under a discriminated key. Result: #1901 can land ahead of #1902 without a cache-poisoning window. My blocker is closed within this PR; #1902 is now a pure feature-add rather than a merge-order dependency.
CI status at R2
- Head SHA:
b75a3d9e0f44610b03375493543a9b17aeff691a. - CI:
pending(no red status contexts). Rames' R2 notesPreflight,preview-regression,player-perfgreen + 8 regression-shards in progress; the aggregate status endpoint hadn't converged when I ran mine. - Approvals: none yet. My R1 was 🔴 request-changes; this R2 closes it.
R2 by Via
14fad52 to
80a18c2
Compare
b75a3d9 to
39ef203
Compare
…LRU gc Warm re-renders now skip source-video frame extraction entirely (video_extract 400ms -> 13ms on a 4-video composition; outputs are pixel-identical, PSNR inf). What made default-on safe: - Atomic entry publish: frames extract into a unique .partial-<pid>-<uuid> dir, the completion sentinel is written there, and the dir is renamed into the final key atomically. Concurrent renders sharing a cache can duplicate work but can never serve a torn entry (previously documented as single-writer only). - Size-capped LRU gc: best-effort sweep after extraction evicts oldest-used entries past a 2 GiB default budget (HYPERFRAMES_EXTRACT_CACHE_MAX_MB) and clears crashed writers' partials. Entries younger than 60 min are never evicted so live renders keep their frames. - Default cache dir: <tmpdir>/hyperframes-extract-cache-<uid>. Opt out with HYPERFRAMES_EXTRACT_CACHE_DIR=off (or none/false/0); a non-writable dir degrades to uncached with a single warning instead of failing the render.
…gnals Review hardening for the default-on extraction cache: - Bypass the cache for HDR-converted intermediates: the key snapshot describes the original source, so publishing converted frames under it would poison later plain-SDR renders of the same trim. (The follow-up transform-keyed change re-enables caching for these.) - publishCacheEntry TOCTOU: adopt a concurrent writer's completed entry both before removing an apparently-stale dir and after a failed retry rename, so a winner's publish is never destroyed or reported as a failure. - Observability for the failure paths: cachePublishFailures, cacheGcEvictions, cacheGcBytesFreed, and cacheAgedPartialsCleared on ExtractionPhaseBreakdown; gcExtractionCache now returns sweep stats.
After a SCHEMA_PREFIX bump, old-generation entries (hfcache-v2-*) no longer matched the sweep's prefix filter and would orphan their disk forever. The gc now matches any hfcache-v* generation; superseded entries never receive sentinel touches, so the LRU evicts them first.
39ef203 to
88cfc2e
Compare

Stack 4/6: #1898 → #1899 → #1900 → this → #1902 → #1885
What
The content-addressed extraction cache turns ON by default, made safe by two mechanisms it previously lacked:
<entry>.partial-<pid>-<uuid>dir, the.hf-completesentinel is written there, and the dir isrename()d into the final key atomically. Concurrent renders sharing a cache can duplicate ffmpeg work but can never observe a torn entry (the cache was previously documented single-writer: a reader could see a truncated frame mid-write).HYPERFRAMES_EXTRACT_CACHE_MAX_MB) and clears crashed writers' partial dirs. Entries younger than 60 min are never evicted, so a live render keeps its frames; sentinel mtime (touched on every hit) is the LRU clock. The sweep only looks at cache-shaped children, so the sibling animated-gif subcache is untouched.Defaults: cache dir
<tmpdir>/hyperframes-extract-cache-<uid>(writable everywhere we render, including Lambda's /tmp; per-uid to avoid shared-/tmp collisions). Opt out withHYPERFRAMES_EXTRACT_CACHE_DIR=off|none|false|0. A non-writable cache dir degrades to uncached with a single stderr warning instead of failing the render.Why
Extraction is the most expensive render phase after capture, and the dominant dev loop is render → tweak → re-render of the same sources. With the cache on, warm re-renders skip extraction entirely.
Measured / verified
video_extract400 ms → 13 ms; cold and warm outputs compare at PSNRinf(pixel-identical).Notes
Existing cache keys stay valid (no schema bump).
extractCacheMaxBytesthreads through the producer's extract stage alongsideextractCacheDir.