ci: make build cache restore resilient to eviction#16275
Merged
Conversation
- Add workflow-level permissions (actions: read) for gh cache list - Use exact key match in jq filter instead of prefix match - Handle gh cache list failures gracefully (fall fast to fallback) - Restore pnpm store cache when restore-build is true (warm fallback)
39e17fb to
e193a58
Compare
Contributor
📦 esbuild Bundle Analysis for payloadThis analysis was generated by esbuild-bundle-analyzer. 🤖
Largest pathsThese visualization shows top 20 largest paths in the bundle.Meta file: packages/next/meta_index.json, Out file: esbuild/index.js
Meta file: packages/payload/meta_index.json, Out file: esbuild/index.js
Meta file: packages/payload/meta_shared.json, Out file: esbuild/exports/shared.js
Meta file: packages/richtext-lexical/meta_client.json, Out file: esbuild/exports/client_optimized/index.js
Meta file: packages/ui/meta_client.json, Out file: esbuild/exports/client_optimized/index.js
Meta file: packages/ui/meta_shared.json, Out file: esbuild/exports/shared_optimized/index.js
DetailsNext to the size is how much the size has increased or decreased compared with the base branch of this PR.
|
Repo default_workflow_permissions is "read", which already grants actions:read needed by gh cache list. The explicit block was unnecessarily restricting other implicit read permissions.
AlessioGr
approved these changes
Apr 14, 2026
Contributor
|
🚀 This is included in version v3.83.0 |
milamer
pushed a commit
to milamer/payload
that referenced
this pull request
Apr 20, 2026
# Overview Downstream CI jobs fail when re-running individual jobs after the build cache has been evicted from GitHub's 10GB repo-wide LRU cache. For example, [this run](https://github.com/payloadcms/payload/actions/runs/24259697364/job/71108809903?pr=15268#step:5:22) had 4 jobs fail at "Restore build" because the cache was evicted ~43 hours after the original build. ## Time Savings This replaces the fixed 120s `sleep` propagation delay and hard-failing `fail-on-cache-miss: true` with a polling + fallback approach that self-heals on cache miss. **This change will save 2 mins per run.** ## Key Changes - **New `restore-build` input on the setup action** - When `true`, the action polls for the build cache using `gh cache list` (10s intervals, 120s timeout), restores it if found, or falls back to `pnpm install && pnpm run build:all` if not. This replaces `pnpm-run-install: false`, `pnpm-restore-cache: false`, `cache-propagation-delay: 120`, and the separate `Restore build` step that every downstream job previously had. - **Simplified downstream jobs** - All 8 downstream jobs (`tests-unit`, `tests-types`, `tests-int`, `e2e-prep`, `tests-e2e`, `build-and-test-templates`, `tests-type-generation`, `analyze`) go from ~10 lines of setup + cache restore boilerplate to `restore-build: true`. - **Removed `cache-propagation-delay` input** - The fixed 120s sleep is no longer needed. Polling finds the cache as soon as it's available (typically instantly), and fallback handles the miss case. ## Design Decisions **Polling over fixed sleep:** The old 120s delay was wasted time on the happy path (propagation is effectively instant) and didn't help when the cache was evicted entirely. Polling with `gh cache list` finds the cache as soon as it propagates, and the timeout triggers a fallback build instead of a hard failure. **`gh cache list` over `actions/cache/restore` with `lookup-only`:** `lookup-only` is a step-level action that can't be called in a bash loop. `gh cache list` is a CLI command available on all runners, works in a loop, and requires no extra extensions. Uses exact key match via jq filter to avoid prefix-match false positives. **Fallback builds with warm pnpm store:** The pnpm store cache is restored regardless of `restore-build` mode, so if the fallback build triggers, `pnpm install` has a warm store rather than starting cold. **Error handling:** If `gh cache list` fails (rate limit, auth, network), the loop breaks immediately and falls back to a full build rather than silently polling for 120s. ## Overall Flow ```mermaid sequenceDiagram participant B as Build Job participant C as GitHub Cache participant D as Downstream Job participant S as Setup Action B->>C: Save build cache (key: SHA) D->>S: restore-build: true S->>S: Restore pnpm store cache loop Every 10s (up to 120s) S->>C: gh cache list (exact key match) C-->>S: Found / Not found end alt Cache found S->>C: actions/cache/restore C-->>S: Build artifacts else Cache not found (evicted or timeout) S->>S: pnpm install && pnpm run build:all end ``` ## References / Links - [actions/cache#1710](actions/cache#1710) — original cache propagation delay issue - [gh cache list docs](https://cli.github.com/manual/gh_cache_list) - [actions/cache/restore README](https://github.com/actions/cache/blob/main/restore/README.md)
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Overview
Downstream CI jobs fail when re-running individual jobs after the build cache has been evicted from GitHub's 10GB repo-wide LRU cache. For example, this run had 4 jobs fail at "Restore build" because the cache was evicted ~43 hours after the original build.
Time Savings
This replaces the fixed 120s
sleeppropagation delay and hard-failingfail-on-cache-miss: truewith a polling + fallback approach that self-heals on cache miss. This change will save 2 mins per run.Key Changes
New
restore-buildinput on the setup actiontrue, the action polls for the build cache usinggh cache list(10s intervals, 120s timeout), restores it if found, or falls back topnpm install && pnpm run build:allif not. This replacespnpm-run-install: false,pnpm-restore-cache: false,cache-propagation-delay: 120, and the separateRestore buildstep that every downstream job previously had.Simplified downstream jobs
tests-unit,tests-types,tests-int,e2e-prep,tests-e2e,build-and-test-templates,tests-type-generation,analyze) go from ~10 lines of setup + cache restore boilerplate torestore-build: true.Removed
cache-propagation-delayinputDesign Decisions
Polling over fixed sleep: The old 120s delay was wasted time on the happy path (propagation is effectively instant) and didn't help when the cache was evicted entirely. Polling with
gh cache listfinds the cache as soon as it propagates, and the timeout triggers a fallback build instead of a hard failure.gh cache listoveractions/cache/restorewithlookup-only:lookup-onlyis a step-level action that can't be called in a bash loop.gh cache listis a CLI command available on all runners, works in a loop, and requires no extra extensions. Uses exact key match via jq filter to avoid prefix-match false positives.Fallback builds with warm pnpm store: The pnpm store cache is restored regardless of
restore-buildmode, so if the fallback build triggers,pnpm installhas a warm store rather than starting cold.Error handling: If
gh cache listfails (rate limit, auth, network), the loop breaks immediately and falls back to a full build rather than silently polling for 120s.Overall Flow
sequenceDiagram participant B as Build Job participant C as GitHub Cache participant D as Downstream Job participant S as Setup Action B->>C: Save build cache (key: SHA) D->>S: restore-build: true S->>S: Restore pnpm store cache loop Every 10s (up to 120s) S->>C: gh cache list (exact key match) C-->>S: Found / Not found end alt Cache found S->>C: actions/cache/restore C-->>S: Build artifacts else Cache not found (evicted or timeout) S->>S: pnpm install && pnpm run build:all endReferences / Links