Add codeload-retry resilience to governance-reusable and friends
Context
During the otpiser#11 estate-wide blocker sweep on 2026-05-26, a single
transient codeload outage caused 44 PRs across the estate to fail
their governance/* checks simultaneously. The failure mode is:
##[error]An action could not be found at the URI
'https://codeload.github.com/trufflesecurity/trufflehog/tar.gz/6c05c4a00b91aa542267d8e32a8254774799d68d'
(B440:3ED9CE:1CD787:252B63:6A1594AA)
##[error]Failed to download archive ... after 1 attempts.
The SHA was valid, the action.yml was present at the SHA, and direct
curl https://codeload.github.com/.../tar.gz/<sha> returned 200 ten
minutes after the failure. The runner only retries once before
failing the job, and a single codeload glitch propagates to ALL of the
matrix jobs (Code quality + docs, Workflow security linter,
Well-Known (RFC 9116 + RSR), Validate Hypatia baseline,
Security policy checks, Guix primary / Nix fallback policy,
Language / package anti-pattern policy) for every consuming repo.
Today's mitigation was bulk gh run rerun --failed across 79 PRs; that
restored 61 of them. But this should not require manual intervention.
Actions known to be affected today
| Action SHA |
Repo |
Used in |
6c05c4a00b91aa542267d8e32a8254774799d68d |
trufflesecurity/trufflehog |
governance-reusable.yml:523 |
fc68ffb90438ef2936bbb3251622353b3dcb2f93 |
erlef/setup-beam |
hypatia-scan.yml (called transitively) |
Recommendations
-
Add explicit continue-on-error: true + retry shim for non-blocking
secret-scanner steps (trufflehog is informational; a one-off skip is
safer than a one-off red CI).
-
Cache the action tarballs via actions/cache keyed on
${{ runner.os }}-action-${SHA}. Once cached, subsequent runs use
the cached copy and bypass codeload entirely.
-
Prefer self-hosted alternatives for the heaviest deps: trufflehog
can be installed via apt/curl directly rather than as an action.
-
Document the flake-clearing recipe in standards/CONTRIBUTING.md
so future maintainers know to run gh run rerun --failed on a
governance cluster red, not to chase per-PR fixes.
Today's impact
- 79 PRs failed across 79 repos
- 44-PR cluster all on the same codeload glitch
- 61 PRs auto-merged after rerun; 18 still open with non-codeload issues
Next steps
If accepted: file follow-up PR for recommendation 1 (continue-on-error),
recommendation 2 (action caching), and recommendation 4 (CONTRIBUTING
note). Recommendation 3 (trufflehog→apt) is larger and can go in a
separate issue if pursued.
Add codeload-retry resilience to governance-reusable and friends
Context
During the otpiser#11 estate-wide blocker sweep on 2026-05-26, a single
transient codeload outage caused 44 PRs across the estate to fail
their
governance/*checks simultaneously. The failure mode is:The SHA was valid, the action.yml was present at the SHA, and direct
curl https://codeload.github.com/.../tar.gz/<sha>returned 200 tenminutes after the failure. The runner only retries once before
failing the job, and a single codeload glitch propagates to ALL of the
matrix jobs (
Code quality + docs,Workflow security linter,Well-Known (RFC 9116 + RSR),Validate Hypatia baseline,Security policy checks,Guix primary / Nix fallback policy,Language / package anti-pattern policy) for every consuming repo.Today's mitigation was bulk
gh run rerun --failedacross 79 PRs; thatrestored 61 of them. But this should not require manual intervention.
Actions known to be affected today
6c05c4a00b91aa542267d8e32a8254774799d68dtrufflesecurity/trufflehogfc68ffb90438ef2936bbb3251622353b3dcb2f93erlef/setup-beamRecommendations
Add explicit
continue-on-error: true+ retry shim for non-blockingsecret-scanner steps (trufflehog is informational; a one-off skip is
safer than a one-off red CI).
Cache the action tarballs via
actions/cachekeyed on${{ runner.os }}-action-${SHA}. Once cached, subsequent runs usethe cached copy and bypass codeload entirely.
Prefer self-hosted alternatives for the heaviest deps: trufflehog
can be installed via apt/curl directly rather than as an action.
Document the flake-clearing recipe in standards/CONTRIBUTING.md
so future maintainers know to run
gh run rerun --failedon agovernance cluster red, not to chase per-PR fixes.
Today's impact
Next steps
If accepted: file follow-up PR for recommendation 1 (continue-on-error),
recommendation 2 (action caching), and recommendation 4 (CONTRIBUTING
note). Recommendation 3 (trufflehog→apt) is larger and can go in a
separate issue if pursued.