Skip to content

governance-reusable: add codeload-retry resilience (44-PR estate flake 2026-05-26) #208

@hyperpolymath

Description

@hyperpolymath

Add codeload-retry resilience to governance-reusable and friends

Context

During the otpiser#11 estate-wide blocker sweep on 2026-05-26, a single
transient codeload outage caused 44 PRs across the estate to fail
their governance/* checks simultaneously. The failure mode is:

##[error]An action could not be found at the URI
  'https://codeload.github.com/trufflesecurity/trufflehog/tar.gz/6c05c4a00b91aa542267d8e32a8254774799d68d'
  (B440:3ED9CE:1CD787:252B63:6A1594AA)
##[error]Failed to download archive ... after 1 attempts.

The SHA was valid, the action.yml was present at the SHA, and direct
curl https://codeload.github.com/.../tar.gz/<sha> returned 200 ten
minutes after the failure. The runner only retries once before
failing the job, and a single codeload glitch propagates to ALL of the
matrix jobs (Code quality + docs, Workflow security linter,
Well-Known (RFC 9116 + RSR), Validate Hypatia baseline,
Security policy checks, Guix primary / Nix fallback policy,
Language / package anti-pattern policy) for every consuming repo.

Today's mitigation was bulk gh run rerun --failed across 79 PRs; that
restored 61 of them. But this should not require manual intervention.

Actions known to be affected today

Action SHA Repo Used in
6c05c4a00b91aa542267d8e32a8254774799d68d trufflesecurity/trufflehog governance-reusable.yml:523
fc68ffb90438ef2936bbb3251622353b3dcb2f93 erlef/setup-beam hypatia-scan.yml (called transitively)

Recommendations

  1. Add explicit continue-on-error: true + retry shim for non-blocking
    secret-scanner steps (trufflehog is informational; a one-off skip is
    safer than a one-off red CI).

  2. Cache the action tarballs via actions/cache keyed on
    ${{ runner.os }}-action-${SHA}. Once cached, subsequent runs use
    the cached copy and bypass codeload entirely.

  3. Prefer self-hosted alternatives for the heaviest deps: trufflehog
    can be installed via apt/curl directly rather than as an action.

  4. Document the flake-clearing recipe in standards/CONTRIBUTING.md
    so future maintainers know to run gh run rerun --failed on a
    governance cluster red, not to chase per-PR fixes.

Today's impact

  • 79 PRs failed across 79 repos
  • 44-PR cluster all on the same codeload glitch
  • 61 PRs auto-merged after rerun; 18 still open with non-codeload issues

Next steps

If accepted: file follow-up PR for recommendation 1 (continue-on-error),
recommendation 2 (action caching), and recommendation 4 (CONTRIBUTING
note). Recommendation 3 (trufflehog→apt) is larger and can go in a
separate issue if pursued.

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementNew feature or requestgithub_actionsPull requests that update GitHub Actions code

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions