Skip to content

[github-actions] Retry transient GitHub infrastructure failures in workflows #175

@coisa

Description

@coisa

Summary

Some repository workflows still fail on transient GitHub-side infrastructure errors such as git fetch HTTP 500 responses during checkout. In those cases the workflow logic is fine, but the run still requires a manual rerun to go green.

Current Behavior

Intermittent failures like the following can fail a job even though rerunning the same workflow immediately succeeds:

/usr/bin/git -c protocol.version=2 fetch --no-tags --prune --no-recurse-submodules origin +refs/heads/*:refs/remotes/origin/* +refs/tags/*:refs/tags/*
Error: error: RPC failed; HTTP 500 curl 22 The requested URL returned error: 500
Error: fatal: expected flush after ref listing
The process '/usr/bin/git' failed with exit code 128
Waiting 10 seconds before trying again
/usr/bin/git -c protocol.version=2 fetch --no-tags --prune --no-recurse-submodules origin +refs/heads/*:refs/remotes/origin/* +refs/tags/*:refs/tags/*
Error: error: RPC failed; HTTP 500 curl 22 The requested URL returned error: 500
Error: fatal: expected 'packfile'
The process '/usr/bin/git' failed with exit code 128
Waiting 14 seconds before trying again
/usr/bin/git -c protocol.version=2 fetch --no-tags --prune --no-recurse-submodules origin +refs/heads/*:refs/remotes/origin/* +refs/tags/*:refs/tags/*
remote: Internal Server Error
Error: fatal: unable to access 'https://github.com/php-fast-forward/dev-tools/': The requested URL returned error: 500
Error: The process '/usr/bin/git' failed with exit code 128

Expected Behavior

When a workflow fails because GitHub checkout or another clearly transient GitHub-side operation hits an infrastructure error, the repository SHOULD retry or rerun automatically within a bounded policy. Logic bugs, validation failures, test failures, and deterministic workflow mistakes MUST still fail normally without automatic reruns.

Scope

Investigate and implement a workflow-level resilience strategy for transient GitHub Actions failures, such as:

  • checkout and fetch failures with HTTP 500 or similar GitHub-side transport errors
  • short-lived internal GitHub service failures that disappear on immediate rerun
  • a bounded retry or rerun mechanism that does not hide genuine workflow regressions

Acceptance Criteria

  • We define which failure signatures count as transient GitHub or network infrastructure failures.
  • Repository workflows can retry or rerun automatically when those transient signatures are detected.
  • The retry policy is bounded and visible in logs so maintainers can still diagnose flaky infrastructure.
  • Deterministic failures from workflow logic, command failures, tests, or validation do not get retried automatically.
  • The implementation documents where the retry policy applies and any intentionally excluded workflows or steps.
  • README or docs are updated if maintainers need to understand or tune the behavior.

Non-Goals

  • Retrying failing tests, lint, changelog validation, or other real quality-signal failures.
  • Hiding repeated infrastructure instability without surfacing that retries happened.
  • Introducing unbounded rerun loops.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    Status

    Released

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions