Summary
Some repository workflows still fail on transient GitHub-side infrastructure errors such as git fetch HTTP 500 responses during checkout. In those cases the workflow logic is fine, but the run still requires a manual rerun to go green.
Current Behavior
Intermittent failures like the following can fail a job even though rerunning the same workflow immediately succeeds:
/usr/bin/git -c protocol.version=2 fetch --no-tags --prune --no-recurse-submodules origin +refs/heads/*:refs/remotes/origin/* +refs/tags/*:refs/tags/*
Error: error: RPC failed; HTTP 500 curl 22 The requested URL returned error: 500
Error: fatal: expected flush after ref listing
The process '/usr/bin/git' failed with exit code 128
Waiting 10 seconds before trying again
/usr/bin/git -c protocol.version=2 fetch --no-tags --prune --no-recurse-submodules origin +refs/heads/*:refs/remotes/origin/* +refs/tags/*:refs/tags/*
Error: error: RPC failed; HTTP 500 curl 22 The requested URL returned error: 500
Error: fatal: expected 'packfile'
The process '/usr/bin/git' failed with exit code 128
Waiting 14 seconds before trying again
/usr/bin/git -c protocol.version=2 fetch --no-tags --prune --no-recurse-submodules origin +refs/heads/*:refs/remotes/origin/* +refs/tags/*:refs/tags/*
remote: Internal Server Error
Error: fatal: unable to access 'https://github.com/php-fast-forward/dev-tools/': The requested URL returned error: 500
Error: The process '/usr/bin/git' failed with exit code 128
Expected Behavior
When a workflow fails because GitHub checkout or another clearly transient GitHub-side operation hits an infrastructure error, the repository SHOULD retry or rerun automatically within a bounded policy. Logic bugs, validation failures, test failures, and deterministic workflow mistakes MUST still fail normally without automatic reruns.
Scope
Investigate and implement a workflow-level resilience strategy for transient GitHub Actions failures, such as:
- checkout and fetch failures with HTTP 500 or similar GitHub-side transport errors
- short-lived internal GitHub service failures that disappear on immediate rerun
- a bounded retry or rerun mechanism that does not hide genuine workflow regressions
Acceptance Criteria
- We define which failure signatures count as transient GitHub or network infrastructure failures.
- Repository workflows can retry or rerun automatically when those transient signatures are detected.
- The retry policy is bounded and visible in logs so maintainers can still diagnose flaky infrastructure.
- Deterministic failures from workflow logic, command failures, tests, or validation do not get retried automatically.
- The implementation documents where the retry policy applies and any intentionally excluded workflows or steps.
- README or docs are updated if maintainers need to understand or tune the behavior.
Non-Goals
- Retrying failing tests, lint, changelog validation, or other real quality-signal failures.
- Hiding repeated infrastructure instability without surfacing that retries happened.
- Introducing unbounded rerun loops.
Summary
Some repository workflows still fail on transient GitHub-side infrastructure errors such as
git fetchHTTP 500 responses during checkout. In those cases the workflow logic is fine, but the run still requires a manual rerun to go green.Current Behavior
Intermittent failures like the following can fail a job even though rerunning the same workflow immediately succeeds:
Expected Behavior
When a workflow fails because GitHub checkout or another clearly transient GitHub-side operation hits an infrastructure error, the repository SHOULD retry or rerun automatically within a bounded policy. Logic bugs, validation failures, test failures, and deterministic workflow mistakes MUST still fail normally without automatic reruns.
Scope
Investigate and implement a workflow-level resilience strategy for transient GitHub Actions failures, such as:
Acceptance Criteria
Non-Goals