Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Remote cache import network error is fatal but should be considered a cache miss #2836

Open
sipsma opened this issue Apr 28, 2022 · 7 comments

Comments

@sipsma
Copy link
Collaborator

sipsma commented Apr 28, 2022

In Dagger CI we got an error when some GHA cache was being imported for an ExecOp:

failed to compute cache key: failed to copy: invalid status response 503 Egress is over the account limit.

The error is just a problem with GHA, but Buildkit failed the solve as a result of it. Ideally Buildkit should just treat this as a cache miss rather than a fatal error.

AFAIK that's supposed to be the behavior, but here I suspect the problem is that there was a cache hit successfully but the error only happened once the remote blob was being unlazied. If so, this is a pretty tricky problem as it would mean that we need the solver to "go back" and now treat the vertex that was previously a cache hit as a miss and just re-execute the vertex.

@tonistiigi
Copy link
Member

tonistiigi commented Apr 28, 2022

Correct that the issue in here is the lazy semantics. The cache load was allowed to fail but when we added lazy semantics this meant that for many errors the error doesn't happen on the actual cache load phase but later when the ref gets unlazied. I tried to fix it for the errors that happened because github removed a blob with this hack #2387 066a011 but transfer errors would need a different approach.

@tonistiigi
Copy link
Member

docker/build-push-action#577 also has same root cause I believe

@thewilkybarkid
Copy link

GHA cache is currently broken (refs actions/cache#820); I'm seeing a build (using docker/build-push-action) fail.

It seems to ignore the cache failure when reading, but fails when trying to write to the cache:

#3 [internal] load metadata for docker.io/library/node:16.15.0-alpine3.15
#3 DONE 0.5s
#4 importing cache manifest from gha:1824[122](https://github.com/PREreview/prereview.org/runs/6861662292?check_suite_focus=true#step:4:125)2899969544227
#4 ERROR: GitHub Actions is temporarily unavailable. Please visit https://www.githubstatus.com/ for the status of our services.
#5 [node 1/2] FROM docker.io/library/node:16.15.0-alpine3.15@sha256:1a9a71ea86aad332aa7740316d4111ee1bd4e890df47d3b5eff3e5bded3b3d10
[...]
#18 exporting cache
#18 preparing build cache for export
#18 preparing build cache for export 51.4s done
#18 ERROR: GitHub Actions is temporarily unavailable. Please visit https://www.githubstatus.com/ for the status of our services.
------
 > importing cache manifest from gha:18241222899969544227:
------
------
 > exporting cache:
------
error: failed to solve: GitHub Actions is temporarily unavailable. Please visit https://www.githubstatus.com/ for the status of our services.
Error: buildx failed with: error: failed to solve: GitHub Actions is temporarily unavailable. Please visit https://www.githubstatus.com/ for the status of our services.

@ciaranmcnulty
Copy link

I got something similar, I think I was flooding GHA when exporting cache and getting rate limited:

#49 [test] exporting cache
[1957](https://github.com/ciaranmcnulty/php-docker-extensions/runs/8097296244?check_suite_focus=true#step:6:1961)
#49 preparing build cache for export
[1958](https://github.com/ciaranmcnulty/php-docker-extensions/runs/8097296244?check_suite_focus=true#step:6:1962)
#49 preparing build cache for export 1.3s done
[1959](https://github.com/ciaranmcnulty/php-docker-extensions/runs/8097296244?check_suite_focus=true#step:6:1963)
#49 ERROR: maximum timeout reached

This led to failed builds even though the images were correctly built

ciaranmcnulty added a commit to ciaranmcnulty/php-docker-extensions that referenced this issue Aug 31, 2022
It's a shame, but when the GHA cache rate-limits that comes across as a build failure

see moby/buildkit#2836
@melnikalex
Copy link

Is this not solved by: #3430

thewilkybarkid added a commit to PREreview/prereview.org that referenced this issue May 11, 2023
Builds regularly fail when multiple Dependabot PRs are being built simultaneously, as they seem to overload the service. This change should allow them to continue, even if they cannot use the cache.

Refs #905, moby/buildkit#2836
@thewilkybarkid
Copy link

thewilkybarkid commented May 11, 2023

Seems to be, I've just seen

#25 exporting to GitHub cache
#25 preparing build cache for export
#25 preparing build cache for export 82.5s done
#25 ERROR: GitHub Actions is temporarily unavailable. Please visit https://www.githubstatus.com/ for the status of our services.
------
 > exporting to GitHub cache:
------

appear in a GitHub action and the job continued.

@sipsma
Copy link
Collaborator Author

sipsma commented May 16, 2023

appear in a GitHub action and the job continued.

That's different than the problem in the issue description. Network errors are now non-fatal when you are either exporting cache or at the time that you resolve the cache metadata during an import.

The remaining issue is that errors that happen during the actual pull of remote cache layers will be fatal. This is because pulling of layers is lazy and happens after the remote cache metadata has been resolved.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

6 participants