-
Notifications
You must be signed in to change notification settings - Fork 1.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Exported build cache randomly missing some platforms #2822
Comments
@Rongronggg9 #1044 It seems that the cache of the later push overwrites the cache of the previous platform push, so only the cache of the last platform can be saved |
@Mytting I've already checked that issue before but I believe that this issue differs from that:
In fact, I doubt that there are some undocumented limitations on the registry cache preventing all cache to be saved. |
I've figured out a workflow to reproduce this issue easily. Please check https://github.com/Rongronggg9/RSSHub/tree/reproduce-buildkit-cache-issue Run attempt 1: https://github.com/Rongronggg9/RSSHub/actions/runs/2248798559/attempts/1
Run attempt 2: https://github.com/Rongronggg9/RSSHub/actions/runs/2248798559/attempts/2
The cache issue for the registry seems solved, right? That's probably because I've applied some tricks to shrink the cache size. What if making it huge again? Please check: https://github.com/Rongronggg9/RSSHub/tree/reproduce-buildkit-cache-issue-huge Run attempt 1: https://github.com/Rongronggg9/RSSHub/actions/runs/2248914216/attempts/1
Run attempt 2: https://github.com/Rongronggg9/RSSHub/actions/runs/2248914216/attempts/2
ConclusionWhat a randomness! It seems that there is some indeterminacy randomly preventing BuildKit (or maybe it's the fault of buildx?) from exporting the build caches of some platforms. Is the indeterminacy cache-type irrelevant, or relevant? Probably irrelevant. The strangest thing is that it also occurs to the I believe that as long as the rebuild job is run immediately after the build job, the GitHub Action cache was not shrunk yet (besides, the build caches within each run attempt are less than 10GB in total so even if shrunk, it would be the old ones to suffer, not them): Still has not been shrunk, last check 2022-04-30T18:40:55Z, 14 hours after the last run attempt. {
"full_name": "Rongronggg9/RSSHub",
"active_caches_size_in_bytes": 15912122436,
"active_caches_count": 709
} Footnotes |
@Rongronggg9
|
For more details, you need to know the cache hit/miss judging mechanism. Basically speaking, each step of each stage has its own metadata used by the mechanism. It is like a hash. For a What inline cache does is push these metadata ("hash") along with the image to the registry. However, since the inline cache is incompatible with There are two use case of inline cache:
If you are interested in https://github.com/DIYgod/RSSHub/blob/eb79456f402b268d8aa5a4f25060b7bc8b6d10f6/.github/workflows/docker-release.yml#L84-L97, let me tell you why non-last stages have their caches stored locally. In the previous build step of the workflow job, all stages have been built and cached both locally and remotely (you may inspect local caches by executing Since the last stage copies files from the penult stage, a little metadata from the penult stage is somehow inline-cached. As a result, the penult stage is able to hit the cache. The caches of the rest stages are from the local caches of buildx builder that were written in the previous build step of the workflow job (no need to specify If you are trying to work around the cache issue, inline-cache may not be a nice choice if you use GHA to build your image since everything is cleared after the job finishes. Of course, unless your |
Investigating a related issue with buildx, I have found that manfest content using multiplatform (amd64, arm64) randomly changes order. This comment is just intended as a possible pointer; however, it could be completely unrelated. Attached is a .diff of the the two manifest: --- /tmp/meta-538b4.json 2022-06-20 22:39:33.302897680 -0600
+++ /tmp/meta-80e8a.json 2022-06-20 22:39:57.467873367 -0600
@@ -3,24 +3,24 @@
"manifest": {
"schemaVersion": 2,
"mediaType": "application/vnd.docker.distribution.manifest.list.v2+json",
- "digest": "sha256:538b4667e072b437a5ea1e0cd97c2b35d264fd887ef686879b0a20c777940c02",
+ "digest": "sha256:80e8a68eb9363d64eabdeaceb1226ae8b1794e39dd5f06b700bae9d8b1f356d5",
"size": 743,
"manifests": [
{
"mediaType": "application/vnd.docker.distribution.manifest.v2+json",
- "digest": "sha256:cef1b67558700a59f4a0e616d314e05dc8c88074c4c1076fbbfd18cc52e6607b",
+ "digest": "sha256:2bc150cfc0d4b6522738b592205d16130f2f4cde8742cd5434f7c81d8d1b2908",
"size": 1367,
"platform": {
- "architecture": "arm64",
+ "architecture": "amd64",
"os": "linux"
}
},
{
"mediaType": "application/vnd.docker.distribution.manifest.v2+json",
- "digest": "sha256:2bc150cfc0d4b6522738b592205d16130f2f4cde8742cd5434f7c81d8d1b2908",
+ "digest": "sha256:cef1b67558700a59f4a0e616d314e05dc8c88074c4c1076fbbfd18cc52e6607b",
"size": 1367,
"platform": {
- "architecture": "amd64",
+ "architecture": "arm64",
"os": "linux"
}
} |
Hi, I've just observed a somewhat similar behaviour there: I'm building different platforms on a matrix (because docker/build-push-action#826, probably due to buildkit too) and the build which finish last overwrite the previous. jobs:
build-image:
strategy:
matrix:
arch:
- amd64
- arm64
runs-on: ubuntu-latest
steps:
[...]
-
name: Build image
uses: docker/build-push-action@v4
with:
cache-from: type=gha
cache-to: type=gha,mode=max
context: .
load: true
platforms: linux/${{ matrix.arch }}
tags: ${{ steps.login-ecr.outputs.registry }}/main/frontend:${{ github.sha }}-${{ matrix.arch }} what I can observe in a 3 runs w/ identical repo contents:
it would be very useful if restore key would preserve arch so the cache would live separately (not that different arches should be mixed anyway, imho). Thank you |
it seems that is sometimes works for registry cache and sometimes it fails. works: https://github.com/renovatebot/docker-renovate/actions/runs/5530222002/jobs/10089343401 It's arm64 most of the time (at least when i noticed it). |
TL;DR
#2822 (comment)
In #2758 (comment), @tonistiigi said that
If someone has something similar with a single node then create a separate issue with a reproducer
so here it is.I am facing almost the same issue but what I use is GitHub Actions which should only have a single-node builder.
In my case, I need to build for 3 platforms (
linux/amd64
,linux/arm/v7
,linux/arm64
). Each time, there are randomly 1 or 2 platforms entirely unable to fetch their cache and must start over.Build settings
(I added linebreaks to make it readable.)
The node
How to reproduce
linux/amd64
got.FYI
Workflow YAML
Dockerfile
This issue seems to exist when the cache type is
gha
, but I am not so sure if this is because of the cache size limit of GitHub Actions.The text was updated successfully, but these errors were encountered: