Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Exported build cache randomly missing some platforms #2822

Open
Rongronggg9 opened this issue Apr 24, 2022 · 8 comments
Open

Exported build cache randomly missing some platforms #2822

Rongronggg9 opened this issue Apr 24, 2022 · 8 comments

Comments

@Rongronggg9
Copy link

Rongronggg9 commented Apr 24, 2022

TL;DR

#2822 (comment)


In #2758 (comment), @tonistiigi said that If someone has something similar with a single node then create a separate issue with a reproducer so here it is.

I am facing almost the same issue but what I use is GitHub Actions which should only have a single-node builder.

In my case, I need to build for 3 platforms (linux/amd64, linux/arm/v7, linux/arm64). Each time, there are randomly 1 or 2 platforms entirely unable to fetch their cache and must start over.

Build settings

      - name: Build and push Docker image (ordinary version)
        uses: docker/build-push-action@v2
        with:
          context: .
          push: true
          tags: ${{ steps.meta-ordinary.outputs.tags }}
          labels: ${{ steps.meta-ordinary.outputs.labels }}
          platforms: linux/amd64,linux/arm/v7,linux/arm64
          cache-from: type=registry,ref=${{ secrets.DOCKER_USERNAME }}/rsshub:buildcache
          cache-to: type=registry,ref=${{ secrets.DOCKER_USERNAME }}/rsshub:buildcache,mode=max
/usr/bin/docker buildx build 
--cache-from type=registry,ref=***/rsshub:buildcache 
--cache-to type=registry,ref=***/rsshub:buildcache,mode=max 
--iidfile /tmp/docker-build-push-AxluK5/iidfile 
--label org.opencontainers.image.title=RSSHub 
--label org.opencontainers.image.description=🍰 Everything is RSSible 
--label org.opencontainers.image.url=https://github.com/Rongronggg9/RSSHub 
--label org.opencontainers.image.source=https://github.com/Rongronggg9/RSSHub 
--label org.opencontainers.image.version=latest 
--label org.opencontainers.image.created=2022-04-24T03:45:12.333Z 
--label org.opencontainers.image.revision=16de07893dacaaf7382fc985ccd43f3f5e587646 
--label org.opencontainers.image.licenses=MIT 
--platform linux/amd64,linux/arm/v7,linux/arm64 
--tag ***/rsshub:latest 
--tag ***/rsshub:2022-04-24 
--metadata-file /tmp/docker-build-push-AxluK5/metadata-file 
--push .

(I added linebreaks to make it readable.)

The node

  Name:   builder-798d7bea-6cd7-454f-baa6-32276a2b4c26
  Driver: docker-container
  
  Nodes:
  Name:      builder-798d7bea-6cd7-454f-baa6-32276a2b4c260
  Endpoint:  unix:///var/run/docker.sock
  Status:    running
  Flags:     --allow-insecure-entitlement security.insecure --allow-insecure-entitlement network.host
  Platforms: linux/amd64, linux/amd64/v2, linux/amd64/v3, linux/amd64/v4, linux/arm64, linux/riscv64, linux/ppc64le, linux/s390x, linux/386, linux/mips64le, linux/mips64, linux/arm/v7, linux/arm/v6
  {
    "name": "builder-798d7bea-6cd7-454f-baa6-32276a2b4c26",
    "driver": "docker-container",
    "node_name": "builder-798d7bea-6cd7-454f-baa6-32276a2b4c260",
    "node_endpoint": "unix:///var/run/docker.sock",
    "node_status": "running",
    "node_flags": "--allow-insecure-entitlement security.insecure --allow-insecure-entitlement network.host",
    "node_platforms": "linux/amd64,linux/amd64/v2,linux/amd64/v3,linux/amd64/v4,linux/arm64,linux/riscv64,linux/ppc64le,linux/s390x,linux/386,linux/mips64le,linux/mips64,linux/arm/v7,linux/arm/v6"
  }

How to reproduce

  1. Delete all relevant build caches on Docker Hub to avoid potential interference. (optional)
  2. Trigger action run attempt #1, which should have the build caches of all 3 platforms pushed to the registry.
  3. Wait until finished, then re-run the job, trigger action run attempt #2, which should have all 3 platforms get cache hits, but only linux/amd64 got.

FYI

Workflow YAML
Dockerfile

This issue seems to exist when the cache type is gha, but I am not so sure if this is because of the cache size limit of GitHub Actions.

@Nick-0314
Copy link

Nick-0314 commented Apr 29, 2022

@Rongronggg9 #1044 It seems that the cache of the later push overwrites the cache of the previous platform push, so only the cache of the last platform can be saved

@Rongronggg9
Copy link
Author

Rongronggg9 commented Apr 29, 2022

@Mytting I've already checked that issue before but I believe that this issue differs from that:

  1. Multi-node builder vs. single-node builder
  2. Only the build cache from the latest-built platform available vs. randomly one or two platforms lose their build cache.

In fact, I doubt that there are some undocumented limitations on the registry cache preventing all cache to be saved. If I do some tricks to shrink the cache size, the problem will probably be solved. Some of my repositories have similar workflows but they never face this issue while the issue mentioned in the issue description does.

@Rongronggg9
Copy link
Author

Rongronggg9 commented Apr 30, 2022

I've figured out a workflow to reproduce this issue easily. Please check https://github.com/Rongronggg9/RSSHub/tree/reproduce-buildkit-cache-issue

Run attempt 1: https://github.com/Rongronggg9/RSSHub/actions/runs/2248798559/attempts/1

cache type builder cache exported cache 1 build time rebuild time cache-miss platform(s)
gha 3.563G 208M 12m 58s 10m 8s linux/arm64
registry 3.563G 295M 21m 44s 6m 1s /
registry (uncompressed) 5.177G 748M 14m 34s 12m 21s linux/amd64, linux/arm64, linux/arm/v7 (ERROR: failed to authorize: failed to fetch oauth token: Post "https://auth.docker.io/token": EOF)
local 3.563G 475M 12m 42s 6m 1s /
local (uncompressed) 5.177G 1.5G (350M pushed) 12m 29s 9m 41s linux/arm64

Run attempt 2: https://github.com/Rongronggg9/RSSHub/actions/runs/2248798559/attempts/2

cache type builder cache exported cache 1 build time rebuild time cache-miss platform(s)
gha 3.563G 294M 14m 41s 5m 56s /
registry 3.563G 294M 14m 46s 6m 5s /
registry (uncompressed) 5.177G 1295M 13m 37s 10m 29s linux/arm64
local 3.563G 303M 20m 52s N/A 2 N/A
local (uncompressed) 5.177G 1.5G 12m 20s N/A 2 N/A

The cache issue for the registry seems solved, right? That's probably because I've applied some tricks to shrink the cache size. What if making it huge again? Please check: https://github.com/Rongronggg9/RSSHub/tree/reproduce-buildkit-cache-issue-huge

Run attempt 1: https://github.com/Rongronggg9/RSSHub/actions/runs/2248914216/attempts/1

cache type builder cache exported cache 1 build time rebuild time cache-miss platform(s)
gha 8.605G 921M 18m 36s 11m 22s linux/arm64
registry 8.605G 917M 18m 14s 11m 58s linux/arm64
registry (uncompressed) 12.25G 3303M 17m 45s 11m 14s linux/arm64
local 8.605G 1.1G 18m 28s 10m 31s linux/arm/v7
local (uncompressed) 12.25G 1.9G (615M pushed) 16m 57s 13m 16s linux/arm64, linux/arm/v7

Run attempt 2: https://github.com/Rongronggg9/RSSHub/actions/runs/2248914216/attempts/2

cache type builder cache exported cache 1 build time rebuild time cache-miss platform(s)
gha 8.605G 483M 21m 37s 12m 39s linux/arm64, linux/arm/v7
registry 8.605G 917M 19m 24s 11m 31s linux/arm64
registry (uncompressed) 12.25G 3319M 26m 13s 9m 58s linux/arm/v7
local 8.605G 659M 18m 31s N/A 2 N/A
local (uncompressed) 12.25G 3.4G 16m 40s N/A 2 N/A

Conclusion

What a randomness!

It seems that there is some indeterminacy randomly preventing BuildKit (or maybe it's the fault of buildx?) from exporting the build caches of some platforms. Is the indeterminacy cache-type irrelevant, or relevant? Probably irrelevant.

The strangest thing is that it also occurs to the local cache type. In theory, there shouldn't be anything preventing BuildKit to export the build cache to a local path.

I believe that as long as the rebuild job is run immediately after the build job, the GitHub Action cache was not shrunk yet (besides, the build caches within each run attempt are less than 10GB in total so even if shrunk, it would be the old ones to suffer, not them):

Still has not been shrunk, last check 2022-04-30T18:40:55Z, 14 hours after the last run attempt.
https://api.github.com/repos/Rongronggg9/rsshub/actions/cache/usage

{
  "full_name": "Rongronggg9/RSSHub",
  "active_caches_size_in_bytes": 15912122436,
  "active_caches_count": 709
}

Footnotes

  1. For local, it is the exported cache size; for gha and registry, it is the total TX bytes on eth0. 2 3 4

  2. This does not have any meaning since the commit hash had not unchanged and the new cache was not pushed. 2 3 4

@Rongronggg9 Rongronggg9 changed the title Build cache pushed to registry randomly missing some platforms Exported build cache randomly missing some platforms Apr 30, 2022
@Nick-0314
Copy link

@Rongronggg9
Cache to inline, but cache from Registry. How is the cache pushed to Registry? How does this part of logic work? I didn't understand. I was trying to solve the multi-Builder cache problem with this

      - name: Build and push Docker image (Chromium-bundled version)
        uses: docker/build-push-action@v2
        with:
          context: .
          build-args: PUPPETEER_SKIP_CHROMIUM_DOWNLOAD=0
          push: true
          tags: ${{ steps.meta-chromium-bundled.outputs.tags }}
          labels: ${{ steps.meta-chromium-bundled.outputs.labels }}
          platforms: linux/amd64,linux/arm/v7,linux/arm64
          cache-from: |
            type=registry,ref=${{ secrets.DOCKER_USERNAME }}/rsshub:chromium-bundled
          # type=gha,scope=docker-release # not needed, Docker automatically uses local cache from the builder
          # type=registry,ref=${{ secrets.DOCKER_USERNAME }}/rsshub:buildcache
          cache-to: type=inline,ref=${{ secrets.DOCKER_USERNAME }}/rsshub:chromium-bundled  # inline cache is enough

@Rongronggg9
Copy link
Author

Rongronggg9 commented May 7, 2022

@Mytting

Inline cache embeds cache metadata into the image config. The layers in the image will be left untouched compared to the image with no cache information.

For more details, you need to know the cache hit/miss judging mechanism. Basically speaking, each step of each stage has its own metadata used by the mechanism. It is like a hash. For a RUN statement or etc, it is the change of the statement itself that determines cache hit or miss. For a COPY statement or etc, it is the changes of files copied that determine cache hit or miss. If a step in a stage faces cache miss, the following steps in this stage will be forced to face cache miss.

What inline cache does is push these metadata ("hash") along with the image to the registry. However, since the inline cache is incompatible with max cache mode, only the metadata of the last stage will be pushed. Why? The built image only contains layers from the last stage. Without layers from previous stages, even though cache hits, no layers can be reused since they just do not exist. While with max mode enabled, every layer (that is, the result of every step of every stage in Dockerfile) along with its metadata will be exported and pushed.

There are two use case of inline cache:

  1. Single-stage Dockerfile
  2. The caches (layers and cache metadata) of non-last stages are stored somewhere else

If you are interested in https://github.com/DIYgod/RSSHub/blob/eb79456f402b268d8aa5a4f25060b7bc8b6d10f6/.github/workflows/docker-release.yml#L84-L97, let me tell you why non-last stages have their caches stored locally.

In the previous build step of the workflow job, all stages have been built and cached both locally and remotely (you may inspect local caches by executing docker buildx du --verbose). The build step you mentioned just changes the result of the last two stages in the Dockerfile.

Since the last stage copies files from the penult stage, a little metadata from the penult stage is somehow inline-cached. As a result, the penult stage is able to hit the cache. The caches of the rest stages are from the local caches of buildx builder that were written in the previous build step of the workflow job (no need to specify cache-from, they are automatically reused). Thus, the build step you mentioned is able to have everything cache-hit.


If you are trying to work around the cache issue, inline-cache may not be a nice choice if you use GHA to build your image since everything is cleared after the job finishes. Of course, unless your Dockerfile is single-staged. Just a reminder, https://github.com/DIYgod/RSSHub/blob/eb79456f402b268d8aa5a4f25060b7bc8b6d10f6/.github/workflows/docker-release.yml has not been applied any workaround of the issue and still has this issue occurs from time to time. The workaround I've mentioned in docker/buildx#1044 (comment) deserves a try if you really need it.

@jobcespedes
Copy link

Investigating a related issue with buildx, I have found that manfest content using multiplatform (amd64, arm64) randomly changes order. This comment is just intended as a possible pointer; however, it could be completely unrelated. Attached is a .diff of the the two manifest:

--- /tmp/meta-538b4.json      2022-06-20 22:39:33.302897680 -0600
+++ /tmp/meta-80e8a.json      2022-06-20 22:39:57.467873367 -0600
@@ -3,24 +3,24 @@
   "manifest": {
     "schemaVersion": 2,
     "mediaType": "application/vnd.docker.distribution.manifest.list.v2+json",
-    "digest": "sha256:538b4667e072b437a5ea1e0cd97c2b35d264fd887ef686879b0a20c777940c02",
+    "digest": "sha256:80e8a68eb9363d64eabdeaceb1226ae8b1794e39dd5f06b700bae9d8b1f356d5",
     "size": 743,
     "manifests": [
       {
         "mediaType": "application/vnd.docker.distribution.manifest.v2+json",
-        "digest": "sha256:cef1b67558700a59f4a0e616d314e05dc8c88074c4c1076fbbfd18cc52e6607b",
+        "digest": "sha256:2bc150cfc0d4b6522738b592205d16130f2f4cde8742cd5434f7c81d8d1b2908",
         "size": 1367,
         "platform": {
-          "architecture": "arm64",
+          "architecture": "amd64",
           "os": "linux"
         }
       },
       {
         "mediaType": "application/vnd.docker.distribution.manifest.v2+json",
-        "digest": "sha256:2bc150cfc0d4b6522738b592205d16130f2f4cde8742cd5434f7c81d8d1b2908",
+        "digest": "sha256:cef1b67558700a59f4a0e616d314e05dc8c88074c4c1076fbbfd18cc52e6607b",
         "size": 1367,
         "platform": {
-          "architecture": "amd64",
+          "architecture": "arm64",
           "os": "linux"
         }
       }

Tbaile added a commit to NethServer/phonehome-server that referenced this issue Oct 18, 2022
@aogier
Copy link

aogier commented Apr 12, 2023

Hi, I've just observed a somewhat similar behaviour there: I'm building different platforms on a matrix (because docker/build-push-action#826, probably due to buildkit too) and the build which finish last overwrite the previous.
My config:

jobs:

  build-image:
    strategy:
      matrix:
        arch:
          - amd64
          - arm64
    runs-on: ubuntu-latest

    steps:

      [...]

      -
        name: Build image
        uses: docker/build-push-action@v4
        with:
          cache-from: type=gha
          cache-to: type=gha,mode=max
          context: .
          load: true
          platforms: linux/${{ matrix.arch }}
          tags: ${{ steps.login-ecr.outputs.registry }}/main/frontend:${{ github.sha }}-${{ matrix.arch }}

what I can observe in a 3 runs w/ identical repo contents:

run amd arm notes
1 ++ +++++ empty caches
2 ++ cached arm is now cached
3 cached +++++ amd is now cached

it would be very useful if restore key would preserve arch so the cache would live separately (not that different arches should be mixed anyway, imho).

Thank you

@viceice
Copy link

viceice commented Jul 12, 2023

it seems that is sometimes works for registry cache and sometimes it fails.

works: https://github.com/renovatebot/docker-renovate/actions/runs/5530222002/jobs/10089343401
fails: https://github.com/renovatebot/docker-renovate/actions/runs/5530341846/jobs/10090968895

It's arm64 most of the time (at least when i noticed it).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

6 participants