Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Docker BuildKit caching w/ --cache-from fails every second time, except when using docker-container #2274

Closed
jli opened this issue Jul 22, 2021 · 18 comments · Fixed by #4796
Closed

Comments

@jli
Copy link

jli commented Jul 22, 2021

Similar to #1981, but it's still happening with 20.10.7, and I have a minimal reproduction case.

Version information

  • Macbook Air (M1, 2020)
  • Mac OS Big Sur 11.4
  • Docker Desktop 3.5.2 (66501)
% docker version
Client:
 Cloud integration: 1.0.17
 Version:           20.10.7
 API version:       1.41
 Go version:        go1.16.4
 Git commit:        f0df350
 Built:             Wed Jun  2 11:56:23 2021
 OS/Arch:           darwin/arm64
 Context:           default
 Experimental:      true

Server: Docker Engine - Community
 Engine:
  Version:          20.10.7
  API version:      1.41 (minimum version 1.12)
  Go version:       go1.13.15
  Git commit:       b0f5bc3
  Built:            Wed Jun  2 11:55:36 2021
  OS/Arch:          linux/arm64
  Experimental:     false
 containerd:
  Version:          1.4.6
  GitCommit:        d71fcd7d8303cbf684402823e425e9dd2e99285d
 runc:
  Version:          1.0.0-rc95
  GitCommit:        b9ee9c6314599f1b4a7f497e1f1f856fe433d3b7
 docker-init:
  Version:          0.19.0
  GitCommit:        de40ad0

Steps to reproduce

Have this Dockerfile:

# syntax=docker/dockerfile:1
FROM debian:buster-slim
RUN yes | head -20 | tee /yes.txt
COPY . /app

Run this script:

#!/bin/bash
set -euo pipefail
DOCKER_BUILDKIT=1
docker system prune -a -f
docker build \
    -t circularly/docker-cache-issue-20210722:cachebug \
    --cache-from circularly/docker-cache-issue-20210722:cachebug \
    --build-arg BUILDKIT_INLINE_CACHE=1 \
    .
docker push circularly/docker-cache-issue-20210722:cachebug
# this causes a change in the local files to simulate a code-only change
date > date_log.txt

(also here: https://github.com/jli/docker-cache-issue-20210722 )

What I see: When I run the above script multiple times, it alternates every time whether the RUN yes | head -20 | tee /yes.txt step is cached or not. The docker build output alternates between:

  • => [2/3] RUN yes | head -20 | tee /yes.txt
  • => CACHED [2/3] RUN yes | head -20 | tee /yes.txt

With docker-container driver

This comment by @tonistiigi suggested to use the "container driver". This does seem to work! I tried replacing the docker build command from above with this:

docker buildx create --driver docker-container --name cache-bug-workaround
docker buildx build --builder cache-bug-workaround --load \
    -t circularly/docker-cache-issue-20210722:cachebug-containerdriver \
    --cache-from circularly/docker-cache-issue-20210722:cachebug-containerdriver \
    --build-arg BUILDKIT_INLINE_CACHE=1 \
    .
docker buildx rm --builder cache-bug-workaround

This consistently results in the RUN yes ... step being cached!

The problem is that docker buildx doesn't appear to a subcommand in the https://hub.docker.com/_/docker image, which is what we use in CI. Is there a way to use the container driver when using that image?

Could you help me understand why this is needed?
Will this be fixed with a future release?

@jli
Copy link
Author

jli commented Jul 23, 2021

Two issues I'm noticing with using the docker-container driver to work around the caching issue:

  1. It adds some export/import steps
  2. docker push seems to be pushing all layers?

With the default driver, rebuilds of code-only changes take ~1 minute (when I get proper caching of the expensive layers in my image).
With the docker-container driver, these 2 factors mean rebuilds after code-only changes take ~4-5 minutes.

export/import steps

#25 exporting to oci image format
#25 exporting layers done
#25 exporting manifest sha256:01230f6377dec5a6988c924373bb62afe2837d3afa7bb0e84e98a016481c1c81 done
#25 exporting config sha256:4f48d81bc559f074600e3088949591f885d4ef3c74b8d833408864b6bd013df4 done
#25 sending tarball
#25 ...

#26 importing to docker
#26 DONE 32.1s

#25 exporting to oci image format
#25 sending tarball 43.0s done
#25 DONE 43.0s

This seems to add an extra minute to the build. I'm working with large images (~3.5gb from various scientific Python libraries), which I'm guessing exacerbates this issue.

docker push issue

Pushing my 3.5gb image takes ~3 minutes.

It seems that with the docker-container driver, docker push isn't able to see that the expensive layers are shared, and it's pushing all the layers instead of only pushing the new layers? I'm guessing this based on the output from docker push not saying "Layer already exists":

6474dc186dfd: Preparing
2d80b2e557e9: Preparing
59149f33a870: Preparing
ed04f21afbe5: Preparing
c9ec67fe6421: Preparing
e42dc4266416: Preparing
a55e5a0e7c4a: Preparing
aef13dfbb6f9: Preparing
1e602bec2da5: Preparing
b1c4e3f331ea: Preparing
3fdf9f44ae06: Preparing
78ce42cd87aa: Preparing
82e21ae59256: Preparing
02c055ef67f5: Preparing
e42dc4266416: Waiting
3fdf9f44ae06: Waiting
a55e5a0e7c4a: Waiting
78ce42cd87aa: Waiting
aef13dfbb6f9: Waiting
1e602bec2da5: Waiting
82e21ae59256: Waiting
b1c4e3f331ea: Waiting
02c055ef67f5: Waiting
ed04f21afbe5: Pushed
59149f33a870: Pushed
c9ec67fe6421: Pushed
2d80b2e557e9: Pushed
aef13dfbb6f9: Pushed
1e602bec2da5: Pushed
b1c4e3f331ea: Pushed
3fdf9f44ae06: Pushed
6474dc186dfd: Pushed
82e21ae59256: Pushed
78ce42cd87aa: Pushed
02c055ef67f5: Pushed
e42dc4266416: Pushed
a55e5a0e7c4a: Pushed

I push several tags. The first push takes 3 minutes, and the rest of the tags finish quickly as they all say "Layer already exists" for all the layers.

@Bi0max
Copy link

Bi0max commented Aug 2, 2021

I opened #1981 and I can confirm that my reproducible example also still does not work

@shootkin
Copy link

shootkin commented Sep 7, 2021

Same problem when building from inside of docker:20.10.8-dind

@sherifabdlnaby
Copy link

Same issue here.

@thomasfrederikhoeck
Copy link

Same issue here on 20.10.11+azure-3

@marchaos
Copy link

Any update on this? This seems like a major issue, and the alternative of using docker-container is untenable due to those issues noted above.

@tonistiigi
Copy link
Member

Can someone test with the master version of dockerd. 20.10 is a couple of buildkit releases old and it has been confirmed that it indeed works with buildkit directly.

airenas added a commit to airenas/boost that referenced this issue Sep 12, 2022
It turned out that using `DOCKER_BUILDKIT=1` has a problem with caching:
moby/buildkit#2274.
Using `docker buildx` would fix it,
but it may not be installed on every machine.
For now, turned buildkit only for boost image.
nonsense pushed a commit to filecoin-project/boost that referenced this issue Sep 15, 2022
* Consolidate makefiles

- Move docker building stuff to the main makefile
- Drop internal makefiles
- Allow to build lotus from source
- Update readme

* Fix caching a docker build of lotus-test

It turned out that using `DOCKER_BUILDKIT=1` has a problem with caching:
moby/buildkit#2274.
Using `docker buildx` would fix it,
but it may not be installed on every machine.
For now, turned buildkit only for boost image.
@Raniz85
Copy link

Raniz85 commented Oct 19, 2022

Same issue here (Debian Bullseye)

$ docker --version
Docker version 20.10.14, build a224086

@adityapatadia
Copy link

We are facing same issue in Bitbucket pipelines.

@jli
Copy link
Author

jli commented Nov 26, 2022

This ended up being enough of a drag on my team's productivity that we came up with a workaround that we've been using for about a month that has been working really well for us so for.

We split out a "base" Docker image which installs all our dependencies, and then we have a "final" Docker image which just copies the code on top of the base image as a final layer.

The important part is that these are distinct images and not just separate layers, which is how we work around the inconsistent layer caching behavior.

Our "final" Dockerfile just looks like:

FROM container-host.com/your-project/your-base-image:latest-version
COPY . /app

Downside: This setup makes it harder to test changes to the base image. Instead of just updating a single Dockerfile and building+pushing, you need to (1) change the "base" Dockerfile/dependencies, (2) build and push the base image to your container host with a new tag for testing, (3) edit the "final" Dockerfile to reference the new testing tag. I wrote a Python script to do 2+3 so testing of changes to our base image is pretty streamlined still.
Note: It would be some more work to make this fully integrated with CI such that the base image used in prod is also built in CI. currently, we just use the base images built on local machines from when people make changes. This is acceptable to us, but maybe some people have more stringent requirements.

Overall, this has definitely been worth it for us, especially since our base image is huge (3GB of Python ML dependencies) and takes a long time to build, so cache misses were extremely painful.

  • docker build for code-only changes are guaranteed to only copy the code layer.
  • docker push for the new code-only layers is also guaranteed to be fast (when the cache would break for base layers before, people would have to upload 3GB of data, sometimes over spotty WiFi or while tethering)
  • everyone is guaranteed to share the expensive central base image. new team members or people who've pruned their cache just need to download the base image instead of building their own local copy (docker pull never worked for this, in my experience)
  • building our Docker image in CI is guaranteed to be fast, and also much simpler since we no longer have a bunch of verbose --cache-from flags and extra docker push calls to get caching in CI builds. (Though see note above about fully integrating this process in CI)

@ShadowLNC
Copy link

Based on my limited testing, using docker pull <version> for every image used in --cache-from arguments will suppress this bug.

This was noted as a workaround in #1981, but may not always work, based on the comment above. We're using Bitbucket Pipelines (regular runner, not the self-hosted ones), which means no access to buildx, limited Docker updates, and x86-only builds - any one of which might affect the viability of this workaround.

As a side note, docker pull <tag> || true can be used in pipeline steps where you're not sure if the image exists.

@rucciva
Copy link

rucciva commented Jul 27, 2023

Same issue with buildctl-daemonless.sh

@tomlau10
Copy link

Our team is facing this same issue recently in github action since its latest runner image updated docker version to v23+ which uses BuildKit as default build engine.

Our original cache flow is:

  • pull same commit sha tag || pull latest tag
  • build with --cache-from <same commit sha> --cache-from <latest>
  • tag the new image as <commit sha> & latest
  • push both tags

And with this flow we have the exact same issue that --cache-from fails every second time.


Tried pulling all image tags beforehand but it is not helpful. Based on my observations, it seems that

  • If an image is built from scratch, then this image CAN be used as cache
  • Otherwise if an image is built using a cache image, then it CANNOT be used as cache
  • This matches the strange caching behaviour because we always push the new built image as latest then build from it the next time.

So our current workaround is to add a specific ci step

  • say if the trigger branch is deployment/build_cache
    it will build image from SCRATCH and push with tag <image>:build_cache
  • all other trigger branch will build with --cache-from <image>:build_cache instead of latest tag
  • and if dockerfile is changed (eg. base image updated)
    then we push deployment/build_cache once to update the cache
  • so far the caching behaviour is now more consistent

@matti
Copy link

matti commented Oct 24, 2023

I can confirm that I have exactly same setup like @tomlau10 and it started to fail every other time in github actions

@tomlau10
Copy link

I had a look at my team's deploy log and everything seems fine. Have you tried pushing deployment/build_cache to refresh the cache image again? @matti


I encountered this about two months ago. I first noticed it on 27/8, and upon investigation, I found that the Docker version in GitHub Action's runner image had been updated. Later on 25/9, I pushed deployment/build_cache to refresh the cache, and everything began working as expected again.

I suspect that if the Docker version used to build the cache image does not match the one in use when building with --cache-from, then this issue can occur. According to the runner image changelog, the following updates have been made:

  • 15/9 updated to 24.0.6
  • 24/8 updated to 24.0.5
  • 3/8 updated to 23.0.6+azure-2

This aligns with my hypothesis:

  • I set up the cache in mid-August when the Docker version was 23.0.6+azure-2.
  • The cache started failing on 27/8 because the Docker version had already been updated to 24.0.5 on 24/8.
  • When I finally had time to debug and push deployment/build_cache on 25/9, the Docker version was 24.0.6, which is the latest version provided by the GitHub Actions runner image. So everything works fine up to now.

Side note:
Docker version 24.0.7 is released recently on 27/10. I think my cache setup will start to fail again soon when the action runner image is updated, and I will have to push deployment/build_cache to refresh the cache again. 🙈
https://docs.docker.com/engine/release-notes/24.0/#2407

@SDeans0
Copy link

SDeans0 commented Feb 20, 2024

Did anybody else observe that after they gave up and split their requirements installation into a separate build, that the new requirements build step always cached properly?

@cvn
Copy link

cvn commented Feb 23, 2024

I am working around this by using the legacy builder. It's deprecated, but still works as of Docker v25.

# Enable legacy builder
DOCKER_BUILDKIT=0

docker pull $MY_IMAGE:latest || true
docker build --cache-from $MY_IMAGE:latest --tag $MY_IMAGE:latest .
docker push $MY_IMAGE:latest

@DaniilAnichin
Copy link

@tonistiigi Not to rush or anything, but what is the estimate on new patch version release?
Our ultimate goal is for the updated version to be available for use in bitbucket pipeline, and I'd be glad to know a rough approximation on when to expect that to happen)

Thanks in advance

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.