Join GitHub today
GitHub is home to over 28 million developers working together to host and review code, manage projects, and build software together.Sign up
Proposal: Use graph search to find valid chains of relevant docker build cache hits #20304
Proposal: Use graph search to find valid chains of relevant docker build cache hits
Summary: As a build tool, it's not enough to sort by build cache item recency and only pick the most recent image. Cache miss event should cause the builder to backtrack (recursive graph traversal etc) and try the next most recent available matching image, until it exhausts all available image options, at which point it should recursively backtrack to the previous Dockerfile step and repeat the exhaustive search.
Otherwise, the docker build tool will not find the right cache hits even though the cached layers and/or images are technically available. This bug causes a significant amount of rebuilds and cache misses.
Reproducing the bug: Unexpected cache miss after successful build --no-cache
Scenario: a Dockerfile, a git repository, two git branches, one branch has a Dockerfile with N lines last one is "RUN echo beep >> /tmp/beep.txt", other branch has a Dockerfile with same N lines except last one is instead "RUN echo bloop >> /tmp/bloop.txt".
Run docker build on both branches, once on the first branch, then once on the second branch, and finally a third time on the first branch and second branch, to show that the docker build cache is working correctly. The first and second build should have some cache misses, first because the Dockerfile is new, and the second because the new Dockerfile line is also new, as expected. The final two out of the four runs should have only cache hits and those two should never have any cache misses, as expected. This works fine, as expected.
However, after a successful build --no-cache, a subsequent (regular) "docker build" run on the other branch will fail to avoid cache misses. This is unexpected because valid correct intermediate images are still available in the docker build cache, and they are still valid according to all of the cache hit rules.
It's not enough to sort by build cache item recency and only pick the most recent image. Cache miss event should cause the builder to backtrack (recursive graph traversal etc) and try the next most recent available matching image, until it exhausts all available image options, at which point it should recursively backtrack to the previous Dockerfile step and repeat the exhaustive search.
"Use the most recent image from the build cache" will fail after using --no-cache because "last known most recently matching image" will naturally not have any relevant child images in the build cache, despite there being working prior images in the build cache that could conceivably be used to correctly satisfy the build. The "last known most recently matching image" will be whatever image was recently created during --no-cache, and therefore have none of the children from previous cached builds. Meanwhile the build cache is still present and available and has useful things inside.
Generally, the builder should prefer to select any satisfactory set of cached images, as long as the build is still estimated to be the correct build. Until the builder has a complete plan, it should prefer the plans that have the lowest cumulative cache miss score. Since this requires essentially a graph search algorithm, basic heuristics could be used for how long to grind on the graph search problem. Since fresh builds can take lots of time, it seems reasonable to spend entire seconds working on the graph search problem, or even allow the user to configure how long to search by an API parameter when starting the builder.
Graph search should significantly improve cache hit performance on many CI clusters, saving what I am sure would be billions of dollars of compute time for builds, which I am sure you can all agree should instead be sent to my Bitcoin addresses, that's how this works, right? Thanks.
"Use the most recent image from the build cache" was implemented here: https://github.com/docker/docker/blob/415dd8688618ac2b3872c6a5b5af3afcef2c3e10/daemon/daemon.go#L1363
Plans and dry runs
There should be a way to generate a "docker build plan", which is not yet executed but based on a comparison of the current Dockerfile to the DOCKER_HOST's docker build cache image store. Being able to compare multiple alternative plans would be extremely useful for diagnosing docker build cache problems (including cache misses) or even potential optimizations. It would also be useful to be able to "hint" to the builder that I expect the cache to be already-warm for a particular build request, so that I may receive an error instead of experiencing a fresh build further contaminating the cache...
For some reason there seems to be a claim floating around (on the docker site) that the docker build cache is invalidated when foregoing the use of a previous intermediate image. However, based on my (short) review of the source code, it seems that the old intermediate images stay around in the cache and are not invalidated. If this is not the case, perhaps it would be useful to make this more clear in the documentation or even within the source code......
Distributed cache warming?
Unfortunately, concurrent builds on the same DOCKER_HOST seem to be pretty slow (see #9656). At the same time, sharing a docker build cache would be very helpful for reducing total build time across a CI cluster of multiple DOCKER_HOSTs. Perhaps if this is not likely, then a tool for DOCKER_HOST build assignment would be helpful, based on looking at the image store on each host, and then deciding which host should be assigned the job, based on which host would be most likely to have the most cache hits for the build job?
Alternatively I would also be interested in exploring the concept of pulling relevant cached intermediates from a registry....
One more bug...
I was originally investigating possible cache bugs because I was experiencing unexpected cache misses on my CI cluster. Unfortunately I am not using --no-cache, so none of this seems to explain the unexpected cache misses I have been experiencing...... If you think image graph cache search would be helpful beyond the after-no-cache unexpected behavior (like maybe it would help with
"output reason for docker cache invalidation"
"faster cache miss detection"
I have updated the issue text to clarify the topic area and the primary problem. Perhaps as an alternative way of closing this issue, some documentation can be added somewhere to state expectations around cache hit frequency, and just mention there will be a very low cache hit frequency for builds after building similar (but not exactly the same) images. I am not sure if this wording is correct, so don't copy-paste my wording from this comment-- the situation probably needs a closer analysis.