New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Proposal: Use graph search to find valid chains of relevant docker build cache hits #20304

Open
kanzure opened this Issue Feb 14, 2016 · 5 comments

Comments

Projects
None yet
6 participants
@kanzure
Copy link

kanzure commented Feb 14, 2016

Proposal: Use graph search to find valid chains of relevant docker build cache hits

Summary: As a build tool, it's not enough to sort by build cache item recency and only pick the most recent image. Cache miss event should cause the builder to backtrack (recursive graph traversal etc) and try the next most recent available matching image, until it exhausts all available image options, at which point it should recursively backtrack to the previous Dockerfile step and repeat the exhaustive search.

Otherwise, the docker build tool will not find the right cache hits even though the cached layers and/or images are technically available. This bug causes a significant amount of rebuilds and cache misses.

Reproducing the bug: Unexpected cache miss after successful build --no-cache

Scenario: a Dockerfile, a git repository, two git branches, one branch has a Dockerfile with N lines last one is "RUN echo beep >> /tmp/beep.txt", other branch has a Dockerfile with same N lines except last one is instead "RUN echo bloop >> /tmp/bloop.txt".

Run docker build on both branches, once on the first branch, then once on the second branch, and finally a third time on the first branch and second branch, to show that the docker build cache is working correctly. The first and second build should have some cache misses, first because the Dockerfile is new, and the second because the new Dockerfile line is also new, as expected. The final two out of the four runs should have only cache hits and those two should never have any cache misses, as expected. This works fine, as expected.

However, after a successful build --no-cache, a subsequent (regular) "docker build" run on the other branch will fail to avoid cache misses. This is unexpected because valid correct intermediate images are still available in the docker build cache, and they are still valid according to all of the cache hit rules.

Speculative proposal

It's not enough to sort by build cache item recency and only pick the most recent image. Cache miss event should cause the builder to backtrack (recursive graph traversal etc) and try the next most recent available matching image, until it exhausts all available image options, at which point it should recursively backtrack to the previous Dockerfile step and repeat the exhaustive search.

"Use the most recent image from the build cache" will fail after using --no-cache because "last known most recently matching image" will naturally not have any relevant child images in the build cache, despite there being working prior images in the build cache that could conceivably be used to correctly satisfy the build. The "last known most recently matching image" will be whatever image was recently created during --no-cache, and therefore have none of the children from previous cached builds. Meanwhile the build cache is still present and available and has useful things inside.

Generally, the builder should prefer to select any satisfactory set of cached images, as long as the build is still estimated to be the correct build. Until the builder has a complete plan, it should prefer the plans that have the lowest cumulative cache miss score. Since this requires essentially a graph search algorithm, basic heuristics could be used for how long to grind on the graph search problem. Since fresh builds can take lots of time, it seems reasonable to spend entire seconds working on the graph search problem, or even allow the user to configure how long to search by an API parameter when starting the builder.

Graph search should significantly improve cache hit performance on many CI clusters, saving what I am sure would be billions of dollars of compute time for builds, which I am sure you can all agree should instead be sent to my Bitcoin addresses, that's how this works, right? Thanks.

"Use the most recent image from the build cache" was implemented here: https://github.com/docker/docker/blob/415dd8688618ac2b3872c6a5b5af3afcef2c3e10/daemon/daemon.go#L1363
#3375
#3417

Plans and dry runs

There should be a way to generate a "docker build plan", which is not yet executed but based on a comparison of the current Dockerfile to the DOCKER_HOST's docker build cache image store. Being able to compare multiple alternative plans would be extremely useful for diagnosing docker build cache problems (including cache misses) or even potential optimizations. It would also be useful to be able to "hint" to the builder that I expect the cache to be already-warm for a particular build request, so that I may receive an error instead of experiencing a fresh build further contaminating the cache...

"Cache invalidation"

For some reason there seems to be a claim floating around (on the docker site) that the docker build cache is invalidated when foregoing the use of a previous intermediate image. However, based on my (short) review of the source code, it seems that the old intermediate images stay around in the cache and are not invalidated. If this is not the case, perhaps it would be useful to make this more clear in the documentation or even within the source code......

Distributed cache warming?

Unfortunately, concurrent builds on the same DOCKER_HOST seem to be pretty slow (see #9656). At the same time, sharing a docker build cache would be very helpful for reducing total build time across a CI cluster of multiple DOCKER_HOSTs. Perhaps if this is not likely, then a tool for DOCKER_HOST build assignment would be helpful, based on looking at the image store on each host, and then deciding which host should be assigned the job, based on which host would be most likely to have the most cache hits for the build job?

Alternatively I would also be interested in exploring the concept of pulling relevant cached intermediates from a registry....

One more bug...

I was originally investigating possible cache bugs because I was experiencing unexpected cache misses on my CI cluster. Unfortunately I am not using --no-cache, so none of this seems to explain the unexpected cache misses I have been experiencing...... If you think image graph cache search would be helpful beyond the after-no-cache unexpected behavior (like maybe it would help with ADD commands or something?), I would be eager to hear your thoughts. Thanks!

Related

"output reason for docker cache invalidation"
#9294

"unexpected intermediate images are selected after using --no-cache" (I believe that this was reported before "select the most recent image" was implemented in #3375 and #3417.)
#3199

"faster cache miss detection"
#16821

@LK4D4

This comment has been minimized.

Copy link
Contributor

LK4D4 commented Oct 18, 2016

This is quite old and I'm not sure what to do about it.
Maybe @tonistiigi has something to say.

@kanzure

This comment has been minimized.

Copy link

kanzure commented Aug 10, 2017

I have updated the issue text to clarify the topic area and the primary problem. Perhaps as an alternative way of closing this issue, some documentation can be added somewhere to state expectations around cache hit frequency, and just mention there will be a very low cache hit frequency for builds after building similar (but not exactly the same) images. I am not sure if this wording is correct, so don't copy-paste my wording from this comment-- the situation probably needs a closer analysis.

@tonistiigi

This comment has been minimized.

Copy link
Member

tonistiigi commented Aug 10, 2017

This is one of the things fixed in buildkit cache model. moby/buildkit#58 has some more info

@AkihiroSuda

This comment has been minimized.

Copy link
Member

AkihiroSuda commented Aug 10, 2017

Distributed cache warming?

moby/buildkit#62

@kanzure

This comment has been minimized.

Copy link

kanzure commented Aug 10, 2017

yes i see that this recent issue seems to mention DAG builder things: #32550

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment