Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Feature: CACHE OFF support, take II #42799

Open
brandonmpetty opened this issue Aug 29, 2021 · 9 comments
Open

Feature: CACHE OFF support, take II #42799

brandonmpetty opened this issue Aug 29, 2021 · 9 comments
Labels
area/builder kind/feature Functionality or other elements that the project doesn't currently have. Features are new and shiny

Comments

@brandonmpetty
Copy link

Background

Adding the ability to tell Docker to avoid caching in the Dockerfile: #1996

I am interested in bringing up the "NOCACHE" issue... once again.
Why?, many think it would be a great feature as I will detail my use case below...
and because the repo moderators have failed to lock these issues, and giving the last word as to why they were closed in the first place.

Feel free to close this out of merit, but please lock it or #1996 and state a clear reason as to why.

Why I want the feature

I am interested from a performance perspective.
I am not sure how COPY is actually implemented. The docs only give a hint:

For the ADD and COPY instructions, the contents of the file(s) in the image are examined and a checksum is calculated for each file. The last-modified and last-accessed times of the file(s) are not considered in these checksums. During the cache lookup, the checksum is compared against the checksum in the existing images. If anything has changed in the file(s), such as the contents and metadata, then the cache is invalidated.

By adding a CACHE OFF option, not only would the checksum analysis not have to be performed, but if Docker is pre-calculating and storing the checksums for the layer, it could also avoid that entirely.

Example

FROM node:14-alpine3.10 as ts-compiler
WORKDIR /usr/app
COPY package*.json ./
RUN npm install
CACHE OFF
COPY . ./

If we know that the COPY layer is likely to always contain different files we could completely avoid caching altogether after that point. This could be a huge savings if there are a lot of files being hashed. Also, since this would be part of a build pipeline I would assume that my next FROM statement would automatically setup caching again. A typical Typescript pattern would be to do a build in the first phase, requiring dev dependencies to perform the actual build, and then in the second phase npm install only the production dependencies and then copy over the built content from the earlier phase. The goal, being able to cache BOTH npm install calls to avoid them in the future since those layers almost never change given a package-lock.json, while completely avoiding needless checksum calculations and storage after those points.

@brandonmpetty
Copy link
Author

#10682

@thaJeztah
Copy link
Member

I don't think caching can be skipped if the files are added to an image layer, because of the content-addressable store used for image layers.

I would assume that my next FROM statement would automatically setup caching again.

In order for the next FROM to know if it can use a cache, it needs to know if things changed since the last time it was built (results of the COPY action if that was the previous step.

A typical Typescript pattern would be to do a build in the first phase, requiring dev dependencies to perform the actual build, and then in the second phase npm install only the production dependencies and then copy over the built content from the earlier phase. The goal, being able to cache BOTH npm install calls to avoid them in the future since those layers almost never change given a package-lock.json, while completely avoiding needless checksum calculations and storage after those points.

Perhaps you're able to provide a minimal example to showcase this scenario (what does such a Dockerfile look like?) Perhaps a combination of RUN --mount could resolve that use-case.

@thaJeztah thaJeztah added area/builder kind/feature Functionality or other elements that the project doesn't currently have. Features are new and shiny labels Aug 30, 2021
@brandonmpetty
Copy link
Author

Here is an example:

# First Stage
FROM node:14-alpine3.10 as node-build
WORKDIR /usr/app
COPY package*.json ./
RUN npm install
# Every layer above this line should be cached
# Ideally this COPY will never cache nor spend any time generating and storing hashes to cache
# This is where a NOCACHE option could help
COPY . ./
RUN npm run build

# Second Stage
# If NOCACHE was issued in the above section, it should not effect this one
# Ideally this next round of 'npm install' should create a new cache layer
FROM node:14-alpine3.10
WORKDIR /usr/app
COPY --from=node-build /usr/app/package*.json ./
RUN npm install --only=production
# The following COPY should ideally never be cached either, as in the previous section.
COPY --from=node-build /usr/app/dist/index.js ./
CMD npm start

I am simply hoping to provide Docker with enough information so that it can intelligently cache layers that need to be cached, and avoid wasted caching overhead when it will never be leveraged in a real world scenario. The only time it would be, as things sit today, would be if someone runs two builds in a row without making any change at all... which I would not optimize around that scenario.

This also assumes Docker is already setup to use caching layers again when it encounters a FROM regardless of what happened before since the previous layers in other sections have nothing to do with the layers of a new section until someone decides to COPY --from. If that is not already in place, that alone (regardless of the NOCACHE option), would provide a large performance boost as npm install calls are very expensive. This same pattern emerges, I believe, in almost every type of build pipeline regardless of language or tech.

@jakerobb
Copy link

jakerobb commented Nov 5, 2021

I am simply hoping to provide Docker with enough information so that it can intelligently cache layers that need to be cached, and avoid wasted caching overhead when it will never be leveraged in a real world scenario. The only time it would be, as things sit today, would be if someone runs two builds in a row without making any change at all... which I would not optimize around that scenario.

@brandonmpetty, a feature like this would only be telling the builder not to reuse the previously-built layer on subsequent builds. The layer still needs to be built, stored, and checksummed, because it still needs to be used as part of the resulting image.

That said, I still think this would be a useful feature. Here's the justification:

As I understand it, Docker decides whether it can cache a given step based on whether any of the following has changed:

  1. Build arguments or environment variables used in the step
  2. Files from context used in the step
  3. Hash of the parent layer
  4. Contents of the line in the Dockerfile

A common use case for Docker is to encapsulate CI builds. The Docker build performs a git clone and executes the build. However, the git clone command itself never changes, and so the builder happily uses a cached result of that step if available. People work around this by passing $(date +%s) as a build arg, then echoing that build arg just prior to cloning, but that moves knowledge of the correct build procedure outside of the Dockerfile, which is not desirable.

To address that, people use ADD {url of a public web API that returns a time} That introduces an unnecessary external dependency. It's fragile -- public web APIs come and go, and the network is unreliable. It adds clutter to the image. It necessitates a comment explaining why it's being done.

The addition of a simple directive, e.g. NOCACHE, which would instruct the builder that all steps from there to the end of the stage are to be executed regardless of their apparent cacheability, and which itself did not introduce a new layer, would resolve this issue.

If there is a reason not to do this, I haven't heard it yet, despite people asking for it for years. I'll reassert Brandon's original request, emphasis mine:

Feel free to close this out of merit, but please lock it or #1996 and state a clear reason as to why.

@azul
Copy link

azul commented Jan 6, 2022

@jakerobb Here's what i am using to rebuild in CI if changes in git happened:

export LAST_SERVER_COMMIT=`git ls-remote $REPO "refs/heads/$BRANCH" | grep -o "^\S\+"`
docker build --build-arg LAST_SERVER_COMMIT="$LAST_SERVER_COMMIT"

And then in the Dockerfile:

ARG LAST_SERVER_COMMIT
RUN git clone ...

This will only rebuild the following layers if the git repo actually changed.

@MauriceArikoglu
Copy link

Not having base features like this really makes me question humanity sometimes. As if designing features that makes sense actually causes pain to some people. It's unfeasible my dockerfile does not execute its RUN statements, even though I changed the list of dependencies to be added, because its using the cache in production (NO I CANT BUST CACHE ON THIS SPECIFIC PRODUCTION SERVER).

wtf is this

@cpuguy83
Copy link
Member

@MauriceArikoglu It would help if you posted what problem you are having.
You say:

It's unfeasible my dockerfile does not execute its RUN statements, even though I changed the list of dependencies to be added

What does this mean?
For build, if the list of inputs don't change then the output doesn't change and therefore the the output is cached.
If you change the input then the RUN must be executed. If the output of RUN is exactly the same then some things may still be cached afterwards.

Inputs may include the order of execution, the build context, build arguments, or the base image of the build stage.
If you want to make sure something always runs you can use --no-cache when executing the build.

Do you always want it to make sure the base image is up to date? Make sure to use --pull=always

Beyond that, I think something like RUN --always could be added for ensuring a specific step always executes.

@C-h-e-r-r-y
Copy link

because its using the cache in production (NO I CANT BUST CACHE ON THIS SPECIFIC PRODUCTION SERVER).

Production is important but what about development - when you have to build the same image Dockerfile a dozen times per day? How often do you forget to add newline for cache skipping and have to rebuild after this? For me, personally, this feature reduces a lot of hours for development. So why not implement it?

@akerouanton
Copy link
Member

when you have to build the same image Dockerfile a dozen times per day?

Unless you have to update your system libraries a dozen times a day, there's no reason to rebuild your dev image that much. I bet you're doing that every time you change some of your project dependencies or even something in your code base. You should probably use a bind mount for that. Then, you can restart your container (if you don't have hot reload) instead of rebuilding the image.

If you're not sure how to optimize your developer experience when Docker is in the loop, you should probably seek for help on our forum or on our Community Slack.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/builder kind/feature Functionality or other elements that the project doesn't currently have. Features are new and shiny
Projects
None yet
Development

No branches or pull requests

8 participants