Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

Already on GitHub? Sign in to your account

Enhancement: Add ADD caching to the docker builder #880

Closed
ismell opened this Issue Jun 12, 2013 · 56 comments

Comments

Projects
None yet

ismell commented Jun 12, 2013

Currently the docker builder will not use the build cache if it sees an ADD command. The ADD command could calculate a hash of the file contents and use that to determine if a layer already exists.

A few things to be considered are

  1. Should the hash include the uid/gid
  2. Should the hash include permissions

My personal opinion is that the uid/gid should not be included by the filer permissions should be. The uid/gid should be set by a RUN chown so it matched what ever user in the container is desired.

Contributor

jpetazzo commented Jun 19, 2013

+1.

Contributor

apatil commented Jul 23, 2013

+1. Also, it would be great to have some way to 'ADD' an entire github repo and trigger cache invalidation when the repo is updated.

zorkian commented Jul 25, 2013

Just got bit by this, so +1.

kkiningh commented Aug 6, 2013

Also got bit by this, +1

Has anyone started to dig into ADD caching? I think it would be a great enhancement.

rca commented Aug 7, 2013

I heard from @shykes about a week ago that the last release had a tarball processing feature that this builds from. Not sure what developments have happened since, but I'm eagerly awaiting this as well.

Collaborator

shykes commented Aug 7, 2013

Note: the new tarball checksum is not in 0.5.1, but it's in master. We're
leaving it for testing a little longer, since it's a pretty critical piece
of the code. I encourage you to get a build of master and try it out if you
feel like helping with the testing effort :)

On Tue, Aug 6, 2013 at 5:29 PM, Roberto Aguilar notifications@github.comwrote:

I heard from @shykes https://github.com/shykes about a week ago that
the last release had a tarball processing feature that this builds from.
Not sure what developments have happened since, but I'm eagerly awaiting
this as well.


Reply to this email directly or view it on GitHubhttps://github.com/dotcloud/docker/issues/880#issuecomment-22222064
.

Contributor

crosbymichael commented Aug 8, 2013

Is anyone interested in working on this feature? If you are interested and need help just let me know.

bgarret commented Aug 9, 2013

@crosbymichael I'd be interested, but I have very little experience with Go. If this is something that could be tackled by a beginner, a few pointers would be helpful to know where to look in the docker sources.

Contributor

crosbymichael commented Aug 9, 2013

@bgarret This may be a hard issue with minimal Go experience.

If you are looking to contribute I can fine a few issues that are smaller in scope that you could complete to gain experience with Go and the docker code base. Sound good?

bgarret commented Aug 10, 2013

Sounds good to me, I completely understand there are easier issues to get
some experience on before moving on to harder issues.

Contributor

mhennings commented Aug 11, 2013

If we build ADD caching we should consider the runtime needed to build checksums to verify the src. In
case of a remote src it is easy through ETAGs, but locally a tarsum has similar costs as a copy.

Therefore i think we should consider a different approach:

Currently the history of a container is linear.
What if we could build the ADD as a separate empty container (with a cachable id) and afterwards JOIN it to an existing one.

this would require a JOIN ability that combines two containers. such an ability is naturally unsafe as it can not ensure that the result is making sense, but it could enable more flexible combinations of images.

So instead of doing an ADD, we would make an empty root based add dockerfile, and in place of the ADD we could do a JOIN.

while this requires more input from the user it is much better to control as it is far less complex.

Contributor

mhennings commented Aug 11, 2013

@crosbymichael what do you think about a JOIN command as an alternative to caching ADD?

Collaborator

shykes commented Aug 24, 2013

What would the UI look like @mhennings?

Contributor

mhennings commented Aug 24, 2013

For example i want to add java to a container, but to multiple ones.

Lets say we had three Dockerfiles

FROM scratch
ADD precise-server-cloudimg-amd64-root.tar.gz /
MAINTAINER [...]
FROM scratch
ADD raring-server-cloudimg-amd64-root.tar.gz /
MAINTAINER [...]
FROM scratch

MAINTAINER [...]

ADD jdk-7u25-linux-x64.tar.gz /opt/
ENV JAVA_HOME /opt/jdk1.7.0_25

Now i have build the first two and we got the image tags:

example/ubuntu:precise
example/ubuntu:raring
example/java:latest

Instead of building the third one based of both containers i would like to have a command like

docker join src1 src2 target

where src1, src2 and target are image tags

docker join example/ubuntu:raring example/java:latest  example/ubuntu-java7:raring
docker join example/ubuntu:precise example/java:latest example/ubuntu-java7:precise

Inside a Dockerfile this could look like:

FROM example/ubuntu:raring

MAINTAINER [...]

JOIN example/java:latest 

Of course that will only work if the images are build without requiring anything more than ADD since scratch does not provide anything to run a command.

👍

Contributor

gesellix commented Sep 24, 2013

👍

Contributor

graydon commented Oct 8, 2013

I ran into this sort of thing via several different angles (non-caching ADD, and alternatively producing cache hits for RUN commands with nondeterministic results as in #2031 , not hashing contents of the filesystem in general). The more I think about it the more I think the system design used in things like redo (https://github.com/apenwarr/redo) and fbuild (https://github.com/felix-lang/fbuild) is appropriate here.

Concretely, this could mean (say) adding a command to Dockerfiles of the form DEPEND <path> that will hash the contents of <path> inside the container (recursively if it's a directory) and use that hash as additional key material to the next RUN command's cache probe.

This would let you not only control cached and non-cached ADD commands at a fine grain, it would also permit cache control on subsets of the image filesystem, such as the state of the system configs (say DEPEND /etc or DEPEND /var), output of specific commands, git commit IDs stored inside an image, etc. For example if you want to make non-file dependencies, just write them to a file, for example:

RUN curl -I <upstream-url> >>/tmp/deps
DEPEND /tmp/deps
RUN expensive-command

Note: this is independent of whether docker hashes the entire image in order to calculate the final image ID, as RUN cache control needs to know what input to look at before it decides whether to run a given RUN step or use the cache.

Contributor

graydon commented Oct 8, 2013

Er, of course, to be truly redo-like one might also want the ability to attach DEPEND commands to previous steps, and support reading a file list from within a container itself, since some dependencies are only discovered dynamically, and after performing an expensive command they depend on, such as:

RUN g++ -MD /tmp/expensive.deps expensive.cpp
DEPEND --prev --from-file /tmp/expensive.deps

These are probably less-common needs in Dockerfiles than in other more serious "build systems", but have proven helpful enough to consider designing in from the get-go.

rca commented Oct 26, 2013

In case it helps anyone else, I put together a utility to externally cache ADD commands in Dockerfile at https://github.com/baremetal/docker-buildcache.

And now that I've published this, the official ADD cache is likely days away, so you're welcome. 😜

Contributor

crosbymichael commented Oct 26, 2013

@rca

Nice prototype. Now implement it in docker and be the hero. ;)

rca commented Oct 26, 2013

@crosbymichael I'd be happy to take a look, but I'd like some clarification about desired direction and some implementation details.

For instance, there are a number of approaches with respect to this feature in this issue; my prototype takes the easy route and builds a hash for the file being added and checks to see if an image for the state of that file exists in conjunction with the result of the previous command. While simple, it seems to work well in the usage scenarios I've tried thus far.

Additionally, the prototype cheats a bit and encodes the hash in the image's tag (repository name?). Does Docker have the ability to store metadata within an image that is searchable during the build process?

Finally, is this even the right approach? It doesn't really cover the other requests in this issue and is that okay?

Thanks!

+1 - use modified dates?

Contributor

gesellix commented Nov 3, 2013

👍
How does Git recognize changes? wouldn't it be possible to delegate such checks to Git?

+1

Collaborator

shykes commented Nov 12, 2013

Absolutely, if the source is a git repo, docker build should detect that and use current git hash + current git changes as a chache key.

@solomonstre
@docker

On Mon, Nov 11, 2013 at 10:49 PM, Ryan Seys notifications@github.com
wrote:

+1

Reply to this email directly or view it on GitHub:
dotcloud#880 (comment)

grinich commented Nov 19, 2013

I'm new to Docker, and wondering if this can be used to cache dynamic commands like pip install -r requirements.txt. I have a handful of dependencies that compile from source during this step (super slow), but they very rarely change, so I'd like to cache them and speed up my builds.

rca commented Nov 19, 2013

@grinich Yes, it will help. Your Dockerfile likely contains the line RUN pip install -r requirements.txt which does cache already. The problem is ADDing your requirements.txt to the image busts the cache prior to the RUN command.

@kohsuke kohsuke added a commit to jenkinsci/selenium-tests that referenced this issue Nov 27, 2013

@kohsuke kohsuke ADD prevents caching, which makes it difficult for other fixtures to …
…enhance this.

See moby/moby#880
fb589fa

@graydon graydon added a commit to graydon/docker that referenced this issue Dec 12, 2013

@graydon graydon Add testcases for ADD caching, closes #880. 6ac0e62

@Krijger Krijger referenced this issue in Krijger/docker-cookbooks Dec 13, 2013

Merged

Added supervisor conf.d directory configuration #3

@creack creack closed this in 15a6854 Dec 25, 2013

@creack creack added a commit that referenced this issue Dec 25, 2013

@creack creack Merge pull request #2809 from graydon/880-cache-ADD-commands-in-docke…
…rfiles

Issue #880 - cache ADD commands in dockerfiles
efaf2ca

ismell commented Dec 25, 2013

Amazing!

Contributor

amuino commented Jan 19, 2014

I wonder why the file timestamps are used for the hash generation (but it seems that only for local files). That causes a cache miss even when the file contents are identical.

For our use case, we would like to checkout some "pristine copy" of a tool from git and build a docker image from it. There are a couple of steps that are very slow (downloading dependencies) and we would like them to use the cache… but because we check out a new copy everytime, the file timestamps are different and the cache is skipped.

Contributor

graydon commented Jan 19, 2014

This was a kinda arbitrary choice since "file identity" is a bit muddy anyways; for example, if you tar up the filesystem you will get different SHA256 of the tarball because the timestamps on the files inside the tarball change. Certain tools use timestamps (eg. make(1)) -- they are "real" data, in some sense.

So: I opted to ignore timestamps only on remote-ADDs -- docker-initiated downloads -- because the temp file was always going to be out-of-date for them (they are fetched to a local file every time, in order to recalculate the SHA256), and include it in local files (you can override this by just touching the file to a fixed time before you run your build). Likewise, I thought this was a good balance because a user can force a particular cached addition to be invalid (while still accepting cache hits in the preceding steps) by touching the file to the present date.

Alternatively I can imagine ignoring the timestamp in all cases. It wasn't something I discussed in this bug and it seems quite reasonable to me to set to behave differently.

Member

tianon commented Jan 20, 2014

I think it's worth noting that you can get reasonably consistent timestamps out of git using something like git-set-file-times: https://git.wiki.kernel.org/index.php/ExampleScripts#Setting_the_timestamps_of_the_files_to_the_commit_timestamp_of_the_commit_which_last_touched_them

Contributor

amuino commented Jan 20, 2014

I had already hacked a workaround (using touch to set an arbitrary date). I can't think a use case for using the dates, but I'm too new to docker and biased by our strategy.

Other than that, @graydon seems to imply that remote files are always downloaded and then its checksum calculated? I would have expected that the http cache-related headers would be used, although that matches my observations (remote ADD is slow even when reporting a cache hit). We also have a workaround for that which involves using CMD with wget.

Would it be useful to start two different issues for these topics?

Member

tianon commented Jan 20, 2014

I definitely think two separate issues would be very worthwhile for discussing these issues, IMO.

Contributor

amuino commented Jan 20, 2014

The issue about access and modification times is already tracked as #3556 with a very similar use case.

Regarding the use of http headers for remote ADD, I've created #3672

niko commented Mar 26, 2014

I'm using Docker 0.9 and the ADD localfile dst command doesn't seem to be cached. My Dockerfile:

FROM ubuntu:quantal
RUN apt-get -y update
ENV DEBIAN_FRONTEND noninteractive

RUN apt-get -y install python-software-properties build-essential curl wget libxml2-dev libxslt-dev libcurl4-openssl-dev git libmysqlclient-dev libmysqlclient18 ruby-dev libshout3-dev libtag1-dev

# ruby-build
ENV RUBY_ROOT /usr/local
ENV RUBY_VERSION 2.1.1
ENV RUBYGEMS_VERSION rubygems-2.2.2

RUN mkdir /src && cd /src && git clone https://github.com/sstephenson/ruby-build.git && cd ruby-build && ./install.sh && rm -rf /src/ruby-build

RUN ruby-build $RUBY_VERSION $RUBY_ROOT

RUN cd /src && curl http://production.cf.rubygems.org/rubygems/$RUBYGEMS_VERSION.tgz | tar -xzf - && cd $RUBYGEMS_VERSION && ruby setup.rb && rm -rf /src/$RUBYGEMS_VERSION

RUN gem install bundler --pre

RUN mkdir -p /app
ADD Gemfile /app/Gemfile
ADD Gemfile.lock /app/Gemfile.lock
ADD vendor /app
RUN cd /app && bundle install --deployment
ADD . /app

And this is the output of the docker build command:

docker build -t radioadmin:$(date +%Y-%m-%d__%H_%M_%S) .
Uploading context 245524480 bytes
Step 1 : FROM ubuntu:quantal
 ---> 5ac751e8d623
Step 2 : RUN apt-get -y update
 ---> Using cache
 ---> 6bcf9b8f8819
Step 3 : ENV DEBIAN_FRONTEND noninteractive
 ---> Using cache
 ---> d96337059f26
Step 4 : RUN apt-get -y install python-software-properties build-essential curl wget libxml2-dev libxslt-dev libcurl4-openssl-dev git libmysqlclient-dev libmysqlclient18 ruby-dev libshout3-dev libtag1-dev
 ---> Using cache
 ---> b8486f273dd0
[…]
Step 11 : RUN gem install bundler --pre
 ---> Using cache
 ---> 29b0bf82230d
Step 12 : RUN mkdir -p /app
 ---> Using cache
 ---> 5f7ce6e3563f
Step 13 : ADD Gemfile /app/Gemfile
 ---> b2fc2d215108
Step 14 : ADD Gemfile.lock /app/Gemfile.lock
 ---> c826612758d0
Step 15 : ADD vendor /app
 ---> 5dbbc7a96d4f
[.]

So up to step 12 the commands are cached. Starting with ADD Gemfile the cache is busted even though the Gemfile isn't modified.

Any help would be highly appreciated.

Kind regards, Niko.

PS: I forgot:

 % docker info
Containers: 87
Images: 69
Devmapper disk use: Data: 13468.6/102400.0 Metadata: 14.0/2048.0
WARNING: No swap limit support
 % docker version
Go version (client): go1.1.2
Go version (server): go1.1.2
Last stable version: 0.9.1
 % uname -a
Linux tier 3.13.6-1-ARCH #1 SMP PREEMPT Fri Mar 7 22:47:48 CET 2014 x86_64 GNU/Linux
Contributor

jpetazzo commented Apr 2, 2014

Hi @niko,

Can you:

  • check if the problem still happens (I guess it does, since this was just 1 week ago, but let's make sure!),
  • if it's the case, test again with a shorter Dockerfile (i.e., if ADD Gemfile /app/Gemfile is never cached, just remove everything after that line),
  • test again with a smaller context (in other words, remove everything except the Dockerfile and the Gemfile),
  • if the problem still happens with that small context, open a new issue, and attach the Dockerfile and Gemfile so that other people can reproduce?

Thank you.

niko commented Apr 3, 2014

Still happens. Short Dockerfile:

 % cat Dockerfile
FROM ubuntu:quantal
ADD a_file /tmp/a_file
RUN cat /tmp/a_file

% cat a_file
some content

 % docker build .
Uploading context 65751040 bytes
Step 1 : FROM ubuntu:quantal
 ---> 5ac751e8d623
Step 2 : ADD a_file /tmp/a_file
 ---> 9e802055bb62
Step 3 : RUN cat /tmp/a_file
 ---> Running in a6bd52aeb52e
some content
 ---> e5f38a220918
Successfully built e5f38a220918

 % docker build .
Uploading context 65751040 bytes
Step 1 : FROM ubuntu:quantal
 ---> 5ac751e8d623
Step 2 : ADD a_file /tmp/a_file
 ---> 79c4f0a0bb91
Step 3 : RUN cat /tmp/a_file
 ---> Running in 153d60febfa9
some content
 ---> cd1be7c76c1d
Successfully built cd1be7c76c1d

I tried the same with the busybox base image and it doesn't cache a added file. Funnily there's another issue:

 % docker build .                                                                                                                                                           ~/tmp niko@tier[0]
Uploading context 65751040 bytes
Step 1 : FROM busybox
 ---> 769b9341d937
Step 2 : ADD a_file /tmp/a_file
 ---> ec3b39a444c5
Step 3 : RUN cat /tmp/a_file
 ---> Running in 17c3b1b13018
lxc-start: No such file or directory - failed to exec /.dockerinit
lxc-start: invalid sequence number 1. expected 4
lxc-start: failed to spawn '17c3b1b13018bdff1384f1964a85f9635e95a01f36dea811323a7021c7c4cc28'
lxc-start: Device or resource busy - cgroup_rmdir: failed to delete /sys/fs/cgroup/blkio/lxc/17c3b1b13018bdff1384f1964a85f9635e95a01f36dea811323a7021c7c4cc28
lxc-start: Device or resource busy - cgroup_rmdir: failed to delete /sys/fs/cgroup/net_cls/lxc/17c3b1b13018bdff1384f1964a85f9635e95a01f36dea811323a7021c7c4cc28
lxc-start: Device or resource busy - cgroup_rmdir: failed to delete /sys/fs/cgroup/freezer/lxc/17c3b1b13018bdff1384f1964a85f9635e95a01f36dea811323a7021c7c4cc28
lxc-start: Device or resource busy - cgroup_rmdir: failed to delete /sys/fs/cgroup/devices/lxc/17c3b1b13018bdff1384f1964a85f9635e95a01f36dea811323a7021c7c4cc28
lxc-start: Device or resource busy - cgroup_rmdir: failed to delete /sys/fs/cgroup/memory/lxc/17c3b1b13018bdff1384f1964a85f9635e95a01f36dea811323a7021c7c4cc28
lxc-start: Device or resource busy - cgroup_rmdir: failed to delete /sys/fs/cgroup/cpu,cpuacct/lxc/17c3b1b13018bdff1384f1964a85f9635e95a01f36dea811323a7021c7c4cc28
lxc-start: Device or resource busy - cgroup_rmdir: failed to delete /sys/fs/cgroup/cpuset/lxc/17c3b1b13018bdff1384f1964a85f9635e95a01f36dea811323a7021c7c4cc28
Error build: The command [/bin/sh -c cat /tmp/a_file] returned a non-zero code: 255

I've no idea wether both issues are due to a faulty docker installation here. I installed docker via the standard pacman package:

% pacman -Si docker
Repository     : community
Name           : docker
Version        : 1:0.9-1
Description    : Pack, ship and run any application as a lightweight container
Architecture   : x86_64
URL            : http://www.docker.io/
Licenses       : Apache
Groups         : None
Provides       : None
Depends On     : bridge-utils  iproute2  device-mapper  sqlite  systemd
Optional Deps  : btrfs-progs: btrfs backend support
                 lxc: lxc backend support
Conflicts With : None
Replaces       : None
Download Size  : 3486.84 KiB
Installed Size : 21444.00 KiB
Packager       : Sébastien Luttringer <seblu@seblu.net>
Build Date     : Mon 10 Mar 2014 11:42:24 PM CET
Validated By   : MD5 Sum  SHA256 Sum  Signature

Niko.

Contributor

jpetazzo commented Apr 3, 2014

Interesting. What kind of filesystem are you using? Also, are you in a physical machine, on a VM...?

niko commented Apr 4, 2014

Arch Linux on a MBP with ext4. I guess the crucial function would be hashPath. I tried and failed to build a standalone version of the function. The other point where things could go wrong - I imagine - would be the cache store. Is there any way to inspect the cache?

Niko.

Contributor

amuino commented Apr 4, 2014

@niko Regarding the cache issue:
By the size of the used context, I guess that there are several other files involved (even when they are not used by the Dockerfile)

Can you try an empty directory which only includes the Dockerfile and a_file?

niko commented Apr 4, 2014

@amuino doesn't make any difference:

 % ls -lah
total 16K
drwxr-xr-x  2 niko users 4.0K Apr  4 10:25 .
drwxr-xr-x 12 niko users 4.0K Apr  4 10:25 ..
-rw-r--r--  1 niko users   13 Apr  3 10:29 a_file
-rw-r--r--  1 niko users   36 Apr  3 10:43 Dockerfile

 % cat a_file
some content

 % cat Dockerfile
FROM busybox
ADD a_file /tmp/a_file

 % docker build .
Uploading context 10240 bytes
Step 1 : FROM busybox
 ---> 769b9341d937
Step 2 : ADD a_file /tmp/a_file
 ---> edbe1827db5d
Successfully built edbe1827db5d

 % docker build .
Uploading context 10240 bytes
Step 1 : FROM busybox
 ---> 769b9341d937
Step 2 : ADD a_file /tmp/a_file
 ---> 440d62f1dad8
Successfully built 440d62f1dad8

Niko.

Contributor

amuino commented Apr 4, 2014

Last idea… can you show what stat * reports before each docker build .

I'm wild guessing that some file metadata gets changed and that invalidates the cache per

niko commented Apr 4, 2014

 % stat *
  File: ‘a_file’
  Size: 13          Blocks: 8          IO Block: 4096   regular file
Device: 804h/2052d  Inode: 1456366     Links: 1
Access: (0644/-rw-r--r--)  Uid: ( 1000/    niko)   Gid: (  100/   users)
Access: 2014-04-04 10:26:14.712428891 +0200
Modify: 2014-04-03 10:29:44.141006816 +0200
Change: 2014-04-04 10:25:48.436934915 +0200
 Birth: -
  File: ‘Dockerfile’
  Size: 36          Blocks: 8          IO Block: 4096   regular file
Device: 804h/2052d  Inode: 1456400     Links: 1
Access: (0644/-rw-r--r--)  Uid: ( 1000/    niko)   Gid: (  100/   users)
Access: 2014-04-04 10:26:17.792291436 +0200
Modify: 2014-04-03 10:43:53.038065179 +0200
Change: 2014-04-04 10:25:55.033307181 +0200
 Birth: -

 % docker build .
Uploading context 10240 bytes
Step 1 : FROM busybox
 ---> 769b9341d937
Step 2 : ADD a_file /tmp/a_file
 ---> 2762c282a774
Successfully built 2762c282a774

 % stat *
  File: ‘a_file’
  Size: 13          Blocks: 8          IO Block: 4096   regular file
Device: 804h/2052d  Inode: 1456366     Links: 1
Access: (0644/-rw-r--r--)  Uid: ( 1000/    niko)   Gid: (  100/   users)
Access: 2014-04-04 10:26:14.712428891 +0200
Modify: 2014-04-03 10:29:44.141006816 +0200
Change: 2014-04-04 10:25:48.436934915 +0200
 Birth: -
  File: ‘Dockerfile’
  Size: 36          Blocks: 8          IO Block: 4096   regular file
Device: 804h/2052d  Inode: 1456400     Links: 1
Access: (0644/-rw-r--r--)  Uid: ( 1000/    niko)   Gid: (  100/   users)
Access: 2014-04-04 10:26:17.792291436 +0200
Modify: 2014-04-03 10:43:53.038065179 +0200
Change: 2014-04-04 10:25:55.033307181 +0200
 Birth: -

 % docker build .
Uploading context 10240 bytes
Step 1 : FROM busybox
 ---> 769b9341d937
Step 2 : ADD a_file /tmp/a_file
 ---> 1929b093c7aa
Successfully built 1929b093c7aa

Looks the same to me. :s

Niko.

Contributor

amuino commented Apr 4, 2014

Yeah, can't spot anything.

This might be some kind of regression. You might want to open a new issue, since this is related to the initial feature and is already closed. It might have better chances of being noticed by the maintainers. Agree @jpetazzo?

niko commented Apr 4, 2014

Funny enough: on our Ubuntu server running docker 0.9.1 the files get cached. Must be something wrong with my development machine. One difference I can see is that my dev machine mounts the ext4 volume with relatime. Could that have an effect?

I'll happily open a new issue if @jpetazzo gives the go.

Niko.

Contributor

jpetazzo commented Apr 4, 2014

If you can test with/without relative, that would be awesome. Then, yes, I think it warrants a new issue.

If you can be so kind and re-copy all the information that you have put here (to avoid having to go back and forth before both issues) that will be awesome.

Thanks a lot!

niko commented Apr 7, 2014

I'll file a new issue.

Hi,

If I understand correctly, now Docker is supposed to checksum any file added through ADDand then decide if it should use the cache or not.
If so, this is a nice improvement.

My situation is a bit like this issue (dotcloud#2031):

  • Suppose I have a requirement file like in the previous issue link. It could be a Python requirements file, a NodeJS packages file, a Rails Gemfile...
    All these example have something in common : potentially big list of modules to install, these commands can take a lot of time to complete.
  • Now, if I modify this dependencies/requirements file, Docker should detect that the file has changed and then won't use the cache. If the file hasn't changed, Docker will not use the cache. That is okay.
  • However, if this requirements file is modified, in many situations it would save a lot of time to be able to do something like: reusing the previous cached image with all the modules/packages, but still run the command (pip install, bundle install, npm install) and have it only install what's relevant according to the requirements file changes.

Even if you don't have a big Python/Node/Rails project, installing all the modules can often take minutes (10, 15, ...).

Would it make sense to be able to reuse a previously built layer, and, on top of it, re-run a command?

If so, from my newbie point of view, it doesn't look too hard (theoretically) to have docker mounting the previously cached layer and run the command if we instruct it to do so, maybe with a new directive (RUN OVERWRITE or something like that).

If it's totally stupid and irrelevant, is there any trick that could be used to lead to the same result?

ismell commented Jun 25, 2014

@m-barthelemy This feature should support your use case. You just need to use one trick.

FROM my-ruby-container
ADD Gemfile
RUN bundle install
ADD *

By adding the Gemfile in it's own ADD command then it should only cache bust if the file has changed. So if it's not cache busted the run command will also use a cached copy. You now have cached builds!

@ismell : thanks! That confirms the first part of my question.

However, I was trying to ask if it would make sense to add another feature/option:

What if your Gemfile has changed because you added 1 gem and maybe updated versions for a couple of other Gems? Considering Gems dependencies, a simple Gemfile often generates a Gemfile.lock with dozens of entries.

If I change a small portion of my Gemfile the cache wouldn't be used at all during the docker build, thus bundle install would fetching, compiling, installing the full list of Gems again. This takes a lot of time.
Gemfile is just an example, that would be the same for Nodejs, Python,... projects.

Contributor

jpetazzo commented Jun 25, 2014

If you change a single line in the Gemfile, yes, it will invalidate the
whole thing.
If you have tons of dependencies and don't want to reinstall them all when
just adding/changing a single one, you could look at the technique
described here:
http://jpetazzo.github.io/2013/12/01/docker-python-pip-requirements/
(paragraph "One-by-one pip install").
It's Python but it's easily mappable to Ruby.
A slightly better approach if you really have hundreds of dependencies
would be to break them down in multiple files, maybe.

HTH,

On Wed, Jun 25, 2014 at 7:44 AM, Matthieu Barthélemy <
notifications@github.com> wrote:

@ismell https://github.com/ismell : thanks! That confirms the first
part of my question.

However, I was trying to ask if it would make sense to add another
feature/option:

What if your Gemfile has changed because you added 1 gem and maybe updated
versions for a couple of other Gems? Considering Gems dependencies, a
simple Gemfile often generates a Gemfile.lock with dozens of entries.

If I change a small portion of my Gemfile the cache wouldn't be used at
all during the docker build, thus bundle install would fetching,
compiling, installing the full list of Gems again. This takes a lot of time.
Gemfile is just an example, that would be the same for Nodejs, Python,...
projects.


Reply to this email directly or view it on GitHub
dotcloud#880 (comment).

@jpetazzo https://twitter.com/jpetazzo
Latest blog post:
http://jpetazzo.github.io/2014/03/23/lxc-attach-nsinit-nsenter-docker-0-9/

Contributor

erikh commented Jun 25, 2014

On Jun 25, 2014, at 2:56 PM, Jérôme Petazzoni notifications@github.com wrote:

If you change a single line in the Gemfile, yes, it will invalidate the
whole thing.
If you have tons of dependencies and don't want to reinstall them all when
just adding/changing a single one, you could look at the technique
described here:
http://jpetazzo.github.io/2013/12/01/docker-python-pip-requirements/
(paragraph "One-by-one pip install").
It's Python but it's easily mappable to Ruby.
A slightly better approach if you really have hundreds of dependencies
would be to break them down in multiple files, maybe.

The problem with Bundler and RubyGems in general is that gem activation (which version of a library to use, etc) happens at run-time and install time. Therefore, just installing the dependencies individually makes no guarantees you’ll get the right version of a library.

http://erik.hollensbe.org/2013/05/11/gem-activation-and-you/ is a set of articles I wrote on the subject of packaging and how rubygems and bundler work as a unit, if you have any interest.

I have no opinion on the feature itself — just figured this context would be valuable to have.

I'm new to the whole thing, apologies if the question is stupid. I'm dealing with a couple of ADD commands that load large files from S3. This takes a long long time each time I make a build. Once the file is downloaded the cache kicks in and everything is perfect:

Step 12 : ADD http://s3.amazonaws.com/downloads.basho.com/riak-cs/${RIAK_CS_SHORT_VERSION}/${RIAK_CS_VERSION}/ubuntu/precise/riak-cs_${RIAK_CS_VERSION}-1_amd64.deb /
 ---> Using cache
 ---> 8b36a87dda5f

Does this discussion includes caching of files transferred over the network or is it purely about docker build caching? If it doesn't, does anyone know how to make docker cache file downloads and use local cache if it's available?

Member

thaJeztah commented Jan 13, 2015

@ianbytchek the issue tracker is not meant for support questions, please use docker-user or the #docker IRC channel.

Perhaps something like RUN curl ...... might help in your case to use caching of the download. Docker itself will allways download remote files to check if the cache should be used.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment