New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Subsequent COPY instructions re-adds all files in every layer instead of only the files that have changed #21950

Open
motin opened this Issue Apr 12, 2016 · 44 comments

Comments

Projects
None yet
@motin
Copy link

motin commented Apr 12, 2016

Under certain circumstances, subsequent COPY instructions re-adds all files in every layer instead of only the files that have changed.

ISSUE RENAMED: The original suggestion was to add a SYNC instruction, but the COPY instruction should already cover the use cases for which the SYNC instructions was intended. The original post was as follows:

Suggestion:
The SYNC instruction compares and and performs the necessary changes for to become identical to .

This would alleviate a lot of pain present when building images that contain application source code that changes incrementally and often, since the whole source code needs to be COPIED (COPY . /app) to the image again and again, leading to longer deploy cycles and tons of unnecessary data pushed to and stored in the registries.

In extreme cases, the source code directory can reach up to gigabytes in size, and in order to publish a some smaller changes in a few hundred files, all gigabytes needs to be pushed and subsequently pulled by those who wish to get access to the changes, or the servers that are to serve the new source code.

A SYNC command would instead allow for new image layers to include only the files that actually changed in comparison to the existing image contents, which probably won't amount to more than a few hundred kilobytes in most cases.

(Optionally, an ability to access the build context from the RUN instruction would make this particular instruction unnecessary by allowing for instance rsync to compare build context against image contents)

The only currently feasible workaround today seems to be to use rsync to analyze the differences between two images and then use the changelog output to craft a tar-file containing the relevant changes.

@phemmer

This comment has been minimized.

Copy link
Contributor

phemmer commented Apr 12, 2016

Docker already does this. When docker generates a new layer, it calculates the difference. It doesn't store the entire filesystem.

This can be demonstrated in the following example:

# ll
total 97688
-rw-r--r-- 1 phemmer adm 10000000 2016/04/12-08:29:36 0.data
-rw-r--r-- 1 phemmer adm 10000000 2016/04/12-08:29:36 1.data
-rw-r--r-- 1 phemmer adm 10000000 2016/04/12-08:29:36 2.data
-rw-r--r-- 1 phemmer adm 10000000 2016/04/12-08:29:36 3.data
-rw-r--r-- 1 phemmer adm 10000000 2016/04/12-08:29:36 4.data
-rw-r--r-- 1 phemmer adm 10000000 2016/04/12-08:29:36 5.data
-rw-r--r-- 1 phemmer adm 10000000 2016/04/12-08:29:36 6.data
-rw-r--r-- 1 phemmer adm 10000000 2016/04/12-08:29:36 7.data
-rw-r--r-- 1 phemmer adm 10000000 2016/04/12-08:29:37 8.data
-rw-r--r-- 1 phemmer adm 10000000 2016/04/12-08:29:37 9.data
-rw-r--r-- 1 phemmer adm       22 2016/04/12-08:21:35 Dockerfile
-rw-r--r-- 1 phemmer adm       19 2016/04/12-08:27:47 Dockerfile.update

# cat Dockerfile
FROM busybox
COPY . /

# cat Dockerfile.update 
FROM test
COPY . /

# docker build -t test .
Sending build context to Docker daemon   100 MB
Step 0 : FROM busybox
 ---> 8c2e06607696
Step 1 : COPY . /
 ---> 8708bc7d48c9
Removing intermediate container c61f255a6bd1
Successfully built 8708bc7d48c9

# echo foo >> 0.data

# docker build -t test -f Dockerfile.update .                                                         
Sending build context to Docker daemon   100 MB
Step 0 : FROM test
 ---> 8708bc7d48c9
Step 1 : COPY . /
 ---> da36496146d8
Removing intermediate container 8cf413dc4621
Successfully built da36496146d8

# docker history test
IMAGE               CREATED              CREATED BY                                      SIZE                COMMENT
da36496146d8        21 seconds ago       /bin/sh -c #(nop) COPY dir:ec5ab01cc732f96db7   10 MB               
8708bc7d48c9        About a minute ago   /bin/sh -c #(nop) COPY dir:5647d3b25c72c1acab   100 MB              
8c2e06607696        12 months ago        /bin/sh -c #(nop) CMD ["/bin/sh"]               0 B                 
6ce2e90b0bc7        12 months ago        /bin/sh -c #(nop) ADD file:8cf517d90fe79547c4   2.43 MB             
cf2616975b4a        12 months ago        /bin/sh -c #(nop) MAINTAINER J�r�me Petazzo     0 B                 

Notice how the size of the last layer (first in the list since docker outputs reversed history) is 10mb, which is the size of the file we changed. All the other 90mb worth of files were unchanged, and thus not added to the layer.

@motin

This comment has been minimized.

Copy link

motin commented Apr 12, 2016

@phemmer Whoa, great news! But in what Docker version was this fixed? I am using 1.10.3 on OSX (latest available from https://www.docker.com/products/docker-toolbox) and getting:

IMAGE               CREATED             CREATED BY                                      SIZE                COMMENT
6512838f1036        4 seconds ago       /bin/sh -c #(nop) COPY dir:7d72c14da84a2496de   100 MB
b2d4c3b36a37        20 seconds ago      /bin/sh -c #(nop) COPY dir:25e5b0c3d4bd31c7c6   100 MB
47bcc53f74dc        3 weeks ago         /bin/sh -c #(nop) CMD ["sh"]                    0 B
<missing>           3 weeks ago         /bin/sh -c #(nop) ADD file:47ca6e777c36a4cfff   1.113 MB

Notice how the size of the last layer (first in the list since docker outputs reversed history) is 100mb, meaning that all files were re-added in the last layer.

@thaJeztah

This comment has been minimized.

Copy link
Member

thaJeztah commented Apr 12, 2016

@motin the difference is that in the example @phemmer gave, he split the image up into a "base" image (using Dockerfile) and an image that extends the image (Dockerfile.update). If you only rebuild the second one, docker compares the change with the layers above it (which are in the base-image), and only adds files that are changed.

The building is still done on the daemon, so docker will upload the whole build-context to the daemon on each build (#9553 is a proposal to make this smarter)

@motin

This comment has been minimized.

Copy link

motin commented Apr 12, 2016

@thaJeztah I too split up the image in the exact same way. Actually, I re-ran @phemmer's example from top to bottom and got the 100MB-sized layer outcome. Maybe there is some bug causing these differences in outcome? On what system and with what Docker version does it work? It sure does not work on 1.10.3 on OSX.

@thaJeztah

This comment has been minimized.

Copy link
Member

thaJeztah commented Apr 12, 2016

@motin oh! I think there's a typo in in @phemmer's example; he's using the same name twice (both in the first and second build). change -t test to something else in the second build, and it should work

@thaJeztah

This comment has been minimized.

Copy link
Member

thaJeztah commented Apr 12, 2016

hm, although, that shouldn't matter the first time you run it (it should just untag the image)

@motin

This comment has been minimized.

Copy link

motin commented Apr 12, 2016

@thaJeztah Thanks, it did not change the outcome however.

I set up a gist and the following one-liner to replicate the issue locally:

git clone https://gist.github.com/cca880c647263eb5e98d9f1e0d60a3c5.git replicate-docker-issue-21950 && bash replicate-docker-issue-21950/replicate-docker-issue-21950.sh

Here is my output from the one-liner

@thaJeztah

This comment has been minimized.

Copy link
Member

thaJeztah commented Apr 12, 2016

hm, looking into this now; it seems that for some reason it's indeed including all files again in the last layer, not just the changed file

@motin motin changed the title Add Dockerfile SYNC instruction Subsequent COPY instructions re-adds all files in every layer instead of only the files that have changed Apr 12, 2016

@motin

This comment has been minimized.

Copy link

motin commented Apr 12, 2016

@thaJeztah Thanks, I renamed the issue to better reflect the underlying issue.

@thaJeztah

This comment has been minimized.

Copy link
Member

thaJeztah commented Apr 12, 2016

Yes, these were my steps to reproduce;

Dockerfile.base;

FROM scratch
COPY . /data

Dockerfile

FROM test
COPY . /data
mkfile 10m data.0
docker build -t test -f Dockerfile.base .

mkfile 10m data.1
docker build -t test .

Then save the image (docker save -o test.tar test ), and extracting the image, and the layers, I get;

# snip #
├── 81650965dc14fed041e07f15b40af4e662d422f2914365aa4a2d2f704ee729a8
│   ├── VERSION
│   ├── data
│   │   ├── Dockerfile
│   │   ├── Dockerfile.base
│   │   ├── data.0
│   │   └── data.1
│   ├── json
├── 82963b397b3e53fc9cecbb81f45c81591fa1fe48c8c71fc5b3013470636c5122
│   ├── VERSION
│   ├── data
│   │   ├── Dockerfile
│   │   ├── Dockerfile.base
│   │   └── data.0
│   ├── json

So data.0 seems to be embedded twice in the image, in each layer that added it.

(testing on 1.11.0-rc4)

@phemmer

This comment has been minimized.

Copy link
Contributor

phemmer commented Apr 12, 2016

@motin oh! I think there's a typo in in @phemmer's example; he's using the same name twice (both in the first and second build). change -t test to something else in the second build, and it should work

No typo. That was deliberate (and works fine).

@motin

This comment has been minimized.

Copy link

motin commented Apr 12, 2016

@thaJeztah What is your output from the one-liner above and one what system are you using Docker?

@motin

This comment has been minimized.

Copy link

motin commented Apr 12, 2016

No typo. That was deliberate.

Thanks for the clarification. The gist has been updated to use the same tags as in your example: https://gist.github.com/motin/cca880c647263eb5e98d9f1e0d60a3c5

Still the same 100MB-sized layer outcome regardless of the tags used.

@phemmer

This comment has been minimized.

Copy link
Contributor

phemmer commented Apr 12, 2016

To answer the question about what version I'm using, 1.8.3 (the underlying os is always linux, even on mac. Though the client is linux, just in case that for some unlikely reason makes a difference).

@motin

This comment has been minimized.

Copy link

motin commented Apr 12, 2016

To answer the question about what version I'm using, 1.8.3

@phemmer Thanks. I now ran the one-liner on Docker 1.8.3 running on Debian, however the issue remains. Apparently, this has nothing to do with Docker version or if it is running in a native Linux environment or in OSX.

What do you get if you run the one-liner on your box?

@thaJeztah

This comment has been minimized.

Copy link
Member

thaJeztah commented Apr 12, 2016

@phemmer having the same on 1.9.1, now trying docker 1.8.3

@schmunk42

This comment has been minimized.

Copy link
Contributor

schmunk42 commented Apr 12, 2016

I also ran the script on 1.7 (Debian), 1.10 (Ubuntu), 1.11-rc4 (Mac) on Mac, via SSH and a swarm, all with the 100 MB / 100 MB outcome.

[update]
All tests were executed on AUFS.

@justincormack

This comment has been minimized.

Copy link
Contributor

justincormack commented Apr 12, 2016

Can replicate on aufs, but on overlay I get 10MB/100MB.

@phemmer

This comment has been minimized.

Copy link
Contributor

phemmer commented Apr 12, 2016

I was beginning to wonder if this was specific to the storage driver. I'm using btrfs. (and I get the proper behavior)

@thaJeztah

This comment has been minimized.

Copy link
Member

thaJeztah commented Apr 12, 2016

looks to be aufs yes

@schmunk42

This comment has been minimized.

Copy link
Contributor

schmunk42 commented Apr 12, 2016

Slightly OT: But how to test other storage drivers (easily) - I looked through the docs and also tried to create machines with docker-machine but none of them with other storage drivers than aufs or devicemapper were working for me.

And (bonus question) what's the current state or plan about docker and it's preferred storage driver?

@justincormack

This comment has been minimized.

Copy link
Contributor

justincormack commented Apr 12, 2016

@schmunk42 Docker for Mac can switch between aufs and overlay, although it is not yet documented I see. Generally those are easy to switch as they do not involve any reformatting, you can just change the /etc/docker/daemon.json config file or the startup arguments.

In terms of advice well there is @jfrazelle view here https://blog.jessfraz.com/post/the-brutally-honest-guide-to-docker-graphdrivers/ - it heavily depends on what base distro you are using what choice you make.

@schmunk42

This comment has been minimized.

Copy link
Contributor

schmunk42 commented Apr 12, 2016

Confirmed to be working as mentioned (10 MB / 100 MB) on 1.11.0-rc5 with overlay.

@AkihiroSuda

This comment has been minimized.

Copy link
Member

AkihiroSuda commented Apr 12, 2016

@schmunk42 This page is also great: https://github.com/docker/docker/blob/master/docs/userguide/storagedriver/selectadriver.md#future-proofing

It says AUFS and Devicemapper(direct-lvm) are "production-ready" status.

@jessfraz

This comment has been minimized.

Copy link
Contributor

jessfraz commented Apr 12, 2016

I would agree with AUFS but you've been warned about devicemapper

On Tue, Apr 12, 2016 at 9:04 AM, Akihiro Suda notifications@github.com
wrote:

@schmunk42 https://github.com/schmunk42 This page is also great:
https://github.com/docker/docker/blob/master/docs/userguide/storagedriver/selectadriver.md#future-proofing

It says AUFS and Devicemapper(direct-lvm) are "production-ready" status.


You are receiving this because you were mentioned.
Reply to this email directly or view it on GitHub
#21950 (comment)

Jessie Frazelle
4096R / D4C4 DD60 0D66 F65A 8EFC 511E 18F3 685C 0022 BFF3
pgp.mit.edu http://pgp.mit.edu/pks/lookup?op=get&search=0x18F3685C0022BFF3

@jessfraz

This comment has been minimized.

Copy link
Contributor

jessfraz commented Apr 12, 2016

tumblr_inline_nr6ulhjpk61t2b0m7_500

@tonistiigi

This comment has been minimized.

Copy link
Member

tonistiigi commented Apr 12, 2016

This is happening because of the different methods storage drivers use to compute layer differences. Everything but aufs uses NaiveDiffDriver that basically means that on every commit new and parent directories are compared based on file sizes and modified times. Aufs uses the native filesystem features and the new layer is the copy-up folder. Afaik builder itself doesn't optimize this case atm so all files are always copied, only in case of the naive driver some files are ignored later. Changing the diff method for aufs isn't a solution as it has many benefits(both performance and accuracy). There are some plans to change overlay to a similar method also.

@motin

This comment has been minimized.

Copy link

motin commented Apr 12, 2016

@tonistiigi Would you say that it should be expected that COPY only copies changed files in the near future, regardless of storage driver? Or would it make sense to implement a SYNC instruction which specifically targets this specific behavior?

@tonistiigi

This comment has been minimized.

Copy link
Member

tonistiigi commented Apr 12, 2016

@motin Not sure how I feel about a new instruction but I think the sync behavior shouldn't be the default for COPY if it comes with a performance regression.

@motin

This comment has been minimized.

Copy link

motin commented Apr 12, 2016

@tonistiigi Don't forget about the potentially huge performance enhancement achieved when thousands of users no longer are cramming in petabytes of duplicate data into the registries. Would it be worth it to have a slightly slower COPY instruction compared to this larger benefit to the Docker ecosystem?

@motin

This comment has been minimized.

Copy link

motin commented Apr 12, 2016

Anyone knows what storage drivers are used in Docker Hub's automated build infrastructure?

@phemmer

This comment has been minimized.

Copy link
Contributor

phemmer commented Apr 13, 2016

@tonistiigi

Aufs uses the native filesystem features and the new layer is the copy-up folder. Afaik builder itself doesn't optimize this case atm so all files are always copied, only in case of the naive driver some files are ignored later. Changing the diff method for aufs isn't a solution as it has many benefits(both performance and accuracy).
...
I think the sync behavior shouldn't be the default for COPY if it comes with a performance regression.

What about using the native filesystem diff capabilities to get the initial list of changes, and then doing the file size/time checks on the results? That should be fast no?

@AkihiroSuda

This comment has been minimized.

Copy link
Member

AkihiroSuda commented Apr 13, 2016

@phemmer I fear just relying on size/time is not robust: #21555
Perhaps we also need to rely on checksum?

@LK4D4

This comment has been minimized.

Copy link
Contributor

LK4D4 commented Nov 28, 2016

@tonistiigi is this overlay only issue?

@schmunk42

This comment has been minimized.

Copy link
Contributor

schmunk42 commented Nov 28, 2016

@LK4D4 For me it was working correctly on overlay but I had issues with AUFS.

@tonistiigi

This comment has been minimized.

Copy link
Member

tonistiigi commented Nov 28, 2016

@LK4D4 It should be aufs and overlay2

@LK4D4

This comment has been minimized.

Copy link
Contributor

LK4D4 commented Nov 28, 2016

@schmunk42 @tonistiigi Thanks. Added both labels. Not sure if this is WONTFIX. WDYT @tonistiigi

@tonistiigi

This comment has been minimized.

Copy link
Member

tonistiigi commented Nov 28, 2016

This can be improved but adding a new command or rescanning everything after each command shouldn't be the solution. Part of it could be fixed with better caching and ultimately we would like to store things by content instead of tar streams, that should remove this issue.

@mt-sergio

This comment has been minimized.

Copy link

mt-sergio commented Jan 19, 2017

My solution: (idea from https://github.com/neam/docker-diff-based-layers !)

docker rm -f uniquename 2> /dev/null
docker run --name uniquename -v ~/repo/mycode:/src ${REPO}/${IMAGE}:${BASE} rsync -ar --exclude-from '/src/.dockerignore' --delete /src/ /app/
docker commit uniquename ${REPO}/${IMAGE}:${NEW_TAG}
@hai-ld

This comment has been minimized.

Copy link

hai-ld commented Mar 10, 2017

Is it okay if I use overlay driver on build server (e.g. Jenkins) so that image size is minimal, but aufs/devicemapper on production servers?

@thaJeztah

This comment has been minimized.

Copy link
Member

thaJeztah commented Mar 10, 2017

@hai-ld yes, that should not be a problem, the images you've built should be able to run on any docker host, irregardless the storage-driver they're using.

@soundsgoodsofar

This comment has been minimized.

Copy link

soundsgoodsofar commented May 12, 2017

Ran into this issue myself and assumed it was expected behavior. Wasted a bunch of time researching arcane ways of creating differential tars and syncing them or something before finding this thread.

Would definitely vote for this to be fixed, even at the expense of COPY command speed. Transmitting multiple gig+ containers to all of our instances to change one file for deploy would probably outweigh any benefits we're getting from docker at this point.

@amenk

This comment has been minimized.

Copy link

amenk commented Aug 27, 2017

Can someone summarize this issue? It seems to work as expected with the overlay driver, but not with overlay2 and aufs, so is the overlay label wrong? (Docker version 17.06.1-ce, build 874a737 on Ubuntu 16.04)

@galakt

This comment has been minimized.

Copy link

galakt commented Mar 26, 2018

2 years soon

@thaJeztah thaJeztah added this to backlog in maintainers-session Mar 26, 2018

@thaJeztah thaJeztah removed this from backlog in maintainers-session May 31, 2018

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment