Dockerfile ADD remote url does not use any HTTP header so always re-downloads #15717

Open
dkirrane opened this Issue Aug 20, 2015 · 12 comments

Comments

Projects
None yet
8 participants
@dkirrane
  • To Reproduce:
    • Build the below Dockerfile a couple of times docker build . Each time the jar file gets downloaded even though the file has not changed on Nexus.
FROM ubuntu

ADD ["https://oss.sonatype.org/service/local/artifact/maven/content?r=public&g=org.eclipse.xtext&a=org.eclipse.xtext.builder&v=2.9.0-SNAPSHOT", "/opt/xtext-builder.jar"]

CMD ["echo", "hello!"]
  • However, Nexus returns HTTP headers like Last-Modified or ETag that could be used to prevent the re-download and just use what's in the Docker cache.
# curl -I "https://oss.sonatype.org/service/local/artifact/maven/content?r=public&g=org.eclipse.xtext&a=org.eclipse.xtext.builder&v=2.9.0-SNAPSHOT"

HTTP/1.1 200 OK
Content-Disposition: attachment; filename="org.eclipse.xtext.builder-2.9.0-20150820.042448-99.jar"
Content-Length: 341045
Content-Type: application/java-archive
Date: Thu, 20 Aug 2015 10:14:34 GMT
ETag: "{SHA1{44253ea5406c02ead80789cbc763c8c27ba87124}}"
Last-Modified: Thu, 20 Aug 2015 04:24:49 GMT
Server: nginx
Vary: Accept-Charset, Accept-Encoding, Accept-Language, Accept
X-Content-Type-Options: nosniff
X-Frame-Options: SAMEORIGIN
Connection: keep-alive

docker version: Docker version 1.8.1, build d12ea79

docker info:
Containers: 2
Images: 365
Storage Driver: aufs
 Root Dir: /var/lib/docker/aufs
 Backing Filesystem: extfs
 Dirs: 369
 Dirperm1 Supported: false
Execution Driver: native-0.2
Logging Driver: json-file
Kernel Version: 3.13.0-49-generic
Operating System: Ubuntu 14.04.2 LTS
CPUs: 4
Total Memory: 15.67 GiB
Name: UBUNTU4116V
ID: OTB6:5W7M:D7PU:M6Q2:KKCI:PKI3:2TS4:SPZ5:6LQC:CYIW:N4AJ:XREU
WARNING: No swap limit support

uname -a: Linux UBUNTU4116V 3.13.0-49-generic #83-Ubuntu SMP Fri Apr 10 20:11:33 UTC 2015 x86_64 x86_64 x86_64 GNU/Linux
Ubuntu image running on VMWare

@bfirsh

This comment has been minimized.

Show comment
Hide comment
@bfirsh

bfirsh Aug 20, 2015

Contributor

Previously reported in #3672 but closed until a better solution could be found. Perhaps we could leave this open to discuss possible solutions to this.

Seems like we track Last-Modified, but it isn't being used to check cache hits: #8716

Contributor

bfirsh commented Aug 20, 2015

Previously reported in #3672 but closed until a better solution could be found. Perhaps we could leave this open to discuss possible solutions to this.

Seems like we track Last-Modified, but it isn't being used to check cache hits: #8716

@aidanhs

This comment has been minimized.

Show comment
Hide comment
@aidanhs

aidanhs Aug 20, 2015

Contributor

#12361 (comment)

The decision was made to, basically, not trust things like the URL or timestamps, and instead actually check the data itself to make sure nothing has changed.

Contributor

aidanhs commented Aug 20, 2015

#12361 (comment)

The decision was made to, basically, not trust things like the URL or timestamps, and instead actually check the data itself to make sure nothing has changed.

@bfirsh

This comment has been minimized.

Show comment
Hide comment
@bfirsh

bfirsh Aug 20, 2015

Contributor

Didn't spot that, thanks. ETag seems like a safer way to do it, though. That's typically a hash of the file's contents.

There's probably a certain degree of "good enough" for busting the cache too. E.g. RUN wget https://... will never get downloaded again if it changes, and people seem to be alright with that. If you absolutely want a URL to be downloaded again, you can do a docker build --no-cache.

Contributor

bfirsh commented Aug 20, 2015

Didn't spot that, thanks. ETag seems like a safer way to do it, though. That's typically a hash of the file's contents.

There's probably a certain degree of "good enough" for busting the cache too. E.g. RUN wget https://... will never get downloaded again if it changes, and people seem to be alright with that. If you absolutely want a URL to be downloaded again, you can do a docker build --no-cache.

@dkirrane

This comment has been minimized.

Show comment
Hide comment
@dkirrane

dkirrane Aug 20, 2015

Yip I already use RUN wget but I'm downloading a bunch of jars and have to manually go an figure out if any jar changed and then build with --no-cache which is cumbersome.

Since Last-Modified header will be in a HTTP-date format (http://tools.ietf.org/html/rfc2616) couldn't this be parsed & used for validating the cache. If you cannot parse it for any reason revert to the existing approach and re-download the file again. But yes ETag would also be another option also.

Yip I already use RUN wget but I'm downloading a bunch of jars and have to manually go an figure out if any jar changed and then build with --no-cache which is cumbersome.

Since Last-Modified header will be in a HTTP-date format (http://tools.ietf.org/html/rfc2616) couldn't this be parsed & used for validating the cache. If you cannot parse it for any reason revert to the existing approach and re-download the file again. But yes ETag would also be another option also.

@thaJeztah

This comment has been minimized.

Show comment
Hide comment
@thaJeztah

thaJeztah Aug 20, 2015

Member

Since Last-Modified header will be in a HTTP-date format (http://tools.ietf.org/html/rfc2616) couldn't this be parsed & used for validating the cache. If you cannot parse it for any reason revert to the existing approach and re-download the file again. But yes ETag would also be another option also.

Here's an old discussion on that topic;
see these comments for that; #8716 (comment), and #8716 (comment)

Member

thaJeztah commented Aug 20, 2015

Since Last-Modified header will be in a HTTP-date format (http://tools.ietf.org/html/rfc2616) couldn't this be parsed & used for validating the cache. If you cannot parse it for any reason revert to the existing approach and re-download the file again. But yes ETag would also be another option also.

Here's an old discussion on that topic;
see these comments for that; #8716 (comment), and #8716 (comment)

@aidanhs

This comment has been minimized.

Show comment
Hide comment
@aidanhs

aidanhs Aug 21, 2015

Contributor

I am in favour of comparing last-modified/etag as an initial check and falling through to a data-check, was just noting a previous position.

Why bother parsing it? Just hash the last-modified/etag, then compare it to the existing hash. If it matches then great, no need to download. I guess you'd store this in addition to the file hash.

Contributor

aidanhs commented Aug 21, 2015

I am in favour of comparing last-modified/etag as an initial check and falling through to a data-check, was just noting a previous position.

Why bother parsing it? Just hash the last-modified/etag, then compare it to the existing hash. If it matches then great, no need to download. I guess you'd store this in addition to the file hash.

@desliner

This comment has been minimized.

Show comment
Hide comment
@desliner

desliner Apr 4, 2016

Do we have any follow up on this issue?
Are there any downsides or concerns with using last-modified/etag for caching remote files and avoid re-downloading?

desliner commented Apr 4, 2016

Do we have any follow up on this issue?
Are there any downsides or concerns with using last-modified/etag for caching remote files and avoid re-downloading?

@thaJeztah

This comment has been minimized.

Show comment
Hide comment
@thaJeztah

thaJeztah Apr 4, 2016

Member

I don't think anyone is working on it at the moment; possibly if someone implements @aidanhs's proposal it could be considered, but I don't know other maintainers opinion on this.

Member

thaJeztah commented Apr 4, 2016

I don't think anyone is working on it at the moment; possibly if someone implements @aidanhs's proposal it could be considered, but I don't know other maintainers opinion on this.

@LK4D4

This comment has been minimized.

Show comment
Hide comment
@LK4D4

LK4D4 Nov 28, 2016

Contributor

It's still a problem. However, I'm not sure if I want to see this implemented in docker. It might be easier to do this by a small bash script.
So, I'm -1 on this.
@thaJeztah I wonder if we have a proper way to close/push forward proposals. Like -3 from maintainers = close, +3 = push forward.

Contributor

LK4D4 commented Nov 28, 2016

It's still a problem. However, I'm not sure if I want to see this implemented in docker. It might be easier to do this by a small bash script.
So, I'm -1 on this.
@thaJeztah I wonder if we have a proper way to close/push forward proposals. Like -3 from maintainers = close, +3 = push forward.

@rollcat rollcat referenced this issue in unit9/docklabs Dec 7, 2016

Merged

Support for /etc/rc.local; cleanup Makefile #6

@efficks

This comment has been minimized.

Show comment
Hide comment
@efficks

efficks Jan 20, 2017

It would be great for my build workflow to have this option. Actually, I'm downloading a 200 Mb file everytime!
I don't want to put this file in my git repo! Also, the file could eventually change.

efficks commented Jan 20, 2017

It would be great for my build workflow to have this option. Actually, I'm downloading a 200 Mb file everytime!
I don't want to put this file in my git repo! Also, the file could eventually change.

@alexellis

This comment has been minimized.

Show comment
Hide comment
@alexellis

alexellis Jan 24, 2017

Contributor

In lieu of a perfect solution being found, can we not review the proposed ideas? @thaJeztah

It seems like there is a lot of discussion around this and several related issues that have gone nowhere. When compared to the caching we're used to in curl/wget commands this feels asymmetric.

Perhaps the earliest report was over 2 years ago 👎 (in - #3672)

Contributor

alexellis commented Jan 24, 2017

In lieu of a perfect solution being found, can we not review the proposed ideas? @thaJeztah

It seems like there is a lot of discussion around this and several related issues that have gone nowhere. When compared to the caching we're used to in curl/wget commands this feels asymmetric.

Perhaps the earliest report was over 2 years ago 👎 (in - #3672)

@thaJeztah

This comment has been minimized.

Show comment
Hide comment
@thaJeztah

thaJeztah Feb 2, 2017

Member

After a lengthy discussion in our maintainers meeting;

  • We're open to adding this in future
  • However, the current caching mechanism makes it very complicated to check either on ETAG, or the checksum of the downloaded files, and would require "hacky" changes to the cache-store (which we want to avoid)
  • There are plans to refactor the caching store to be more flexible
  • After that refactor is done, we can revisit, and see if we can implement this
Member

thaJeztah commented Feb 2, 2017

After a lengthy discussion in our maintainers meeting;

  • We're open to adding this in future
  • However, the current caching mechanism makes it very complicated to check either on ETAG, or the checksum of the downloaded files, and would require "hacky" changes to the cache-store (which we want to avoid)
  • There are plans to refactor the caching store to be more flexible
  • After that refactor is done, we can revisit, and see if we can implement this

@thaJeztah thaJeztah removed this from backlog in maintainers-session Feb 2, 2017

samherrmann added a commit to samherrmann/docker-noms that referenced this issue Mar 28, 2017

Use wget to download noms build
ADD currently re-downloads remote content on every build. RUN wget/curl only causes a re-download if the URL changes.

see moby/moby#15717
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment