Join GitHub today
GitHub is home to over 40 million developers working together to host and review code, manage projects, and build software together.Sign up
Issue related to #1888 that wasn't resolved.
As Ross said here: #1888 (comment)
We're seeing this.
Steps to reproduce:
Run Docker 1.1.1 (or any version) on another machine on the local network. "Local network" in our case is a 10GbE connection with ample available bandwidth.
Push and pull any docker image from the Docker host to the docker-registry. Observe the speeds.
For reference, here's a concrete example (names changed, everything else as per output).
OK, so 2 mins 6 seconds to pull this container. Now, I took the access log from the registry server (to see every file Docker downloaded from the registry) and created a Bash script from it to curl each of the files downloaded:
Now, how long does this script take?
I realise that pull is doing other stuff besides downloading the layers (tar?), but should it really take over 12 times as long to 'docker pull' than to download the files? It seems to me there's something slowing down the process considerably.
The problem is especially noticable on large containers, pulling them from a docker-registry to hosts takes a considerable length of time despite ample free bandwidth over a local network.
Edit: Replaced semicolons with newlines in bash script to enhance readability
It's possible that TCP congestion control kicks in as we are stalling the TCP connection when performing on the fly untar. Increasing buffer sizes might help, and @unclejack is testing that.
@sammcj you seem to have a very relevant test setup that demonstrates that problem. Would you mind helping test a custom build with a tentative fix for that on your side? If so, ping me on irc.
Not a problem, I can pop on IRC tomorrow (it's 6PM here in Aus) - if you
Sent from my iPhone
On 30 Jul 2014, at 5:49 pm, Olivier Gambier email@example.com
It's possible that TCP congestion control kicks in as we are stalling the
@sammcj https://github.com/sammcj you seem to have a very relevant test
Another thought here is that
@stevenschlansker I looked into their briefly a while back, and thought why can't the client and server just sync a list of layers they each have or need.
Its infuriating seeing this:
take well over a minute.
Why can't the client just say "my push is going to be these layers", and the server say "I have a,b,c, only send me x,y,z"
So as a summary so far we have a list of things to investigate:
This is really impacting our ability to use Docker across our environments and it's currently giving Docker a bit of a bad name around the business due to the increase in time to release applications.
I looked into the push side of things by running a registry on the same host and benchmarking pushes to the same host to eliminate any remote network issues. I'm using docker 1.1.2 and latest master for benchmarking.
There appears to be a bottleneck when pushing a layer and calculating a tarsum.
The problem line is: https://github.com/docker/docker/blob/master/registry/registry.go#L643 and the two main issues are:
Running a patched version of docker that disables compression and uses md5 for tarsum improved my docker push times from 41s to 7s (~6x improvement).
The docker image I'm pushing is about 391MB.
#5956 would allow md5 to be used for registry checksums and the diff below disables compresssion.
@jwilder thank you!
Couple of questions;
Here's times for SHA256, SHA1, MD5 w/ and w/o Gzip.
Looks like Gzip (while tarsumming) is the main bottleneck. Interestingly, SHA1 was slightly faster than the other two.
SHA256 w/ Gzip (baseline)
MD5 w/ Gzip
SHA1 w/ Gzip
SHA256 w/o Gzip
SHA1 w/o Gzip
MD5 w/o Gzip
Rather than coming up with a solution which actually fixes this problem in 20-50% of the cases and makes other problems worse (or introduces new ones), I'd like to have a proper fix. This takes a bit more time and a bit more work, but that will avoid regressions.
also, while the sha256 hash will be used for the deterministic IDs of layers, allowing for faster hashes used for transactions like push/pull or others, is one of the use-cases for this PR of allowing generic hashes for the TarSum #5956
For the sake of your benchmarking, you ought to check on the blake2b hash as well.