Skip to content
This repository has been archived by the owner on Jun 4, 2021. It is now read-only.

Puller Slowness #23

Closed
promiseofcake opened this issue Oct 11, 2017 · 8 comments
Closed

Puller Slowness #23

promiseofcake opened this issue Oct 11, 2017 · 8 comments

Comments

@promiseofcake
Copy link

promiseofcake commented Oct 11, 2017

It took my local machine 24 minutes to pull a 1.3GB image from our private docker registry. Is this expected behavior?

@mattmoor
Copy link
Contributor

There are two pullers, which were you using?

The version that spits out a docker save tarball is going to be slow because that form isn't great.

The version that spits out the more proprietary form we consume here should be fast, but one exception to this is if it needs to convert a v2 image to v2.2 here.

Can you share a little more about how you are invoking this, so I can help diagnose?

@promiseofcake
Copy link
Author

promiseofcake commented Oct 11, 2017

Hey @mattmoor thanks for the quick reply. So I have been hoping to use it as part of bazelbuild/rules_docker so the proprietary form would be fine, but was getting timeouts after 600s so I started debugging with the underlying :puller target from this repo.

I think we might be falling into the "if it needs to convert a v2 image to v2.2" use case, I am doing some benchmarking locally:

$ time bazel-bin/puller --directory openjdk8 --name index.docker.io/library/openjdk:8

real	0m33.088s
user	0m3.014s
sys	0m1.985s

But for our private registry (same image pushed up to it):

$ time bazel-bin/puller --directory private-openjdk8 --name tld.for.private.registry/library/openjdk:8

real	6m17.294s
user	0m11.760s
sys	0m6.030s

We are running image: distribution/registry:2.1.1, so I think we may just need to upgrade to take advantage of the faster speeds?

@mattmoor
Copy link
Contributor

If you are able to bazel run :puller can you try instrumenting that line and see if it hits?

The 2.1.1 tag reports as being built 2 years ago, so I'd definitely update. You may need to republish the image for the change to take effect.

@promiseofcake
Copy link
Author

promiseofcake commented Oct 12, 2017

Logging patch:

diff --git a/tools/fast_puller_.py b/tools/fast_puller_.py
index 64c61a5..1ff9d2c 100755
--- a/tools/fast_puller_.py
+++ b/tools/fast_puller_.py
@@ -19,6 +19,7 @@ Unlike docker_puller the format this uses is proprietary.
 
 
 import argparse
+import logging
 
 from containerregistry.client import docker_creds
 from containerregistry.client import docker_name
@@ -73,12 +74,16 @@ def main():
   creds = docker_creds.DefaultKeychain.Resolve(name)
 
   with v2_2_image.FromRegistry(name, creds, transport, accept) as v2_2_img:
+    logging.warning("with v2_2_image.FromRegistry")
     if v2_2_img.exists():
+      logging.warning("saving fast for v2_2 image")
       save.fast(v2_2_img, args.directory, threads=_THREADS)
       return
 
   with v2_image.FromRegistry(name, creds, transport) as v2_img:
+    logging.warning("with v2_image.FromRegistry")
     with v2_compat.V22FromV2(v2_img) as v2_2_img:
+      logging.warning("with v2_compat.V22FromV2")
       save.fast(v2_2_img, args.directory, threads=_THREADS)
       return
time bazel-bin/puller --directory openjdk8 --name index.docker.io/library/openjdk:8
WARNING:root:with v2_2_image.FromRegistry
WARNING:root:saving fast for v2_2 image

real	0m28.850s
user	0m3.495s
sys	0m2.366s
$ time bazel-bin/puller --directory private-openjdk8 --name tld.for.private.registry/library/openjdk:8
WARNING:root:with v2_2_image.FromRegistry
WARNING:root:with v2_image.FromRegistry
WARNING:root:with v2_compat.V22FromV2

real	5m13.048s
user	0m14.126s
sys	0m7.164s

Looks like that could be it, thanks @mattmoor.

@mattmoor
Copy link
Contributor

Yeah, this is an unfortunate consequence of some strange choices Docker made about the v2.2 manifest format and extremely poor Python gzip performance**. :(

Our new intermediate format can be so fast on the happy path because it's basically designed to avoid Python g[un]zip and random accesses within a tarball.

There are other nuances that enable Bazel to prune other optional work depending on the result the user is looking for.

** - With fresh eyes, we could probably gunzip layers in a threadpool here, but (I'm guessing) this will still be several times slower than switching to v2.2.

@promiseofcake
Copy link
Author

@mattmoor, so we have upgraded our registry to 2.5.2 in order to take advantage of the new manifest version. I have also bumped my version of this repo to v0.0.22 and re-built the puller and am getting the following results:

$ time bazel-bin/puller --stderrthreshold DEBUG --directory openjdk8 --name index.docker.io/library/openjdk:8
I1212 14:56:00.225013   75771 docker_creds_.py:228] Loading Docker credentials for repository 'index.docker.io/library/openjdk:8'
I1212 14:56:00.225629   75771 fast_puller_.py:79] Pulling v2.2 image from <containerregistry.client.docker_name_.Tag object at 0x10f1980d0> ...

real	0m26.948s
user	0m2.628s
sys	0m2.335s
$ time bazel-bin/puller --stderrthreshold DEBUG --directory private-openjdk8 --name tld.for.private.registry/library/openjdk:8
I1212 14:57:01.843225   75821 docker_creds_.py:228] Loading Docker credentials for repository 'tld.for.private.registry/library/openjdk:8'
I1212 14:57:01.843544   75821 fast_puller_.py:79] Pulling v2.2 image from <containerregistry.client.docker_name_.Tag object at 0x1108d90d0> ...

real	3m32.554s
user	0m3.144s
sys	0m2.464s

When observing the network pane of Actibity monitor in OSX, I am pulling much slower from our private registry in comparison to the public one. We are not running anything fancy for our registry, any tips on speeding up the pulls? Is there any additional logging I can add to try to isolate where the slow-downs are?

@promiseofcake promiseofcake reopened this Dec 12, 2017
@mattmoor
Copy link
Contributor

That's interesting. These should be following nearly identical code paths, so I'm wondering if this is just registry slowness? Hard to say without further instrumentation.

e.g. the major registries I've played with (or built) serve blobs by redirecting to the underlying storage (e.g. GCS, S3) because those stacks are optimized for delivering large data very fast. For the registry to serve the data directly, it would have to read the data itself and then proxy it to the client. Knowing nothing about your private stack, I'm left to wonder about this :)

I'd probably instrument this code to find out how much time is spent in accessor(arg), which is effectively the blob or config fetch.

@lshamis
Copy link

lshamis commented Mar 8, 2020

Has this been resolved? I think I'm having similar issues.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants