Skip to content

Idea: maintain metadata cache about layers #264

@jonjohnsonjr

Description

@jonjohnsonjr

cc @mattmoor @imjasonh wdyt

Go maintains a cache of build outputs under ~/.cache/go-build (go env GOCACHE). What if we did something similar for our layers under ~/.cache/ko-build? I don't think we should cache the actual layer (it's a waste of space), but we could cache a small amount of metadata about the layers which would let us avoid re-tarring and re-gzipping them if the inputs haven't changed.

Basically, I want the benefits of --watch but without the --watch.

Currently, ko has to re-gzip layers for every build, even if we don't end up pushing them to a registry, because we need to calculate the digest to HEAD the blob (existence check) in the registry. If we already knew what the digest and diffid of each layer would be without having to gzip them, we could avoid a lot of work.

time saved

tl;dr about 800ms per build per cache hit on my underpowered chromebook:

For our first build, time is going to be dominated by go build:

$ time go build ./cmd/ko/

real	0m48.950s
user	1m18.696s
sys	0m10.057s

Subsequent builds are very fast:

$ time go build ./cmd/ko/

real	0m0.600s
user	0m0.744s
sys	0m0.327s

How much time do we spend gzipping?

$ time gzip $(which ko) --fast -c >/dev/null

real	0m1.067s
user	0m0.983s
sys	0m0.054s

How long does it take to sha256 the binary instead?

$ time sha256sum $(which ko)
67b1a51fe41816c164eec9f294bd31bcbcc93c5c96cae2a7506247ad313318a7  /home/jonjohnson/go/bin/ko

real	0m0.293s
user	0m0.258s
sys	0m0.021s

What about sha1?

 $ time sha1sum ~/go/bin/ko 
728cfd98610f4f39648eb9741ea0161c531f3b20  /home/jonjohnson/go/bin/ko

real	0m0.149s
user	0m0.134s
sys	0m0.015s
step time (ms)
go build 600
gzip --fast 1067
sha256 293
sha1 149

We can drop at least 1s here by eliding the gzip for cache hits. We could also elide the go build if we got really into the internal guts of the go build cache, but let's punt on that for now. Hashing the binary will add ~150-300ms back, but we've still saved 7-800ms (of expensive CPU work).

what to store

For a push using go-containerregistry, we need three (~4 if you include the bytes) pieces of information:

  1. The DiffID, which is the sha256 digest of the layer tarball.
  2. The Digest, which is the sha256 digest of the gzipped layer tarball.
  3. The Size of the gzipped layer tarball.
  4. The gzipped layer tarball bytes.

We only actually need the bytes if we're going to upload the layer. If we can skip uploading it (because it's already in the registry), all we need to know are the first three pieces of information. We don't need to cache the actual bytes, because it's easy enough to generate them, and we're in the "miss" path if we need them anyway.

how long to store it

It's not a lot of data... we could just store everything, but I don't want this disk info to grow without bound. It seems unlikely that anything other than the most recent build would be useful. Maybe the last two builds. You usually are either pushing whatever the target is as $COMMIT or you've changed the target slightly and need to rebuild it. As a heuristic, keeping the last two should cover most cases we care about.

We could also do some time-based garbage collection, but time isn't real, so I don't like that solution much.

I'm imagining something like:

// package -> cache keys that we want to retain
// capacity per package can be configurable, but I'd propose at least 2
// simple LRU with fixed capacity
type retainedInfo map[string][]string

At the end of a ko invocation, we can just GC any info that isn't reachable from the above structure (or just make this an explicit thing a la go cache clean).

cache keys

The simplest thing I can imagine is to just sha256 the output binary and use that as our key. In fact, it doesn't really matter what we use to hash the binary, if sha1 or md5 is faster, we can use that as well. We aren't relying on this being that secure.

We can do the same thing for the kodata layers, but we need a different key (no binary to hash). We could hash the tarball and just use the diffid as a key for kodata layers. Using the diffid would work for both the binary layers and the kodata layers, and would be one less string to store, so maybe that's a good idea. I'd want to measure how long tarring stuff up takes, but it seems not that bad:

$ time sha256sum <(tar -c $(which ko) -f -)
tar: Removing leading `/' from member names
544e7ac2d2090a1f533c417904f32bff54bc5a00268cc609ca56a86ec2070837  /dev/fd/63

real	0m0.312s
user	0m0.276s
sys	0m0.082s

^ Probably worth the ~150ms to simplify things...

Another idea would be to use the go build cache keys, but I'm not super clear on how they work:

# Maybe something interestingwe can do here?
GODEBUG=gocachehash=1 go build ./cmd/ko/ 2>&1 | tail -n 1
HASH[link github.com/google/ko/cmd/ko]: dd7b3838bcf55328330b009deed92bd3412cfb21c2d6efcf9277d5aa98e611e5

Anyway, the info we need to store would look something like:

type layerInfo struct {
  diffid v1.Hash
  digest v1.Hash
  size int64
}

type koCache struct {
  map[string]layerInfo
}

lazyLayer

Our implementation for a layer in our cache would look something like:

// We have all this info from the cache.

func (l *lazyLayer) Digest() (v1.Hash, error) {
  return l.info.digest, nil
}

func (l *lazyLayer) DiffID() (v1.Hash, error) {
  return l.info.diffid, nil
}

func (l *lazyLayer) Size() (int64, error) {
  return l.info.size, nil
}

func (l *lazyLayer) MediaType() (types.MediaType, error) {
  return types.DockerLayer, nil
}

// We only need to do these if the registry doesn't have it.

func (l *lazyLayer) Compressed() (io.ReadCloser, error) {
  return v1util.GzipReadCloserLevel(l.tarBinary(), gzip.BestSpeed), nil
}

func (l *lazyLayer) Uncompressed() (io.ReadCloser, error) {
  return l.tarBinary(), nil
}

Alternatively, maybe we can give tarball.LayerFromWhatever some options that allow callers to supply hints like this?

activation

Maybe we don't want to do this by default, so users can just do:

export KOCACHE=$HOME/.cache/ko-build

And this gets enabled?

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions