cc @mattmoor @imjasonh wdyt
Go maintains a cache of build outputs under ~/.cache/go-build (go env GOCACHE). What if we did something similar for our layers under ~/.cache/ko-build? I don't think we should cache the actual layer (it's a waste of space), but we could cache a small amount of metadata about the layers which would let us avoid re-tarring and re-gzipping them if the inputs haven't changed.
Basically, I want the benefits of --watch but without the --watch.
Currently, ko has to re-gzip layers for every build, even if we don't end up pushing them to a registry, because we need to calculate the digest to HEAD the blob (existence check) in the registry. If we already knew what the digest and diffid of each layer would be without having to gzip them, we could avoid a lot of work.
time saved
tl;dr about 800ms per build per cache hit on my underpowered chromebook:
For our first build, time is going to be dominated by go build:
$ time go build ./cmd/ko/
real 0m48.950s
user 1m18.696s
sys 0m10.057s
Subsequent builds are very fast:
$ time go build ./cmd/ko/
real 0m0.600s
user 0m0.744s
sys 0m0.327s
How much time do we spend gzipping?
$ time gzip $(which ko) --fast -c >/dev/null
real 0m1.067s
user 0m0.983s
sys 0m0.054s
How long does it take to sha256 the binary instead?
$ time sha256sum $(which ko)
67b1a51fe41816c164eec9f294bd31bcbcc93c5c96cae2a7506247ad313318a7 /home/jonjohnson/go/bin/ko
real 0m0.293s
user 0m0.258s
sys 0m0.021s
What about sha1?
$ time sha1sum ~/go/bin/ko
728cfd98610f4f39648eb9741ea0161c531f3b20 /home/jonjohnson/go/bin/ko
real 0m0.149s
user 0m0.134s
sys 0m0.015s
| step |
time (ms) |
| go build |
600 |
| gzip --fast |
1067 |
| sha256 |
293 |
| sha1 |
149 |
We can drop at least 1s here by eliding the gzip for cache hits. We could also elide the go build if we got really into the internal guts of the go build cache, but let's punt on that for now. Hashing the binary will add ~150-300ms back, but we've still saved 7-800ms (of expensive CPU work).
what to store
For a push using go-containerregistry, we need three (~4 if you include the bytes) pieces of information:
- The
DiffID, which is the sha256 digest of the layer tarball.
- The
Digest, which is the sha256 digest of the gzipped layer tarball.
- The
Size of the gzipped layer tarball.
- The gzipped layer tarball bytes.
We only actually need the bytes if we're going to upload the layer. If we can skip uploading it (because it's already in the registry), all we need to know are the first three pieces of information. We don't need to cache the actual bytes, because it's easy enough to generate them, and we're in the "miss" path if we need them anyway.
how long to store it
It's not a lot of data... we could just store everything, but I don't want this disk info to grow without bound. It seems unlikely that anything other than the most recent build would be useful. Maybe the last two builds. You usually are either pushing whatever the target is as $COMMIT or you've changed the target slightly and need to rebuild it. As a heuristic, keeping the last two should cover most cases we care about.
We could also do some time-based garbage collection, but time isn't real, so I don't like that solution much.
I'm imagining something like:
// package -> cache keys that we want to retain
// capacity per package can be configurable, but I'd propose at least 2
// simple LRU with fixed capacity
type retainedInfo map[string][]string
At the end of a ko invocation, we can just GC any info that isn't reachable from the above structure (or just make this an explicit thing a la go cache clean).
cache keys
The simplest thing I can imagine is to just sha256 the output binary and use that as our key. In fact, it doesn't really matter what we use to hash the binary, if sha1 or md5 is faster, we can use that as well. We aren't relying on this being that secure.
We can do the same thing for the kodata layers, but we need a different key (no binary to hash). We could hash the tarball and just use the diffid as a key for kodata layers. Using the diffid would work for both the binary layers and the kodata layers, and would be one less string to store, so maybe that's a good idea. I'd want to measure how long tarring stuff up takes, but it seems not that bad:
$ time sha256sum <(tar -c $(which ko) -f -)
tar: Removing leading `/' from member names
544e7ac2d2090a1f533c417904f32bff54bc5a00268cc609ca56a86ec2070837 /dev/fd/63
real 0m0.312s
user 0m0.276s
sys 0m0.082s
^ Probably worth the ~150ms to simplify things...
Another idea would be to use the go build cache keys, but I'm not super clear on how they work:
# Maybe something interestingwe can do here?
GODEBUG=gocachehash=1 go build ./cmd/ko/ 2>&1 | tail -n 1
HASH[link github.com/google/ko/cmd/ko]: dd7b3838bcf55328330b009deed92bd3412cfb21c2d6efcf9277d5aa98e611e5
Anyway, the info we need to store would look something like:
type layerInfo struct {
diffid v1.Hash
digest v1.Hash
size int64
}
type koCache struct {
map[string]layerInfo
}
lazyLayer
Our implementation for a layer in our cache would look something like:
// We have all this info from the cache.
func (l *lazyLayer) Digest() (v1.Hash, error) {
return l.info.digest, nil
}
func (l *lazyLayer) DiffID() (v1.Hash, error) {
return l.info.diffid, nil
}
func (l *lazyLayer) Size() (int64, error) {
return l.info.size, nil
}
func (l *lazyLayer) MediaType() (types.MediaType, error) {
return types.DockerLayer, nil
}
// We only need to do these if the registry doesn't have it.
func (l *lazyLayer) Compressed() (io.ReadCloser, error) {
return v1util.GzipReadCloserLevel(l.tarBinary(), gzip.BestSpeed), nil
}
func (l *lazyLayer) Uncompressed() (io.ReadCloser, error) {
return l.tarBinary(), nil
}
Alternatively, maybe we can give tarball.LayerFromWhatever some options that allow callers to supply hints like this?
activation
Maybe we don't want to do this by default, so users can just do:
export KOCACHE=$HOME/.cache/ko-build
And this gets enabled?
cc @mattmoor @imjasonh wdyt
Go maintains a cache of build outputs under
~/.cache/go-build(go env GOCACHE). What if we did something similar for our layers under~/.cache/ko-build? I don't think we should cache the actual layer (it's a waste of space), but we could cache a small amount of metadata about the layers which would let us avoid re-tarring and re-gzipping them if the inputs haven't changed.Basically, I want the benefits of
--watchbut without the--watch.Currently,
kohas to re-gzip layers for every build, even if we don't end up pushing them to a registry, because we need to calculate the digest to HEAD the blob (existence check) in the registry. If we already knew what the digest and diffid of each layer would be without having to gzip them, we could avoid a lot of work.time saved
tl;dr about 800ms per build per cache hit on my underpowered chromebook:
For our first build, time is going to be dominated by
go build:Subsequent builds are very fast:
How much time do we spend gzipping?
How long does it take to sha256 the binary instead?
What about sha1?
We can drop at least 1s here by eliding the gzip for cache hits. We could also elide the
go buildif we got really into the internal guts of the go build cache, but let's punt on that for now. Hashing the binary will add ~150-300ms back, but we've still saved 7-800ms (of expensive CPU work).what to store
For a push using go-containerregistry, we need three (~4 if you include the bytes) pieces of information:
DiffID, which is the sha256 digest of the layer tarball.Digest, which is the sha256 digest of the gzipped layer tarball.Sizeof the gzipped layer tarball.We only actually need the bytes if we're going to upload the layer. If we can skip uploading it (because it's already in the registry), all we need to know are the first three pieces of information. We don't need to cache the actual bytes, because it's easy enough to generate them, and we're in the "miss" path if we need them anyway.
how long to store it
It's not a lot of data... we could just store everything, but I don't want this disk info to grow without bound. It seems unlikely that anything other than the most recent build would be useful. Maybe the last two builds. You usually are either pushing whatever the target is as $COMMIT or you've changed the target slightly and need to rebuild it. As a heuristic, keeping the last two should cover most cases we care about.
We could also do some time-based garbage collection, but time isn't real, so I don't like that solution much.
I'm imagining something like:
At the end of a
koinvocation, we can just GC any info that isn't reachable from the above structure (or just make this an explicit thing a lago cache clean).cache keys
The simplest thing I can imagine is to just sha256 the output binary and use that as our key. In fact, it doesn't really matter what we use to hash the binary, if sha1 or md5 is faster, we can use that as well. We aren't relying on this being that secure.
We can do the same thing for the
kodatalayers, but we need a different key (no binary to hash). We could hash the tarball and just use the diffid as a key forkodatalayers. Using the diffid would work for both the binary layers and the kodata layers, and would be one less string to store, so maybe that's a good idea. I'd want to measure how long tarring stuff up takes, but it seems not that bad:^ Probably worth the ~150ms to simplify things...
Another idea would be to use the
go buildcache keys, but I'm not super clear on how they work:Anyway, the info we need to store would look something like:
lazyLayer
Our implementation for a layer in our cache would look something like:
Alternatively, maybe we can give
tarball.LayerFromWhateversome options that allow callers to supply hints like this?activation
Maybe we don't want to do this by default, so users can just do:
And this gets enabled?