Idea: maintain metadata cache about layers

cc @mattmoor @ImJasonH wdyt

Go maintains a cache of build outputs under `~/.cache/go-build` (`go env GOCACHE`). What if we did something similar for our layers under `~/.cache/ko-build`? I don't think we should cache the actual layer (it's a waste of space), but we could cache a small amount of metadata about the layers which would let us avoid re-tarring and re-gzipping them if the inputs haven't changed.  

Basically, I want the benefits of `--watch` but without the `--watch`.

Currently, `ko` has to re-gzip layers for every build, even if we don't end up pushing them to a registry, because we need to calculate the digest to HEAD the blob (existence check) in the registry. If we already knew what the digest and diffid of each layer would be without having to gzip them, we could avoid a lot of work.

### time saved

tl;dr about **800ms** per build per cache hit on my underpowered chromebook:

For our first build, time is going to be dominated by `go build`:
```
$ time go build ./cmd/ko/

real	0m48.950s
user	1m18.696s
sys	0m10.057s
```

Subsequent builds are _very_ fast:
```
$ time go build ./cmd/ko/

real	0m0.600s
user	0m0.744s
sys	0m0.327s
```

How much time do we spend gzipping?
```
$ time gzip $(which ko) --fast -c >/dev/null

real	0m1.067s
user	0m0.983s
sys	0m0.054s
```

How long does it take to sha256 the binary instead?
```
$ time sha256sum $(which ko)
67b1a51fe41816c164eec9f294bd31bcbcc93c5c96cae2a7506247ad313318a7  /home/jonjohnson/go/bin/ko

real	0m0.293s
user	0m0.258s
sys	0m0.021s
```

What about sha1?
```
 $ time sha1sum ~/go/bin/ko 
728cfd98610f4f39648eb9741ea0161c531f3b20  /home/jonjohnson/go/bin/ko

real	0m0.149s
user	0m0.134s
sys	0m0.015s
```

| step        | time (ms) |
|-------------|-----------|
| go build    | 600       |
| gzip --fast | 1067      |
| sha256      | 293       |
| sha1      | 149       |

We can drop at least 1s here by eliding the gzip for cache hits. We could also elide the `go build` if we got really into the internal guts of the go build cache, but let's punt on that for now. Hashing the binary will add ~150-300ms back, but we've still saved 7-800ms (of expensive CPU work).

### what to store

For a push using go-containerregistry, we need three (~4 if you include the bytes) pieces of information:
1. The `DiffID`, which is the sha256 digest of the layer tarball.
2. The `Digest`, which is the sha256 digest of the _gzipped_ layer tarball.
3. The `Size` of the _gzipped_ layer tarball.
4. The _gzipped_ layer tarball bytes.

We only actually need the bytes if we're going to upload the layer. If we can skip uploading it (because it's already in the registry), all we need to know are the first three pieces of information. We don't need to cache the actual bytes, because it's easy enough to generate them, and we're in the "miss" path if we need them anyway.

### how long to store it

It's not a lot of data... we _could_ just store everything, but I don't want this disk info to grow without bound. It seems unlikely that anything other than the most recent build would be useful. Maybe the last two builds. You usually are either pushing whatever the target is as $COMMIT or you've changed the target slightly and need to rebuild it. As a heuristic, keeping the last two should cover _most_ cases we care about.

We could also do some time-based garbage collection, but time isn't real, so I don't like that solution much.

I'm imagining something like:

```go
// package -> cache keys that we want to retain
// capacity per package can be configurable, but I'd propose at least 2
// simple LRU with fixed capacity
type retainedInfo map[string][]string
```

At the end of a `ko` invocation, we can just GC any info that isn't reachable from the above structure (or just make this an explicit thing a la `go cache clean`).

### cache keys

The simplest thing I can imagine is to just sha256 the output binary and use that as our key. In fact, it doesn't _really_ matter what we use to hash the binary, if sha1 or md5 is faster, we can use that as well. We aren't relying on this being _that_ secure.

We can do the same thing for the `kodata` layers, but we need a different key (no binary to hash). We could hash the tarball and just use the diffid as a key for `kodata` layers. Using the diffid would work for both the binary layers and the kodata layers, and would be one less string to store, so maybe that's a good idea. I'd want to measure how long tarring stuff up takes, but it seems not that bad:

```
$ time sha256sum <(tar -c $(which ko) -f -)
tar: Removing leading `/' from member names
544e7ac2d2090a1f533c417904f32bff54bc5a00268cc609ca56a86ec2070837  /dev/fd/63

real	0m0.312s
user	0m0.276s
sys	0m0.082s
```

^ Probably worth the ~150ms to simplify things...

Another idea would be to use the `go build` cache keys, but I'm not super clear on how they work:

```
# Maybe something interestingwe can do here?
GODEBUG=gocachehash=1 go build ./cmd/ko/ 2>&1 | tail -n 1
HASH[link github.com/google/ko/cmd/ko]: dd7b3838bcf55328330b009deed92bd3412cfb21c2d6efcf9277d5aa98e611e5
```

Anyway, the info we need to store would look something like:

```go
type layerInfo struct {
  diffid v1.Hash
  digest v1.Hash
  size int64
}

type koCache struct {
  map[string]layerInfo
}
```

### lazyLayer

Our implementation for a layer in our cache would look something like:

```go
// We have all this info from the cache.

func (l *lazyLayer) Digest() (v1.Hash, error) {
  return l.info.digest, nil
}

func (l *lazyLayer) DiffID() (v1.Hash, error) {
  return l.info.diffid, nil
}

func (l *lazyLayer) Size() (int64, error) {
  return l.info.size, nil
}

func (l *lazyLayer) MediaType() (types.MediaType, error) {
  return types.DockerLayer, nil
}

// We only need to do these if the registry doesn't have it.

func (l *lazyLayer) Compressed() (io.ReadCloser, error) {
  return v1util.GzipReadCloserLevel(l.tarBinary(), gzip.BestSpeed), nil
}

func (l *lazyLayer) Uncompressed() (io.ReadCloser, error) {
  return l.tarBinary(), nil
}
```

Alternatively, maybe we can give `tarball.LayerFromWhatever` some options that allow callers to supply hints like this?

### activation

Maybe we don't want to do this by default, so users can just do:

```
export KOCACHE=$HOME/.cache/ko-build
```

And this gets enabled?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Idea: maintain metadata cache about layers #264

time saved

what to store

how long to store it

cache keys

lazyLayer

activation

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Idea: maintain metadata cache about layers #264

Description

time saved

what to store

how long to store it

cache keys

lazyLayer

activation

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions