Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Refactor to support caching compression. #867

Merged
merged 6 commits into from Dec 13, 2020

Conversation

mattmoor
Copy link
Collaborator

Today tarball.LayerFromOpener must either compress or decompress its input based on which it receives.

Virtually all clients of this interface have an uncompressed tarball they want to turn into a layer, but the API we have presents two fun problems.

  1. If we simply present the uncompressed tarball (as most clients do) then we will often end up compressing the layer (expensive!) twice. The first time is to compute the layer's digest, and the second is typically to publish the layer as part of remote.Write.
  2. If we present a precompressed tarball (as ko does), then we will have an extra decompression (less expensive, but still redundant) to compute the diff id.

This refactors the layer to store an opener for each form of the layer. The constructor sniffs whether the layer is compressed and based on this makes the "other" opener (that requires conversion) a lazy thunk that performs the conversion each time (this is effectively what we were doing before based on the stored compressed bit). The advantage of this approach is that by setting them up before option evaluation we can make them memoizing on their first invocation. Given that most consumers of this library only pass the uncompressed form (or should), I have only provided an option for memoizing the compression.

Today `tarball.LayerFromOpener` must either compress or decompress its input based on which it receives.

Virtually all clients of this interface have an uncompressed tarball they want to turn into a layer, but the API we have presents two fun problems.
1. If we simply present the uncompressed tarball (as most clients do) then we will often end up compressing the layer (expensive!) twice.  The first time is to compute the layer's digest, and the second is typically to publish the layer as part of `remote.Write`.
2. If we present a precompressed tarball (as ko does), then we will have an extra decompression (less expensive, but still redundant) to compute the diff id.

This refactors the layer to store an opener for each form of the layer.  The constructor sniffs whether the layer is compressed and based on this makes the "other" opener (that requires conversion) a lazy thunk that performs the conversion each time (this is effectively what we were doing before based on the stored `compressed` bit).  The advantage of this approach is that by setting them up before option evaluation we can make them memoizing on their first invocation.  Given that most consumers of this library only pass the uncompressed form (or should), I have only provided an option for memoizing the compression.
Copy link
Collaborator

@imjasonh imjasonh left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why does ko pass a pre-compressed tarball? Should we recommend against it if we're going to treat the other path as special?

@@ -79,6 +70,31 @@ func WithCompressionLevel(level int) LayerOption {
}
}

// WithCompressedCaching ensures that we only compress the tarball once.
func WithCompressedCaching(l *layer) {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is an exported method that takes a param of an unexported type, so I don't think any external package can actually successfully call it.

Should it just be unexported?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This implements LayerOption, it is meant to pass to LayerFromOpener

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To pile on: tarball.LayerFromOpener(opener, tarball.WithCompressedCaching) -- it doesn't get called, just passed in.

We could thunk it or mess with signatures to make it clearer that this is an option? Or just document it as WithCompressedCaching is a functional option that....

@mattmoor
Copy link
Collaborator Author

Why does ko pass a pre-compressed tarball?

I think @jonjohnsonjr was trying to optimize the ko side, and by eagerly compressing things we trade off 2 gzips for 1 gzip and 1 gunzip (and gunzip is faster). When an uncompressed tarball is passed with this option, it will now be a single gzip.

Should we recommend against it if we're going to treat the other path as special?

It's not really "special", we just don't surface an option for the other path. We could if folks are using it, but the consumers I've looked at don't seem to be using that path (except ko, which I will change as soon as this lands).

@imjasonh
Copy link
Collaborator

Yeah I guess I just mean, should we document, "you don't have to compress before passing into here, this will take care of compression for you if it's not compressed."

@mattmoor
Copy link
Collaborator Author

👍 I added a detailed comment. In other news, does Windows work at all, ever?

@jonjohnsonjr
Copy link
Collaborator

I think @jonjohnsonjr was trying to optimize the ko side, and by eagerly compressing things we trade off 2 gzips for 1 gzip and 1 gunzip (and gunzip is faster).

👍 this was the case when I measured it.

In other news, does Windows work at all, ever?

Yes, but seems to be flakier than other legs.

opener: opener,
}

if compressed {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd really like to drop the above (L123) use of IsGzipped because this came up recently and is really annoying: #112 (comment)

We should be able to call opener just once, which would fix this... what about something like this:

Use a bufio.Reader to Peek at the first 2 bytes. We technically only need a 2 byte buffer for this, but we should probably do 4K to read (at least) a whole block. This will let us know if it's compressed without consuming the reader.

crane append uses tarball.LayerFromFile, which we should usually be able to call multiple times, but that doesn't seem to work with process substitution without fiddling.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay, so I went ahead and fixed the thing that I actually care about: #868

I still think it's a good idea to use a bufio.Reader here to peek at stuff, but I care about 15% less about it than I did before.

@jonjohnsonjr
Copy link
Collaborator

@imjasonh somehow this change is causing windows to fail because we're trying to open the same file twice without closing it (I think?), but scanning through the code I don't see an obvious issue. It looks like we Close everything but maybe not?

@mattmoor
Copy link
Collaborator Author

I think I tracked down the windows bug (see the final commit). We may have this problem other places, but this was dislodged by this change using the utility function in a place that previously used a gzip.NewReader directly.

@mattmoor
Copy link
Collaborator Author

That fixed things 😅

@jonjohnsonjr jonjohnsonjr merged commit 8b5370a into google:master Dec 13, 2020
@mattmoor mattmoor deleted the refactor-tarball-layer branch December 13, 2020 19:19
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants