Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

how do I check if a layout contains all of the blobs? #838

Closed
deitch opened this issue Nov 23, 2020 · 19 comments · Fixed by #1013
Closed

how do I check if a layout contains all of the blobs? #838

deitch opened this issue Nov 23, 2020 · 19 comments · Fixed by #1013

Comments

@deitch
Copy link
Collaborator

deitch commented Nov 23, 2020

Is it possible to somehow try to "resolve", or walk, my local v1 layout from an index that is there and see if all parts are there?

Use case: let's say I have an image as follows. I am using part of docker.io/library/alpine:3.11

9a839e63dad54c3a6d1834e29692c8492d93f90c59c978c1ed79109ea4fb9a54 - index
|
|- 39eda93d15866957feaee28f8fc5adb545276a64147445c64992ef69804dbf01 - manifest linux/amd64
    |- f70734b6a266dcb5f44c383274821207885b549b75c8e119404917a61335981a - config
    |-cbdbe7a5bc2a134ca8ec91be58565ec07d037386d1f1d8385412d224deafca08 - layer0
|- ad295e950e71627e9d0d14cdc533f4031d42edae31ab57a841c5b9588eacc280 - manifest linux/arm64
    |- c20d2a9ab6869161e3ea6d8cb52d00be9adac2cc733d3fbc3955b9268bfd7fc5 - config
    |- 29e5d40040c18c692ed73df24511071725b74956ca1a61fe6056a651d86a13bd - layer0

For the above, I have all of the parts from root index through manifests, configs, and layers for linux/amd64 and linux/arm64.

If my local layout directory has some of those, but not linux/arm64 parts, then it might look like this:

9a839e63dad54c3a6d1834e29692c8492d93f90c59c978c1ed79109ea4fb9a54 - index
|
|- 39eda93d15866957feaee28f8fc5adb545276a64147445c64992ef69804dbf01 - manifest linux/amd64
    |- f70734b6a266dcb5f44c383274821207885b549b75c8e119404917a61335981a - config
    |-cbdbe7a5bc2a134ca8ec91be58565ec07d037386d1f1d8385412d224deafca08 - layer0

How do I check if all of the parts are there without going to docker.io? I can do the following:

		desc, err := remote.Get("docker.io/library/alpine:3.11")
		ii, err := desc.ImageIndex()
		err = p.AppendIndex(ii)

and it will go to docker.io, get the index, and then download any missing parts.

However, it should be possible to somehow try to "resolve" my local index and see if all parts are there.

As usual, happy to open a PR on it.

@jonjohnsonjr
Copy link
Collaborator

I would point you towards validate.Index and validate.Image 😄 though these might do more than you really care about, since they do full validation of digest integrity for the whole merkledag, diffids in the config file, and sizes in the descriptors.

I wish I had seen this before you wrote up the PR, sorry!

@deitch
Copy link
Collaborator Author

deitch commented Nov 24, 2020

Haha! No worries. It was a fun little exercise, if unnecessary.

So those in the validate package do the same thing? They pull down all manifests and indexes, and check the accessibility of layers and configs without having to actually pull them all? Will they work for an image based on v1/layout and v1/remote, and pretty much anywhere I get an Image or ImageIndex?

@deitch
Copy link
Collaborator Author

deitch commented Nov 24, 2020

Those can get expensive, as they actually calculate layer hashes, but I can live with it for now.

@deitch deitch closed this as completed Nov 24, 2020
@jonjohnsonjr
Copy link
Collaborator

They pull down all manifests and indexes, and check the accessibility of layers and configs without having to actually pull them all? Will they work for an image based on v1/layout and v1/remote, and pretty much anywhere I get an Image or ImageIndex?

Yep, that's the idea. I use these to "cheat" at test coverage in a lot of places.

without having to actually pull them all

Except for this bit. We will hit every byte at least once (sometimes twice), so it's pretty expensive, but if you're doing this on a local disk, it should be pretty fast. For remote, yeah it might be slow.

In general, I don't love the idea of a "does this stuff exist?" method because you generally shouldn't have to care about that. The read and write implementations should do everything as lazily-accessed as possible, so you shouldn't have to care.

@deitch
Copy link
Collaborator Author

deitch commented Nov 24, 2020

Come back to the original use case. I want to check if everything is there, before pulling remotely. Or because I want to check before going offline.

@jonjohnsonjr
Copy link
Collaborator

I want to check if everything is there, before pulling remotely. Or because I want to check before going offline.

In this instance, you could just read the index from your layout, then attempt to WriteIndex it back. If that succeeds, it means everything was accessible from your layout.

From earlier:

it will go to docker.io, get the index, and then download any missing parts

This would only incur the token handshake and a single manifest GET (assuming everything is already there). For most use cases, I think that's pretty cheap, but I agree there are some scenarios (air-gapped or firewalled environments) where this doesn't work... but in these scenarios, I do think you'd really want to validate.Index the whole thing before going offline :)

@deitch
Copy link
Collaborator Author

deitch commented Nov 25, 2020

Yup, then I am just going to use validate.Index and validate.Image and be happy.

I am going to close this issue out.

@deitch
Copy link
Collaborator Author

deitch commented Nov 25, 2020

Oh wait. It is closed! 😂

@deitch
Copy link
Collaborator Author

deitch commented Nov 27, 2020

Hmm, there may still be an issue. validate.Image works quite well, but it is slow. If I do not want to validate the actual layers, just accept them as there, there is no way to do it, correct? Would we consider an option to validate.{Image,Index}()?

I am not 100% positive that is a good idea, but it is worth discussing.

@jonjohnsonjr
Copy link
Collaborator

but it is slow.

How slow? I'm kind of surprised by that if it's just hitting local disk unless your images are enormous. What exactly are you doing? I can try to reproduce it.

If I do not want to validate the actual layers, just accept them as there, there is no way to do it, correct?

Not currently.

Would we consider an option

As in something like validate.Image(img, validate.SkipLayers)? I'd be curious what the actual implementation would look like. We could probably skip computing diffids and digests, but we need some way to confirm the layers are there. We could just read the first byte of a layer? Not sure.

I am not 100% positive that is a good idea, but it is worth discussing.

Yeah... we don't have a generic way to ask an image "does this layer exist?" -- just "give me the bytes for this layer".

@deitch
Copy link
Collaborator Author

deitch commented Nov 29, 2020

How slow? I'm kind of surprised by that if it's just hitting local disk unless your images are enormous. What exactly are you doing? I can try to reproduce it.

I was surprised too, but I am definite that it is in the hashing stage. Truth is, the more I think about it, the more it makes sense. If I am validating a local directory, I should be hashing. Validation means "valid", not, "I have the files and hope they are valid."

So we live with that. If It comes up again as unusually slow, I will raise a separate issue (and attach pprof and a flamegraph as well, to make it useful).

@deitch
Copy link
Collaborator Author

deitch commented Apr 28, 2021

I am coming back to this one. I find that if it is an index, with several manifests, each of which has several decent sized layers, hashing can take a while. Is there any way to validate existence of all of the parts of an image (and index) without calculating the hash?

@deitch
Copy link
Collaborator Author

deitch commented Apr 28, 2021

I think I tracked it down. I stepped through it, looked at the source, and compared to typical utilities like sha256sum.

I timed a 160MB blob. sha256sum did it consistently in just under 1 second. computeLayer takes several seconds. I tried to figure it why. I think it is because computeLayer calculates not only the hash of the layer as is, but also calculates the diffid, which involves gunzipping the blob, reading the tar, etc.

I get why we do it, we are validating the index, the hashes of manifests, the hashes of layers and configs, and the diffids in the configs. But it does come with a big price.

@jonjohnsonjr
Copy link
Collaborator

jonjohnsonjr commented Apr 28, 2021

It does everything twice, actually, because we access it through both Compressed() and Uncompressed(), which is awful but necessary because of the interface :/

I think I'd be okay with adding options to these functions to speed things up. I can imagine:

An option to skip calculating diffids would skip all of this and this.

(There's also this assumption in there that everything is a tarball, which doesn't hold up. We should only do that for certain media types.)

Another option would be to add something that does just existence checks by calling Compressed() and immediately closing it.

@deitch
Copy link
Collaborator Author

deitch commented Apr 29, 2021

OK, which do you prefer?

I looked at it like this: if I am opening an image to run it, I probably validate hashes and diffids; if I am containerd or a registry checking the input of an image when loaded, I probably only check that the blob (and config and manifest)hashes match what is expected.

@jonjohnsonjr
Copy link
Collaborator

Maybe benchmark something that just skips the diffids, and we can see if that's fast enough?

@jonjohnsonjr jonjohnsonjr reopened this May 10, 2021
@jonjohnsonjr
Copy link
Collaborator

I keep running into this. We need some way to do an existence check on a blob that explicitly isn't lazy but doesn't give you all the bytes. This would translate into a HEAD request or stat of a file.

@jonjohnsonjr
Copy link
Collaborator

I am not happy with this because of all the type wrapping nonsense :/

#1013

@deitch
Copy link
Collaborator Author

deitch commented May 12, 2021

You got there ahead of me! I was going to, but been swamped this week. Thanks! :-)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
2 participants