how do I check if a layout contains all of the blobs? #838

deitch · 2020-11-23T08:53:41Z

Is it possible to somehow try to "resolve", or walk, my local v1 layout from an index that is there and see if all parts are there?

Use case: let's say I have an image as follows. I am using part of docker.io/library/alpine:3.11

9a839e63dad54c3a6d1834e29692c8492d93f90c59c978c1ed79109ea4fb9a54 - index
|
|- 39eda93d15866957feaee28f8fc5adb545276a64147445c64992ef69804dbf01 - manifest linux/amd64
    |- f70734b6a266dcb5f44c383274821207885b549b75c8e119404917a61335981a - config
    |-cbdbe7a5bc2a134ca8ec91be58565ec07d037386d1f1d8385412d224deafca08 - layer0
|- ad295e950e71627e9d0d14cdc533f4031d42edae31ab57a841c5b9588eacc280 - manifest linux/arm64
    |- c20d2a9ab6869161e3ea6d8cb52d00be9adac2cc733d3fbc3955b9268bfd7fc5 - config
    |- 29e5d40040c18c692ed73df24511071725b74956ca1a61fe6056a651d86a13bd - layer0

For the above, I have all of the parts from root index through manifests, configs, and layers for linux/amd64 and linux/arm64.

If my local layout directory has some of those, but not linux/arm64 parts, then it might look like this:

9a839e63dad54c3a6d1834e29692c8492d93f90c59c978c1ed79109ea4fb9a54 - index
|
|- 39eda93d15866957feaee28f8fc5adb545276a64147445c64992ef69804dbf01 - manifest linux/amd64
    |- f70734b6a266dcb5f44c383274821207885b549b75c8e119404917a61335981a - config
    |-cbdbe7a5bc2a134ca8ec91be58565ec07d037386d1f1d8385412d224deafca08 - layer0

How do I check if all of the parts are there without going to docker.io? I can do the following:

		desc, err := remote.Get("docker.io/library/alpine:3.11")
		ii, err := desc.ImageIndex()
		err = p.AppendIndex(ii)

and it will go to docker.io, get the index, and then download any missing parts.

However, it should be possible to somehow try to "resolve" my local index and see if all parts are there.

As usual, happy to open a PR on it.

The text was updated successfully, but these errors were encountered:

jonjohnsonjr · 2020-11-24T02:55:14Z

I would point you towards validate.Index and validate.Image 😄 though these might do more than you really care about, since they do full validation of digest integrity for the whole merkledag, diffids in the config file, and sizes in the descriptors.

I wish I had seen this before you wrote up the PR, sorry!

deitch · 2020-11-24T04:15:14Z

Haha! No worries. It was a fun little exercise, if unnecessary.

So those in the validate package do the same thing? They pull down all manifests and indexes, and check the accessibility of layers and configs without having to actually pull them all? Will they work for an image based on v1/layout and v1/remote, and pretty much anywhere I get an Image or ImageIndex?

deitch · 2020-11-24T09:13:46Z

Those can get expensive, as they actually calculate layer hashes, but I can live with it for now.

jonjohnsonjr · 2020-11-24T19:22:03Z

They pull down all manifests and indexes, and check the accessibility of layers and configs without having to actually pull them all? Will they work for an image based on v1/layout and v1/remote, and pretty much anywhere I get an Image or ImageIndex?

Yep, that's the idea. I use these to "cheat" at test coverage in a lot of places.

without having to actually pull them all

Except for this bit. We will hit every byte at least once (sometimes twice), so it's pretty expensive, but if you're doing this on a local disk, it should be pretty fast. For remote, yeah it might be slow.

In general, I don't love the idea of a "does this stuff exist?" method because you generally shouldn't have to care about that. The read and write implementations should do everything as lazily-accessed as possible, so you shouldn't have to care.

deitch · 2020-11-24T19:24:30Z

Come back to the original use case. I want to check if everything is there, before pulling remotely. Or because I want to check before going offline.

jonjohnsonjr · 2020-11-24T19:46:03Z

I want to check if everything is there, before pulling remotely. Or because I want to check before going offline.

In this instance, you could just read the index from your layout, then attempt to WriteIndex it back. If that succeeds, it means everything was accessible from your layout.

From earlier:

it will go to docker.io, get the index, and then download any missing parts

This would only incur the token handshake and a single manifest GET (assuming everything is already there). For most use cases, I think that's pretty cheap, but I agree there are some scenarios (air-gapped or firewalled environments) where this doesn't work... but in these scenarios, I do think you'd really want to validate.Index the whole thing before going offline :)

deitch · 2020-11-25T09:50:57Z

Yup, then I am just going to use validate.Index and validate.Image and be happy.

I am going to close this issue out.

deitch · 2020-11-25T09:51:07Z

Oh wait. It is closed! 😂

deitch · 2020-11-27T11:23:38Z

Hmm, there may still be an issue. validate.Image works quite well, but it is slow. If I do not want to validate the actual layers, just accept them as there, there is no way to do it, correct? Would we consider an option to validate.{Image,Index}()?

I am not 100% positive that is a good idea, but it is worth discussing.

jonjohnsonjr · 2020-11-27T16:32:41Z

but it is slow.

How slow? I'm kind of surprised by that if it's just hitting local disk unless your images are enormous. What exactly are you doing? I can try to reproduce it.

If I do not want to validate the actual layers, just accept them as there, there is no way to do it, correct?

Not currently.

Would we consider an option

As in something like validate.Image(img, validate.SkipLayers)? I'd be curious what the actual implementation would look like. We could probably skip computing diffids and digests, but we need some way to confirm the layers are there. We could just read the first byte of a layer? Not sure.

I am not 100% positive that is a good idea, but it is worth discussing.

Yeah... we don't have a generic way to ask an image "does this layer exist?" -- just "give me the bytes for this layer".

deitch · 2020-11-29T14:19:06Z

How slow? I'm kind of surprised by that if it's just hitting local disk unless your images are enormous. What exactly are you doing? I can try to reproduce it.

I was surprised too, but I am definite that it is in the hashing stage. Truth is, the more I think about it, the more it makes sense. If I am validating a local directory, I should be hashing. Validation means "valid", not, "I have the files and hope they are valid."

So we live with that. If It comes up again as unusually slow, I will raise a separate issue (and attach pprof and a flamegraph as well, to make it useful).

deitch · 2021-04-28T05:20:38Z

I am coming back to this one. I find that if it is an index, with several manifests, each of which has several decent sized layers, hashing can take a while. Is there any way to validate existence of all of the parts of an image (and index) without calculating the hash?

deitch · 2021-04-28T07:50:01Z

I think I tracked it down. I stepped through it, looked at the source, and compared to typical utilities like sha256sum.

I timed a 160MB blob. sha256sum did it consistently in just under 1 second. computeLayer takes several seconds. I tried to figure it why. I think it is because computeLayer calculates not only the hash of the layer as is, but also calculates the diffid, which involves gunzipping the blob, reading the tar, etc.

I get why we do it, we are validating the index, the hashes of manifests, the hashes of layers and configs, and the diffids in the configs. But it does come with a big price.

jonjohnsonjr · 2021-04-28T18:14:04Z

It does everything twice, actually, because we access it through both Compressed() and Uncompressed(), which is awful but necessary because of the interface :/

I think I'd be okay with adding options to these functions to speed things up. I can imagine:

An option to skip calculating diffids would skip all of this and this.

(There's also this assumption in there that everything is a tarball, which doesn't hold up. We should only do that for certain media types.)

Another option would be to add something that does just existence checks by calling Compressed() and immediately closing it.

deitch · 2021-04-29T15:22:36Z

OK, which do you prefer?

I looked at it like this: if I am opening an image to run it, I probably validate hashes and diffids; if I am containerd or a registry checking the input of an image when loaded, I probably only check that the blob (and config and manifest)hashes match what is expected.

jonjohnsonjr · 2021-04-29T17:31:50Z

Maybe benchmark something that just skips the diffids, and we can see if that's fast enough?

jonjohnsonjr · 2021-05-10T22:06:19Z

I keep running into this. We need some way to do an existence check on a blob that explicitly isn't lazy but doesn't give you all the bytes. This would translate into a HEAD request or stat of a file.

jonjohnsonjr · 2021-05-10T23:25:11Z

I am not happy with this because of all the type wrapping nonsense :/

#1013

deitch · 2021-05-12T18:33:14Z

You got there ahead of me! I was going to, but been swamped this week. Thanks! :-)

deitch mentioned this issue Nov 23, 2020

functions to check and validate if can retrieve all parts of an image or index #839

Closed

deitch closed this as completed Nov 24, 2020

jonjohnsonjr reopened this May 10, 2021

jonjohnsonjr mentioned this issue May 11, 2021

Add --fast flag to crane validate #1013

Merged

jonjohnsonjr closed this as completed in #1013 May 12, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

how do I check if a layout contains all of the blobs? #838

how do I check if a layout contains all of the blobs? #838

deitch commented Nov 23, 2020

jonjohnsonjr commented Nov 24, 2020

deitch commented Nov 24, 2020

deitch commented Nov 24, 2020

jonjohnsonjr commented Nov 24, 2020

deitch commented Nov 24, 2020

jonjohnsonjr commented Nov 24, 2020

deitch commented Nov 25, 2020

deitch commented Nov 25, 2020

deitch commented Nov 27, 2020

jonjohnsonjr commented Nov 27, 2020

deitch commented Nov 29, 2020

deitch commented Apr 28, 2021

deitch commented Apr 28, 2021

jonjohnsonjr commented Apr 28, 2021 •

edited

Loading

deitch commented Apr 29, 2021

jonjohnsonjr commented Apr 29, 2021

jonjohnsonjr commented May 10, 2021

jonjohnsonjr commented May 10, 2021

deitch commented May 12, 2021

how do I check if a layout contains all of the blobs? #838

how do I check if a layout contains all of the blobs? #838

Comments

deitch commented Nov 23, 2020

jonjohnsonjr commented Nov 24, 2020

deitch commented Nov 24, 2020

deitch commented Nov 24, 2020

jonjohnsonjr commented Nov 24, 2020

deitch commented Nov 24, 2020

jonjohnsonjr commented Nov 24, 2020

deitch commented Nov 25, 2020

deitch commented Nov 25, 2020

deitch commented Nov 27, 2020

jonjohnsonjr commented Nov 27, 2020

deitch commented Nov 29, 2020

deitch commented Apr 28, 2021

deitch commented Apr 28, 2021

jonjohnsonjr commented Apr 28, 2021 • edited Loading

deitch commented Apr 29, 2021

jonjohnsonjr commented Apr 29, 2021

jonjohnsonjr commented May 10, 2021

jonjohnsonjr commented May 10, 2021

deitch commented May 12, 2021

jonjohnsonjr commented Apr 28, 2021 •

edited

Loading