Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

OCI Images used as cache SHOULD not have empty layers list #4331

Closed

Conversation

TBBle
Copy link
Collaborator

@TBBle TBBle commented Oct 13, 2023

Opening this PR as draft for discussion and visibility. The first three commits are not actually related, and could be pulled out as a separate PR (Edit: Now #4334), all the interesting stuff is in the final commit, 5bb3f9d at the time of writing.


I noticed while researching background for aws/containers-roadmap#876 that the OCI Image Format Specification v1.1.0-rc5 contains the following note in the Guildelines for Artifact Usage:

If the artifact does not need layers, a single layer SHOULD be included with a non-zero size. The suggested content for an unused layers array is the empty descriptor.

And a similar note appears in the Image Manifest layers property.

I threw together this PR in the hope that this might let aws/containers-roadmap#876 finally be resolved. I don't know that it actually helps there; I haven't tested this with anything except the CI test-suite.


Reproducing the WIP notes from the commit message

The empty layer is not handled in the cache importer. Not sure if we need to, tests pass without any such changes.

Do we actually need to ensure the blob exists on the registry? Probably; the tests pass because the blob already exists as the default cache config object, but that's 100% crimes. Edit: The descriptor inlines the data value, so assuming that's handled correctly, this isn't an issue. Not crimes.

Only applied to remote registry cache. Not sure if the other caches should do something similar. Are any of them also outputting in an OCI image format?

Relatedly, should we do something like this for normal OCI images? I don't know how robust the ecosystem is against unexpected layers in an OCI Image.

@jedevc
Copy link
Member

jedevc commented Oct 13, 2023

Just to note - the SHOULD requirement has only been added in release candidates for 1.1.0 of the OCI image spec - no release of those has been done yet. It's not in the latest release of 1.0.4 - the new empty descriptor type only appeared as part of the standard in 1.1.0.

I'm actually not convinced that we should take this, unless it's clear that this would actually fix aws/containers-roadmap#876. I thought the underlying issue with this was putting layers directly under an image index, see: #2251.

That said, maybe @tonistiigi and @neersighted will have more context here.

@neersighted
Copy link
Member

I have been meaning to move BuildKit to use artifacts for provenance and for registry cache images; I would like to lean into that as an alternative to this PR, though I have not started any work myself. I'm happy to provide input/guidance if you want to take that on, @TBBle

@jedevc
Copy link
Member

jedevc commented Oct 13, 2023

I have been meaning to move BuildKit to use artifacts for provenance and for registry cache images

I started work on something like this here: #3610, I'm happy for someone to take over that, though I might also get to it eventually. I think we could probably update that to set the config type to the empty descriptor, which would hopefully work around the Quay/Gitlab issue. We could then set artifactType to a buildkit-specific value, and keep the in-toto attestations in the layers. (But also, I'm a little worried about relying on details of an unreleased spec, given the previous changes that have happened in-between release candidates).

I think if that works on all registries, we could definitely do that change. I'm not sure if you're also suggesting switching to the referrers API instead of doing index-bundling - while I'd be happy to have that as an opt-in control for users who want that, I don't think the registry support is out there (including dockerhub, IIRC) to justify changing that default.

@neersighted
Copy link
Member

neersighted commented Oct 13, 2023

I already talked through this with @tonistiigi pretty extensively, but the gist of it is that I want to use the empty config.mediaType, a custom artifactType, and keep the layers as they are. This will significantly reduce the burden to detect/properly handle attestations in runtimes.

With regard to the spec, I understand the reluctance, but it is final as it is going to get and is only blocked on hammering out some final points of contention in distribution-spec. I've been participating in the OCI weekly meetings for over a year now, and I'm confident that we can start building on top of these new primitives (and indeed, getting away from 'fake images' will make all existing conformant registries/most real implementations accept the content).

With regard to index-bundling, my proposal is to do both, or to keep it as is. Artifacts != referrers, and while it would be nice to eliminate the digest label and do discovery via the referrers API, I consider that second-order work/not as important as getting away from "fake images" in the immediate term.

@tonistiigi
Copy link
Member

Is there a way to generate image manifest with --cache-to that has empty layers? What case is that? If it is possible, I tend to think we should just skip pushing manifest at all as it provides no value on importing if layers are empty.

@sudo-bmitch
Copy link

We added the "empty" blob to the spec because Quay, GCR, ECR, and Artifactory all rejected manifests without any layers field, and GCR and ECR rejected a layers field with an empty array: opencontainers/image-spec#1025

@TBBle
Copy link
Collaborator Author

TBBle commented Oct 14, 2023

Just to note - the SHOULD requirement has only been added in release candidates for 1.1.0 of the OCI image spec - no release of those has been done yet. It's not in the latest release of 1.0.4 - the new empty descriptor type only appeared as part of the standard in 1.1.0.

Indeed, I phrased this PR around the idea that rc5 gets shipped as 1.1.0, or at least no relevant changes happen between the two. I wouldn't want this merged before the 1.1.0 spec release, more-careful testing, and also someone actually verifying that this change actually solves the observed issue with ECR.


I'm actually not convinced that we should take this, unless it's clear that this would actually fix aws/containers-roadmap#876. I thought the underlying issue with this was putting layers directly under an image index, see: #2251.

The remote registry cache works on ECR with image-manifest=true,oci-mediatypes=true (i.e. #2251), except in the case that the Layers array in the resulting image is empty. See aws/containers-roadmap#876 (comment)


I have been meaning to move BuildKit to use artifacts for provenance and for registry cache images; I would like to lean into that as an alternative to this PR, though I have not started any work myself. I'm happy to provide input/guidance if you want to take that on, @TBBle

I probably have a knowledge gap here, because I'm not sure what this means, in terms of changes that would resolve this issue. My understanding is that with image-manifest=true,oci-mediatypes=true, the caches are already "Artifacts" in the current sense (images with configs and layers that aren't image-spec 1.0 pre-defined media types) thanks to #2251/#3724.

#3610, if that is along the lines you were thinking here, appears to be bringing attestation manifests into line with how image-manifest=true,oci-mediatypes=true caches already work. (Separately, could an attestation manifest have an empty Layers array?)


We added the "empty" blob to the spec because Quay, GCR, ECR, and Artifactory all rejected manifests without any layers field, and GCR and ECR rejected a layers field with an empty array: opencontainers/image-spec#1025.

Thank you for that, it looks like this issue affects GCR too. And possibly Quay.io and Artifactory, since if I remember correctly, Go will serialise an empty array to JSON as null, and I'm not sure if those registries will round-trip that back to an empty array, or fail.

As soon as I saw the guideline for "portability", I suspected that this was the reason for that part of the spec.


Is there a way to generate image manifest with --cache-to that has empty layers? What case is that? If it is possible, I tend to think we should just skip pushing manifest at all as it provides no value on importing if layers are empty.

This DockerFile will do it, as seen in the ECR roadmap issue.

FROM ubuntu:22.04 AS xxx

ARG CI_PIPELINE_ID=not_set
ENV BUILD_BY_CI_PIPELINE_ID=$CI_PIPELINE_ID

#
# RUN echo dummy >f
CMD echo $CI_PIPELINE_ID

and enabling the RUN line allows the cache-push to succeed.

(CLI per aws/containers-roadmap#876 (comment))

docker buildx build --push -t $AWS_ACCOUNT.dkr.ecr.us-east-1.amazonaws.com/${REPO}:latest  \
--cache-from type=registry,ref=$AWS_ACCOUNT.dkr.ecr.us-east-1.amazonaws.com/${CACHE_REPO}:latest \
--cache-to mode=max,image-manifest=true,oci-mediatypes=true,type=registry,ref=$AWS_ACCOUNT.dkr.ecr.us-east-1.amazonaws.com/${CACHE_REPO}:latest .

My understanding is that the use-case is that these are dummy images used in their CI pipeline, and so the command-line isn't aware that these images have degenerate caches, so they can't simply skip --cache-to in this case.

The test I added for this case uses

				st := llb.Scratch()
				def, err := st.Marshal(sb.Context())
				require.NoError(t, err)

as seen in testBuildExportScratch.

BuildKit skipping pushing the cache manifest when it has no Layers (i.e. ignoring cache-to) also makes sense to me as a resolution to this issue, and could also be done for image index caches, when their Manifests contain only a cache-config. It'd be easy to adapt the existing test too.

Would that be something that should happen in the other exporters too? i.e. would BuildKit not export degenerate caches, or would it just be the Registry Exporter doing that. Or even just the image-manifest exporter?

I wonder if that would surprise any users who have built tooling around the expectation that after --cache-to type=registry,..., they can rely on the ref existing on registries which aren't rejecting the current no-layer cache.


I also suspect (but have not tested) that we could trigger the same issue with a simple image push (no cache) to ECR or GCR with this Dockerfile:

FROM scratch
ARG CI_PIPELINE_ID=not_set
ENV BUILD_BY_CI_PIPELINE_ID=$CI_PIPELINE_ID
CMD echo $CI_PIPELINE_ID

since the testBuildExportScratch test case already verifies that a trivial image has no layers, and none of those Dockerfile commands produce a layer, AFAIK.

I haven't tried to resolve that here, as (a) that isn't a new issue, so mitigations may already exist; (b) using this fix in that case has impact outside BuildKit. Also, not pushing the image isn't an option for this case. Since the image-spec requirement is only SHOULD, I don't feel too bad about the cache exporter and the image builder behaving differently on this point until and unless this space is better-explored.

@tonistiigi
Copy link
Member

This DockerFile will do it,

Yes, there are no layers in this case and no config as well so I think we can skip it. Metadata commands in Dockerfile that modify image config are not tracked by build cache.

» cat cache/blobs/sha256/042db63b6ca76d7379881ca4145b1c15d05efb2e4d2b35838dc7753c60ad63f8 | jq .                                                                                                                                               
{
  "schemaVersion": 2,
  "mediaType": "application/vnd.oci.image.manifest.v1+json",
  "config": {
    "mediaType": "application/vnd.buildkit.cacheconfig.v0",
    "digest": "sha256:44136fa355b3678a1146ad16f7e8649e94fb4fc21fe77e8310c060f61caaff8a",
    "size": 2
  },
  "layers": null
}


» cat cache/blobs/sha256/44136fa355b3678a1146ad16f7e8649e94fb4fc21fe77e8310c060f61caaff8a| jq .                                                                                                                                               
{}

Would that be something that should happen in the other exporters too?

All exporters that use the JSON manifests(this code is shared). Eg. the output above is from the local exporter.

I wonder if that would surprise any users who have built tooling around the expectation that after --cache-to type=registry,..., they can rely on the ref existing on registries which aren't rejecting the current no-layer cache.

Cache import from missing ref does not cause a build to fail, so I don't think this is a problem. This also looks like an exceptional case as such a combination of flags does not bring any value to the user. We probably should still print a message in export progress though about "skipping cache export for empty result".

Upstream changed the definition of the action in
docker/docs#17427 to no longer expect the `repo`
parameter.

Signed-off-by: Paul "TBBle" Hampson <Paul.Hampson@Pobox.com>
This brings in new string constants ImageIndexFile and ImagesBlobsDir.

Signed-off-by: Paul "TBBle" Hampson <Paul.Hampson@Pobox.com>
Signed-off-by: Paul "TBBle" Hampson <Paul.Hampson@Pobox.com>
Per image-spec 1.1.0, OCI Images that would otherwise have an empty
layers list SHOULD instead have a single "Empty JSON" known-value layer
listed.

This guideline does not apply to Image Indexes, and also does not appear
in the Docker Manifest/Manifest List documentation, so those cases are
unchanged.

WIP notes:

The empty layer is not handled in the cache importer. Not sure if we
need to, tests pass without any such changes.

Do we actually need to ensure the blob exists on the registry? Probably;
the tests pass because the blob already exists as the default cache
config object, but that's 100% crimes.

Only applied to remote registry cache. Not sure if the other caches
should do something similar. Are any of them also outputting in an OCI
image format?

Relatedly, should we do something like this for normal OCI images? I
don't know how robust the ecosystem is against unexpected layers in an
OCI Image.

Signed-off-by: Paul "TBBle" Hampson <Paul.Hampson@Pobox.com>
@TBBle
Copy link
Collaborator Author

TBBle commented Oct 14, 2023

Okay, I created #4336 to implement skipping of registry/local export if there'd be no layers in the cache. It doesn't rely on any of the small-fix commits from this PR, so those are already separately out for review at #4334.

@kattmang
Copy link
Contributor

kattmang commented Oct 16, 2023

Not that it matters, but my two cents is that had I considered this for #2251 I would have tried to skip the manifest push like @tonistiigi said.
I believe this "empty" image behavior existed before this, it's just that these cache manifests were never pushable before to OCI 1.1+ registries before. Someone correct me if I'm wrong.

Thanks for chasing this up @TBBle . I was more concerned with the fact that customers to our registry doesn't actually see the errors coming from our registry that use the OCI response body schema that would mention that the image being pushed doesn't have layers, containerd/containerd#8969 which you're already aware of :)

@neersighted
Copy link
Member

neersighted commented Oct 16, 2023

I probably have a knowledge gap here, because I'm not sure what this means, in terms of changes that would resolve this issue. My understanding is that with image-manifest=true,oci-mediatypes=true, the caches are already "Artifacts" in the current sense (images with configs and layers that aren't image-spec 1.0 pre-defined media types) thanks to #2251.

The current format isn't really an artifact (manifest), it's an index with blobs and config referenced as top-level objects (often called the cache manifest), e.g. neersighted/bk-cache:im-oci:

{
  "schemaVersion": 2,
  "mediaType": "application/vnd.oci.image.index.v1+json",
  "manifests": [
    {
      "mediaType": "application/vnd.oci.image.layer.v1.tar+gzip",
      "digest": "sha256:570e078d15e89744dadb98bc4295b431d20a4bde16a711d203327cfcb53d7332",
      "size": 130,
      "annotations": {
        "buildkit/createdat": "2023-10-16T23:37:43.252746472Z",
        "containerd.io/uncompressed": "sha256:87a056ab33ecf0523d1508cdc28f753d433acdedfeecc8e92fbbed2651fc30d1"
      }
    },
    {
      "mediaType": "application/vnd.oci.image.layer.v1.tar+gzip",
      "digest": "sha256:8a0af25e8c2e5dc07c14df3b857877f58bf10c944685cb717b81c5a90974a5ee",
      "size": 1916632,
      "annotations": {
        "buildkit/createdat": "2023-10-13T19:40:00.582866295Z",
        "containerd.io/uncompressed": "sha256:3694737149b11ec4d2c9f15ad24788e81955cd1c7f2c6f555baf1e4a3615bd26"
      }
    },
    {
      "mediaType": "application/vnd.buildkit.cacheconfig.v0",
      "digest": "sha256:8f97e1164c8e5b58419c4c1a3751bacf7ce6519a69302f5a347bac4dcbe2b13e",
      "size": 571
    }
  ]
}

I'd prefer to think about aligning this (opt-in) to the artifactType work in the image-spec, and consider making it the default/removing the other implementations if it manages to improve registry support.

#3610, if that is along the lines you were thinking here, appears to be bringing attestation manifests into line with how image-manifest=true,oci-mediatypes=true caches already work. (Separately, could an attestation manifest have an empty Layers array?)

Yes, this is along the lines I am thinking, though I have some differences in how I'd like to see the format represented.

@neersighted
Copy link
Member

Oh, shoot, it turns out this entire time I was using BuildKit 0.11 and did not realize it! Here's the correct manifest (neersighted/bk-cache:im-oci2):

{
  "schemaVersion": 2,
  "mediaType": "application/vnd.oci.image.manifest.v1+json",
  "config": {
    "mediaType": "application/vnd.buildkit.cacheconfig.v0",
    "digest": "sha256:e72907d9f27625e61d45adad67318189aa55cf62c0589e8b40743789c6259a65",
    "size": 560
  },
  "layers": [
    {
      "mediaType": "application/vnd.oci.image.layer.v1.tar+gzip",
      "digest": "sha256:8a0af25e8c2e5dc07c14df3b857877f58bf10c944685cb717b81c5a90974a5ee",
      "size": 1916632,
      "annotations": {
        "buildkit/createdat": "2023-10-17T00:09:16.613818917Z",
        "containerd.io/uncompressed": "sha256:3694737149b11ec4d2c9f15ad24788e81955cd1c7f2c6f555baf1e4a3615bd26"
      }
    },
    {
      "mediaType": "application/vnd.oci.image.layer.v1.tar+gzip",
      "digest": "sha256:e35c85b8d12aa0e47d7460b729a1ff2816ce9250f025b88829d46a7866833035",
      "size": 133,
      "annotations": {
        "buildkit/createdat": "2023-10-17T00:09:16.835607043Z",
        "containerd.io/uncompressed": "sha256:a423b702fe33f7dd5b4ea16d2e178597234040fa0e9c45ca9801f1f24015aa75"
      }
    }
  ]
}

Yeah, there's a couple things I'd like to tweak here (e.g. set the artifactType), but this is otherwise what I would have suggested/expected.

@TBBle
Copy link
Collaborator Author

TBBle commented Oct 17, 2023

Okay, #4336 has pretty clear consensus as the way to approach this issue for BuildKit remote registry caches. Clearly no one's hit a similar problem for pushing FROM scratch-only-Dockerfile-generated images to ECR/GCR, as far as I've heard (because that's totally a basis for determining issue severity).

So closing this one out. Thank you all for the feedback.

@TBBle TBBle closed this Oct 17, 2023
@TBBle TBBle deleted the oci_images_should_not_have_empty_layers branch October 17, 2023 08:41
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants