Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Supplement generated SBOMs with layer information post-build #3258

Merged
merged 3 commits into from
Nov 23, 2022

Conversation

jedevc
Copy link
Member

@jedevc jedevc commented Nov 7, 2022

⬆️ Follow-up to #2983.
📓 Because of interface changes, this PR depends on #3197.

BuildKit can generate SBOMs for images that it builds using scanner images: e.g. jedevc/buildkit-syft-scanner - these SBOMs are attached as image attestations to the final image. Any component in BuildKit that creates and modifies the build result, starting with the frontend, to the solver, or even the final exporter, can attach attestations, allowing flexibility such as #3249 (where we consume Dockerfile args to determine exactly what inputs to pass to the scanner). This flexibility will allow custom SBOMs to be generated by custom frontends in the future, if we go down that route.

However, this flexibility presents a challenge - we cannot provide layer identifiers to the scanner at the time of scanning. There are several reasons for this:

  • Scanning may be performed at any point during the build - it's entirely likely that this occurs before we invoke the exporter, so we don't know what the layer ids are.
  • Layers are not always available - for example, when exporting to the local/tar exporter, we do not export layers. Conceptually, for caching purposes, a change in exporter should not result in needing to re-run the frontend, so the SBOM layering information is inherently part of the exporter.
  • Even if we are exporting to an image, there is no guarantee that the reference that the SBOM has been generated from is even part of the final image - we could scan a previous build stage/context, that isn't even a part of the final image.

Layer identifiers only become known in the final step of image exporting, so we have to wait until then to supplement our SBOM with these identifiers. Modifying the SPDX generated by the SBOM scanner is not ideal - however, because of the above reasons, there's no way to provide this data to the scanner at time-of-scan (at least without a fundamental re-architecture of buildkit's frontend⟹backend⟹exporter which would likely break caching in major ways).

For now, these layer identifiers are added to the comment field of the files map, a la syft, e.g.:

    "files": [
      {
        "SPDXID": "SPDXRef-100432319155cc60",
        "comment": "layerID: sha256:19b3481a712ac0533d119c48ad3655f7ca68aa2a7079da935c8aa013d3072a3b",
        "fileName": "usr/lib/python3/dist-packages/pkg_resources/_vendor/packaging/tags.py",
        "licenseConcluded": "NOASSERTION"
      },

There is one major difference between this and the output produced by syft - we instead use digests from the descriptor, which is the digest of the layer tarball in the registry/content store instead of the chain id, which can be fiddly to use back-reference content when scanning items in the registry.

This transformation of the SBOM should be invisible to users - it should happen automatically. We should provide controls to disable this functionality for the user (and maybe the scanner as well?) in the case that this information isn't desired.

@tonistiigi
Copy link
Member

To save space would it make sense to add this comment to the whole package instead of every file. Is there a use case where knowing location of every file is needed? 99+% of the cases this value would be the same for all files inside a package.

if err != nil {
return err
}
f.FileComment = fmt.Sprintf("layerID: %s", desc.Digest.String())
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What should we do if f.FileComment is already commented?

I'm not sure if there's a standard for composing comments together, so maybe this isn't the right sort of place to put this information?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Have settled on not overwriting the FileComment and simply skipping over this file if there is already a comment - given that there's no way to safely detect what to do in this scenario. If some sort of standard emerges, we can try and follow that, but as of now, it doesn't seem like there is one.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@wagoodman, should we use a different key here instead of layerID? Ideally we wouldn't want to interfere with an existing tooling that expects the value here to be a ChainID, so it might make sense to pick something different, like maybe just layer?

Copy link
Member

@tonistiigi tonistiigi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is the "draft" only because it is based on local attestations or smth missing?

exporter/attestation/generate.go Outdated Show resolved Hide resolved
exporter/attestation/generate.go Outdated Show resolved Hide resolved
exporter/containerimage/attestations.go Outdated Show resolved Hide resolved
exporter/containerimage/attestations.go Outdated Show resolved Hide resolved
c.pending = c.pending[:len(c.pending)-1]

found := false
for _, f := range files {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This doesn't handle whiteouts or opaque directories. Not a blocker for first PR but we need to track the missing pieces.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not quite sure how we should handle whiteouts. Ideally, every file that we lookup exists in the final image - if it does, we should be safe to detect + ignore all whiteouts, skipping over them.

If we want to handle deleted files, we could return the details of the layer that deleted them, which should be just as easy, we'd just write those into the same cache map.

exporter/containerimage/attestations.go Show resolved Hide resolved
exporter/containerimage/attestations.go Show resolved Hide resolved
exporter/containerimage/writer.go Outdated Show resolved Hide resolved
)

var intotoPlatform ocispecs.Platform = ocispecs.Platform{
Architecture: "unknown",
OS: "unknown",
}

// supplementSBOM modifies SPDX attestations to include the file layers
func supplementSBOM(ctx context.Context, s session.Group, target cache.ImmutableRef, targetRemote *solver.Remote, refs map[string]cache.ImmutableRef, att result.Attestation) (result.Attestation, error) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this would be better with 2 functions in series. supplementSBOM() should just take []byte as an input.

tonistiigi and others added 2 commits November 23, 2022 17:16
Signed-off-by: Tonis Tiigi <tonistiigi@gmail.com>
Include a new ReadAll primitive, and rename Generate to
MakeInTotoAttestations.

Signed-off-by: Justin Chadwell <me@jedevc.com>
@jedevc jedevc marked this pull request as ready for review November 23, 2022 17:20
Signed-off-by: Justin Chadwell <me@jedevc.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants