Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

pkg/rhcos/release: Extract RHCOS build from release image #1286

Closed
wants to merge 2 commits into from

Conversation

wking
Copy link
Member

@wking wking commented Feb 21, 2019

Since e2b31b2 (#1149), we have been using the machine-os-content image to seed the machine-config operator. With this commit, use the RHCOS build ID from that image's annotations to calculate our AMI, etc. as well. This gives one less degree of freedom for breaking things ;). Users who want to test clusters based on a different RHCOS build should bump the value in their update payload, just like users testing operator updates and other changes.

The vendor commit is depressingly large, but I don't think there's much we can do about that from this repository. See the e189de1 commit message for notes.

This is a WIP until I look into signature verification.

I'm not sure where this stands with openshift/origin#21998 deferred, but I was looking into ways to simplify the release process and this was one of them. CC @cgwalters

Fixes #987.

@openshift-ci-robot openshift-ci-robot added do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. labels Feb 21, 2019
@openshift-ci-robot openshift-ci-robot added the size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files. label Feb 21, 2019
@openshift-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: wking

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@openshift-ci-robot openshift-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Feb 21, 2019
@wking
Copy link
Member Author

wking commented Feb 21, 2019

An additional benefit of this approach is that invalid pull-secrets will result in installer-host errors like:

FATAL failed to fetch Bootstrap Ignition Config: failed to fetch dependency of "Bootstrap Ignition Config": failed to fetch dependency of "Master Machines": failed to generate asset "Image": Error reading manifest 4.0.0-0.nightly-2019-02-20-194411 in registry.svc.ci.openshift.org/ocp/release: unauthorized: authentication required 

For previous discussion of that issue, see #901. And this reminds me that I need to wire in the user-supplied pull-secret for these fetches ;).

@cgwalters
Copy link
Member

I only skimmed this but it looks good so far!

One thing I'd add to this...for release payloads let's be prepared to bake in the AMI metadata - don't fetch it over HTTP.

It'd probably be just changing code around here to parse a constant JSON blob embedded in the code or so.

@abhinavdahiya
Copy link
Contributor

/hold

This change is to big to be included after freeze.

@openshift-ci-robot openshift-ci-robot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Feb 21, 2019
@cgwalters
Copy link
Member

This change is to big to be included after freeze.

Fair enough, but let's at least switch to hardcoding the AMI data for releases to drop the dependency on the "RHCOS build metadata" server?

@wking
Copy link
Member Author

wking commented Feb 21, 2019

Fair enough, but let's at least switch to hardcoding the AMI data for releases to drop the dependency on the "RHCOS build metadata" server?

We'd still rely on that server for libvirt QCOW2 and whatever OpenStack uses, right? Can we put that JSON blob in the machine-os-content?

@cgwalters
Copy link
Member

Can we put that JSON blob in the machine-os-content?

openshift/origin#21998 (comment)

return nil, nil, err
}
defer func() {
if err != nil {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is confusing.. which err are we trying to check...?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is confusing.. which err are we trying to check...?

If FromUnparsedImage (or the coming signature validation) fails, I want to be sure we close imageSource. So this closure looks at the err that will be returned by pull, and, if it's not nil, closes imageSource for us. That way pull callers can just return the error without having to worry about checking imageSource for nil-ness and possibly closing it themselves; they can just return pull's err.

wking added a commit to wking/openshift-installer that referenced this pull request Feb 27, 2019
… image

A temporary workaround while [1] is in flight.

[1]: openshift#1286
wking added a commit to wking/openshift-installer that referenced this pull request Feb 27, 2019
… image

A temporary workaround while [1] is in flight.

[1]: openshift#1286
wking added a commit to wking/openshift-installer that referenced this pull request Feb 27, 2019
… image

A temporary workaround while [1] is in flight.

[1]: openshift#1286
wking added a commit to wking/openshift-installer that referenced this pull request Feb 27, 2019
… image

A temporary workaround while [1] is in flight.

[1]: openshift#1286
wking added a commit to wking/openshift-installer that referenced this pull request Feb 27, 2019
… image

A temporary workaround while [1] is in flight.

[1]: openshift#1286
@wking wking force-pushed the rhcos-build-from-update-payload branch from e189de1 to c1d35cc Compare March 8, 2019 09:06
@openshift-ci-robot openshift-ci-robot removed the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Mar 8, 2019
@wking wking force-pushed the rhcos-build-from-update-payload branch 2 times, most recently from 3a99f85 to 327976e Compare March 8, 2019 09:15
Copy link
Contributor

@mtrmac mtrmac left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(Just a few comments on the use of c/image API.)

pkg/rhcos/release/release.go Show resolved Hide resolved
pkg/rhcos/release/release.go Show resolved Hide resolved
defer blob.Close()

// fmt.Println(layer.MediaType) // TODO: blank from quay? Just assume this is a gzipped tar?
gzipReader, err := gzip.NewReader(blob)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Consider c/image/pkg/compression.DetectCompression. Yes, there is a MIME type in the manifest, but historically it’s not been used to differentiate between compressed/uncompressed in practice.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, there is a MIME type in the manifest...

Fix the broken servers! ;) In practice, these pulls are going to be from release images that we cut with oc adm release new ... and the layer is going to be gzipped, but sure, I can reroll to use peek-inside detection.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sadly, it’s not just the servers; the schema2 spec does not officially specify an “uncompressed tar” MIME type, that’s new in OCI (although there is a de-facto schema2 definition in containerd).


Actually, …/compression.AutoDecompress would probably be even more convenient.

I agree that in practice pretty much every image is compressed, and this may not be worth worrying about. Only noting that this exists because I saw that TODO.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

FWIW there is work ongoing to add new compression types like Zstd, so hard-coding gzip here will eventually break.

Since e2b31b2 (bootkube: Supply machine-os-content to MCO,
2019-01-29, openshift#1149), we have been using the machine-os-content image to
seed the machine-config operator.  With this commit, use the RHCOS
build ID from that image's annotations to calculate our AMI, etc. as
well.  This gives one less degree of freedom for breaking things ;).
Users who want to test clusters based on a different RHCOS build
should bump the value in their release image, just like users testing
operator updates and other changes.

The new pkg/asset/release subpackage allows users to continue using
pkg/rhcos without pulling in all of containers/image as a dependency.

The pull-secret handling is a bit of a hack, leaning on the fact that
users are likely providing clean secrets from [1].  Hopefully soon
containers/image will grow an API for injecting in-memory bytes into
their more-robust Docker-auth-config parser, but my attempt at that
[2] is unlikely to land in the next few days, so I've cludged together
a minimal implementation here.

[1]: https://cloud.openshift.com/clusters/install#pull-secret
[2]: containers/image#588
@wking wking mentioned this pull request Mar 11, 2019
And zounds of other dependencies.  Use Gopkg.toml's 'ignored' [1] to
exclude, BoltDB, which comes in via:

  github.com/containers/image/pkg/blobinfocache
    github.com/boltdb/bolt

and github.com/docker/distribution/registry/storage/cache, which comes
in via:

  github.com/containers/image/docker
    github.com/docker/distribution/registry/client
      github.com/docker/distribution/registry/storage/cache

because we don't use either the storage backend or the BoltDB
blob-info cache, but dep isn't checking at that level of granularity.
Ideally the upstream repositories would restructure to split these out
into separate packages (and/or dep will grow support for pruning by
build tag [2]), but until then, we can manually prune via 'ignored'.

Also ignore mtrmac/gpgme, because we only need signature verification
and we don't want to use CGO (because we want to be portable to other
operating systems, and we only need verification support, not signing
support [3]).

There may be more fat we can prune as well, but I got tired of looking
and gave up, even though ~50k lines of new code is pretty embarassing
for what is effectively just a handful of HTTPS requests and an
OpenPGP check.

Generated (after manually editing Gopkg.toml) with:

  $ dep ensure

using:

  $ dep version
  dep:
   version     : v0.5.0-31-g73b3afe
   build date  : 2019-02-08
   git hash    : 73b3afe
   go version  : go1.10.3
   go compiler : gc
   platform    : linux/amd64
   features    : ImportDuringSolve=false

[1]: https://golang.github.io/dep/docs/Gopkg.toml.html#ignored
[2]: golang/dep#291
[3]: containers/image#268
@wking wking force-pushed the rhcos-build-from-update-payload branch from c6be68c to c41e524 Compare March 12, 2019 16:17
@wking
Copy link
Member Author

wking commented Mar 12, 2019

I've pushed c6be68c -> c41e524, rebasing onto master and dropping the runc ignore, because we apparently do need runc's homedir library.

wking added a commit to wking/openshift-installer that referenced this pull request Mar 12, 2019
Clayton pushed release:4.0.0-0.nightly-2019-03-04-234414 to
quay.io/openshift-release-dev/ocp-release:4.0.0-0.7.  Extracting the
associated RHCOS build:

  $ oc adm release info --pullspecs quay.io/openshift-release-dev/ocp-release:4.0.0-0.7 | grep machine-os-content
    machine-os-content                            quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:399582f711226ab1a0e76d8928ec55436dea9f8dc60976c10790d308b9d92181
  $ oc image info quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:399582f711226ab1a0e76d8928ec55436dea9f8dc60976c10790d308b9d92181 | grep version
              version=47.330

I'd like to be pulling that version out of the release image at
install-time [1], but that PR isn't green yet, so hard-coding here is
a quick fix.

[1]: openshift#1286
wking added a commit to wking/openshift-installer that referenced this pull request Mar 12, 2019
This is the most recent RHCOS:

  $ curl -s https://releases-rhcos.svc.ci.openshift.org/storage/releases/maipo/builds.json | jq '{"latest": .builds[0], timestamp}'
  {
    "latest": "400.7.20190306.0",
    "timestamp": "2019-03-06T22:24:53Z"
  }

I'd prefer to pull this from the release image:

  $ oc adm release info --pullspecs quay.io/openshift-release-dev/ocp-release:4.0.0-0.7 | grep machine-os-content
    machine-os-content                            quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:399582f711226ab1a0e76d8928ec55436dea9f8dc60976c10790d308b9d92181
  $ oc image info quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:399582f711226ab1a0e76d8928ec55436dea9f8dc60976c10790d308b9d92181 | grep version
              version=47.330

But [1] isn't green yet, so hard-coding here is a quick fix.

Supported regions:

  $ curl -s https://releases-rhcos.svc.ci.openshift.org/storage/releases/maipo/400.7.20190306.0/meta.json | jq -r '.amis[] | .name'
  ap-northeast-1
  ap-northeast-2
  ap-south-1
  ap-southeast-1
  ap-southeast-2
  ca-central-1
  eu-central-1
  eu-west-1
  eu-west-2
  eu-west-3
  sa-east-1
  us-east-1
  us-east-2
  us-west-1
  us-west-2

[1]: openshift#1286
@openshift-ci-robot
Copy link
Contributor

@wking: PR needs rebase.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@openshift-ci-robot openshift-ci-robot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Mar 15, 2019
@cgwalters
Copy link
Member

I think this PR is a good idea - it seems like we made an architectural decision not to do it right now though.

One thing I would say though is it seems quite likely at some point we'll make a stronger push for "cluster managed bootimages" - https://url.corp.redhat.com/6f27cd4 "Lifecycle of RHCOS image source". If we start to implement that more ambitious path, it may involve something like having the bootstrap node set up the bootimage (i.e. the bootimage for the bootstrap would be different from master/worker) potentially. That may lead us down a different architectural path.

@cgwalters
Copy link
Member

One potential blocker here is that going forward it's not clear to me that we'll ship new bootimages for every machine-os-content. That's currently true, but we are thinking about changing it - it would speed up the builds rather dramatically, particularly since we keep adding new bootimage types.

We also know that we cannot support updating the bootimage in every environment right now; we need to deal with pivoting.

The argument is it's better to always pivot so it's tested.

These two combined say that even if we extract the RHCOS metadata from the release image, we should support a separate "bootimage" annotation or so.

@wking
Copy link
Member Author

wking commented Mar 26, 2019

One potential blocker here is that going forward it's not clear to me that we'll ship new bootimages for every machine-os-content.

That's fine, just don't bump the embedded bootimage labels.

That's currently true, but we are thinking about changing it - it would speed up the builds rather dramatically, particularly since we keep adding new bootimage types.

I think prioritizing RHCOS-build-speed (affects devs) over cluster-install-speed (affects users) is slanting things the wrong way ;).

We also know that we cannot support updating the bootimage in every environment right now; we need to deal with pivoting.

We definitely want this to work. But surely we can get sufficient soak time on pivots in CI without pushing them onto users too?

These two combined say that even if we extract the RHCOS metadata from the release image, we should support a separate "bootimage" annotation or so.

Yeah, and since AMI IDs, etc., are only used for boot images and com.coreos.ostree-commit is only ever used for pivots, we're fine, right? This doesn't seem like something we could possibly mix together.

@ericavonb
Copy link

ericavonb commented Mar 27, 2019 via email

@wking
Copy link
Member Author

wking commented Mar 27, 2019

faster builds -> sorter iteration times -> quicker bug fixes...

Right, and if you just want to iterate on the post-pivot OSTree, you can do that now by skipping AMI, etc. builds and just tweaking com.coreos.ostree-commit in the the machine-os-content image. And if you just want to iterate on the pre-pivot bootimage, you'll be able to do that by building a single AMI (or whatever) to match your test environment and only tweaking that. Or iterate on both at the same time, whatever. But do we really expect to be pushing RHCOS updates out to end users so fast that out build pipeline can't keep them in fresh buildimages to allow them to avoid an initial pivot? But whatever, pulling this information from machine-os-content annotations will allow folks to wire this together for iteration however they like, and those wiring decisions will not be up to me ;).

@cgwalters
Copy link
Member

I think prioritizing RHCOS-build-speed (affects devs) over cluster-install-speed (affects users) is slanting things the wrong way ;).

It's not just about speeding up builds - let's say image uploads for 1 region in a public cloud like AWS/Azure are broken - do we stall everything on that? No, we shouldn't need to.

layers := image.LayerInfos()
for i := len(layers) - 1; i >= 0; i-- {
layer := layers[i]
blob, _, err := imageSource.GetBlob(ctx, layer, blobinfocache.NoCache)
Copy link
Contributor

@mtrmac mtrmac Apr 30, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If this is expected to enforce/verify signatures (as the code does by calling IsRunningImageAllowed in pull), this code must enforce that the blob stream matches the digest in layer.Digest; otherwise the registry could substitute a malicious blob not matching the manifest.

(IsRunningImageAllowed can’t do that for you: it would have to download the blobs, verify them, and cache them somewhere, so that later calls to GetBlob can read the cached version to ensure it has not been changed; the layering/responsibilities are allocated the opposite way, which allows the caller to stream the data directly with no storage requirements, as long as it handles the verification.

GetBlob also does not do digest verification; without, again, requiring extra storage space to cache the data, it could in principle return an error at EOF, but that would only help consumers that don’t consume the data as being streamed, just the way this code does. Also, GetBlob is implemented in all the different transports, and having every one of them copy&paste code to verify digests would be hard to maintain and enforce; centralizing this in the consumer, typically c/image/copy.Image is more practical.)

@openshift-ci-robot
Copy link
Contributor

@wking: The following tests failed, say /retest to rerun them all:

Test name Commit Details Rerun command
ci/prow/govet c41e524 link /test govet
ci/prow/unit c41e524 link /test unit
ci/prow/images c41e524 link /test images
ci/prow/e2e-aws-upgrade c41e524 link /test e2e-aws-upgrade
ci/prow/e2e-aws c41e524 link /test e2e-aws

Full PR test history. Your PR dashboard. Please help us cut down on flakes by linking to an open issue when you hit one in your PR.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here.

@abhinavdahiya
Copy link
Contributor

closing due to inactivity. Please reopen if needed.

/close

@openshift-ci-robot
Copy link
Contributor

@abhinavdahiya: Closed this PR.

In response to this:

closing due to inactivity. Please reopen if needed.

/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Embed RHCOS buildID in update payload, fetch it for installer
7 participants