Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[WIP] Teach MCO to use the new-format image by default #3258

Closed
wants to merge 12 commits into from

Conversation

jkyros
Copy link
Contributor

@jkyros jkyros commented Jul 21, 2022

This is a way to give the MCO "new format" image support by:

  • Plumbing BaseOperatingSystemContainer and BaseOperatingSystemExtensionsContainer all the way through into MachineConfig like OSImageURL

I don't think this is actually how we'll want to do it, this is just the "simplest way" that doesn't "overload" machine-os-content with a new format image.

Caveats:

Can I install a cluster using the new format image? Yes, but right now you probably don't want to, because you have to:

  1. Build an installer, say off this branch https://github.com/jkyros/installer/tree/allow-mco-consume-multiple-images
  2. Create a payload manually
oc adm release new -n ocp --server https://api.ci.l2s4.p1.openshiftapps.com --from-release registry.ci.openshift.org/ocp/releas
e:4.12 --to-image quay.io/jkyros/ocp-release:v4.12  --include=rhel-coreos-8-extensions rhel-coreos-8-extensions=registry.ci.openshift.org/rhcos-dev
el/rhel-coreos-extensions:latest --include=rhel-coreos-8 rhel-coreos-8=quay.io/jkyros/derived-images:rhel-coreos-latest  machine-config-operator=quay.io/jkyros/ma
chine-config-operator:latest 
  1. Install using your custom installer, overriding the payload:
export OPENSHIFT_INSTALL_RELEASE_IMAGE_OVERRIDE=quay.io/jkyros/ocp-release:v4.12
../openshift-install-dev create cluster ...

This PR is in service to: https://issues.redhat.com/browse/MCO-291

jkyros and others added 12 commits July 20, 2022 22:55
Previously our machine-os-content images were just "wrappers" around
ostree commits. With our new "layering" work, we have a new image format
that is actually a working OCI container, but this type of image needs
to be applied differently by rpm-ostree.

This extends the RpmOstreeClient to support new layered images by:
- capturing more metadata from rpm-ostree status
- adding an additional rebase function specifically for the layered case
- adding a function to  determine whether or not a given image is a
  "layered"/bootable OCI image
We changed the interface for RpmOstreeClient, so we need to make sure
the mocks are also up to date with the new signatures
Part of openshift/enhancements#1032

We'll add the new-format image into the payload alongside the old
one until we can complete the transition.

(There may actually be a separate `rhel-coreos-extensions` image
 e.g. too, so this is just the start)

Note this PR is just laying groundwork; the new format container
will not be used by default.
controllerconfig

The extensions for our layered/bootable `rhel-coreos` images are no
longer going to be contained within the image themselves. They will be a
separate container. See: openshift/os#763

This makes sure that, if present, the new extensions image gets pushed
through to controllerconfig and is available when merging
machineconfigs.
The MCO orchestrates the management/setup of a big pile of different
containers today, from `machine-os-content` to `haproxy` and
configuring cri-o to use the correct `pause`/`pod` container.

These images get set up at both "bootstrap" time and cluster time.
In bootstrap today we manually scrape each image out of the CVO in
shell script, and then pass them as CLI arguments inside `bootkube.sh`.

This means adding new images requires a tedious "ratcheting" process
where the CLI argument is added to the MCO, then we patch the installer
to pass it in.

Instead, add a new CLI argument which accepts a serialized imagestream
object, which is exactly what the CVO carries.

Prep for adding a new `rhel-coreos` image into the payload, so I can
avoid that ratcheting process.
This updates the daemon's bootstrapping path so that the daemon can
support updating to a new format/layered image while it is
bootstrapping.
This modifies the daemon such that if a "new format" image is present
in the MachineConfig as BaseOperatingSystemContainer, the daemon will
use it instead of OSImageURL.
So once upon a time, our image references in our manifests, the ones
that look like:

`registry.svc.ci.openshift.org/openshift:something`

probably pointed at real images. But:
- that registry has long since retired and
- the openshift client (`oc`) rewrites these values as though they were
templated when an openshift release gets packed by `oc adm release new`

This was not immediately apparent to someone who was not familiar with
the process, and an outside observer could be forgiven for thinking that
these were real values.

This tries to alleviate any confusion or belief that these values are
'real' by making them obviously fake, but also descriptive of their
function.
The way things work with the openshift client, once we add an image to
image-references, `oc` expects it to be in the release, and `oc adm
release new` will fail if it's not present.

Also, in order for the placeholder image location in our
`machine-config-osimageurl` configmap to get rewritten with a real value
(rather than the bogus 'template' value it has by default), it has to be
referenced in image-references.

The desired outcome was "if it's there use it, otherwise ignore it", but
that's just not possible with regard to the payload the way things are
set up.

This groups the addition of the new format os container and extensions
container together into sort of an "on switch" commit that will break
the builds if merged before https://issues.redhat.com/browse/ART-3883,
but will make them work with new images if merged afterwards.
@openshift-ci openshift-ci bot added the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Jul 21, 2022
@openshift-ci
Copy link
Contributor

openshift-ci bot commented Jul 21, 2022

Skipping CI for Draft Pull Request.
If you want CI signal for your change, please convert it to an actual PR.
You can still manually trigger a test run with /test all

@openshift-ci
Copy link
Contributor

openshift-ci bot commented Jul 21, 2022

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: jkyros
Once this PR has been reviewed and has the lgtm label, please assign kikisdeliveryservice for approval by writing /assign @kikisdeliveryservice in a comment. For more information see:The Kubernetes Code Review Process.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@cgwalters
Copy link
Member

OK! Word on the street is that this may be unblocked. Basically I think we can now refer to rhel-coreos-8 in the MCO, and that should cause it to actually appear in the release image.

@cgwalters cgwalters marked this pull request as ready for review August 3, 2022 22:28
@cgwalters
Copy link
Member

Just going to lift draft and see what happens in CI...

@cgwalters
Copy link
Member

/retest
build02 was ill

@cgwalters
Copy link
Member

cgwalters commented Aug 4, 2022

error: failed to push image registry.build02.ci.openshift.org/ci-op-phwdkwcp/release:latest: unable to upload new layer (0): Patch "https://registry.build02.ci.openshift.org/v2/ci-op-phwdkwcp/release/blobs/uploads/b47751e3-2483-4339-a334-69829baef911?_state=Fa3B1HLwCBlMl1-yuXU2stAtMiUcb5TO597R2xNk74x7Ik5hbWUiOiJjaS1vcC1waHdka3djcC9yZWxlYXNlIiwiVVVJRCI6ImI0Nzc1MWUzLTI0ODMtNDMzOS1hMzM0LTY5ODI5YmFlZjkxMSIsIk9mZnNldCI6MCwiU3RhcnRlZEF0IjoiMjAyMi0wOC0wNFQxMjoyOToyMS4zMjQyMDc1ODVaIn0%3D": operator "machine-config-operator" contained an invalid image-references file: no input image tag named "rhel-coreos-8"

OK, so something else still needs to happen.

I think we need to find the place in the CI configuration to stitch in the production ART rhel-coreos-8 builds - the same thing we're doing with machine-os-content today; somewhere?

cgwalters added a commit to cgwalters/release that referenced this pull request Aug 4, 2022
…tent`

The former is going to replace the latter.  See
https://github.com/openshift/enhancements/blob/master/enhancements/ocp-coreos-layering.md

I'm *hoping* this is somehow what's necessary to have the image
that is now built by ART included in the CI payload, so that
we can get this PR to work:

openshift/machine-config-operator#3258 (comment)
@cgwalters
Copy link
Member

randomly trying openshift/release#31083

cgwalters added a commit to cgwalters/release that referenced this pull request Aug 4, 2022
The release payload name will be `rhel-coreos-8`, so let's have the CI config match.

I think this may unblock openshift/machine-config-operator#3258 (comment)

Closes: openshift/os#940
openshift-ci bot pushed a commit to openshift/release that referenced this pull request Aug 5, 2022
The release payload name will be `rhel-coreos-8`, so let's have the CI config match.

I think this may unblock openshift/machine-config-operator#3258 (comment)

Closes: openshift/os#940
@yuqi-zhang
Copy link
Contributor

/retest

2 similar comments
@yuqi-zhang
Copy link
Contributor

/retest

@cgwalters
Copy link
Member

/retest

@cgwalters
Copy link
Member

@jkyros did you do a PR for openshift/installer to have it pass through the new image references? We'll need that in order for bootstrap to work, right?

@cgwalters
Copy link
Member

I believe we need to land c63533d first, then PR to the installer

@openshift-ci
Copy link
Contributor

openshift-ci bot commented Aug 15, 2022

@jkyros: The following tests failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name Commit Details Required Rerun command
ci/prow/e2e-gcp-op a83a3f9 link true /test e2e-gcp-op
ci/prow/e2e-aws a83a3f9 link true /test e2e-aws

Full PR test history. Your PR dashboard.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here.

@cgwalters
Copy link
Member

@yuqi-zhang and I were talking about this and we think it makes sense to split this PR in two:

  • code changes sufficient to just reference the new image, but not use it
  • a second PR which actually tries to use it

The first hunk of this could land right now with no external dependencies (e.g. on the installer etc.)

@yuqi-zhang
Copy link
Contributor

We discussed this last week and I think we have reached consensus on these steps or so:

  1. get Allow overriding OSImageURL with a layered image  #3272 in
  2. split out this PR to just consume to the controllerconfig level, and then work on the mismatch via an installer PR
  3. merge the rest of this PR.

Does that still sound like the plan @jkyros

@jkyros
Copy link
Contributor Author

jkyros commented Aug 15, 2022

Yep, that's still the plan. I am in the process of splitting this.

@jkyros
Copy link
Contributor Author

jkyros commented Aug 16, 2022

@openshift-merge-robot openshift-merge-robot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Aug 16, 2022
@openshift-merge-robot
Copy link
Contributor

@jkyros: PR needs rebase.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

return false, err
}

if isBootable, ok := imageData.Labels["ostree.bootable"]; ok {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm OK with this temporarily as a way to distinguish old vs new format, but I also think this check is better done in the ostree stack; done in ostreedev/ostree-rs-ext#356

Once we drop the old machine-os-content and hard require new format, I don't think we need to duplicate the error we'd get from rpm-ostree anyways.

@cgwalters
Copy link
Member

OK so this conflicts with #3302 right now - I think I vote we land this first (but disabled by default), then I'll rebase that PR and then rebase #3310 on top of that merged stack.

@cgwalters
Copy link
Member

OK I took a crack at rebasing this myself, but man there are a ton of conflicts. Is doing that on your near term radar? If not I'll probably look at this later today

@jkyros
Copy link
Contributor Author

jkyros commented Sep 1, 2022

Yes, I'll do that this afternoon. Most of this is already in in some fashion, what's left is really small.

If we're going for shortest path to test "new format by default", I'm almost tempted to:

  • close this
  • hard code the extensions container to our CI extensions image for MCO-289: OCPBUGS-1324: Teach the MCO to use new format image #3317
  • sneak in a DNS entry or IP address (or I guess extract the extensions to disk and use it that way).
  • use that to test "new image by default", because with the extensions container that should at least have a chance of passing our CI

@jkyros
Copy link
Contributor Author

jkyros commented Sep 6, 2022

Closing this in favor of #3317

@jkyros jkyros closed this Sep 6, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants