New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Design for osImageURL updates - integration with CVO/release payload #183

Open
cgwalters opened this Issue Nov 19, 2018 · 25 comments

Comments

Projects
None yet
7 participants
@cgwalters
Contributor

cgwalters commented Nov 19, 2018

Design for osImageURL updates - integration with CVO/release payload

Prior internal discussion: https://url.corp.redhat.com/rht-internal-mco-discussion

Today the osImageURL of dummy:// is built into the MCO codebase.

We need to have a flow that provides updated oscontainer images into
the release payload.

Tentative proposal: ConfigMap object in release payload. RHCOS builds
send PRs to update that ConfigMap, which are gated by CI.

And the MCC then reads that ConfigMap when rendering the MachineConfig.

Implementation details:

@jlebon

This comment has been minimized.

Member

jlebon commented Nov 19, 2018

For completeness, one alternative is to make it a MachineConfig object instead, which is then used as the base config containing the osImageURL in which all other (Ignition) configs are merged into. I think this maps slightly more closely to what's currently done today.

That said, using a separate ConfigMap allows us to have richer metadata. E.g. pkg lists & diffs, OSTree commits & versions, etc... The idea being that an "Update available" button can show all that stuff without having to hunt down the osImage.

@ashcrow

This comment has been minimized.

Member

ashcrow commented Nov 19, 2018

Remember, this data will need to flow one way or another down into a MachineConfig. If it's easier we could add some more information to MCs themselves.

@jlebon

This comment has been minimized.

Member

jlebon commented Nov 19, 2018

Strawman:

apiVersion: v1
kind: ConfigMap
metadata:
  name: rhcos-release-info
data:
  metadata.json: '{"osImageUrl": "registry.svc.ci.openshift.org/rhcos/os-maipo@sha256:4efb36f476405f8e30256733ac900e11b70833dc6a6b54179db60a501fa47124", "ostree-version": "47.115", "ostree-checksum": "b0edcc594e65ccebcc917b0ead76ea6894c0d4d672a63ceee5bdc59976c55bf9", "pkgdiff": [["origin-clients", 2, {"NewPackage": ["origin-clients", "4.0.0-0.alpha.0.610.98ebf23", "x86_64"], "PreviousPackage": ["origin-clients", "4.0.0-0.alpha.0.607.49e9f08", "x86_64"]}], ...], "pkglist": ["GeoIP-1.5.0-13.el7.x86_64", "NetworkManager-1:1.12.0-7.el7_6.x86_64", ...]}'

Basically, we can include a subset of what coreos-assembler already outputs. Or maybe just a link to that meta.json instead?

@jlebon

This comment has been minimized.

Member

jlebon commented Nov 19, 2018

Remember, this data will need to flow one way or another down into a MachineConfig. If it's easier we could add some more information to MCs themselves.

I think it shouldn't be too hard to adapt the MCC either way so we shouldn't constrain ourselves too much on what's easier.

@ashcrow

This comment has been minimized.

Member

ashcrow commented Nov 19, 2018

I think it shouldn't be too hard to adapt the MCC either way so we shouldn't constrain ourselves too much on what's easier.

Fair. Rephrased: what's doable and consumable within specific constraints 😃

@cgwalters

This comment has been minimized.

Contributor

cgwalters commented Nov 19, 2018

Basically, we can include a subset of what coreos-assembler already outputs.

Do we need anything other than the "registry.svc.ci.openshift.org/rhcos/os-maipo@sha256:4efb36f476405f8e30256733ac900e11b70833dc6a6b54179db60a501fa47124", string?

EDIT: Ah sorry I missed this:

The idea being that an "Update available" button can show all that stuff without having to hunt down the osImage.

Yeah, though...a tricky part about this is that the pkgdiff is against the previous version, which may not be the one they're updating to...

Mmm. I'm OK stuffing the whole meta.json in there for now, but I suspect we're going to need to add something smarter later.

@jlebon

This comment has been minimized.

Member

jlebon commented Nov 20, 2018

Yeah, though...a tricky part about this is that the pkgdiff is against the previous version, which may not be the one they're updating to...

Yeah, I think to do this correctly, the pkg diff would have to be done from the pkg lists instead rather than precomputed. Anyway, those are things that could easily come later. We just have to make sure we leave the door open for it.

Mmm. I'm OK stuffing the whole meta.json in there for now, but I suspect we're going to need to add something smarter later.

Yeah, that sounds fine to me.

@smarterclayton

This comment has been minimized.

Member

smarterclayton commented Nov 20, 2018

Be aware you’re going to eventually need to have a distinction between “the config says this is the latest” and “I’m ready to roll that out to the nodes”. Design with that in mind because new kubeelts are going to happen ~ 1/week

@smarterclayton

This comment has been minimized.

Member

smarterclayton commented Nov 20, 2018

Pkgdiff is useless. There is no guarantee it has any relevance. Think package manifest instead of diffs. Our errata and higher level tools will calculate the diff

@jlebon

This comment has been minimized.

Member

jlebon commented Nov 21, 2018

Be aware you’re going to eventually need to have a distinction between “the config says this is the latest” and “I’m ready to roll that out to the nodes”. Design with that in mind because new kubeelts are going to happen ~ 1/week

Can you elaborate on that? Do you mean making sure we actually upgrade to the selected release instead of whatever happens to now be the latest at the time we're actually ready to upgrade? I think that should be covered yeah. The metadata includes the full sha256 of the oscontainer.

Pkgdiff is useless. There is no guarantee it has any relevance. Think package manifest instead of diffs. Our errata and higher level tools will calculate the diff

Yup, see coreos/coreos-assembler#226.

@smarterclayton

This comment has been minimized.

Member

smarterclayton commented Nov 26, 2018

We don't decide on version 4.0.6 until the last moment. So the goal of automation is to continuously have a set of 4.0.6 candidates that we then pick one and ship it. The "train" mindset, not the "artisanal release payload".

@smarterclayton

This comment has been minimized.

Member

smarterclayton commented Nov 30, 2018

Expectation is that you will build an image and push it to the openshift origin integration image stream, then reference the component “os” from this operator and have the dummy value substituted.

In OCP, we do something similar where the OS content gets built via whatever process and shows up in the prerelease list, and is processed the same way.

You can build anywhere - we just need you to push/be imported to the right place

@smarterclayton

This comment has been minimized.

Member

smarterclayton commented Nov 30, 2018

Note the dummy value can be real - but since we need to mirror the content you have to be referenced in the operator list which means you need to be sucked into the right place.

Push vs scheduled import is also possible, but if we do scheduled import the source location has to be appropriately gated like a push would (only changes if you test against latest in an install)

@wking

This comment has been minimized.

Member

wking commented Nov 30, 2018

... but if we do scheduled import the source location has to be appropriately gated like a push would (only changes if you test against latest in an install)

That breaks the sceduled-import model, doesn't it? How do you know the "latest" test used for the gate hadn't been surpassed by further release-payload work? Or will errors there be caught by post-testing?

@cgwalters

This comment has been minimized.

Contributor

cgwalters commented Dec 6, 2018

My notes from playing around with MCD state so far;

I am still a bit confused as to the current flow of the osImageURL; should oc edit machineconfig/00-worker work to render or a new one? Currently I edit the generated one.

Next, currently our oscontainers are uploaded to api.ci under the rhcos/ namespace and they require a separate pull secret, and we don't have a defined way to upload that to nodes. Actually I am a bit confused by pull secrets today, there is openshift/installer#775 which merged but I don't see /root/.docker being created on my master/worker?

For some reason it's not working to podman login manually, need to trace that.

@ashcrow

This comment has been minimized.

Member

ashcrow commented Dec 6, 2018

I am still a bit confused as to the current flow of the osImageURL; should oc edit machineconfig/00 worker work to render or a new one? Currently I edit the generated one.

I've never tried to edit the source and always edited the generated content. Only generated versions should ever be available to the MCD from the MCO.

FWIW the resource version should get updated on edit. The MCO is in charge of updating the annotations. EG: pkg/controller/node/node_controller.go:syncMachineConfigPool(..) When machineconfiguration.openshift.io/desiredConfig doesn't match machineconfiguration.openshift.io/currentConfig MCD will attempt to take action.

Next, currently our oscontainers are uploaded to api.ci under the rhcos/ namespace and they require a separate pull secret, and we don't have a defined way to upload that to nodes. Actually I am a bit confused by pull secrets today, there is openshift/installer#775 which merged but I don't see /root/.docker being created on my master/worker?

That I'm not sure about ☹️

@cgwalters

This comment has been minimized.

Contributor

cgwalters commented Dec 6, 2018

Looked at that PR more carefully and it's only about the bootstrap. The pull secret goes into the main oc get configmap -n kube-system cluster-config-v1 (thanks Kirsten for mentioning that one earlier!).

What I don't understand yet is where that pull secret ends up on the nodes.

@kikisdeliveryservice

This comment has been minimized.

Member

kikisdeliveryservice commented Dec 6, 2018

@jlebon

This comment has been minimized.

Member

jlebon commented Dec 6, 2018

I am still a bit confused as to the current flow of the osImageURL; should oc edit machineconfig/00-worker work to render or a new one? Currently I edit the generated one.

Hmm, I think you're right that that should trigger a regeneration of a new machineconfig. I'm trying that here, but my MCC now is hitting:

E1206 15:14:13.341864       1 reflector.go:322] github.com/openshift/machine-config-operator/vendor/k8s.io/client-go/informers/factory.go:130: Failed to watch *v1.Node: Get https://172.30.0.1:443/api/v1/nodes?resourceVersion=9818&timeoutSeconds=360&watch=true: dial tcp 172.30.0.1:443: connect: connection refused
E1206 15:14:13.341972       1 reflector.go:322] github.com/openshift/machine-config-operator/pkg/generated/informers/externalversions/factory.go:101: Failed to watch *v1.ControllerConfig: Get https://172.30.0.1:443/apis/machineconfiguration.openshift.io/v1/controllerconfigs?resourceVersion=4441&timeoutSeconds=365&watch=true: dial tcp 172.30.0.1:443: connect: connection refused
E1206 15:14:13.342012       1 reflector.go:322] github.com/openshift/machine-config-operator/pkg/generated/informers/externalversions/factory.go:101: Failed to watch *v1.MachineConfigPool: Get https://172.30.0.1:443/apis/machineconfiguration.openshift.io/v1/machineconfigpools?resourceVersion=7665&timeoutSeconds=503&watch=true: dial tcp 172.30.0.1:443: connect: connection refused
E1206 15:14:13.342044       1 reflector.go:322] github.com/openshift/machine-config-operator/pkg/generated/informers/externalversions/factory.go:101: Failed to watch *v1.MachineConfig: Get https://172.30.0.1:443/apis/machineconfiguration.openshift.io/v1/machineconfigs?resourceVersion=7316&timeoutSeconds=387&watch=true: dial tcp 172.30.0.1:443: connect: connection refused

which is a flavour of some of the errors people were hitting in #199. Will try to dig deeper there.

Next, currently our oscontainers are uploaded to api.ci under the rhcos/ namespace and they require a separate pull secret, and we don't have a defined way to upload that to nodes. Actually I am a bit confused by pull secrets today, there is openshift/installer#775 which merged but I don't see /root/.docker being created on my master/worker?

I will admit the hacky way I've been testing upgrades so far includes prepending the machine-config-daemon invocation in the daemonset config with oc login ... &&. Fixing that to use secrets would be awesome!

@cgwalters

This comment has been minimized.

Contributor

cgwalters commented Dec 6, 2018

Yep, I missed it somehow, I see it now, it goes to /var/lib/kubelet/config.json. Adding the secret there works.

@kikisdeliveryservice

This comment has been minimized.

Member

kikisdeliveryservice commented Dec 6, 2018

@jlebon those timeouts are related to the openshift apiserver: openshift/origin#21612

@aaronlevy

This comment has been minimized.

Member

aaronlevy commented Dec 6, 2018

I am still a bit confused as to the current flow of the osImageURL; should oc edit machineconfig/00-worker work to render or a new one? Currently I edit the generated one.

My understanding is that the source should be edited (which should result in a new generated config). We should not be in the habit of editing the generated config (and if that happened, the MCC should actually roll-out a non-edited config to all nodes -- as that is built from the canonical source(s)).

It might help to add docs along the lines of, "As a user, how do I modify host configuration?" and it points to creating a new (source) machine config (as a layer that will be merged with config we control), and outlining that you should not be editing generated configs (that will ultimately be stomped on by MCC rolled-out source-generated configs anyway).

cc @abhinavdahiya these assumptions are still correct.

@jlebon

This comment has been minimized.

Member

jlebon commented Dec 6, 2018

I am still a bit confused as to the current flow of the osImageURL; should oc edit machineconfig/00-worker work to render or a new one? Currently I edit the generated one.

Hmm, I think you're right that that should trigger a regeneration of a new machineconfig.

OK yeah, this does work for me. After sorting out the MCC issue (@kikisdeliveryservice thanks! I tried out the workaround there and it seems like it worked), and doing oc edit machineconfig 00-worker, the daemon shows:

I1206 19:46:08.991607    4369 update.go:486] Updating OS to http://example.com
I1206 19:46:08.991616    4369 run.go:13] Running: /bin/pivot http://example.com
pivot version 0.0.2
...
@jlebon

This comment has been minimized.

Member

jlebon commented Dec 6, 2018

My understanding is that the source should be edited (which should result in a new generated config).

Yeah, this is mostly for testing stuff out.

It might help to add docs along the lines of, "As a user, how do I modify host configuration?" and it points to creating a new (source) machine config (as a layer that will be merged with config we control), and outlining that you should not be editing generated configs (that will ultimately be stomped on by MCC rolled-out source-generated configs anyway).

The issue is that when the MCC merges configs, it doesn't replace the base osImageURL:

// It only uses the OSImageURL from first object and ignores it from rest.
. (And IIUC, the first config is that 00 generated from the baked in template currently, though the osImageURL part of it comes from: ).

Eventually, testing an OS update for hacking could be done by changing the configmap directly (or whatever we settle on in this ticket). (Or at an even higher level, pointing at a custom release payload).

@jlebon

This comment has been minimized.

Member

jlebon commented Dec 6, 2018

To add some context to the previous comment: this is strictly for testing changes to osImageURL. Other filesystem updates that are part of the Ignition spec are merged in, so can be done indeed by creating a new source machine config that gets squashed into a new generated machine config.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment