New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add RHCOS oscontainer into payload, render to 00-$role-osimageurl MC #273

Open
wants to merge 8 commits into
base: master
from

Conversation

Projects
None yet
7 participants
@cgwalters
Copy link
Contributor

cgwalters commented Jan 8, 2019

This closes the gap in getting the RHCOS oscontainer into the update
payload, and having the MCO then render that down into the MachineConfigs.

Introduce a new ConfigMap machine-config-osimageurl that points to
the oscontainer and is applied by the CVO. Then, the controller
renders that into machineconfigs/00-$role-osimageurl which then finally
goes into the rendered config, and should be applied by the daemon.

Closes: #183

@openshift-ci-robot

This comment has been minimized.

Copy link

openshift-ci-robot commented Jan 8, 2019

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: cgwalters

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@openshift-ci-robot openshift-ci-robot requested review from ashcrow and jlebon Jan 8, 2019

@cgwalters cgwalters force-pushed the cgwalters:images-json-osimageurl branch from a346594 to 0271d40 Jan 8, 2019

@cgwalters

This comment has been minimized.

Copy link
Contributor

cgwalters commented Jan 8, 2019

The high level idea with this so far (still writing this code) is that we have a ConfigMap that points to the oscontainer and is applied by the CVO. Then, the controller renders that into machineconfigs/00-$role-osimageurl which should make it into the final config.

This supercedes #258 and also takes a small bit of #228 in that we're taking the first non-empty osImageURL.

@cgwalters

This comment has been minimized.

Copy link
Contributor

cgwalters commented Jan 8, 2019

A very important thing here is that we need to make an immediate choice - do we pin the oscontainer in this repo and then update it via PRs? (Would be very noisy and painful, but give us CI gating) Or in the short term do we have it float :latest or so?

Or, (this would be my preference probably) matching discussion in openshift/installer#987 do we have the release controller "gather" it?

@cgwalters

This comment has been minimized.

Copy link
Contributor

cgwalters commented Jan 8, 2019

(Hm, since AIUI the release controller reacts to ImageStream changes, if we fixed the pipeline to push into the main namespace it should then trigger release payloads?)

@cgwalters cgwalters force-pushed the cgwalters:images-json-osimageurl branch from 0271d40 to 1882c9e Jan 8, 2019

@smarterclayton

This comment has been minimized.

Copy link
Member

smarterclayton commented Jan 8, 2019

I would recommend pushing to the image stream because it is exactly what ART will do, so we can then get the RHCoS image integrated into OCP exactly like origin more quickly. You can always control how often you push.

Show resolved Hide resolved install/image-references Outdated
@smarterclayton

This comment has been minimized.

Copy link
Member

smarterclayton commented Jan 8, 2019

Instead of pinning or pushing, you can also just manually tag a new one in, or we can set up a periodic job that grabs the latest maipo, tests it, and then promotes it if the test passes (I could set that up in < 1hr)

@cgwalters

This comment has been minimized.

Copy link
Contributor

cgwalters commented Jan 8, 2019

or we can set up a periodic job that grabs the latest maipo, tests it, and then promotes it if the test passes (I could set that up in < 1hr)

This one sounds good! I'd be interested in seeing the code and understanding how that works...there's a lot of magic in the ci-operator/release stuff that I haven't yet fully grasped. The "test it" part is blocked a bit on this PR landing though right?

@cgwalters

This comment has been minimized.

Copy link
Contributor

cgwalters commented Jan 8, 2019

And it turns out we aren't pushing a :latest today; we could easily but it's probably better to get the "rhcos oscontainer promotion" going.

Although...maybe we really do need to separate out "add rhcos to payload" from "apply updates"?

Here's a possibility, we could add a config option or so to ignore osImageURL changes and flip that on by default in this PR. That way landing this wouldn't impact development immediately.

@smarterclayton

This comment has been minimized.

Copy link
Member

smarterclayton commented Jan 8, 2019

yes - once you have an image, with a payload, it's really easy to script overlaying that test. I can go over what would be necessary there. That would block promoting a new image, but we would still need a job for PRs that overrides the image appropriately (which is not difficult)

@cgwalters cgwalters force-pushed the cgwalters:images-json-osimageurl branch from 1882c9e to 79b5001 Jan 9, 2019

@openshift-ci-robot openshift-ci-robot added size/L and removed size/M labels Jan 9, 2019

@cgwalters

This comment has been minimized.

Copy link
Contributor

cgwalters commented Jan 9, 2019

OK this is updated 🆕 - we now have a CONFIG_IGNORE_OSIMAGEURL and use it by default. This will allow us to land this PR to act as a mechanism and then in both local development and CI we can override the env setting to test things out.

cgwalters added a commit to cgwalters/machine-config-operator that referenced this pull request Jan 9, 2019

controller: Take first non-empty osImageURL
Today each MC will contain both an Ignition fragment and an
`osImageURL`.  Define "merging" as using the first
non-empty `osImageURL` so we don't have to be very picky about
ordering.

This is a smaller version of
openshift#228

Prep for: openshift#273
@cgwalters

This comment has been minimized.

Copy link
Contributor

cgwalters commented Jan 9, 2019

Hm:

error: unable to create a release: operator "machine-config-operator" failed to map images: image file "/tmp/release-image-0.0.1-2019-01-09-161203356873649/machine-config-operator/image-references" referenced image "machine-os-content" that is not part of the input images

Hmm but it worked in #258

@cgwalters

This comment has been minimized.

Copy link
Contributor

cgwalters commented Jan 9, 2019

/test images

@cgwalters

This comment has been minimized.

Copy link
Contributor

cgwalters commented Jan 9, 2019

/hold

@cgwalters

This comment has been minimized.

Copy link
Contributor

cgwalters commented Jan 9, 2019

/retest

@cgwalters cgwalters force-pushed the cgwalters:images-json-osimageurl branch from 10a4353 to 527ade1 Jan 14, 2019

@cgwalters

This comment has been minimized.

Copy link
Contributor

cgwalters commented Jan 14, 2019

Basically, we don't set the current config annotation on startup from the file and let the MCD naturally do it itself if it deems that the system state matches the desired config (and otherwise, e.g. pivot if the osImageURLs don't match).

I opted to pass this as an explicit stamp file, see the top commit.

@ashcrow

This comment has been minimized.

Copy link
Member

ashcrow commented Jan 14, 2019

--- FAIL: TestBootstrapServer (0.00s)
	server_test.go:197: could not find file: root/etc/machine-config-daemon/initial-config-required.stamp
	server_test.go:200: file validation failed for: root/etc/machine-config-daemon/initial-config-required.stamp, exp: {{root <nil> <nil> /etc/machine-config-daemon/initial-config-required.stamp <nil>} {false { data:, {<nil>}} 0xc42003c568}}, got: {{ <nil> <nil>  <nil>} {false {  {<nil>}} <nil>}}
--- FAIL: TestClusterServer (0.00s)
	server_test.go:197: could not find file: root/etc/machine-config-daemon/initial-config-required.stamp
	server_test.go:200: file validation failed for: root/etc/machine-config-daemon/initial-config-required.stamp, exp: {{root <nil> <nil> /etc/machine-config-daemon/initial-config-required.stamp <nil>} {false { data:, {<nil>}} 0xc420422080}}, got: {{ <nil> <nil>  <nil>} {false {  {<nil>}} <nil>}}
@ashcrow

This comment has been minimized.

Copy link
Member

ashcrow commented Jan 14, 2019

Looks like the stamp file needs to be provided for unittests.

@jlebon
Copy link
Member

jlebon left a comment

Yup, that makes sense to me!

Show resolved Hide resolved pkg/daemon/node.go Outdated

@cgwalters cgwalters force-pushed the cgwalters:images-json-osimageurl branch from 9594afd to 43010b2 Jan 14, 2019

@ashcrow

This comment has been minimized.

Copy link
Member

ashcrow commented Jan 14, 2019

I think it's failing due to getAppenders in both _server.go files being called, and then there being no initial file on the file system.

Edit: I see the update. 🤞

@cgwalters

This comment has been minimized.

Copy link
Contributor

cgwalters commented Jan 14, 2019

OK on this run, seeing the same master machines timing out fetching MCs.

$ oc get machineconfigs
NAME                                      GENERATEDBYCONTROLLER        IGNITIONVERSION   OSIMAGEURL                                                                                                                    CREATED
00-master                                 3.11.0-453-gf0070874-dirty   2.2.0                                                                                                                                           3m
00-master-osimageurl                      3.11.0-453-gf0070874-dirty                     registry.svc.ci.openshift.org/ci-op-2zl77531/stable@sha256:61dc83d62cfb5054c4c5532bd2478742a0711075ef5151572e63f94babeacc1a   3m
00-master-ssh                             3.11.0-453-gf0070874-dirty                                                                                                                                                   3m
00-worker                                 3.11.0-453-gf0070874-dirty   2.2.0                                                                                                                                           3m
00-worker-osimageurl                      3.11.0-453-gf0070874-dirty                     registry.svc.ci.openshift.org/ci-op-2zl77531/stable@sha256:61dc83d62cfb5054c4c5532bd2478742a0711075ef5151572e63f94babeacc1a   3m
00-worker-ssh                             3.11.0-453-gf0070874-dirty                                                                                                                                                   3m
01-master-kubelet                         3.11.0-453-gf0070874-dirty   2.2.0                                                                                                                                           3m
01-worker-kubelet                         3.11.0-453-gf0070874-dirty   2.2.0                                                                                                                                           3m
master-383ca913310d861fee0be89e6f1d0127   3.11.0-453-gf0070874-dirty   2.2.0             registry.svc.ci.openshift.org/ci-op-2zl77531/stable@sha256:61dc83d62cfb5054c4c5532bd2478742a0711075ef5151572e63f94babeacc1a   3m
master-46c05bfb9cb3d4e05608277bb2cb0a5d   3.11.0-453-gf0070874-dirty   2.2.0                                                                                                                                           3m
master-7734c782bad1ead0f8ef5b6affcaf35c   3.11.0-453-gf0070874-dirty   2.2.0             registry.svc.ci.openshift.org/ci-op-2zl77531/stable@sha256:61dc83d62cfb5054c4c5532bd2478742a0711075ef5151572e63f94babeacc1a   3m
master-a206c9459a44d859587164a68bb484f2   3.11.0-453-gf0070874-dirty   2.2.0             registry.svc.ci.openshift.org/ci-op-2zl77531/stable@sha256:61dc83d62cfb5054c4c5532bd2478742a0711075ef5151572e63f94babeacc1a   3m
worker-f8ffb3b151cd556f0e52db58a59991e8   3.11.0-453-gf0070874-dirty   2.2.0             registry.svc.ci.openshift.org/ci-op-2zl77531/stable@sha256:61dc83d62cfb5054c4c5532bd2478742a0711075ef5151572e63f94babeacc1a   3m

OK, but the new MCS/MCD code worked fine:

$ oc logs -f pods/machine-config-daemon-g6j45 -p
I0114 22:28:05.900642    4685 start.go:52] Version: 3.11.0-453-gf0070874-dirty
I0114 22:28:05.903124    4685 start.go:88] starting node writer
I0114 22:28:05.911505    4685 run.go:22] Running captured: chroot /rootfs rpm-ostree status --json
I0114 22:28:05.951118    4685 daemon.go:153] Booted osImageURL: registry.svc.ci.openshift.org/rhcos/maipo@sha256:5e04a144af8106440fb2dd0ac494c562dece239f653134212546da4348adacaf (47.264)                        
I0114 22:28:05.954575    4685 daemon.go:222] Managing node: ip-10-0-152-195.ec2.internal
I0114 22:29:05.968259    4685 node.go:54] Setting initial node config: worker-f8ffb3b151cd556f0e52db58a59991e8                                                                                                    
I0114 22:29:05.980567    4685 start.go:139] Calling chroot("/rootfs")
I0114 22:29:05.999715    4685 daemon.go:392] Current+desired config: worker-f8ffb3b151cd556f0e52db58a59991e8                                                                                                      
I0114 22:29:06.000397    4685 daemon.go:572] Validated on-disk state
I0114 22:29:06.000489    4685 daemon.go:609] Booting into initial configuration; pivot required
I0114 22:29:06.000546    4685 update.go:646] Updating OS to registry.svc.ci.openshift.org/ci-op-2zl77531/stable@sha256:61dc83d62cfb5054c4c5532bd2478742a0711075ef5151572e63f94babeacc1a                           
I0114 22:29:06.000597    4685 run.go:13] Running: /bin/pivot registry.svc.ci.openshift.org/ci-op-2zl77531/stable@sha256:61dc83d62cfb5054c4c5532bd2478742a0711075ef5151572e63f94babeacc1a                          
pivot version 0.0.2
I0114 22:29:06.163992    5228 run.go:27] Running: rpm-ostree status --json
I0114 22:29:06.200747    5228 root.go:79] Previous pivot: registry.svc.ci.openshift.org/rhcos/maipo@sha256:5e04a144af8106440fb2dd0ac494c562dece239f653134212546da4348adacaf                                       
I0114 22:29:06.200868    5228 run.go:27] Running: skopeo inspect docker://registry.svc.ci.openshift.org/ci-op-2zl77531/stable@sha256:61dc83d62cfb5054c4c5532bd2478742a0711075ef5151572e63f94babeacc1a             
I0114 22:29:09.294667    5228 root.go:89] Resolved to: registry.svc.ci.openshift.org/ci-op-2zl77531/stable@sha256:61dc83d62cfb5054c4c5532bd2478742a0711075ef5151572e63f94babeacc1a                                
I0114 22:29:09.294780    5228 root.go:106] Pivoting to: 47.231 (b35193d6f2f34c2a38bbbca9e3bbe618cb1266ffa9c38d10050ec8cef2ccba79)                                                         
...
Removed:
  atomic-openshift-clients-4.0.0-0.136.0.git.0.3903800.el7.x86_64
  atomic-openshift-hyperkube-4.0.0-0.136.0.git.0.3903800.el7.x86_64
  atomic-openshift-node-4.0.0-0.136.0.git.0.3903800.el7.x86_64
  pyOpenSSL-0.13.1-4.el7.x86_64
  python-cffi-1.6.0-5.el7.x86_64
  python-enum34-1.0.4-1.el7.noarch
  python-idna-2.4-1.el7.noarch
  python-ply-3.4-11.el7.noarch
  python-pycparser-2.14-1.el7.noarch
  python2-cryptography-1.7.2-2.el7.x86_64
  python2-pyasn1-0.1.9-7.el7.noarch
  python2-pysocks-1.5.7-4.el7.noarch
  python2-urllib3-1.21.1-1.el7.noarch
Added:
  origin-clients-4.0.0-0.alpha.0.802.62de992.x86_64
  origin-hyperkube-4.0.0-0.alpha.0.802.62de992.x86_64
  origin-node-4.0.0-0.alpha.0.802.62de992.x86_64
  python-urllib3-1.10.2-5.el7.noarch
...
I0114 22:30:37.167113    4685 update.go:672] machine-config-daemon initiating reboot: Node will reboot (initial pivot) into config worker-f8ffb3b151cd556f0e52db58a59991e8
$ oc logs -f pods/machine-config-daemon-g6j45
I0114 22:31:33.224130    4570 start.go:52] Version: 3.11.0-453-gf0070874-dirty
I0114 22:31:33.226886    4570 start.go:88] starting node writer
I0114 22:31:33.274734    4570 run.go:22] Running captured: chroot /rootfs rpm-ostree status --json
I0114 22:31:33.415079    4570 daemon.go:153] Booted osImageURL: registry.svc.ci.openshift.org/ci-op-2zl77531/stable@sha256:61dc83d62cfb5054c4c5532bd2478742a0711075ef5151572e63f94babeacc1a (47.231)
I0114 22:31:33.443501    4570 daemon.go:222] Managing node: ip-10-0-152-195.ec2.internal
I0114 22:31:48.244057    4570 start.go:139] Calling chroot("/rootfs")
I0114 22:31:48.262181    4570 daemon.go:392] Current+desired config: worker-f8ffb3b151cd556f0e52db58a59991e8
I0114 22:31:48.264417    4570 daemon.go:572] Validated on-disk state
I0114 22:31:48.278430    4570 daemon.go:599] Completing pending config worker-f8ffb3b151cd556f0e52db58a59991e8
I0114 22:31:48.284692    4570 update.go:672] machine-config-daemon: completed update for config worker-f8ffb3b151cd556f0e52db58a59991e8
I0114 22:31:48.289525    4570 daemon.go:614] Completed initial pivot
I0114 22:31:48.289592    4570 daemon.go:616] Booting into initial configuration; no pivot required
I0114 22:31:48.289674    4570 daemon.go:624] In desired config worker-f8ffb3b151cd556f0e52db58a59991e8
I0114 22:31:48.289753    4570 start.go:158] Starting MachineConfigDaemon
I0114 22:31:48.289804    4570 daemon.go:241] Enabling Kubelet Healthz Monitor

Though one confusing thing about this is the machineconfigpool is unaware of this state - it flags the workers as being updated while the reboots are happening...and in fact, the reboots won't be coordinated. So we do need to add an annotation or something for this.

@cgwalters

This comment has been minimized.

Copy link
Contributor

cgwalters commented Jan 14, 2019

Alternatively...we could do the pivot early on (boot-time systemd unit), before the node joins the cluster. Positives: less confusion in the MCC, we don't potentially land workloads only to immediately drain them. Downsides: this bootstrap phase isn't visible in cluster logs directly (e.g. oc logs -f pods/machine-config-daemon). Maybe something else?

EDIT: Yeah, unless someone votes otherwise I am going to pursue having a basic redhat-coreos-initial-pivot.service and undo my changes to effectively implement this in the MCD.

That still leaves the MC cycling issue, I stuck that here: #301

@jlebon

This comment has been minimized.

Copy link
Member

jlebon commented Jan 15, 2019

Hmm, would it be possible to bring up the node in drained mode and then once we confirm we're in the desired state (possibly after pivoting), we uncordon? That way, the pivot is still in the cluster logs, but we still don't schedule other things that aren't daemonsets.

cgwalters added a commit to cgwalters/machine-config-operator that referenced this pull request Jan 15, 2019

Add osimageurl to release payload and controllerconfig
Split out of openshift#273

This is the first part of closing the gap in getting the RHCOS oscontainer
into the update payload.

Introduce a new ConfigMap machine-config-osimageurl that points to
the oscontainer and is owned by the CVO.  Then, the operator propagates
that into the `controllerconfig` CRD.

Further patches will have the controller react to the update, but having
this first step in will help us validate the model without actually
having clusters update.

cgwalters added a commit to cgwalters/machine-config-operator that referenced this pull request Jan 15, 2019

Add osimageurl to release payload and controllerconfig
Split out of openshift#273

This is the first part of closing the gap in getting the RHCOS oscontainer
into the update payload.

Introduce a new ConfigMap machine-config-osimageurl that points to
the oscontainer and is owned by the CVO.  Then, the operator propagates
that into the `controllerconfig` CRD.

Further patches will have the controller react to the update, but having
this first step in will help us validate the model without actually
having clusters update.
@cgwalters

This comment has been minimized.

Copy link
Contributor

cgwalters commented Jan 15, 2019

Split out prep into #305

@abhinavdahiya

This comment has been minimized.

Copy link
Member

abhinavdahiya commented Jan 15, 2019

Didn't see any code so that boostrap mco,mcc use the is image url to make sure the configuration is same during special case bootstrapping and long running

@cgwalters

This comment has been minimized.

Copy link
Contributor

cgwalters commented Jan 15, 2019

Didn't see any code so that boostrap mco,mcc use the is image url to make sure the configuration is same during special case bootstrapping and long running

Yes, I'm working on that - see #273 (comment)

cgwalters added some commits Jan 14, 2019

hack/cluster-push.sh: Add -server
First time I needed to hack on the MCS.
docs: Move MC creation bits to README.md
Since they're very much user facing.
controller: Add a 5s delay before rendering MCs
To reduce churn if MCs are being created rapidly - both on general
principle, and also to reduce our exposure to the current bug
that a booting node may fail to find a GC'd MachineConfig:
#301
Add osimageurl to release payload and controllerconfig
Split out of #273

This is the first part of closing the gap in getting the RHCOS oscontainer
into the update payload.

Introduce a new ConfigMap machine-config-osimageurl that points to
the oscontainer and is owned by the CVO.  Then, the operator propagates
that into the `controllerconfig` CRD.

Further patches will have the controller react to the update, but having
this first step in will help us validate the model without actually
having clusters update.
controller: Render 00-{master,worker}-osimageurl from controllerconfig
Part of #183

This ensures that the `osImageURL` as provided by the cluster update/release payload
makes it down into the rendered `MachineConfig` objects so that the MCD
will update the nodes.
Add rhcos-initial-pivot.service
Change the MCC to write the target osImageURL from the MC
to `/etc/rhcos-initial-pivot-target`.  This will then be
handled by the `rhcos-initial-pivot.service` systemd unit.

@cgwalters cgwalters force-pushed the cgwalters:images-json-osimageurl branch from 43010b2 to 5b346bc Jan 15, 2019

@cgwalters

This comment has been minimized.

Copy link
Contributor

cgwalters commented Jan 15, 2019

OK still testing this, but here's a new approach. Includes #305 and other patches like #303

Implements a new systemd service rhcos-initial-pivot.service.

However, one thing I'm realizing is that while this new code should handle "secondary" machines, we may still need further special handling for the initial master (I think this is what this comment was getting at).

Handling that correctly seems like it's going to require extracting the osimageurl during the bootstrap process? I need to learn more about that code.

cgwalters added a commit to cgwalters/machine-config-operator that referenced this pull request Jan 15, 2019

Add osimageurl to release payload and controllerconfig
Split out of openshift#273

This is the first part of closing the gap in getting the RHCOS oscontainer
into the update payload.

Introduce a new ConfigMap machine-config-osimageurl that points to
the oscontainer and is owned by the CVO.  Then, the operator propagates
that into the `controllerconfig` CRD.

Further patches will have the controller react to the update, but having
this first step in will help us validate the model without actually
having clusters update.
@ashcrow

This comment has been minimized.

Copy link
Member

ashcrow commented Jan 15, 2019

/test e2e-aws

cgwalters added a commit to cgwalters/machine-config-operator that referenced this pull request Jan 15, 2019

Add osimageurl to release payload and controllerconfig
Split out of openshift#273

This is the first part of closing the gap in getting the RHCOS oscontainer
into the update payload.

Introduce a new ConfigMap machine-config-osimageurl that points to
the oscontainer and is owned by the CVO.  Then, the operator propagates
that into the `controllerconfig` CRD.

Further patches will have the controller react to the update, but having
this first step in will help us validate the model without actually
having clusters update.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment