bootkube: Inject bootstrap MachineConfigs into cluster #2936

cgwalters · 2020-01-16T14:42:43Z

This is an ugly fix for
openshift/machine-config-operator#367

This is an ugly fix for openshift/machine-config-operator#367

openshift-ci-robot · 2020-01-16T14:43:08Z

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
To complete the pull request process, please assign smarterclayton
You can assign the PR to them by writing /assign @smarterclayton in a comment when ready.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

OWNERS

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

cgwalters · 2020-01-16T14:48:18Z

data/data/bootstrap/files/usr/local/bin/bootkube.sh.template

+    bn=$(basename $x)
+    (echo 'apiVersion: machineconfiguration.openshift.io/v1'
+     echo 'kind: MachineConfig'
+     python -c 'import sys,yaml; d=yaml.load(open(sys.argv[1])); del d["metadata"]["ownerReferences"]; yaml.dump(d, sys.stdout)' $x) > /opt/openshift/openshift/9a-${bn}


I forget now why I was doing this deletion of the owner references.
Clearly what we should be doing is have the bootstrap process write out the API objects directly.

And we shouldn't depend on python.

deads2k · 2020-01-16T14:48:49Z

We have hit this in the field caused by a cluster-admin not getting all the options quite right. As I understand it, this can alleviate the "your cluster isn't installing" problem automatically in the unfortunate case and carries no negative side-effects in the common case. We should get this into a state where it (or something like it) can merge.

We've also hit it during development. Basically, we aren't perfect and since we have the ability to make sure that the clusters start up, we should use it.

This doesn't necessarily obviate the need or benefit gotten from openshift/machine-config-operator#1376 if we use it to provide a clear indication during installation status that the machine-config-operator isn't using the same content that was initially desired. As I understand the current implications, right now the cluster is very difficult to recover cleanly because the actual content is missing. Having the content available and choosing to fail with a clear "do this simple update to opt-in/fix/update" is a big win comparatively.

cgwalters · 2020-01-16T14:51:21Z

This overlaps with openshift/machine-config-operator#1376 - which makes it easier to debug. This PR would cause the system to automatically reconcile, and we're still debating whether that's a good idea or not.

For sure, if we merge this PR we need to add an e2e test that verifies each node in the cluster had exactly one reboot.

deads2k · 2020-01-16T14:55:03Z

This overlaps with openshift/machine-config-operator#1376 - which makes it easier to debug. This PR would cause the system to automatically reconcile, and we're still debating whether that's a good idea or not.

A re-reconcile doesn't necessarily mean that the operator can't go degraded on an unexpected rollout. As I understand it, you can know that you are creating the "first" machineconfig because masters have known labels/names and you could annotate the machineconfig created during rendering. If you did that, the MCC can see that only one master machineconfig exists and knows that it should get exactly the same result. If it does not, it can still produce the expected output, but annotate in a way that say, "I've done this because I failed". That signal could be picked up by the operator to report a failing status with a reconciled cluster and the cluster-admin gets a signal that something didn't go right, but he has a functional cluster to start recovering from instead of one reporting NotFound

abhinavdahiya · 2020-01-17T03:12:09Z

mc* is not allowed to reconcile multiple times during installation... It is a requirement because if during installation it is reconciling to multiple versions.. each would require a reboot. This is a special requirement for out mc* today.

So imo we shouldn't think about solving this by pushing old rendered configuration... rather making sure we don't push ignition to masters unless all work from machineconfigs, kubeletconfig etc has completed

smarterclayton · 2020-01-21T20:17:07Z

It is a requirement because if during installation it is reconciling to multiple versions.. each would require a reboot.

Can you describe what you mean here?

I thought the impact of this change was trying to capture

a) installer desired state as rendered
b) attempted calculated desired state from MCO once it boots

so that when we have bugs that show that delta we recognize it.

I agree that we should have 2 e2e tests - one that on install we verify we only reboot every node exactly once, and one that on upgrade every node is rebooted exactly once, and that violating those is a bug. On an install however, it sounded like the argument was that in UPI when customers induce drift we need the extra info (installer gen vs MCO gen) to identify and highlight for the UPI user what their bug was. I am definitely supportive of relaxing the hard rule in order to better identify the problem, and the e2e tests are a prerequisite to ensure we are actually following that (e2e tests are better at validating than we are)

deads2k · 2020-01-21T20:38:46Z

I agree that we should have 2 e2e tests - one that on install we verify we only reboot every node exactly once, and one that on upgrade every node is rebooted exactly once, and that violating those is a bug.

@cgwalters @runcom I think these tests can be written against master.

openshift-ci-robot · 2020-01-31T01:12:43Z

@cgwalters: PR needs rebase.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

jstuever · 2020-03-13T16:36:35Z

/uncc @jstuever
This PR appears to have become stale.

openshift-ci-robot · 2020-04-22T00:46:47Z

@cgwalters: The following tests failed, say /retest to rerun all failed tests:

Test name	Commit	Details	Rerun command
ci/prow/e2e-libvirt	`efa4cba`	link	`/test e2e-libvirt`
ci/prow/e2e-openstack	`efa4cba`	link	`/test e2e-openstack`
ci/prow/e2e-aws	`efa4cba`	link	`/test e2e-aws`
ci/prow/e2e-ovirt	`efa4cba`	link	`/test e2e-ovirt`
ci/prow/e2e-aws-fips	`efa4cba`	link	`/test e2e-aws-fips`
ci/prow/e2e-aws-scaleup-rhel7	`efa4cba`	link	`/test e2e-aws-scaleup-rhel7`
ci/prow/shellcheck	`efa4cba`	link	`/test shellcheck`
ci/prow/tf-lint	`efa4cba`	link	`/test tf-lint`
ci/prow/yaml-lint	`efa4cba`	link	`/test yaml-lint`
ci/prow/e2e-aws-upgrade	`efa4cba`	link	`/test e2e-aws-upgrade`
ci/prow/verify-vendor	`efa4cba`	link	`/test verify-vendor`
ci/prow/images	`efa4cba`	link	`/test images`
ci/prow/govet	`efa4cba`	link	`/test govet`
ci/prow/gofmt	`efa4cba`	link	`/test gofmt`
ci/prow/unit	`efa4cba`	link	`/test unit`
ci/prow/golint	`efa4cba`	link	`/test golint`

Full PR test history. Your PR dashboard. Please help us cut down on flakes by linking to an open issue when you hit one in your PR.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here.

abhinavdahiya · 2020-05-08T23:11:39Z

/close

closing due to inactivity.

openshift-ci-robot · 2020-05-08T23:11:54Z

@abhinavdahiya: Closed this PR.

In response to this:

/close

closing due to inactivity.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

bootkube: Inject bootstrap MachineConfigs into cluster

efa4cba

This is an ugly fix for openshift/machine-config-operator#367

cgwalters mentioned this pull request Jan 16, 2020

bootkube: Inject bootstrap MachineConfigs into cluster #1189

Closed

openshift-ci-robot added the size/S Denotes a PR that changes 10-29 lines, ignoring generated files. label Jan 16, 2020

openshift-ci-robot requested review from jcpowermac and jstuever January 16, 2020 14:43

cgwalters commented Jan 16, 2020

View reviewed changes

sdodson mentioned this pull request Jan 20, 2020

Bug 1796627: cmd/openshift-install/create: add env var to preserve bootstrap #2935

Merged

openshift-ci-robot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Jan 31, 2020

openshift-ci-robot removed the request for review from jstuever March 13, 2020 16:36

openshift-ci-robot closed this May 8, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

bootkube: Inject bootstrap MachineConfigs into cluster #2936

bootkube: Inject bootstrap MachineConfigs into cluster #2936

cgwalters commented Jan 16, 2020

openshift-ci-robot commented Jan 16, 2020

cgwalters Jan 16, 2020

deads2k commented Jan 16, 2020

cgwalters commented Jan 16, 2020

deads2k commented Jan 16, 2020

abhinavdahiya commented Jan 17, 2020

smarterclayton commented Jan 21, 2020

deads2k commented Jan 21, 2020

openshift-ci-robot commented Jan 31, 2020

jstuever commented Mar 13, 2020

openshift-ci-robot commented Apr 22, 2020

abhinavdahiya commented May 8, 2020

openshift-ci-robot commented May 8, 2020

bootkube: Inject bootstrap MachineConfigs into cluster #2936

bootkube: Inject bootstrap MachineConfigs into cluster #2936

Conversation

cgwalters commented Jan 16, 2020

openshift-ci-robot commented Jan 16, 2020

cgwalters Jan 16, 2020

Choose a reason for hiding this comment

deads2k commented Jan 16, 2020

cgwalters commented Jan 16, 2020

deads2k commented Jan 16, 2020

abhinavdahiya commented Jan 17, 2020

smarterclayton commented Jan 21, 2020

deads2k commented Jan 21, 2020

openshift-ci-robot commented Jan 31, 2020

jstuever commented Mar 13, 2020

openshift-ci-robot commented Apr 22, 2020

abhinavdahiya commented May 8, 2020

openshift-ci-robot commented May 8, 2020