Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

bootkube: Inject bootstrap MachineConfigs into cluster #2936

Closed
wants to merge 1 commit into from

Conversation

cgwalters
Copy link
Member

This is an ugly fix for
openshift/machine-config-operator#367

@openshift-ci-robot openshift-ci-robot added the size/S Denotes a PR that changes 10-29 lines, ignoring generated files. label Jan 16, 2020
@openshift-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
To complete the pull request process, please assign smarterclayton
You can assign the PR to them by writing /assign @smarterclayton in a comment when ready.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

bn=$(basename $x)
(echo 'apiVersion: machineconfiguration.openshift.io/v1'
echo 'kind: MachineConfig'
python -c 'import sys,yaml; d=yaml.load(open(sys.argv[1])); del d["metadata"]["ownerReferences"]; yaml.dump(d, sys.stdout)' $x) > /opt/openshift/openshift/9a-${bn}
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I forget now why I was doing this deletion of the owner references.
Clearly what we should be doing is have the bootstrap process write out the API objects directly.

And we shouldn't depend on python.

@deads2k
Copy link
Contributor

deads2k commented Jan 16, 2020

We have hit this in the field caused by a cluster-admin not getting all the options quite right. As I understand it, this can alleviate the "your cluster isn't installing" problem automatically in the unfortunate case and carries no negative side-effects in the common case. We should get this into a state where it (or something like it) can merge.

We've also hit it during development. Basically, we aren't perfect and since we have the ability to make sure that the clusters start up, we should use it.

This doesn't necessarily obviate the need or benefit gotten from openshift/machine-config-operator#1376 if we use it to provide a clear indication during installation status that the machine-config-operator isn't using the same content that was initially desired. As I understand the current implications, right now the cluster is very difficult to recover cleanly because the actual content is missing. Having the content available and choosing to fail with a clear "do this simple update to opt-in/fix/update" is a big win comparatively.

@cgwalters
Copy link
Member Author

This overlaps with openshift/machine-config-operator#1376 - which makes it easier to debug. This PR would cause the system to automatically reconcile, and we're still debating whether that's a good idea or not.

For sure, if we merge this PR we need to add an e2e test that verifies each node in the cluster had exactly one reboot.

@deads2k
Copy link
Contributor

deads2k commented Jan 16, 2020

This overlaps with openshift/machine-config-operator#1376 - which makes it easier to debug. This PR would cause the system to automatically reconcile, and we're still debating whether that's a good idea or not.

A re-reconcile doesn't necessarily mean that the operator can't go degraded on an unexpected rollout. As I understand it, you can know that you are creating the "first" machineconfig because masters have known labels/names and you could annotate the machineconfig created during rendering. If you did that, the MCC can see that only one master machineconfig exists and knows that it should get exactly the same result. If it does not, it can still produce the expected output, but annotate in a way that say, "I've done this because I failed". That signal could be picked up by the operator to report a failing status with a reconciled cluster and the cluster-admin gets a signal that something didn't go right, but he has a functional cluster to start recovering from instead of one reporting NotFound

@abhinavdahiya
Copy link
Contributor

mc* is not allowed to reconcile multiple times during installation... It is a requirement because if during installation it is reconciling to multiple versions.. each would require a reboot. This is a special requirement for out mc* today.

So imo we shouldn't think about solving this by pushing old rendered configuration... rather making sure we don't push ignition to masters unless all work from machineconfigs, kubeletconfig etc has completed

@smarterclayton
Copy link
Contributor

It is a requirement because if during installation it is reconciling to multiple versions.. each would require a reboot.

Can you describe what you mean here?

I thought the impact of this change was trying to capture

a) installer desired state as rendered
b) attempted calculated desired state from MCO once it boots

so that when we have bugs that show that delta we recognize it.

I agree that we should have 2 e2e tests - one that on install we verify we only reboot every node exactly once, and one that on upgrade every node is rebooted exactly once, and that violating those is a bug. On an install however, it sounded like the argument was that in UPI when customers induce drift we need the extra info (installer gen vs MCO gen) to identify and highlight for the UPI user what their bug was. I am definitely supportive of relaxing the hard rule in order to better identify the problem, and the e2e tests are a prerequisite to ensure we are actually following that (e2e tests are better at validating than we are)

@deads2k
Copy link
Contributor

deads2k commented Jan 21, 2020

I agree that we should have 2 e2e tests - one that on install we verify we only reboot every node exactly once, and one that on upgrade every node is rebooted exactly once, and that violating those is a bug.

@cgwalters @runcom I think these tests can be written against master.

@openshift-ci-robot openshift-ci-robot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Jan 31, 2020
@openshift-ci-robot
Copy link
Contributor

@cgwalters: PR needs rebase.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@jstuever
Copy link
Contributor

/uncc @jstuever
This PR appears to have become stale.

@openshift-ci-robot openshift-ci-robot removed the request for review from jstuever March 13, 2020 16:36
@openshift-ci-robot
Copy link
Contributor

@cgwalters: The following tests failed, say /retest to rerun all failed tests:

Test name Commit Details Rerun command
ci/prow/e2e-libvirt efa4cba link /test e2e-libvirt
ci/prow/e2e-openstack efa4cba link /test e2e-openstack
ci/prow/e2e-aws efa4cba link /test e2e-aws
ci/prow/e2e-ovirt efa4cba link /test e2e-ovirt
ci/prow/e2e-aws-fips efa4cba link /test e2e-aws-fips
ci/prow/e2e-aws-scaleup-rhel7 efa4cba link /test e2e-aws-scaleup-rhel7
ci/prow/shellcheck efa4cba link /test shellcheck
ci/prow/tf-lint efa4cba link /test tf-lint
ci/prow/yaml-lint efa4cba link /test yaml-lint
ci/prow/e2e-aws-upgrade efa4cba link /test e2e-aws-upgrade
ci/prow/verify-vendor efa4cba link /test verify-vendor
ci/prow/images efa4cba link /test images
ci/prow/govet efa4cba link /test govet
ci/prow/gofmt efa4cba link /test gofmt
ci/prow/unit efa4cba link /test unit
ci/prow/golint efa4cba link /test golint

Full PR test history. Your PR dashboard. Please help us cut down on flakes by linking to an open issue when you hit one in your PR.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here.

@abhinavdahiya
Copy link
Contributor

/close

closing due to inactivity.

@openshift-ci-robot
Copy link
Contributor

@abhinavdahiya: Closed this PR.

In response to this:

/close

closing due to inactivity.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. size/S Denotes a PR that changes 10-29 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

6 participants