-
Notifications
You must be signed in to change notification settings - Fork 1.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
bootkube: Inject bootstrap MachineConfigs into cluster #2936
Conversation
This is an ugly fix for openshift/machine-config-operator#367
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: The full list of commands accepted by this bot can be found here.
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
bn=$(basename $x) | ||
(echo 'apiVersion: machineconfiguration.openshift.io/v1' | ||
echo 'kind: MachineConfig' | ||
python -c 'import sys,yaml; d=yaml.load(open(sys.argv[1])); del d["metadata"]["ownerReferences"]; yaml.dump(d, sys.stdout)' $x) > /opt/openshift/openshift/9a-${bn} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I forget now why I was doing this deletion of the owner references.
Clearly what we should be doing is have the bootstrap process write out the API objects directly.
And we shouldn't depend on python
.
We have hit this in the field caused by a cluster-admin not getting all the options quite right. As I understand it, this can alleviate the "your cluster isn't installing" problem automatically in the unfortunate case and carries no negative side-effects in the common case. We should get this into a state where it (or something like it) can merge. We've also hit it during development. Basically, we aren't perfect and since we have the ability to make sure that the clusters start up, we should use it. This doesn't necessarily obviate the need or benefit gotten from openshift/machine-config-operator#1376 if we use it to provide a clear indication during installation status that the machine-config-operator isn't using the same content that was initially desired. As I understand the current implications, right now the cluster is very difficult to recover cleanly because the actual content is missing. Having the content available and choosing to fail with a clear "do this simple update to opt-in/fix/update" is a big win comparatively. |
This overlaps with openshift/machine-config-operator#1376 - which makes it easier to debug. This PR would cause the system to automatically reconcile, and we're still debating whether that's a good idea or not. For sure, if we merge this PR we need to add an e2e test that verifies each node in the cluster had exactly one reboot. |
A re-reconcile doesn't necessarily mean that the operator can't go degraded on an unexpected rollout. As I understand it, you can know that you are creating the "first" machineconfig because masters have known labels/names and you could annotate the machineconfig created during rendering. If you did that, the MCC can see that only one master machineconfig exists and knows that it should get exactly the same result. If it does not, it can still produce the expected output, but annotate in a way that say, "I've done this because I failed". That signal could be picked up by the operator to report a failing status with a reconciled cluster and the cluster-admin gets a signal that something didn't go right, but he has a functional cluster to start recovering from instead of one reporting NotFound |
mc* is not allowed to reconcile multiple times during installation... It is a requirement because if during installation it is reconciling to multiple versions.. each would require a reboot. This is a special requirement for out mc* today. So imo we shouldn't think about solving this by pushing old rendered configuration... rather making sure we don't push ignition to masters unless all work from machineconfigs, kubeletconfig etc has completed |
Can you describe what you mean here? I thought the impact of this change was trying to capture a) installer desired state as rendered so that when we have bugs that show that delta we recognize it. I agree that we should have 2 e2e tests - one that on install we verify we only reboot every node exactly once, and one that on upgrade every node is rebooted exactly once, and that violating those is a bug. On an install however, it sounded like the argument was that in UPI when customers induce drift we need the extra info (installer gen vs MCO gen) to identify and highlight for the UPI user what their bug was. I am definitely supportive of relaxing the hard rule in order to better identify the problem, and the e2e tests are a prerequisite to ensure we are actually following that (e2e tests are better at validating than we are) |
@cgwalters @runcom I think these tests can be written against master. |
@cgwalters: PR needs rebase. Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
/uncc @jstuever |
@cgwalters: The following tests failed, say
Full PR test history. Your PR dashboard. Please help us cut down on flakes by linking to an open issue when you hit one in your PR. Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here. |
/close closing due to inactivity. |
@abhinavdahiya: Closed this PR. In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
This is an ugly fix for
openshift/machine-config-operator#367