-
Notifications
You must be signed in to change notification settings - Fork 202
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Bug 1994820: Degrade operator on cluster bootstrap if not all Machines are Running #1019
Bug 1994820: Degrade operator on cluster bootstrap if not all Machines are Running #1019
Conversation
@JoelSpeed: This pull request references Bugzilla bug 1994820, which is valid. The bug has been moved to the POST state. The bug has been updated to refer to the pull request using the external bug tracker. 3 validation(s) were run on this bug
Requesting review from QA contact: In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
An example of the output from the clusteroperator conditions when bootstrapping with a misconfigured machines
|
When the install failed it failed with:
So this was a bright red and very obvious failure, though the rest of the cluster was up and running, need to double check our notes on what we thought about failing the cluster here |
/retest |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
/lgtm
it seems weird, but should we have some e2e test to exercise the failure state here? or perhaps a unit test with a mocked client?
I wanted to get consensus on the approach before working on testing. A unit test with envtest is probably the best idea I have so far, having an E2E is a bit trickier |
+1, envtest sounds perfect to me |
18b8866
to
8c6bad3
Compare
@JoelSpeed: This pull request references Bugzilla bug 1994820, which is valid. 3 validation(s) were run on this bug
Requesting review from QA contact: In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
8c6bad3
to
c9e5f41
Compare
/retest |
we should discuss this further on standup, i think there are a few questions about the common approach and what will happen with the installer. cc @lobziik |
/retest |
1 similar comment
/retest |
My main concern there is that we basing that "isInitializing" logic on a bunch of assumptions and basically have no guarantees that behaviour around will not change. If we could reference to some org-wide guidelies i would be way more calm about this. At the moment it just looks a bit fragile and obscure to me. If folks are agree i'm ok to go with this. Code wise - lgtm. |
/lgtm |
i think this is a good concern, i'm just wondering if there is any documentation that we could link to from the code comments? |
It is up to each operator to handle what happens when they start, and for now, to define when they vary the status of their ClusterOperator status. Even if this mechanism does fail somehow, it's more likely to fail into the more open state (ie the check doesn't happen) so it's unlikely that we will end up degrading some customer cluster if something does get broken during some code changes. This is basically an artifact of the cluster not having an initialising state defined, and that's ok, it's not meant to, it's a reconciling system. We just have to add this optimisation where we can based on what we have (IMO), if it works and helps some users, great, if it doesn't work, well, it's no worse than it is right now. If anyone has some suggestion for how to make this more bulletproof I'm happy to investigate |
i appreciate the deep research @JoelSpeed , i think for now we should use this as is. if we get an improvement in the future we can revisit, or if we find a way to be more definitive about the initializing state. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
bot is acting funny
/approve
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: elmiko The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
c9e5f41
to
2355e91
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
/lgtm
/retest-required |
1 similar comment
/retest-required |
/retest-required |
2355e91
to
2012b4c
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
/lgtm
/retest-required |
@JoelSpeed: The following tests failed, say
Full PR test history. Your PR dashboard. Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here. |
@JoelSpeed: All pull requests linked via external trackers have merged: Bugzilla bug 1994820 has been moved to the MODIFIED state. In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
This was discussed on the a recent cluster lifecycle arch call.
At the moment, we do not return any errors through the installation log if Machines fail to come up, even though these are often the cause of several issues that we see frequently in CI (eg route not ready because there aren't enough worker nodes).
To make things a little more obvious, this changed introduces a check for Machines during the initialization of the operator.
We require, in general, 2 worker machines to be running to host operators such as ingress.
If MachineSets are present in the cluster (this prevents errors on UPI or SNO), and there are fewer than 2 Machines in the running phase, this moves the operator into a degraded state until at least 2 Machines are running.
Once all Machines have been moved to running, this will move the operator to available and therefore, prevent the machine check from happening again in the future.
This means that we will now fail installations if fewer than 2 Machines come up.
It should help when there are generic Machine/cloud errors to help users diagnose the underlying issue, by identifying Machine API as a failed component rather than just the route and auth operators.
I have tested this by simulating a failed Machine on a running cluster and then resetting the ClusterOperator so that the operator thinks it is initializing for a second time, as well as bootstrapping a cluster with only 1 MachineSet machine working, and 1 failing, and then eventually adding an extra working Machine.