Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Bug 1994820: Degrade operator on cluster bootstrap if not all Machines are Running #1019

Merged
merged 2 commits into from
Jul 2, 2022

Conversation

JoelSpeed
Copy link
Contributor

@JoelSpeed JoelSpeed commented May 23, 2022

This was discussed on the a recent cluster lifecycle arch call.

At the moment, we do not return any errors through the installation log if Machines fail to come up, even though these are often the cause of several issues that we see frequently in CI (eg route not ready because there aren't enough worker nodes).

To make things a little more obvious, this changed introduces a check for Machines during the initialization of the operator.
We require, in general, 2 worker machines to be running to host operators such as ingress.
If MachineSets are present in the cluster (this prevents errors on UPI or SNO), and there are fewer than 2 Machines in the running phase, this moves the operator into a degraded state until at least 2 Machines are running.
Once all Machines have been moved to running, this will move the operator to available and therefore, prevent the machine check from happening again in the future.

This means that we will now fail installations if fewer than 2 Machines come up.

It should help when there are generic Machine/cloud errors to help users diagnose the underlying issue, by identifying Machine API as a failed component rather than just the route and auth operators.

I have tested this by simulating a failed Machine on a running cluster and then resetting the ClusterOperator so that the operator thinks it is initializing for a second time, as well as bootstrapping a cluster with only 1 MachineSet machine working, and 1 failing, and then eventually adding an extra working Machine.

@openshift-ci openshift-ci bot added bugzilla/severity-medium Referenced Bugzilla bug's severity is medium for the branch this PR is targeting. bugzilla/valid-bug Indicates that a referenced Bugzilla bug is valid for the branch this PR is targeting. labels May 23, 2022
@openshift-ci
Copy link
Contributor

openshift-ci bot commented May 23, 2022

@JoelSpeed: This pull request references Bugzilla bug 1994820, which is valid. The bug has been moved to the POST state. The bug has been updated to refer to the pull request using the external bug tracker.

3 validation(s) were run on this bug
  • bug is open, matching expected state (open)
  • bug target release (4.11.0) matches configured target release for branch (4.11.0)
  • bug is in the state ASSIGNED, which is one of the valid states (NEW, ASSIGNED, ON_DEV, POST, POST)

Requesting review from QA contact:
/cc @sunzhaohua2

In response to this:

Bug 1994820: Degrade operator on cluster bootstrap if not all Machines are Running

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@JoelSpeed
Copy link
Contributor Author

An example of the output from the clusteroperator conditions when bootstrapping with a misconfigured machines

  conditions:
  - lastTransitionTime: "2022-05-23T15:58:38Z"
    message: 'Progressing towards operator: 4.11.0-0.ci.test-2022-05-23-152641-ci-ln-ghys7m2-latest'
    reason: SyncingResources
    status: "True"
    type: Progressing
  - lastTransitionTime: "2022-05-23T16:01:58Z"
    message: 'Failed when progressing towards operator: 4.11.0-0.ci.test-2022-05-23-152641-ci-ln-ghys7m2-latest
      because found 1 non running machine(s): jspeed-test-4l4vz-worker-us-east-2a-tb8zz'
    reason: SyncingFailed
    status: "True"
    type: Degraded
  - lastTransitionTime: "2022-05-23T15:58:38Z"
    message: Operator is initializing
    reason: Initializing
    status: "False"
    type: Available
  - lastTransitionTime: "2022-05-23T15:58:38Z"
    status: "True"
    type: Upgradeable

@JoelSpeed
Copy link
Contributor Author

When the install failed it failed with:

ERROR Cluster operator machine-api Degraded is True with SyncingFailed: Failed when progressing towards operator: 4.11.0-0.ci.test-2022-05-23-152641-ci-ln-ghys7m2-latest because found 1 non running machine(s): jspeed-test-4l4vz-worker-us-east-2a-tb8zz

So this was a bright red and very obvious failure, though the rest of the cluster was up and running, need to double check our notes on what we thought about failing the cluster here

@JoelSpeed
Copy link
Contributor Author

/retest

Copy link
Contributor

@elmiko elmiko left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/lgtm

it seems weird, but should we have some e2e test to exercise the failure state here? or perhaps a unit test with a mocked client?

@openshift-ci openshift-ci bot added the lgtm Indicates that a PR is ready to be merged. label May 26, 2022
@JoelSpeed
Copy link
Contributor Author

it seems weird, but should we have some e2e test to exercise the failure state here? or perhaps a unit test with a mocked client?

I wanted to get consensus on the approach before working on testing. A unit test with envtest is probably the best idea I have so far, having an E2E is a bit trickier

@elmiko
Copy link
Contributor

elmiko commented May 26, 2022

I wanted to get consensus on the approach before working on testing. A unit test with envtest is probably the best idea I have so far, having an E2E is a bit trickier

+1, envtest sounds perfect to me

@openshift-ci openshift-ci bot removed the lgtm Indicates that a PR is ready to be merged. label Jun 1, 2022
@openshift-ci
Copy link
Contributor

openshift-ci bot commented Jun 1, 2022

@JoelSpeed: This pull request references Bugzilla bug 1994820, which is valid.

3 validation(s) were run on this bug
  • bug is open, matching expected state (open)
  • bug target release (4.11.0) matches configured target release for branch (4.11.0)
  • bug is in the state POST, which is one of the valid states (NEW, ASSIGNED, ON_DEV, POST, POST)

Requesting review from QA contact:
/cc @sunzhaohua2

In response to this:

Bug 1994820: Degrade operator on cluster bootstrap if not all Machines are Running

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@JoelSpeed
Copy link
Contributor Author

/retest

@elmiko
Copy link
Contributor

elmiko commented Jun 8, 2022

we should discuss this further on standup, i think there are a few questions about the common approach and what will happen with the installer. cc @lobziik

@JoelSpeed
Copy link
Contributor Author

/retest

1 similar comment
@JoelSpeed
Copy link
Contributor Author

/retest

@lobziik
Copy link
Contributor

lobziik commented Jun 21, 2022

My main concern there is that we basing that "isInitializing" logic on a bunch of assumptions and basically have no guarantees that behaviour around will not change. If we could reference to some org-wide guidelies i would be way more calm about this. At the moment it just looks a bit fragile and obscure to me.

If folks are agree i'm ok to go with this.

Code wise - lgtm.

@lobziik
Copy link
Contributor

lobziik commented Jun 21, 2022

/lgtm

@openshift-ci openshift-ci bot added the lgtm Indicates that a PR is ready to be merged. label Jun 21, 2022
@elmiko
Copy link
Contributor

elmiko commented Jun 21, 2022

My main concern there is that we basing that "isInitializing" logic on a bunch of assumptions and basically have no guarantees that behaviour around will not change. If we could reference to some org-wide guidelies i would be way more calm about this. At the moment it just looks a bit fragile and obscure to me.

i think this is a good concern, i'm just wondering if there is any documentation that we could link to from the code comments?

@JoelSpeed
Copy link
Contributor Author

My main concern there is that we basing that "isInitializing" logic on a bunch of assumptions and basically have no guarantees that behaviour around will not change. If we could reference to some org-wide guidelies i would be way more calm about this. At the moment it just looks a bit fragile and obscure to me.

It is up to each operator to handle what happens when they start, and for now, to define when they vary the status of their ClusterOperator status.
In MAO, we have always, and should continue to, set, when we see an empty operator status, we set it to initializing until we have, for the first time, set it to available.
This logic is encoded into our operator and is therefore under our control. No one else should be interacting with our cluster operator status (CVO certainly doesn't) so there should be no way, in a running cluster, that the Initializing state comes back, unless someone manually manages to patch the status, which we always assume isn't supported.

Even if this mechanism does fail somehow, it's more likely to fail into the more open state (ie the check doesn't happen) so it's unlikely that we will end up degrading some customer cluster if something does get broken during some code changes.

This is basically an artifact of the cluster not having an initialising state defined, and that's ok, it's not meant to, it's a reconciling system. We just have to add this optimisation where we can based on what we have (IMO), if it works and helps some users, great, if it doesn't work, well, it's no worse than it is right now.

If anyone has some suggestion for how to make this more bulletproof I'm happy to investigate

@elmiko
Copy link
Contributor

elmiko commented Jun 27, 2022

i appreciate the deep research @JoelSpeed , i think for now we should use this as is. if we get an improvement in the future we can revisit, or if we find a way to be more definitive about the initializing state.
/approve

@openshift-ci openshift-ci bot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Jun 27, 2022
Copy link
Contributor

@elmiko elmiko left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

bot is acting funny

/approve

@openshift-ci
Copy link
Contributor

openshift-ci bot commented Jun 27, 2022

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: elmiko

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@openshift-ci-robot
Copy link
Contributor

/retest-required

Remaining retests: 2 against base HEAD ad05451 and 8 for PR HEAD c9e5f41 in total

@openshift-ci-robot
Copy link
Contributor

/retest-required

Remaining retests: 1 against base HEAD ad05451 and 7 for PR HEAD c9e5f41 in total

@openshift-ci-robot
Copy link
Contributor

/retest-required

Remaining retests: 0 against base HEAD ad05451 and 6 for PR HEAD c9e5f41 in total

@openshift-ci openshift-ci bot removed the lgtm Indicates that a PR is ready to be merged. label Jun 29, 2022
Copy link
Member

@damdo damdo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/lgtm

@openshift-ci openshift-ci bot added the lgtm Indicates that a PR is ready to be merged. label Jun 29, 2022
@damdo
Copy link
Member

damdo commented Jun 30, 2022

/retest-required

1 similar comment
@JoelSpeed
Copy link
Contributor Author

/retest-required

@openshift-ci-robot
Copy link
Contributor

/retest-required

Remaining retests: 2 against base HEAD ad05451 and 8 for PR HEAD 2355e91 in total

@damdo
Copy link
Member

damdo commented Jul 1, 2022

/retest-required

@openshift-ci openshift-ci bot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Jul 1, 2022
@openshift-ci openshift-ci bot removed lgtm Indicates that a PR is ready to be merged. needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. labels Jul 1, 2022
Copy link
Member

@damdo damdo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/lgtm

@openshift-ci openshift-ci bot added the lgtm Indicates that a PR is ready to be merged. label Jul 1, 2022
@JoelSpeed
Copy link
Contributor Author

/retest-required

@openshift-ci-robot
Copy link
Contributor

/retest-required

Remaining retests: 2 against base HEAD 7a747d8 and 8 for PR HEAD 2012b4c in total

@openshift-ci-robot
Copy link
Contributor

/retest-required

Remaining retests: 1 against base HEAD 7a747d8 and 7 for PR HEAD 2012b4c in total

@openshift-ci-robot
Copy link
Contributor

/retest-required

Remaining retests: 0 against base HEAD 7a747d8 and 6 for PR HEAD 2012b4c in total

@openshift-ci-robot
Copy link
Contributor

/retest-required

Remaining retests: 2 against base HEAD a55209e and 5 for PR HEAD 2012b4c in total

@openshift-ci
Copy link
Contributor

openshift-ci bot commented Jul 2, 2022

@JoelSpeed: The following tests failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name Commit Details Required Rerun command
ci/prow/e2e-libvirt 2012b4c link false /test e2e-libvirt
ci/prow/e2e-aws-disruptive 2012b4c link false /test e2e-aws-disruptive
ci/prow/e2e-metal-ipi-ovn-dualstack 2012b4c link false /test e2e-metal-ipi-ovn-dualstack
ci/prow/e2e-gcp-operator 2012b4c link false /test e2e-gcp-operator
ci/prow/e2e-vsphere-operator 2012b4c link false /test e2e-vsphere-operator
ci/prow/e2e-vsphere-upgrade 2012b4c link false /test e2e-vsphere-upgrade
ci/prow/e2e-metal-ipi-ovn-ipv6 2012b4c link false /test e2e-metal-ipi-ovn-ipv6
ci/prow/e2e-openstack 2012b4c link false /test e2e-openstack

Full PR test history. Your PR dashboard.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here.

@openshift-ci openshift-ci bot merged commit 058da9b into openshift:master Jul 2, 2022
@openshift-ci
Copy link
Contributor

openshift-ci bot commented Jul 2, 2022

@JoelSpeed: All pull requests linked via external trackers have merged:

Bugzilla bug 1994820 has been moved to the MODIFIED state.

In response to this:

Bug 1994820: Degrade operator on cluster bootstrap if not all Machines are Running

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. bugzilla/severity-medium Referenced Bugzilla bug's severity is medium for the branch this PR is targeting. bugzilla/valid-bug Indicates that a referenced Bugzilla bug is valid for the branch this PR is targeting. lgtm Indicates that a PR is ready to be merged.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants