Bug 1994820: Degrade operator on cluster bootstrap if not all Machines are Running #1019

JoelSpeed · 2022-05-23T14:55:38Z

This was discussed on the a recent cluster lifecycle arch call.

At the moment, we do not return any errors through the installation log if Machines fail to come up, even though these are often the cause of several issues that we see frequently in CI (eg route not ready because there aren't enough worker nodes).

To make things a little more obvious, this changed introduces a check for Machines during the initialization of the operator.
We require, in general, 2 worker machines to be running to host operators such as ingress.
If MachineSets are present in the cluster (this prevents errors on UPI or SNO), and there are fewer than 2 Machines in the running phase, this moves the operator into a degraded state until at least 2 Machines are running.
Once all Machines have been moved to running, this will move the operator to available and therefore, prevent the machine check from happening again in the future.

This means that we will now fail installations if fewer than 2 Machines come up.

It should help when there are generic Machine/cloud errors to help users diagnose the underlying issue, by identifying Machine API as a failed component rather than just the route and auth operators.

I have tested this by simulating a failed Machine on a running cluster and then resetting the ClusterOperator so that the operator thinks it is initializing for a second time, as well as bootstrapping a cluster with only 1 MachineSet machine working, and 1 failing, and then eventually adding an extra working Machine.

openshift-ci · 2022-05-23T14:56:25Z

@JoelSpeed: This pull request references Bugzilla bug 1994820, which is valid. The bug has been moved to the POST state. The bug has been updated to refer to the pull request using the external bug tracker.

3 validation(s) were run on this bug

bug is open, matching expected state (open)
bug target release (4.11.0) matches configured target release for branch (4.11.0)
bug is in the state ASSIGNED, which is one of the valid states (NEW, ASSIGNED, ON_DEV, POST, POST)

Requesting review from QA contact:
/cc @sunzhaohua2

In response to this:

Bug 1994820: Degrade operator on cluster bootstrap if not all Machines are Running

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

JoelSpeed · 2022-05-23T16:04:14Z

An example of the output from the clusteroperator conditions when bootstrapping with a misconfigured machines

  conditions:
  - lastTransitionTime: "2022-05-23T15:58:38Z"
    message: 'Progressing towards operator: 4.11.0-0.ci.test-2022-05-23-152641-ci-ln-ghys7m2-latest'
    reason: SyncingResources
    status: "True"
    type: Progressing
  - lastTransitionTime: "2022-05-23T16:01:58Z"
    message: 'Failed when progressing towards operator: 4.11.0-0.ci.test-2022-05-23-152641-ci-ln-ghys7m2-latest
      because found 1 non running machine(s): jspeed-test-4l4vz-worker-us-east-2a-tb8zz'
    reason: SyncingFailed
    status: "True"
    type: Degraded
  - lastTransitionTime: "2022-05-23T15:58:38Z"
    message: Operator is initializing
    reason: Initializing
    status: "False"
    type: Available
  - lastTransitionTime: "2022-05-23T15:58:38Z"
    status: "True"
    type: Upgradeable

JoelSpeed · 2022-05-23T16:54:03Z

When the install failed it failed with:

ERROR Cluster operator machine-api Degraded is True with SyncingFailed: Failed when progressing towards operator: 4.11.0-0.ci.test-2022-05-23-152641-ci-ln-ghys7m2-latest because found 1 non running machine(s): jspeed-test-4l4vz-worker-us-east-2a-tb8zz

So this was a bright red and very obvious failure, though the rest of the cluster was up and running, need to double check our notes on what we thought about failing the cluster here

JoelSpeed · 2022-05-24T09:52:21Z

/retest

elmiko

/lgtm

it seems weird, but should we have some e2e test to exercise the failure state here? or perhaps a unit test with a mocked client?

JoelSpeed · 2022-05-26T14:58:40Z

it seems weird, but should we have some e2e test to exercise the failure state here? or perhaps a unit test with a mocked client?

I wanted to get consensus on the approach before working on testing. A unit test with envtest is probably the best idea I have so far, having an E2E is a bit trickier

elmiko · 2022-05-26T15:04:32Z

I wanted to get consensus on the approach before working on testing. A unit test with envtest is probably the best idea I have so far, having an E2E is a bit trickier

+1, envtest sounds perfect to me

openshift-ci · 2022-06-01T13:44:42Z

@JoelSpeed: This pull request references Bugzilla bug 1994820, which is valid.

3 validation(s) were run on this bug

bug is open, matching expected state (open)
bug target release (4.11.0) matches configured target release for branch (4.11.0)
bug is in the state POST, which is one of the valid states (NEW, ASSIGNED, ON_DEV, POST, POST)

Requesting review from QA contact:
/cc @sunzhaohua2

In response to this:

Bug 1994820: Degrade operator on cluster bootstrap if not all Machines are Running

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

JoelSpeed · 2022-06-06T09:10:17Z

/retest

elmiko · 2022-06-08T15:29:16Z

we should discuss this further on standup, i think there are a few questions about the common approach and what will happen with the installer. cc @lobziik

JoelSpeed · 2022-06-14T08:41:21Z

/retest

JoelSpeed · 2022-06-14T12:34:24Z

/retest

lobziik · 2022-06-21T14:09:40Z

My main concern there is that we basing that "isInitializing" logic on a bunch of assumptions and basically have no guarantees that behaviour around will not change. If we could reference to some org-wide guidelies i would be way more calm about this. At the moment it just looks a bit fragile and obscure to me.

If folks are agree i'm ok to go with this.

Code wise - lgtm.

lobziik · 2022-06-21T14:11:48Z

/lgtm

elmiko · 2022-06-21T19:35:12Z

My main concern there is that we basing that "isInitializing" logic on a bunch of assumptions and basically have no guarantees that behaviour around will not change. If we could reference to some org-wide guidelies i would be way more calm about this. At the moment it just looks a bit fragile and obscure to me.

i think this is a good concern, i'm just wondering if there is any documentation that we could link to from the code comments?

JoelSpeed · 2022-06-22T16:01:50Z

My main concern there is that we basing that "isInitializing" logic on a bunch of assumptions and basically have no guarantees that behaviour around will not change. If we could reference to some org-wide guidelies i would be way more calm about this. At the moment it just looks a bit fragile and obscure to me.

It is up to each operator to handle what happens when they start, and for now, to define when they vary the status of their ClusterOperator status.
In MAO, we have always, and should continue to, set, when we see an empty operator status, we set it to initializing until we have, for the first time, set it to available.
This logic is encoded into our operator and is therefore under our control. No one else should be interacting with our cluster operator status (CVO certainly doesn't) so there should be no way, in a running cluster, that the Initializing state comes back, unless someone manually manages to patch the status, which we always assume isn't supported.

Even if this mechanism does fail somehow, it's more likely to fail into the more open state (ie the check doesn't happen) so it's unlikely that we will end up degrading some customer cluster if something does get broken during some code changes.

This is basically an artifact of the cluster not having an initialising state defined, and that's ok, it's not meant to, it's a reconciling system. We just have to add this optimisation where we can based on what we have (IMO), if it works and helps some users, great, if it doesn't work, well, it's no worse than it is right now.

If anyone has some suggestion for how to make this more bulletproof I'm happy to investigate

elmiko · 2022-06-27T13:26:17Z

i appreciate the deep research @JoelSpeed , i think for now we should use this as is. if we get an improvement in the future we can revisit, or if we find a way to be more definitive about the initializing state.
/approve

elmiko

bot is acting funny

/approve

openshift-ci · 2022-06-27T13:34:09Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: elmiko

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

~~OWNERS~~ [elmiko]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

openshift-ci-robot · 2022-06-27T14:16:24Z

/retest-required

Remaining retests: 2 against base HEAD ad05451 and 8 for PR HEAD c9e5f41 in total

openshift-ci-robot · 2022-06-27T15:16:26Z

/retest-required

Remaining retests: 1 against base HEAD ad05451 and 7 for PR HEAD c9e5f41 in total

openshift-ci-robot · 2022-06-27T16:29:47Z

/retest-required

Remaining retests: 0 against base HEAD ad05451 and 6 for PR HEAD c9e5f41 in total

damdo

/lgtm

damdo · 2022-06-30T08:19:04Z

/retest-required

JoelSpeed · 2022-06-30T16:22:54Z

/retest-required

openshift-ci-robot · 2022-07-01T04:30:41Z

/retest-required

Remaining retests: 2 against base HEAD ad05451 and 8 for PR HEAD 2355e91 in total

damdo · 2022-07-01T08:50:17Z

/retest-required

damdo

/lgtm

JoelSpeed · 2022-07-01T15:37:30Z

/retest-required

openshift-ci-robot · 2022-07-02T02:10:53Z

/retest-required

Remaining retests: 2 against base HEAD 7a747d8 and 8 for PR HEAD 2012b4c in total

openshift-ci-robot · 2022-07-02T04:10:45Z

/retest-required

Remaining retests: 1 against base HEAD 7a747d8 and 7 for PR HEAD 2012b4c in total

openshift-ci-robot · 2022-07-02T06:10:52Z

/retest-required

Remaining retests: 0 against base HEAD 7a747d8 and 6 for PR HEAD 2012b4c in total

openshift-ci-robot · 2022-07-02T09:10:59Z

/retest-required

Remaining retests: 2 against base HEAD a55209e and 5 for PR HEAD 2012b4c in total

openshift-ci · 2022-07-02T11:40:27Z

@JoelSpeed: The following tests failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name	Commit	Details	Required	Rerun command
ci/prow/e2e-libvirt	`2012b4c`	link	false	`/test e2e-libvirt`
ci/prow/e2e-aws-disruptive	`2012b4c`	link	false	`/test e2e-aws-disruptive`
ci/prow/e2e-metal-ipi-ovn-dualstack	`2012b4c`	link	false	`/test e2e-metal-ipi-ovn-dualstack`
ci/prow/e2e-gcp-operator	`2012b4c`	link	false	`/test e2e-gcp-operator`
ci/prow/e2e-vsphere-operator	`2012b4c`	link	false	`/test e2e-vsphere-operator`
ci/prow/e2e-vsphere-upgrade	`2012b4c`	link	false	`/test e2e-vsphere-upgrade`
ci/prow/e2e-metal-ipi-ovn-ipv6	`2012b4c`	link	false	`/test e2e-metal-ipi-ovn-ipv6`
ci/prow/e2e-openstack	`2012b4c`	link	false	`/test e2e-openstack`

Full PR test history. Your PR dashboard.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here.

openshift-ci · 2022-07-02T14:05:11Z

@JoelSpeed: All pull requests linked via external trackers have merged:

openshift/machine-api-operator#1019

Bugzilla bug 1994820 has been moved to the MODIFIED state.

In response to this:

Bug 1994820: Degrade operator on cluster bootstrap if not all Machines are Running

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

openshift-ci bot added bugzilla/severity-medium Referenced Bugzilla bug's severity is medium for the branch this PR is targeting. bugzilla/valid-bug Indicates that a referenced Bugzilla bug is valid for the branch this PR is targeting. labels May 23, 2022

openshift-ci bot requested review from sunzhaohua2, elmiko and lobziik May 23, 2022 14:56

elmiko reviewed May 26, 2022

View reviewed changes

openshift-ci bot assigned elmiko May 26, 2022

openshift-ci bot added the lgtm Indicates that a PR is ready to be merged. label May 26, 2022

JoelSpeed force-pushed the initial-cluster-degrade branch from 18b8866 to 8c6bad3 Compare June 1, 2022 13:39

openshift-ci bot removed the lgtm Indicates that a PR is ready to be merged. label Jun 1, 2022

JoelSpeed force-pushed the initial-cluster-degrade branch from 8c6bad3 to c9e5f41 Compare June 1, 2022 14:25

openshift-ci bot assigned lobziik Jun 21, 2022

openshift-ci bot added the lgtm Indicates that a PR is ready to be merged. label Jun 21, 2022

openshift-ci bot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Jun 27, 2022

elmiko reviewed Jun 27, 2022

View reviewed changes

JoelSpeed force-pushed the initial-cluster-degrade branch from c9e5f41 to 2355e91 Compare June 29, 2022 15:55

openshift-ci bot removed the lgtm Indicates that a PR is ready to be merged. label Jun 29, 2022

damdo reviewed Jun 29, 2022

View reviewed changes

openshift-ci bot assigned damdo Jun 29, 2022

openshift-ci bot added the lgtm Indicates that a PR is ready to be merged. label Jun 29, 2022

openshift-ci bot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Jul 1, 2022

JoelSpeed added 2 commits July 1, 2022 12:55

Degrade operator on cluster bootstrap if there aren't enough workers

28853af

Update client-go to fix clienset version issue

2012b4c

JoelSpeed force-pushed the initial-cluster-degrade branch from 2355e91 to 2012b4c Compare July 1, 2022 11:57

openshift-ci bot removed lgtm Indicates that a PR is ready to be merged. needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. labels Jul 1, 2022

damdo reviewed Jul 1, 2022

View reviewed changes

openshift-ci bot added the lgtm Indicates that a PR is ready to be merged. label Jul 1, 2022

openshift-ci bot merged commit 058da9b into openshift:master Jul 2, 2022

Bug 1994820: Degrade operator on cluster bootstrap if not all Machines are Running #1019

Bug 1994820: Degrade operator on cluster bootstrap if not all Machines are Running #1019

Conversation

JoelSpeed commented May 23, 2022 • edited Loading

openshift-ci bot commented May 23, 2022

JoelSpeed commented May 23, 2022

JoelSpeed commented May 23, 2022

JoelSpeed commented May 24, 2022

elmiko left a comment

Choose a reason for hiding this comment

JoelSpeed commented May 26, 2022

elmiko commented May 26, 2022

openshift-ci bot commented Jun 1, 2022

JoelSpeed commented Jun 6, 2022

elmiko commented Jun 8, 2022 • edited Loading

JoelSpeed commented Jun 14, 2022

JoelSpeed commented Jun 14, 2022

lobziik commented Jun 21, 2022 • edited Loading

lobziik commented Jun 21, 2022

elmiko commented Jun 21, 2022

JoelSpeed commented Jun 22, 2022

elmiko commented Jun 27, 2022

elmiko left a comment

Choose a reason for hiding this comment

openshift-ci bot commented Jun 27, 2022

openshift-ci-robot commented Jun 27, 2022

openshift-ci-robot commented Jun 27, 2022

openshift-ci-robot commented Jun 27, 2022

damdo left a comment

Choose a reason for hiding this comment

damdo commented Jun 30, 2022

JoelSpeed commented Jun 30, 2022

openshift-ci-robot commented Jul 1, 2022

damdo commented Jul 1, 2022

damdo left a comment

Choose a reason for hiding this comment

JoelSpeed commented Jul 1, 2022

openshift-ci-robot commented Jul 2, 2022

openshift-ci-robot commented Jul 2, 2022

openshift-ci-robot commented Jul 2, 2022

openshift-ci-robot commented Jul 2, 2022

openshift-ci bot commented Jul 2, 2022

openshift-ci bot commented Jul 2, 2022

JoelSpeed commented May 23, 2022 •

edited

Loading

elmiko commented Jun 8, 2022 •

edited

Loading

lobziik commented Jun 21, 2022 •

edited

Loading