Bug 1943719: Don't degrade cluster on connection error #39

jsafrane · 2021-04-30T09:25:53Z

Clusters that do not have correct credentials to vCenter should not get degraded when vsphere-problem-detector cannot connect to it.

Instead, keep the cluster Availabe=true and only report a new metric + alert on it (in a separate PR)

Add the error message to Availabe=true condition, so it can be found without digging through logs:

      type: VSphereProblemDetectorControllerAvailable
      status: "True"
      reason: SyncFailed
      message: 'failed to connect to vcenter.sddc-44-236-21-251.vmwarevmc.com: ServerFaultCode:
        Cannot complete login due to an incorrect user name or password.'

cc @openshift/storage

openshift-ci-robot · 2021-04-30T09:25:57Z

@jsafrane: This pull request references Bugzilla bug 1943719, which is invalid:

expected the bug to target the "4.8.0" release, but it targets "---" instead

Comment /bugzilla refresh to re-evaluate validity if changes to the Bugzilla bug are made, or edit the title of this pull request to link to a different bug.

In response to this:

Bug 1943719: Don't degrade cluster on connection error

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

openshift-ci-robot · 2021-04-30T09:26:10Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: jsafrane

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

~~OWNERS~~ [jsafrane]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

jsafrane · 2021-04-30T09:26:21Z

/bugzilla refresh

openshift-ci-robot · 2021-04-30T09:26:30Z

@jsafrane: This pull request references Bugzilla bug 1943719, which is valid. The bug has been moved to the POST state. The bug has been updated to refer to the pull request using the external bug tracker.

3 validation(s) were run on this bug

bug is open, matching expected state (open)
bug target release (4.8.0) matches configured target release for branch (4.8.0)
bug is in the state ASSIGNED, which is one of the valid states (NEW, ASSIGNED, ON_DEV, POST, POST)

No GitHub users were found matching the public email listed for the QA contact in Bugzilla (piqin@redhat.com), skipping review request.

In response to this:

/bugzilla refresh

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

jsafrane · 2021-04-30T09:34:41Z

And the condition is propagated to ClusterOperator storage:

$ oc get clusteroperator storage -o yaml
...
  - lastTransitionTime: "2021-04-30T09:32:47Z"
    message: 'VSphereProblemDetectorControllerAvailable: failed to connect to vcenter.sddc-44-236-21-251.vmwarevmc.com:
      ServerFaultCode: Cannot complete login due to an incorrect user name or password.'
    reason: AsExpected
    status: "True"
    type: Available

gnufied · 2021-04-30T11:00:00Z

pkg/operator/metrics.go

+			Help:           "Indicates failing vSphere problem detector sync error. Value 1 means that the last sync failed.",
+			StabilityLevel: metrics.ALPHA,
+		},
+	)


how about name vsphere_sync_errrors ?

it is still using old name.

sorry, now it should be really fixed

gnufied · 2021-04-30T11:37:04Z

pkg/operator/operator.go

 	// TODO: Run in a separate goroutine? We may not want to run time-consuming checks here.
 	if platformSupported && time.Now().After(c.nextCheck) {
 		delay, err := c.runChecks(ctx)
 		if err != nil {
-			// This sets VSphereProblemDetectorControllerDegraded condition
-			return err
+			// Do not return the error, it would degrade the whole cluster.


Can we report errors from talking to openshift's api-server differently from error from talking to vcenter api server?

How would we use the information? Both APi server and vCenter errors will result in an alert and IMO a generic one is enough.

We would use that information by putting specific error in alert. The approach here seems backward and does not help us with - https://bugzilla.redhat.com/show_bug.cgi?id=1929780

The information we can put in alert will be very generic in both cases, because we don't have detailed error message in a metric. In addition, runChecks() can fail for variety of reason on OCP side, for example a secret is missing or has invalid format. I don't want to have separate alerts for them, it would be too many, and at the same time you can't unify them into a generic "Failed to connect to API server", because it's not true.

I want to have alert something like "Failed to initialize vsphere-problem-detector periodic checks for X minutes. Please see ClusterOperator "storage" conditions for detailed error message, "oc get clusteroperator storage -o yaml", and look for message in Available condition" .

The bug you linked is to have specific alert for every check, not to have alert for every error of every check.

My main concern is - since we are using kubeClient directly for fetching list of nodes - we could have an sporadic failure in talking to api-server, which would result in a recorded metric and hence an alert (eventually).

Moved away from API calls and using informers for everything. Now either there is something wrong with a secret (can't be parsed, does not exist, ...) or vCenter returns a bad response.

jsafrane · 2021-04-30T13:52:58Z

/retest

vrutkovs · 2021-05-03T17:25:22Z

I'd prefer to have it degraded. Its not trivial to figure out if cluster really using the provided features (i.e. in-tree storage), so the operator can't reliably tell if this Degraded condition prevents clusters from errors in the future or not.

So, with this change, we might see a broken upgrade with image-registry being degraded with unrelated message and an alert "incorrect user/password" - and its complicated to link those together. The operator should be defensive and stop before breaking image registry.
Users who don't use in-tree features shouldn't be using platform: vsphere in the first place.

vrutkovs · 2021-05-03T17:31:52Z

/hold

vrutkovs · 2021-05-04T13:42:25Z

/hold cancel

Lets roll with this - clusters can't change infra provider and OTA team concerns would be solved by introducing a CCX rule

jsafrane · 2021-05-05T15:03:01Z

/retest

Clusters that do not have correct credentials to vCenter should not get degraded when vsphere-problem-detector cannot connect to it. Instead, keep the cluster Availabe=true and only report a new metric + alert on it. Add the error message to Availabe=true condition, so it can be found without digging through logs.

jsafrane · 2021-05-05T15:15:29Z

/hold
testing manually

Use informers for everything that vsphere-problem-detector needs from the API server. This will result in less errors reported as failed checks on random hiccups of the API server, network, etcd etc. As result, types needed to be changed to pointers.

Exp. backoff should be used on all errors, incl. failed connections to vCenter. Reorganize calculation of the next check to accomodate that.

jsafrane · 2021-05-05T18:58:25Z

/hold cancel
I needed to rework exp. backoff a bit to apply it also to connection errors

gnufied · 2021-05-10T16:04:07Z

/lgtm

openshift-ci · 2021-05-10T16:04:43Z

@jsafrane: Some pull requests linked via external trackers have merged:

The following pull requests linked via external trackers have not merged:

openshift/cluster-storage-operator#166 is open

These pull request must merge or be unlinked from the Bugzilla bug in order for it to move to the next state. Once unlinked, request a bug refresh with /bugzilla refresh.

Bugzilla bug 1943719 has not been moved to the MODIFIED state.

In response to this:

Bug 1943719: Don't degrade cluster on connection error

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

openshift-ci-robot added the bugzilla/severity-high Referenced Bugzilla bug's severity is high for the branch this PR is targeting. label Apr 30, 2021

openshift-ci-robot added the bugzilla/invalid-bug Indicates that a referenced Bugzilla bug is invalid for the branch this PR is targeting. label Apr 30, 2021

openshift-ci-robot requested review from huffmanca and tsmetana April 30, 2021 09:26

openshift-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Apr 30, 2021

openshift-ci-robot added bugzilla/valid-bug Indicates that a referenced Bugzilla bug is valid for the branch this PR is targeting. and removed bugzilla/invalid-bug Indicates that a referenced Bugzilla bug is invalid for the branch this PR is targeting. labels Apr 30, 2021

gnufied reviewed Apr 30, 2021

View reviewed changes

jsafrane force-pushed the dont-degrade branch from c3fadbe to e4fc8cf Compare May 3, 2021 16:09

openshift-ci-robot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label May 3, 2021

openshift-ci-robot removed the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label May 4, 2021

jsafrane force-pushed the dont-degrade branch 2 times, most recently from a34aeb8 to ae7ee54 Compare May 5, 2021 14:47

jsafrane force-pushed the dont-degrade branch from ae7ee54 to 50c7c42 Compare May 5, 2021 15:14

openshift-ci-robot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label May 5, 2021

Use informers for all API objects

95e98ae

Use informers for everything that vsphere-problem-detector needs from the API server. This will result in less errors reported as failed checks on random hiccups of the API server, network, etcd etc. As result, types needed to be changed to pointers.

jsafrane force-pushed the dont-degrade branch from 50c7c42 to 95e98ae Compare May 5, 2021 16:29

jsafrane force-pushed the dont-degrade branch from 4e96ab3 to 7dc5a5e Compare May 5, 2021 18:19

Apply exp. backoff on every runCheck error

df807c6

Exp. backoff should be used on all errors, incl. failed connections to vCenter. Reorganize calculation of the next check to accomodate that.

jsafrane force-pushed the dont-degrade branch from 7dc5a5e to df807c6 Compare May 5, 2021 18:57

openshift-ci-robot removed the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label May 5, 2021

jsafrane mentioned this pull request May 6, 2021

Bug 1943719: Add alert about vsphere-problem-detector unable to connect openshift/cluster-storage-operator#166

Merged

openshift-ci bot assigned gnufied May 10, 2021

openshift-ci bot added the lgtm Indicates that a PR is ready to be merged. label May 10, 2021

openshift-merge-robot merged commit 50a2cab into openshift:master May 10, 2021

openshift-ci bot mentioned this pull request May 17, 2021

Bug 1943719: Save error in Available message on every sync #42

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Bug 1943719: Don't degrade cluster on connection error #39

Bug 1943719: Don't degrade cluster on connection error #39

jsafrane commented Apr 30, 2021

openshift-ci-robot commented Apr 30, 2021

openshift-ci-robot commented Apr 30, 2021

jsafrane commented Apr 30, 2021

openshift-ci-robot commented Apr 30, 2021

jsafrane commented Apr 30, 2021 •

edited

gnufied Apr 30, 2021

jsafrane May 3, 2021

gnufied May 3, 2021

jsafrane May 5, 2021

gnufied Apr 30, 2021

jsafrane May 3, 2021

gnufied May 3, 2021

jsafrane May 3, 2021

gnufied May 3, 2021

jsafrane May 5, 2021

jsafrane commented Apr 30, 2021

vrutkovs commented May 3, 2021 •

edited

vrutkovs commented May 3, 2021

vrutkovs commented May 4, 2021

jsafrane commented May 5, 2021

jsafrane commented May 5, 2021

jsafrane commented May 5, 2021

gnufied commented May 10, 2021

openshift-ci bot commented May 10, 2021

Bug 1943719: Don't degrade cluster on connection error #39

Bug 1943719: Don't degrade cluster on connection error #39

Conversation

jsafrane commented Apr 30, 2021

openshift-ci-robot commented Apr 30, 2021

openshift-ci-robot commented Apr 30, 2021

jsafrane commented Apr 30, 2021

openshift-ci-robot commented Apr 30, 2021

jsafrane commented Apr 30, 2021 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jsafrane commented Apr 30, 2021

vrutkovs commented May 3, 2021 • edited

vrutkovs commented May 3, 2021

vrutkovs commented May 4, 2021

jsafrane commented May 5, 2021

jsafrane commented May 5, 2021

jsafrane commented May 5, 2021

gnufied commented May 10, 2021

openshift-ci bot commented May 10, 2021

jsafrane commented Apr 30, 2021 •

edited

vrutkovs commented May 3, 2021 •

edited