Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Bug 1943719: Don't degrade cluster on connection error #39

Merged
merged 3 commits into from May 10, 2021

Conversation

jsafrane
Copy link
Contributor

Clusters that do not have correct credentials to vCenter should not get degraded when vsphere-problem-detector cannot connect to it.

Instead, keep the cluster Availabe=true and only report a new metric + alert on it (in a separate PR)

Add the error message to Availabe=true condition, so it can be found without digging through logs:

      type: VSphereProblemDetectorControllerAvailable
      status: "True"
      reason: SyncFailed
      message: 'failed to connect to vcenter.sddc-44-236-21-251.vmwarevmc.com: ServerFaultCode:
        Cannot complete login due to an incorrect user name or password.'

cc @openshift/storage

@openshift-ci-robot openshift-ci-robot added the bugzilla/severity-high Referenced Bugzilla bug's severity is high for the branch this PR is targeting. label Apr 30, 2021
@openshift-ci-robot
Copy link
Contributor

@jsafrane: This pull request references Bugzilla bug 1943719, which is invalid:

  • expected the bug to target the "4.8.0" release, but it targets "---" instead

Comment /bugzilla refresh to re-evaluate validity if changes to the Bugzilla bug are made, or edit the title of this pull request to link to a different bug.

In response to this:

Bug 1943719: Don't degrade cluster on connection error

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@openshift-ci-robot openshift-ci-robot added the bugzilla/invalid-bug Indicates that a referenced Bugzilla bug is invalid for the branch this PR is targeting. label Apr 30, 2021
@openshift-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: jsafrane

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@openshift-ci-robot openshift-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Apr 30, 2021
@jsafrane
Copy link
Contributor Author

/bugzilla refresh

@openshift-ci-robot openshift-ci-robot added bugzilla/valid-bug Indicates that a referenced Bugzilla bug is valid for the branch this PR is targeting. and removed bugzilla/invalid-bug Indicates that a referenced Bugzilla bug is invalid for the branch this PR is targeting. labels Apr 30, 2021
@openshift-ci-robot
Copy link
Contributor

@jsafrane: This pull request references Bugzilla bug 1943719, which is valid. The bug has been moved to the POST state. The bug has been updated to refer to the pull request using the external bug tracker.

3 validation(s) were run on this bug
  • bug is open, matching expected state (open)
  • bug target release (4.8.0) matches configured target release for branch (4.8.0)
  • bug is in the state ASSIGNED, which is one of the valid states (NEW, ASSIGNED, ON_DEV, POST, POST)

No GitHub users were found matching the public email listed for the QA contact in Bugzilla (piqin@redhat.com), skipping review request.

In response to this:

/bugzilla refresh

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@jsafrane
Copy link
Contributor Author

jsafrane commented Apr 30, 2021

And the condition is propagated to ClusterOperator storage:

$ oc get clusteroperator storage -o yaml
...
  - lastTransitionTime: "2021-04-30T09:32:47Z"
    message: 'VSphereProblemDetectorControllerAvailable: failed to connect to vcenter.sddc-44-236-21-251.vmwarevmc.com:
      ServerFaultCode: Cannot complete login due to an incorrect user name or password.'
    reason: AsExpected
    status: "True"
    type: Available

Help: "Indicates failing vSphere problem detector sync error. Value 1 means that the last sync failed.",
StabilityLevel: metrics.ALPHA,
},
)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

how about name vsphere_sync_errrors ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Renamed

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it is still using old name.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sorry, now it should be really fixed

// TODO: Run in a separate goroutine? We may not want to run time-consuming checks here.
if platformSupported && time.Now().After(c.nextCheck) {
delay, err := c.runChecks(ctx)
if err != nil {
// This sets VSphereProblemDetectorControllerDegraded condition
return err
// Do not return the error, it would degrade the whole cluster.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we report errors from talking to openshift's api-server differently from error from talking to vcenter api server?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How would we use the information? Both APi server and vCenter errors will result in an alert and IMO a generic one is enough.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We would use that information by putting specific error in alert. The approach here seems backward and does not help us with - https://bugzilla.redhat.com/show_bug.cgi?id=1929780

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The information we can put in alert will be very generic in both cases, because we don't have detailed error message in a metric. In addition, runChecks() can fail for variety of reason on OCP side, for example a secret is missing or has invalid format. I don't want to have separate alerts for them, it would be too many, and at the same time you can't unify them into a generic "Failed to connect to API server", because it's not true.

I want to have alert something like "Failed to initialize vsphere-problem-detector periodic checks for X minutes. Please see ClusterOperator "storage" conditions for detailed error message, "oc get clusteroperator storage -o yaml", and look for message in Available condition" .

The bug you linked is to have specific alert for every check, not to have alert for every error of every check.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My main concern is - since we are using kubeClient directly for fetching list of nodes - we could have an sporadic failure in talking to api-server, which would result in a recorded metric and hence an alert (eventually).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Moved away from API calls and using informers for everything. Now either there is something wrong with a secret (can't be parsed, does not exist, ...) or vCenter returns a bad response.

@jsafrane
Copy link
Contributor Author

/retest

@vrutkovs
Copy link
Member

vrutkovs commented May 3, 2021

I'd prefer to have it degraded. Its not trivial to figure out if cluster really using the provided features (i.e. in-tree storage), so the operator can't reliably tell if this Degraded condition prevents clusters from errors in the future or not.

So, with this change, we might see a broken upgrade with image-registry being degraded with unrelated message and an alert "incorrect user/password" - and its complicated to link those together. The operator should be defensive and stop before breaking image registry.
Users who don't use in-tree features shouldn't be using platform: vsphere in the first place.

@vrutkovs
Copy link
Member

vrutkovs commented May 3, 2021

/hold

@openshift-ci-robot openshift-ci-robot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label May 3, 2021
@vrutkovs
Copy link
Member

vrutkovs commented May 4, 2021

/hold cancel

Lets roll with this - clusters can't change infra provider and OTA team concerns would be solved by introducing a CCX rule

@openshift-ci-robot openshift-ci-robot removed the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label May 4, 2021
@jsafrane jsafrane force-pushed the dont-degrade branch 2 times, most recently from a34aeb8 to ae7ee54 Compare May 5, 2021 14:47
@jsafrane
Copy link
Contributor Author

jsafrane commented May 5, 2021

/retest

Clusters that do not have correct credentials to vCenter should not get
degraded when vsphere-problem-detector cannot connect to it.

Instead, keep the cluster Availabe=true and only report a new metric +
alert on it.

Add the error message to Availabe=true condition, so it can be found
without digging through logs.
@jsafrane
Copy link
Contributor Author

jsafrane commented May 5, 2021

/hold
testing manually

@openshift-ci-robot openshift-ci-robot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label May 5, 2021
Use informers for everything that vsphere-problem-detector needs from the
API server. This will result in less errors reported as failed checks on
random hiccups of the API server, network, etcd etc.

As result, types needed to be changed to pointers.
Exp. backoff should be used on all errors, incl. failed connections to
vCenter. Reorganize calculation of the next check to accomodate that.
@jsafrane
Copy link
Contributor Author

jsafrane commented May 5, 2021

/hold cancel
I needed to rework exp. backoff a bit to apply it also to connection errors

@openshift-ci-robot openshift-ci-robot removed the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label May 5, 2021
@gnufied
Copy link
Member

gnufied commented May 10, 2021

/lgtm

@openshift-ci openshift-ci bot added the lgtm Indicates that a PR is ready to be merged. label May 10, 2021
@openshift-merge-robot openshift-merge-robot merged commit 50a2cab into openshift:master May 10, 2021
@openshift-ci
Copy link
Contributor

openshift-ci bot commented May 10, 2021

@jsafrane: Some pull requests linked via external trackers have merged:

The following pull requests linked via external trackers have not merged:

These pull request must merge or be unlinked from the Bugzilla bug in order for it to move to the next state. Once unlinked, request a bug refresh with /bugzilla refresh.

Bugzilla bug 1943719 has not been moved to the MODIFIED state.

In response to this:

Bug 1943719: Don't degrade cluster on connection error

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. bugzilla/severity-high Referenced Bugzilla bug's severity is high for the branch this PR is targeting. bugzilla/valid-bug Indicates that a referenced Bugzilla bug is valid for the branch this PR is targeting. lgtm Indicates that a PR is ready to be merged.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

5 participants