New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Bug 1943719: Don't degrade cluster on connection error #39
Bug 1943719: Don't degrade cluster on connection error #39
Conversation
@jsafrane: This pull request references Bugzilla bug 1943719, which is invalid:
Comment In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: jsafrane The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
/bugzilla refresh |
@jsafrane: This pull request references Bugzilla bug 1943719, which is valid. The bug has been moved to the POST state. The bug has been updated to refer to the pull request using the external bug tracker. 3 validation(s) were run on this bug
No GitHub users were found matching the public email listed for the QA contact in Bugzilla (piqin@redhat.com), skipping review request. In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
And the condition is propagated to ClusterOperator
|
Help: "Indicates failing vSphere problem detector sync error. Value 1 means that the last sync failed.", | ||
StabilityLevel: metrics.ALPHA, | ||
}, | ||
) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
how about name vsphere_sync_errrors
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Renamed
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it is still using old name.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
sorry, now it should be really fixed
// TODO: Run in a separate goroutine? We may not want to run time-consuming checks here. | ||
if platformSupported && time.Now().After(c.nextCheck) { | ||
delay, err := c.runChecks(ctx) | ||
if err != nil { | ||
// This sets VSphereProblemDetectorControllerDegraded condition | ||
return err | ||
// Do not return the error, it would degrade the whole cluster. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we report errors from talking to openshift's api-server differently from error from talking to vcenter api server?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How would we use the information? Both APi server and vCenter errors will result in an alert and IMO a generic one is enough.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We would use that information by putting specific error in alert. The approach here seems backward and does not help us with - https://bugzilla.redhat.com/show_bug.cgi?id=1929780
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The information we can put in alert will be very generic in both cases, because we don't have detailed error message in a metric. In addition, runChecks()
can fail for variety of reason on OCP side, for example a secret is missing or has invalid format. I don't want to have separate alerts for them, it would be too many, and at the same time you can't unify them into a generic "Failed to connect to API server", because it's not true.
I want to have alert something like "Failed to initialize vsphere-problem-detector periodic checks for X minutes. Please see ClusterOperator "storage" conditions for detailed error message, "oc get clusteroperator storage -o yaml", and look for message in Available condition" .
The bug you linked is to have specific alert for every check, not to have alert for every error of every check.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My main concern is - since we are using kubeClient directly for fetching list of nodes - we could have an sporadic failure in talking to api-server, which would result in a recorded metric and hence an alert (eventually).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Moved away from API calls and using informers for everything. Now either there is something wrong with a secret (can't be parsed, does not exist, ...) or vCenter returns a bad response.
/retest |
I'd prefer to have it degraded. Its not trivial to figure out if cluster really using the provided features (i.e. in-tree storage), so the operator can't reliably tell if this Degraded condition prevents clusters from errors in the future or not. So, with this change, we might see a broken upgrade with image-registry being degraded with unrelated message and an alert "incorrect user/password" - and its complicated to link those together. The operator should be defensive and stop before breaking image registry. |
/hold |
/hold cancel Lets roll with this - clusters can't change infra provider and OTA team concerns would be solved by introducing a CCX rule |
a34aeb8
to
ae7ee54
Compare
/retest |
Clusters that do not have correct credentials to vCenter should not get degraded when vsphere-problem-detector cannot connect to it. Instead, keep the cluster Availabe=true and only report a new metric + alert on it. Add the error message to Availabe=true condition, so it can be found without digging through logs.
/hold |
Use informers for everything that vsphere-problem-detector needs from the API server. This will result in less errors reported as failed checks on random hiccups of the API server, network, etcd etc. As result, types needed to be changed to pointers.
Exp. backoff should be used on all errors, incl. failed connections to vCenter. Reorganize calculation of the next check to accomodate that.
/hold cancel |
/lgtm |
@jsafrane: Some pull requests linked via external trackers have merged: The following pull requests linked via external trackers have not merged: These pull request must merge or be unlinked from the Bugzilla bug in order for it to move to the next state. Once unlinked, request a bug refresh with Bugzilla bug 1943719 has not been moved to the MODIFIED state. In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
Clusters that do not have correct credentials to vCenter should not get degraded when vsphere-problem-detector cannot connect to it.
Instead, keep the cluster
Availabe=true
and only report a new metric + alert on it (in a separate PR)Add the error message to
Availabe=true
condition, so it can be found without digging through logs:cc @openshift/storage