-
Notifications
You must be signed in to change notification settings - Fork 82
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
pkg/operator/controller/status: 5 minutes of inertia before propagating Degraded=True #377
Conversation
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: The full list of commands accepted by this bot can be found here.
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
e348b07
to
2e3e2b8
Compare
…nges From the API docs [1]: lastTransitionTime is the time of the last update to the current status property. And the library-go SetStatusCondition implementation includes [2]: if existingCondition.Status != newCondition.Status { existingCondition.Status = newCondition.Status existingCondition.LastTransitionTime = metav1.NewTime(time.Now()) } existingCondition.Reason = newCondition.Reason existingCondition.Message = newCondition.Message The motivation for that behavior is that it's often more useful to know "how long has this resource been Degraded=True?" and similar than it is to know how long it has had exactly the same status/reason/message. Messages in particular can be fairly mutable ("# of # pods available", etc.), and changes there do not necessarily represent fundamental shifts in the underlying issue. [1]: https://github.com/openshift/api/blob/81f778f3b3ec31c1dd344e795620a8fbaf2d9a51/config/v1/types_cluster_operator.go#L129 [2]: https://github.com/openshift/library-go/blob/c515269de16e5e239bd6e93e1f9821a976bb460b/pkg/config/clusteroperator/v1helpers/status.go#L29C1-L35C50
…ng Degraded=True The default DNS resource can go Degraded=True when it has insufficient available pods. But it's possible to have insufficient available pods momentarily and subsequently recover. For example, while the DaemonSet is rolling out an update, we expect maxSurge unavailable pods for the duration of the rollout, which could involve up to 10 (for 10% maxSurge) rounds of pods being spun up on the cluster's nodes: 1. DaemonSet bumped. 2. Pod a2 launched on node a, pod b2 launched on node b, etc. 3. Pod a2 goes ready, and the old pod a1 is deleted. 4. Pod c2 launched on node c. 5. Pod b2 goes ready, and the old pod b1 is deleted. ... n. Eventually all the new pods are ready. During that rollout, there are always some unready pods. But we're still making quick progress and things are happy. However, we could also have: 1. DaemonSet bumped. 2. Pod a2 launched on node a, pod b2 launched on node b, etc. 3. Pod b2 goes ready, and the old pod b1 is deleted. 4. Pod c2 launched on node c. ... n. Pod a2 still stuck. That will have a similar number of unready pods as the happy case, but a2 being stuck for so long is a bad sign, and might deserve admin intervention. Ideally the DaemonSet controller would be watching individual pods and reporting conditions to let us know if it was concerned about progress or recovery from external disruption. But DaemonSet status has no conditions today [1]. We could look over the DaemonSet controller's shoulder and watch the pods directly, but that would be a lot of work. So instead I'm adding a few minutes of inertia here, assuming that if the unready pods quickly resolve, it's unlikely to impact quality-of-service or require admin intervention. And if the unready pods (or other issue) does not quickly resolve, it is likely to deserve admin intervention (now with some hopefully acceptable additional latency before summoning the admin). [1]: https://github.com/kubernetes/kubernetes/blob/98358b8ce11b0c1878ae7aa1482668cb7a0b0e23/staging/src/k8s.io/api/apps/v1/types.go#L722
2e3e2b8
to
269b9b8
Compare
@wking: The following test failed, say
Full PR test history. Your PR dashboard. Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here. |
/assign |
Unrelated to change:
/test e2e-aws-ovn-serial |
@@ -417,7 +418,7 @@ func computeOperatorDegradedCondition(haveDNS bool, dns *operatorv1.DNS) configv | |||
|
|||
var degraded bool | |||
for _, cond := range dns.Status.Conditions { | |||
if cond.Type == operatorv1.OperatorStatusTypeDegraded && cond.Status == operatorv1.ConditionTrue { | |||
if cond.Type == operatorv1.OperatorStatusTypeDegraded && cond.Status == operatorv1.ConditionTrue && time.Now().Sub(cond.LastTransitionTime.Time) > 5*time.Minute { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We set the degraded condtition based on transition time in https://github.com/openshift/cluster-dns-operator/blob/master/pkg/operator/controller/dns_status.go#L130, but I don't see a need to check lastTransitionTime
again here.
@wking I've approved #375 separately, and I don't think we need the remaining change in here as explained in #377 (comment) and because we won't use 5 minutes. Is it okay with you to close this? |
Issues go stale after 90d of inactivity. Mark the issue as fresh by commenting If this issue is safe to close now please do so with /lifecycle stale |
Stale issues rot after 30d of inactivity. Mark the issue as fresh by commenting If this issue is safe to close now please do so with /lifecycle rotten |
Rotten issues close after 30d of inactivity. Reopen the issue by commenting /close |
@openshift-bot: Closed this PR. In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
Layering on top of both #375
and #376.The default DNS resource can go
Degraded=True
when it has insufficient available pods. But it's possible to have insufficient available pods momentarily and subsequently recover. For example, while the DaemonSet is rolling out an update, we expectmaxSurge
unavailable pods for the duration of the rollout, which could involve up to 10 (for 10%maxSurge
) rounds of pods being spun up on the cluster's nodes:...
n. Eventually all the new pods are ready.
During that rollout, there are always some unready pods. But we're still making quick progress and things are happy.
However, we could also have:
...
n. Pod a2 still stuck.
That will have a similar number of unready pods as the happy case, but a2 being stuck for so long is a bad sign, and might deserve admin intervention. Ideally the DaemonSet controller would be watching individual pods and reporting conditions to let us know if it was concerned about progress or recovery from external disruption. But DaemonSet
status
has no conditions today. We could look over the DaemonSet controller's shoulder and watch the pods directly, but that would be a lot of work. So instead I'm adding a few minutes of inertia here, assuming that if the unready pods quickly resolve, it's unlikely to impact quality-of-service or require admin intervention. And if the unready pods (or other issue) does not quickly resolve, it is likely to deserve admin intervention (now with some hopefully acceptable additional latency before summoning the admin).