pkg/operator/controller/status: 5 minutes of inertia before propagating Degraded=True #377

wking · 2023-07-26T20:54:07Z

Layering on top of both #375 ~~and #376~~.

The default DNS resource can go Degraded=True when it has insufficient available pods. But it's possible to have insufficient available pods momentarily and subsequently recover. For example, while the DaemonSet is rolling out an update, we expect maxSurge unavailable pods for the duration of the rollout, which could involve up to 10 (for 10% maxSurge) rounds of pods being spun up on the cluster's nodes:

DaemonSet bumped.
Pod a2 launched on node a, pod b2 launched on node b, etc.
Pod a2 goes ready, and the old pod a1 is deleted.
Pod c2 launched on node c.
Pod b2 goes ready, and the old pod b1 is deleted.
...
n. Eventually all the new pods are ready.

During that rollout, there are always some unready pods. But we're still making quick progress and things are happy.
However, we could also have:

DaemonSet bumped.
Pod a2 launched on node a, pod b2 launched on node b, etc.
Pod b2 goes ready, and the old pod b1 is deleted.
Pod c2 launched on node c.
...
n. Pod a2 still stuck.

That will have a similar number of unready pods as the happy case, but a2 being stuck for so long is a bad sign, and might deserve admin intervention. Ideally the DaemonSet controller would be watching individual pods and reporting conditions to let us know if it was concerned about progress or recovery from external disruption. But DaemonSet status has no conditions today. We could look over the DaemonSet controller's shoulder and watch the pods directly, but that would be a lot of work. So instead I'm adding a few minutes of inertia here, assuming that if the unready pods quickly resolve, it's unlikely to impact quality-of-service or require admin intervention. And if the unready pods (or other issue) does not quickly resolve, it is likely to deserve admin intervention (now with some hopefully acceptable additional latency before summoning the admin).

openshift-ci · 2023-07-26T20:54:45Z

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
Once this PR has been reviewed and has the lgtm label, please assign knobunc for approval. For more information see the Kubernetes Code Review Process.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

OWNERS

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

wking · 2023-08-16T21:12:22Z

I've pushed 2e3e2b8 -> 269b9b8, dropping #376's maxUnavailable backstopping now that #379 has returned the DaemonSet to using maxUnavailable explicitly.

…nges From the API docs [1]: lastTransitionTime is the time of the last update to the current status property. And the library-go SetStatusCondition implementation includes [2]: if existingCondition.Status != newCondition.Status { existingCondition.Status = newCondition.Status existingCondition.LastTransitionTime = metav1.NewTime(time.Now()) } existingCondition.Reason = newCondition.Reason existingCondition.Message = newCondition.Message The motivation for that behavior is that it's often more useful to know "how long has this resource been Degraded=True?" and similar than it is to know how long it has had exactly the same status/reason/message. Messages in particular can be fairly mutable ("# of # pods available", etc.), and changes there do not necessarily represent fundamental shifts in the underlying issue. [1]: https://github.com/openshift/api/blob/81f778f3b3ec31c1dd344e795620a8fbaf2d9a51/config/v1/types_cluster_operator.go#L129 [2]: https://github.com/openshift/library-go/blob/c515269de16e5e239bd6e93e1f9821a976bb460b/pkg/config/clusteroperator/v1helpers/status.go#L29C1-L35C50

…ng Degraded=True The default DNS resource can go Degraded=True when it has insufficient available pods. But it's possible to have insufficient available pods momentarily and subsequently recover. For example, while the DaemonSet is rolling out an update, we expect maxSurge unavailable pods for the duration of the rollout, which could involve up to 10 (for 10% maxSurge) rounds of pods being spun up on the cluster's nodes: 1. DaemonSet bumped. 2. Pod a2 launched on node a, pod b2 launched on node b, etc. 3. Pod a2 goes ready, and the old pod a1 is deleted. 4. Pod c2 launched on node c. 5. Pod b2 goes ready, and the old pod b1 is deleted. ... n. Eventually all the new pods are ready. During that rollout, there are always some unready pods. But we're still making quick progress and things are happy. However, we could also have: 1. DaemonSet bumped. 2. Pod a2 launched on node a, pod b2 launched on node b, etc. 3. Pod b2 goes ready, and the old pod b1 is deleted. 4. Pod c2 launched on node c. ... n. Pod a2 still stuck. That will have a similar number of unready pods as the happy case, but a2 being stuck for so long is a bad sign, and might deserve admin intervention. Ideally the DaemonSet controller would be watching individual pods and reporting conditions to let us know if it was concerned about progress or recovery from external disruption. But DaemonSet status has no conditions today [1]. We could look over the DaemonSet controller's shoulder and watch the pods directly, but that would be a lot of work. So instead I'm adding a few minutes of inertia here, assuming that if the unready pods quickly resolve, it's unlikely to impact quality-of-service or require admin intervention. And if the unready pods (or other issue) does not quickly resolve, it is likely to deserve admin intervention (now with some hopefully acceptable additional latency before summoning the admin). [1]: https://github.com/kubernetes/kubernetes/blob/98358b8ce11b0c1878ae7aa1482668cb7a0b0e23/staging/src/k8s.io/api/apps/v1/types.go#L722

openshift-ci · 2023-08-16T23:39:22Z

@wking: The following test failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name	Commit	Details	Required	Rerun command
ci/prow/e2e-aws-ovn-single-node	`269b9b8`	link	false	`/test e2e-aws-ovn-single-node`

Full PR test history. Your PR dashboard.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here.

candita · 2023-08-23T17:19:56Z

/assign

candita · 2023-08-23T17:25:34Z

Unrelated to change:

{ fail [github.com/openshift/origin/test/extended/oauth/requestheaders.go:218]: full response header: HTTP/1.1 403 Forbidden
...
{"kind":"Status","apiVersion":"v1","metadata":{},"status":"Failure","message":"forbidden: User "system:anonymous" cannot get path "/metrics"","reason":"Forbidden","details":{},"code":403}
...
Expected
: 403 Forbidden
to contain substring
: 401 Unauthorized

/test e2e-aws-ovn-serial

candita · 2023-08-23T17:43:11Z

pkg/operator/controller/status/controller.go

@@ -417,7 +418,7 @@ func computeOperatorDegradedCondition(haveDNS bool, dns *operatorv1.DNS) configv

 	var degraded bool
 	for _, cond := range dns.Status.Conditions {
-		if cond.Type == operatorv1.OperatorStatusTypeDegraded && cond.Status == operatorv1.ConditionTrue {
+		if cond.Type == operatorv1.OperatorStatusTypeDegraded && cond.Status == operatorv1.ConditionTrue && time.Now().Sub(cond.LastTransitionTime.Time) > 5*time.Minute {


We set the degraded condtition based on transition time in https://github.com/openshift/cluster-dns-operator/blob/master/pkg/operator/controller/dns_status.go#L130, but I don't see a need to check lastTransitionTime again here.

candita · 2023-08-23T18:31:52Z

@wking I've approved #375 separately, and I don't think we need the remaining change in here as explained in #377 (comment) and because we won't use 5 minutes. Is it okay with you to close this?

openshift-bot · 2023-11-22T01:01:05Z

Issues go stale after 90d of inactivity.

Mark the issue as fresh by commenting /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.
Exclude this issue from closing by commenting /lifecycle frozen.

If this issue is safe to close now please do so with /close.

/lifecycle stale

openshift-bot · 2023-12-22T08:30:16Z

Stale issues rot after 30d of inactivity.

Mark the issue as fresh by commenting /remove-lifecycle rotten.
Rotten issues close after an additional 30d of inactivity.
Exclude this issue from closing by commenting /lifecycle frozen.

If this issue is safe to close now please do so with /close.

/lifecycle rotten
/remove-lifecycle stale

openshift-bot · 2024-01-22T00:00:22Z

Rotten issues close after 30d of inactivity.

Reopen the issue by commenting /reopen.
Mark the issue as fresh by commenting /remove-lifecycle rotten.
Exclude this issue from closing again by commenting /lifecycle frozen.

/close

openshift-ci · 2024-01-22T00:01:04Z

@openshift-bot: Closed this PR.

In response to this:

Rotten issues close after 30d of inactivity.

Reopen the issue by commenting /reopen.
Mark the issue as fresh by commenting /remove-lifecycle rotten.
Exclude this issue from closing again by commenting /lifecycle frozen.

/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

openshift-ci bot requested review from frobware and knobunc July 26, 2023 20:54

wking force-pushed the dns-condition-inertia branch 5 times, most recently from e348b07 to 2e3e2b8 Compare July 26, 2023 23:51

wking added 2 commits August 16, 2023 14:13

wking force-pushed the dns-condition-inertia branch from 2e3e2b8 to 269b9b8 Compare August 16, 2023 21:13

openshift-ci bot assigned candita Aug 23, 2023

candita reviewed Aug 23, 2023

View reviewed changes

openshift-ci bot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Nov 22, 2023

openshift-ci bot added lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. and removed lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. labels Dec 22, 2023

openshift-ci bot closed this Jan 22, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

pkg/operator/controller/status: 5 minutes of inertia before propagating Degraded=True #377

pkg/operator/controller/status: 5 minutes of inertia before propagating Degraded=True #377

wking commented Jul 26, 2023 •

edited

openshift-ci bot commented Jul 26, 2023

wking commented Aug 16, 2023 •

edited

openshift-ci bot commented Aug 16, 2023 •

edited

candita commented Aug 23, 2023

candita commented Aug 23, 2023

candita Aug 23, 2023 •

edited

candita commented Aug 23, 2023 •

edited

openshift-bot commented Nov 22, 2023

openshift-bot commented Dec 22, 2023

openshift-bot commented Jan 22, 2024

openshift-ci bot commented Jan 22, 2024

pkg/operator/controller/status: 5 minutes of inertia before propagating Degraded=True #377

pkg/operator/controller/status: 5 minutes of inertia before propagating Degraded=True #377

Conversation

wking commented Jul 26, 2023 • edited

openshift-ci bot commented Jul 26, 2023

wking commented Aug 16, 2023 • edited

openshift-ci bot commented Aug 16, 2023 • edited

candita commented Aug 23, 2023

candita commented Aug 23, 2023

candita Aug 23, 2023 • edited

Choose a reason for hiding this comment

candita commented Aug 23, 2023 • edited

openshift-bot commented Nov 22, 2023

openshift-bot commented Dec 22, 2023

openshift-bot commented Jan 22, 2024

openshift-ci bot commented Jan 22, 2024

wking commented Jul 26, 2023 •

edited

wking commented Aug 16, 2023 •

edited

openshift-ci bot commented Aug 16, 2023 •

edited

candita Aug 23, 2023 •

edited

candita commented Aug 23, 2023 •

edited