Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

pkg/operator/controller/status: 5 minutes of inertia before propagating Degraded=True #377

Closed
wants to merge 2 commits into from

Conversation

wking
Copy link
Member

@wking wking commented Jul 26, 2023

Layering on top of both #375 and #376.

The default DNS resource can go Degraded=True when it has insufficient available pods. But it's possible to have insufficient available pods momentarily and subsequently recover. For example, while the DaemonSet is rolling out an update, we expect maxSurge unavailable pods for the duration of the rollout, which could involve up to 10 (for 10% maxSurge) rounds of pods being spun up on the cluster's nodes:

  1. DaemonSet bumped.
  2. Pod a2 launched on node a, pod b2 launched on node b, etc.
  3. Pod a2 goes ready, and the old pod a1 is deleted.
  4. Pod c2 launched on node c.
  5. Pod b2 goes ready, and the old pod b1 is deleted.
    ...
    n. Eventually all the new pods are ready.

During that rollout, there are always some unready pods. But we're still making quick progress and things are happy.
However, we could also have:

  1. DaemonSet bumped.
  2. Pod a2 launched on node a, pod b2 launched on node b, etc.
  3. Pod b2 goes ready, and the old pod b1 is deleted.
  4. Pod c2 launched on node c.
    ...
    n. Pod a2 still stuck.

That will have a similar number of unready pods as the happy case, but a2 being stuck for so long is a bad sign, and might deserve admin intervention. Ideally the DaemonSet controller would be watching individual pods and reporting conditions to let us know if it was concerned about progress or recovery from external disruption. But DaemonSet status has no conditions today. We could look over the DaemonSet controller's shoulder and watch the pods directly, but that would be a lot of work. So instead I'm adding a few minutes of inertia here, assuming that if the unready pods quickly resolve, it's unlikely to impact quality-of-service or require admin intervention. And if the unready pods (or other issue) does not quickly resolve, it is likely to deserve admin intervention (now with some hopefully acceptable additional latency before summoning the admin).

@openshift-ci openshift-ci bot requested review from frobware and knobunc July 26, 2023 20:54
@openshift-ci
Copy link
Contributor

openshift-ci bot commented Jul 26, 2023

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
Once this PR has been reviewed and has the lgtm label, please assign knobunc for approval. For more information see the Kubernetes Code Review Process.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@wking wking force-pushed the dns-condition-inertia branch 5 times, most recently from e348b07 to 2e3e2b8 Compare July 26, 2023 23:51
@wking
Copy link
Member Author

wking commented Aug 16, 2023

I've pushed 2e3e2b8 -> 269b9b8, dropping #376's maxUnavailable backstopping now that #379 has returned the DaemonSet to using maxUnavailable explicitly.

…nges

From the API docs [1]:

  lastTransitionTime is the time of the last update to the current status property.

And the library-go SetStatusCondition implementation includes [2]:

  if existingCondition.Status != newCondition.Status {
    existingCondition.Status = newCondition.Status
    existingCondition.LastTransitionTime = metav1.NewTime(time.Now())
  }

  existingCondition.Reason = newCondition.Reason
  existingCondition.Message = newCondition.Message

The motivation for that behavior is that it's often more useful to
know "how long has this resource been Degraded=True?" and similar than
it is to know how long it has had exactly the same
status/reason/message.  Messages in particular can be fairly mutable
("# of # pods available", etc.), and changes there do not necessarily
represent fundamental shifts in the underlying issue.

[1]: https://github.com/openshift/api/blob/81f778f3b3ec31c1dd344e795620a8fbaf2d9a51/config/v1/types_cluster_operator.go#L129
[2]: https://github.com/openshift/library-go/blob/c515269de16e5e239bd6e93e1f9821a976bb460b/pkg/config/clusteroperator/v1helpers/status.go#L29C1-L35C50
…ng Degraded=True

The default DNS resource can go Degraded=True when it has insufficient
available pods.  But it's possible to have insufficient available pods
momentarily and subsequently recover.  For example, while the
DaemonSet is rolling out an update, we expect maxSurge unavailable
pods for the duration of the rollout, which could involve up to 10
(for 10% maxSurge) rounds of pods being spun up on the cluster's
nodes:

1. DaemonSet bumped.
2. Pod a2 launched on node a, pod b2 launched on node b, etc.
3. Pod a2 goes ready, and the old pod a1 is deleted.
4. Pod c2 launched on node c.
5. Pod b2 goes ready, and the  old pod b1 is deleted.
...
n. Eventually all the new pods are ready.

During that rollout, there are always some unready pods.  But we're
still making quick progress and things are happy.  However, we could
also have:

1. DaemonSet bumped.
2. Pod a2 launched on node a, pod b2 launched on node b, etc.
3. Pod b2 goes ready, and the  old pod b1 is deleted.
4. Pod c2 launched on node c.
...
n. Pod a2 still stuck.

That will have a similar number of unready pods as the happy case, but
a2 being stuck for so long is a bad sign, and might deserve admin
intervention.  Ideally the DaemonSet controller would be watching
individual pods and reporting conditions to let us know if it was
concerned about progress or recovery from external disruption.  But
DaemonSet status has no conditions today [1].  We could look over the
DaemonSet controller's shoulder and watch the pods directly, but that
would be a lot of work.  So instead I'm adding a few minutes of
inertia here, assuming that if the unready pods quickly resolve, it's
unlikely to impact quality-of-service or require admin intervention.
And if the unready pods (or other issue) does not quickly resolve, it
is likely to deserve admin intervention (now with some hopefully
acceptable additional latency before summoning the admin).

[1]: https://github.com/kubernetes/kubernetes/blob/98358b8ce11b0c1878ae7aa1482668cb7a0b0e23/staging/src/k8s.io/api/apps/v1/types.go#L722
@openshift-ci
Copy link
Contributor

openshift-ci bot commented Aug 16, 2023

@wking: The following test failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name Commit Details Required Rerun command
ci/prow/e2e-aws-ovn-single-node 269b9b8 link false /test e2e-aws-ovn-single-node

Full PR test history. Your PR dashboard.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here.

@candita
Copy link
Contributor

candita commented Aug 23, 2023

/assign

@candita
Copy link
Contributor

candita commented Aug 23, 2023

Unrelated to change:

{ fail [github.com/openshift/origin/test/extended/oauth/requestheaders.go:218]: full response header: HTTP/1.1 403 Forbidden
...
{"kind":"Status","apiVersion":"v1","metadata":{},"status":"Failure","message":"forbidden: User "system:anonymous" cannot get path "/metrics"","reason":"Forbidden","details":{},"code":403}
...
Expected
: 403 Forbidden
to contain substring
: 401 Unauthorized

/test e2e-aws-ovn-serial

@@ -417,7 +418,7 @@ func computeOperatorDegradedCondition(haveDNS bool, dns *operatorv1.DNS) configv

var degraded bool
for _, cond := range dns.Status.Conditions {
if cond.Type == operatorv1.OperatorStatusTypeDegraded && cond.Status == operatorv1.ConditionTrue {
if cond.Type == operatorv1.OperatorStatusTypeDegraded && cond.Status == operatorv1.ConditionTrue && time.Now().Sub(cond.LastTransitionTime.Time) > 5*time.Minute {
Copy link
Contributor

@candita candita Aug 23, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We set the degraded condtition based on transition time in https://github.com/openshift/cluster-dns-operator/blob/master/pkg/operator/controller/dns_status.go#L130, but I don't see a need to check lastTransitionTime again here.

@candita
Copy link
Contributor

candita commented Aug 23, 2023

@wking I've approved #375 separately, and I don't think we need the remaining change in here as explained in #377 (comment) and because we won't use 5 minutes. Is it okay with you to close this?

@openshift-bot
Copy link
Contributor

Issues go stale after 90d of inactivity.

Mark the issue as fresh by commenting /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.
Exclude this issue from closing by commenting /lifecycle frozen.

If this issue is safe to close now please do so with /close.

/lifecycle stale

@openshift-ci openshift-ci bot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Nov 22, 2023
@openshift-bot
Copy link
Contributor

Stale issues rot after 30d of inactivity.

Mark the issue as fresh by commenting /remove-lifecycle rotten.
Rotten issues close after an additional 30d of inactivity.
Exclude this issue from closing by commenting /lifecycle frozen.

If this issue is safe to close now please do so with /close.

/lifecycle rotten
/remove-lifecycle stale

@openshift-ci openshift-ci bot added lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. and removed lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. labels Dec 22, 2023
@openshift-bot
Copy link
Contributor

Rotten issues close after 30d of inactivity.

Reopen the issue by commenting /reopen.
Mark the issue as fresh by commenting /remove-lifecycle rotten.
Exclude this issue from closing again by commenting /lifecycle frozen.

/close

@openshift-ci openshift-ci bot closed this Jan 22, 2024
Copy link
Contributor

openshift-ci bot commented Jan 22, 2024

@openshift-bot: Closed this PR.

In response to this:

Rotten issues close after 30d of inactivity.

Reopen the issue by commenting /reopen.
Mark the issue as fresh by commenting /remove-lifecycle rotten.
Exclude this issue from closing again by commenting /lifecycle frozen.

/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants