Bug 1761506: status: prevent degraded status flapping on rollout #134

ironcladlou · 2019-10-11T20:41:12Z

status: prevent degraded status flapping on rollout

The current DNS degraded status calculation doesn't incorporate time. If at any
time of observation the number of desired daemonset available replicas diverges
from the desired scheduled replicas, the CoreDNS daemonset is instantly
considered degraded. This results in degraded true/false flapping when the
operator aggregates the DNS status during a daemonset rollout.

This seems like a poor meausure of degraded, because most of the time this
condition arises during a rollout when QoS is not seriously affected. A better
way might be to incorprate progressing time into the calculation, and that is
something we should probably do, but the solution is a little more complex.

This patch should provide a good improvement in the meantime by only considering
DNS degraded if the number of unavailable replicas exceeds the max unavailable
count on the daemonset's rolling parameters, which seems like an objectively
poor state to be in.

And, as mentioned, we can further improve the situation later by trying to
incorporate progressing time. That later refinement could replace the logic
introduced in this commit.

ironcladlou · 2019-10-11T20:42:06Z

/cc @wking

wking · 2019-10-11T20:50:32Z

pkg/operator/controller/dns_status.go

@@ -70,14 +70,14 @@ func computeDNSDegradedCondition(oldCondition *operatorv1.OperatorCondition, clu
 		degradedCondition.Status = operatorv1.ConditionTrue
 		degradedCondition.Reason = "NoClusterIP"
 		degradedCondition.Message = "No ClusterIP assigned to DNS Service"
-	case ds.Status.NumberAvailable == 0:
+	case ds.Status.DesiredNumberScheduled == 0:


This should stay NumberAvailable == 0, right?

i.e. wanting more than one pod is not enough, you actually have to have more than one pod to get out of NoPodsScheduled

NoPodsScheduled is actually the misnomer, the check is intentional... if none are even desired to be scheduled, then available seems irrelevant. I'm not even sure how we can get here in the wild...

NoPodsDesired?

Maybe DesiredNumberScheduled=0 if none of the nodes match the scheduling criteria due to taints or something.

wking · 2019-10-11T20:56:15Z

pkg/operator/controller/dns_status.go

-	case ds.Status.NumberAvailable != ds.Status.DesiredNumberScheduled:
+		degradedCondition.Reason = "NoPodsScheduled"
+		degradedCondition.Message = "No CoreDNS pods are desired to be scheduled"
+	case ds.Status.DesiredNumberScheduled > 0 && ds.Status.NumberAvailable <= 1:


@Miciah suggested ds.Status.DesiredNumberScheduled - ds.Status.NumberAvailable > ds.Spec.UpdateStrategy.RollingUpdate.MaxUnavailable for this case, and I like that better than NumberAvailable <= 1, because it means that Kubernetes is violating our DaemonSet rollout condition, and presumably you could grow a cluster to be large enough that a single CoreDNS could not serve the whole cluster.

The current DNS degraded status calculation doesn't incorporate time. If at any time of observation the number of desired daemonset available replicas diverges from the desired scheduled replicas, the CoreDNS daemonset is instantly considered degraded. This results in degraded true/false flapping when the operator aggregates the DNS status during a daemonset rollout. This seems like a poor meausure of degraded, because most of the time this condition arises during a rollout when QoS is not seriously affected. A better way might be to incorprate progressing time into the calculation, and that is something we should probably do, but the solution is a little more complex. This patch should provide a good improvement in the meantime by only considering DNS degraded if the number of unavailable replicas exceeds the max unavailable count on the daemonset's rolling parameters, which seems like an objectively poor state to be in. And, as mentioned, we can further improve the situation later by trying to incorporate progressing time. That later refinement could replace the logic introduced in this commit.

wking · 2019-10-11T21:19:49Z

/lgtm

Miciah · 2019-10-11T21:21:00Z

/lgtm

openshift-ci-robot · 2019-10-11T21:21:14Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: ironcladlou, Miciah, wking

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

~~OWNERS~~ [Miciah,ironcladlou]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

openshift-ci-robot · 2019-10-14T14:32:14Z

@ironcladlou: All pull requests linked via external trackers have merged. Bugzilla bug 1761506 has been moved to the MODIFIED state.

In response to this:

Bug 1761506: status: prevent degraded status flapping on rollout

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

ironcladlou · 2019-10-17T21:55:57Z

/cherrypick release-4.2

openshift-cherrypick-robot · 2019-10-17T21:56:06Z

@ironcladlou: new pull request created: #135

In response to this:

/cherrypick release-4.2

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

jwforres · 2019-10-21T21:59:25Z

@ironcladlou trying to understand @wking 's comment here openshift/machine-api-operator#417 (comment)

Does this mean this change introduces a regression, or did it just uncover another thing that should be cleaned up.

ironcladlou · 2019-10-22T00:30:23Z

Not sure that I understand the question, can you try another way?

openshift-ci-robot added size/S Denotes a PR that changes 10-29 lines, ignoring generated files. approved Indicates a PR has been approved by an approver from all required OWNERS files. labels Oct 11, 2019

openshift-ci-robot requested review from knobunc and Miciah October 11, 2019 20:41

openshift-ci-robot requested a review from wking October 11, 2019 20:42

wking reviewed Oct 11, 2019

View reviewed changes

ironcladlou force-pushed the dns-status-flapping branch from 47143c1 to 0a173c0 Compare October 11, 2019 21:14

openshift-ci-robot added size/M Denotes a PR that changes 30-99 lines, ignoring generated files. and removed size/S Denotes a PR that changes 10-29 lines, ignoring generated files. labels Oct 11, 2019

wking approved these changes Oct 11, 2019

View reviewed changes

ironcladlou force-pushed the dns-status-flapping branch from 0a173c0 to f8d3aab Compare October 11, 2019 21:18

openshift-ci-robot assigned wking Oct 11, 2019

openshift-ci-robot added the lgtm Indicates that a PR is ready to be merged. label Oct 11, 2019

openshift-ci-robot assigned Miciah Oct 11, 2019

openshift-merge-robot merged commit 729b96f into openshift:master Oct 11, 2019

ironcladlou changed the title ~~status: prevent degraded status flapping on rollout~~ Bug 1761506: status: prevent degraded status flapping on rollout Oct 14, 2019

openshift-cherrypick-robot mentioned this pull request Oct 17, 2019

[release-4.2] Bug 1762960: status: prevent degraded status flapping on rollout #135

Merged

wking mentioned this pull request Oct 18, 2019

Bug 1763293: pkg/operator/sync: Track lastError in waitForDeploymentRollout openshift/machine-api-operator#417

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Bug 1761506: status: prevent degraded status flapping on rollout #134

Bug 1761506: status: prevent degraded status flapping on rollout #134

ironcladlou commented Oct 11, 2019 •

edited

Loading

ironcladlou commented Oct 11, 2019

wking Oct 11, 2019

wking Oct 11, 2019

ironcladlou Oct 11, 2019

ironcladlou Oct 11, 2019

ironcladlou Oct 11, 2019

wking Oct 11, 2019

wking commented Oct 11, 2019

Miciah commented Oct 11, 2019

openshift-ci-robot commented Oct 11, 2019

openshift-ci-robot commented Oct 14, 2019

ironcladlou commented Oct 17, 2019

openshift-cherrypick-robot commented Oct 17, 2019

jwforres commented Oct 21, 2019

ironcladlou commented Oct 22, 2019

Bug 1761506: status: prevent degraded status flapping on rollout #134

Bug 1761506: status: prevent degraded status flapping on rollout #134

Conversation

ironcladlou commented Oct 11, 2019 • edited Loading

ironcladlou commented Oct 11, 2019

wking Oct 11, 2019

Choose a reason for hiding this comment

wking Oct 11, 2019

Choose a reason for hiding this comment

ironcladlou Oct 11, 2019

Choose a reason for hiding this comment

ironcladlou Oct 11, 2019

Choose a reason for hiding this comment

ironcladlou Oct 11, 2019

Choose a reason for hiding this comment

wking Oct 11, 2019

Choose a reason for hiding this comment

wking commented Oct 11, 2019

Miciah commented Oct 11, 2019

openshift-ci-robot commented Oct 11, 2019

openshift-ci-robot commented Oct 14, 2019

ironcladlou commented Oct 17, 2019

openshift-cherrypick-robot commented Oct 17, 2019

jwforres commented Oct 21, 2019

ironcladlou commented Oct 22, 2019

ironcladlou commented Oct 11, 2019 •

edited

Loading