Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Bug 1761506: status: prevent degraded status flapping on rollout #134

Merged

Conversation

ironcladlou
Copy link
Contributor

@ironcladlou ironcladlou commented Oct 11, 2019

status: prevent degraded status flapping on rollout

The current DNS degraded status calculation doesn't incorporate time. If at any
time of observation the number of desired daemonset available replicas diverges
from the desired scheduled replicas, the CoreDNS daemonset is instantly
considered degraded. This results in degraded true/false flapping when the
operator aggregates the DNS status during a daemonset rollout.

This seems like a poor meausure of degraded, because most of the time this
condition arises during a rollout when QoS is not seriously affected. A better
way might be to incorprate progressing time into the calculation, and that is
something we should probably do, but the solution is a little more complex.

This patch should provide a good improvement in the meantime by only considering
DNS degraded if the number of unavailable replicas exceeds the max unavailable
count on the daemonset's rolling parameters, which seems like an objectively
poor state to be in.

And, as mentioned, we can further improve the situation later by trying to
incorporate progressing time. That later refinement could replace the logic
introduced in this commit.

@openshift-ci-robot openshift-ci-robot added size/S Denotes a PR that changes 10-29 lines, ignoring generated files. approved Indicates a PR has been approved by an approver from all required OWNERS files. labels Oct 11, 2019
@ironcladlou
Copy link
Contributor Author

/cc @wking

@@ -70,14 +70,14 @@ func computeDNSDegradedCondition(oldCondition *operatorv1.OperatorCondition, clu
degradedCondition.Status = operatorv1.ConditionTrue
degradedCondition.Reason = "NoClusterIP"
degradedCondition.Message = "No ClusterIP assigned to DNS Service"
case ds.Status.NumberAvailable == 0:
case ds.Status.DesiredNumberScheduled == 0:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should stay NumberAvailable == 0, right?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i.e. wanting more than one pod is not enough, you actually have to have more than one pod to get out of NoPodsScheduled

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

NoPodsScheduled is actually the misnomer, the check is intentional... if none are even desired to be scheduled, then available seems irrelevant. I'm not even sure how we can get here in the wild...

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

NoPodsDesired?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe DesiredNumberScheduled=0 if none of the nodes match the scheduling criteria due to taints or something.

case ds.Status.NumberAvailable != ds.Status.DesiredNumberScheduled:
degradedCondition.Reason = "NoPodsScheduled"
degradedCondition.Message = "No CoreDNS pods are desired to be scheduled"
case ds.Status.DesiredNumberScheduled > 0 && ds.Status.NumberAvailable <= 1:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Miciah suggested ds.Status.DesiredNumberScheduled - ds.Status.NumberAvailable > ds.Spec.UpdateStrategy.RollingUpdate.MaxUnavailable for this case, and I like that better than NumberAvailable <= 1, because it means that Kubernetes is violating our DaemonSet rollout condition, and presumably you could grow a cluster to be large enough that a single CoreDNS could not serve the whole cluster.

@openshift-ci-robot openshift-ci-robot added size/M Denotes a PR that changes 30-99 lines, ignoring generated files. and removed size/S Denotes a PR that changes 10-29 lines, ignoring generated files. labels Oct 11, 2019
The current DNS degraded status calculation doesn't incorporate time. If at any
time of observation the number of desired daemonset available replicas diverges
from the desired scheduled replicas, the CoreDNS daemonset is instantly
considered degraded. This results in degraded true/false flapping when the
operator aggregates the DNS status during a daemonset rollout.

This seems like a poor meausure of degraded, because most of the time this
condition arises during a rollout when QoS is not seriously affected. A better
way might be to incorprate progressing time into the calculation, and that is
something we should probably do, but the solution is a little more complex.

This patch should provide a good improvement in the meantime by only considering
DNS degraded if the number of unavailable replicas exceeds the max unavailable
count on the daemonset's rolling parameters, which seems like an objectively
poor state to be in.

And, as mentioned, we can further improve the situation later by trying to
incorporate progressing time. That later refinement could replace the logic
introduced in this commit.
@wking
Copy link
Member

wking commented Oct 11, 2019

/lgtm

@openshift-ci-robot openshift-ci-robot added the lgtm Indicates that a PR is ready to be merged. label Oct 11, 2019
@Miciah
Copy link
Contributor

Miciah commented Oct 11, 2019

/lgtm

@openshift-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: ironcladlou, Miciah, wking

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@openshift-merge-robot openshift-merge-robot merged commit 729b96f into openshift:master Oct 11, 2019
@ironcladlou ironcladlou changed the title status: prevent degraded status flapping on rollout Bug 1761506: status: prevent degraded status flapping on rollout Oct 14, 2019
@openshift-ci-robot
Copy link
Contributor

@ironcladlou: All pull requests linked via external trackers have merged. Bugzilla bug 1761506 has been moved to the MODIFIED state.

In response to this:

Bug 1761506: status: prevent degraded status flapping on rollout

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@ironcladlou
Copy link
Contributor Author

/cherrypick release-4.2

@openshift-cherrypick-robot

@ironcladlou: new pull request created: #135

In response to this:

/cherrypick release-4.2

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@jwforres
Copy link
Member

@ironcladlou trying to understand @wking 's comment here openshift/machine-api-operator#417 (comment)

Does this mean this change introduces a regression, or did it just uncover another thing that should be cleaned up.

@ironcladlou
Copy link
Contributor Author

Not sure that I understand the question, can you try another way?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. lgtm Indicates that a PR is ready to be merged. size/M Denotes a PR that changes 30-99 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants