Bug 1698562: status: introduce ingresscontroller degraded condition #283

ironcladlou · 2019-08-07T15:29:50Z

Introduce degraded condition computation for ingresscontroller. For now,
degraded only considers failed deployments, which seems like a conservative bare
minimum indicator of a degraded state. Ensure that ingresscontroller deployments
have a useful progressDeadlineSeconds so that degraded deployments are actually
detected in a useful timeframe.

Refacor clusteroperator degraded status to account for ingresscontroller
degraded conditions.

Introduce degraded condition computation for ingresscontroller. For now, degraded only considers failed deployments, which seems like a conservative bare minimum indicator of a degraded state. Ensure that ingresscontroller deployments have a useful progressDeadlineSeconds so that degraded deployments are actually detected in a useful timeframe. Refacor clusteroperator degraded status to account for ingresscontroller degraded conditions.

openshift-ci-robot · 2019-08-07T15:29:55Z

@ironcladlou: This pull request references a valid Bugzilla bug. The bug has been moved to the POST state. The bug has been updated to refer to the pull request using the external bug tracker.

In response to this:

Bug 1698562: status: introduce ingresscontroller degraded condition

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

ironcladlou · 2019-08-07T15:31:16Z

assets/router/deployment.yaml

@@ -4,6 +4,7 @@ kind: Deployment
 apiVersion: apps/v1
 # name and namespace are set at runtime.
 spec:
+  progressDeadlineSeconds: 120


This might be too long, but 120 seems conservative and we can adjust if necessary.

ironcladlou · 2019-08-07T15:35:09Z

Here's an example of the degraded condition on the ingresscontroller:

  - lastTransitionTime: "2019-08-07T15:05:03Z"
    message: 'The deployment failed (reason: ProgressDeadlineExceeded) with message:
      ReplicaSet "router-default-858c66c5c9" has timed out progressing.'
    reason: DeploymentFailed
    status: "True"
    type: Degraded

And here's the corresponding degraded condition on the clusteroperator:

  - lastTransitionTime: "2019-08-07T15:05:03Z"
    message: 'Some ingresscontrollers are degraded: default'
    reason: IngressControllersDegraded
    status: "True"
    type: Degraded

The ingresscontroller and operator are still available in this case because minimum deployment availability is maintained.

Miciah · 2019-08-07T15:47:50Z

pkg/operator/controller/ingress/status.go

+	return operatorv1.OperatorCondition{
+		Type:   operatorv1.OperatorStatusTypeDegraded,
+		Status: operatorv1.ConditionFalse,
+		Reason: "DeploymentAvailable",


DeploymentAvailable could be misleading if the deployment is still progressing and not available. Do we need an explicit reason when the degraded condition is false?

Good point, I removed the reason entirely

Miciah · 2019-08-07T16:00:29Z

pkg/operator/controller/status/controller_test.go

 		}

-		conditions := r.computeOperatorStatusConditions([]configv1.ClusterOperatorStatusCondition{},
-			namespace, tc.allIngressesAvailable, oldVersions, reportedVersions)
+		actual := computeOperatorProgressingCondition(tc.allIngressesAvailable, oldVersions, reportedVersions, tc.curVersions.operator, tc.curVersions.operand)
 		conditionsCmpOpts := []cmp.Option{
 			cmpopts.IgnoreFields(configv1.ClusterOperatorStatusCondition{}, "LastTransitionTime", "Reason", "Message"),
 			cmpopts.EquateEmpty(),
 			cmpopts.SortSlices(func(a, b configv1.ClusterOperatorStatusCondition) bool { return a.Type < b.Type }),


No longer need cmpopts.SortSlices or cmpopts.EquateEmpty.

Miciah · 2019-08-07T16:11:44Z

assets/router/deployment.yaml

@@ -4,6 +4,7 @@ kind: Deployment
 apiVersion: apps/v1
 # name and namespace are set at runtime.
 spec:
+  progressDeadlineSeconds: 120


This reminds me that we need to set the readiness probe to use /healthz/ready. I'm nervous about setting the progressing deadline significantly lower than the default, especially if we start using /healthz/ready. Are we confident that the initial sync will finish in time on large clusters?

The default is 600 seconds, which seems too long. Do you agree? Would some e2e test warn us if we chose a value that's too short on average?

The default is 600 seconds, which seems too long. Do you agree?

That's what I'm wondering.

Would some e2e test warn us if we chose a value that's too short on average?

Not if the E2E tests are not representative of production clusters. Moreover, while we may be fine now, I intend to fix the readiness check to use /healthz/ready, which will cause the deployment not to be ready until the router has synched routes, which I could see taking more than 120 seconds on burdened clusters with many routes.

Progress deadline is usually measured in 10-20m.

This is way too low.

We're also not setting readiness endpoints correctly, fixing that and changing back to 10m in a followup.

Nevermind, fixed here instead

Miciah · 2019-08-07T16:21:05Z

pkg/operator/controller/status/controller_test.go

-	type conditions struct {
-		degraded, progressing, available bool
-	}
+func TestComputeOperatorProgressingCondition(t *testing.T) {


So we're losing unit-test coverage of computeOperatorAvailableCondition and computeOperatorDegradedCondition, but I suppose they are sufficiently covered by E2E tests.

They currently seem simple enough that unit test coverage would be more code than it's worth given e2e, IMO.

If we start introducing additional inputs to the formulas, unit tests may become useful...

Miciah · 2019-08-07T18:20:17Z

/lgtm

ironcladlou · 2019-08-07T19:24:47Z

Tests are still going, might as well roll the followups into this one.

/hold

Set a more conservative deadline and use the correct readiness endpoint.

ironcladlou · 2019-08-07T19:28:00Z

/hold cancel

Miciah · 2019-08-07T19:31:05Z

/lgtm

openshift-ci-robot · 2019-08-07T19:31:15Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: ironcladlou, Miciah

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

~~OWNERS~~ [Miciah,ironcladlou]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

openshift-bot · 2019-08-07T21:45:12Z

/retest

Please review the full test history for this PR and help us cut down flakes.

openshift-ci-robot · 2019-08-07T23:52:30Z

@ironcladlou: All pull requests linked via external trackers have merged. The Bugzilla bug has been moved to the MODIFIED state.

In response to this:

Bug 1698562: status: introduce ingresscontroller degraded condition

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

openshift-ci-robot added the bugzilla/valid-bug Indicates that a referenced Bugzilla bug is valid for the branch this PR is targeting. label Aug 7, 2019

openshift-ci-robot added the size/L Denotes a PR that changes 100-499 lines, ignoring generated files. label Aug 7, 2019

openshift-ci-robot requested review from knobunc and Miciah August 7, 2019 15:30

openshift-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Aug 7, 2019

ironcladlou commented Aug 7, 2019

View reviewed changes

Miciah reviewed Aug 7, 2019

View reviewed changes

ironcladlou added 2 commits August 7, 2019 13:20

status: remove misleading/unnecessary reason

0719446

tests: remove redundant code

7d3a028

openshift-ci-robot assigned Miciah Aug 7, 2019

openshift-ci-robot added the lgtm Indicates that a PR is ready to be merged. label Aug 7, 2019

ironcladlou mentioned this pull request Aug 7, 2019

Expands IngressController status conditions #224

Closed

openshift-ci-robot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Aug 7, 2019

deployment: fix readiness

e5f283b

Set a more conservative deadline and use the correct readiness endpoint.

openshift-ci-robot removed the lgtm Indicates that a PR is ready to be merged. label Aug 7, 2019

openshift-ci-robot removed the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Aug 7, 2019

openshift-ci-robot added the lgtm Indicates that a PR is ready to be merged. label Aug 7, 2019

openshift-merge-robot merged commit df1eba3 into openshift:master Aug 7, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Bug 1698562: status: introduce ingresscontroller degraded condition #283

Bug 1698562: status: introduce ingresscontroller degraded condition #283

ironcladlou commented Aug 7, 2019

openshift-ci-robot commented Aug 7, 2019

ironcladlou Aug 7, 2019

ironcladlou commented Aug 7, 2019

Miciah Aug 7, 2019

ironcladlou Aug 7, 2019

Miciah Aug 7, 2019

ironcladlou Aug 7, 2019

Miciah Aug 7, 2019

ironcladlou Aug 7, 2019

Miciah Aug 7, 2019

smarterclayton Aug 7, 2019

smarterclayton Aug 7, 2019

ironcladlou Aug 7, 2019

ironcladlou Aug 7, 2019

Miciah Aug 7, 2019

ironcladlou Aug 7, 2019

Miciah commented Aug 7, 2019

ironcladlou commented Aug 7, 2019

ironcladlou commented Aug 7, 2019

Miciah commented Aug 7, 2019

openshift-ci-robot commented Aug 7, 2019

openshift-bot commented Aug 7, 2019

openshift-ci-robot commented Aug 7, 2019

Bug 1698562: status: introduce ingresscontroller degraded condition #283

Bug 1698562: status: introduce ingresscontroller degraded condition #283

Conversation

ironcladlou commented Aug 7, 2019

openshift-ci-robot commented Aug 7, 2019

Choose a reason for hiding this comment

ironcladlou commented Aug 7, 2019

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Miciah commented Aug 7, 2019

ironcladlou commented Aug 7, 2019

ironcladlou commented Aug 7, 2019

Miciah commented Aug 7, 2019

openshift-ci-robot commented Aug 7, 2019

openshift-bot commented Aug 7, 2019

openshift-ci-robot commented Aug 7, 2019