[release-1.12] Don't drop traffic when upgrading a deployment fails #14840

dprotaso · 2024-01-28T21:12:23Z

codecov · 2024-01-28T21:16:17Z

Codecov Report

Attention: 3 lines in your changes are missing coverage. Please review.

Comparison is base (0d199d9) 86.02% compared to head (f50114f) 86.02%.
Report is 1 commits behind head on release-1.12.

Files	Patch %	Lines
pkg/apis/serving/v1/revision_lifecycle.go	70.00%	2 Missing and 1 partial ⚠️

Additional details and impacted files

@@              Coverage Diff              @@
##           release-1.12   #14840   +/-   ##
=============================================
  Coverage         86.02%   86.02%           
=============================================
  Files               197      197           
  Lines             14922    14931    +9     
=============================================
+ Hits              12837    12845    +8     
- Misses             1775     1776    +1     
  Partials            310      310

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

dprotaso · 2024-01-29T13:04:10Z

/test upgrade-tests_serving_release-1.12

dprotaso · 2024-01-29T13:32:00Z

/test upgrade-tests_serving_release-1.12

dprotaso · 2024-01-29T14:55:41Z

/test upgrade-tests_serving_release-1.12

dprotaso · 2024-01-29T18:10:19Z

/test upgrade-tests_serving_release-1.12

knative-prow · 2024-01-30T17:28:18Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: dprotaso

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

~~OWNERS~~ [dprotaso]
~~pkg/apis/OWNERS~~ [dprotaso]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

When transforming the deployment status to the revision we want to bubble up the more severe condition to Ready. Since Replica failures will include a more actionable error message this condition is preferred

This isn't accurate when the Revision has failed to rollout an update to it's deployment

1. PA Reachability now depends on the status of the Deployment If we have available replicas we don't mark the revision as unreachable. This allows ongoing requests to be handled 2. Always propagate the K8s Deployment Status to the Revision. We don't need to gate this depending on whether the Revision required activation. Since the only two conditions we propagate from the Deployment is Progressing and ReplicaSetFailure=False 3. Mark Revision as Deploying if the PA's service name isn't set

dprotaso · 2024-01-30T19:06:19Z

rebased

dprotaso · 2024-01-30T19:36:39Z

ambient is flaky - removing from 1.12 branch here - #14848

/override "test (v1.26.x, istio-ambient, runtime)"
/override "test (v1.26.x, istio-ambient, api)"
/override "test (v1.26.x, istio-ambient, e2e)"

skonto · 2024-01-31T15:08:24Z

pkg/apis/serving/v1/revision_helpers.go

@@ -144,9 +143,3 @@ func (rs *RevisionStatus) IsActivationRequired() bool {
 	c := revisionCondSet.Manage(rs).GetCondition(RevisionConditionActive)
 	return c != nil && c.Status != corev1.ConditionTrue
 }
-
-// IsReplicaSetFailure returns true if the deployment replicaset failed to create
-func (rs *RevisionStatus) IsReplicaSetFailure(deploymentStatus *appsv1.DeploymentStatus) bool {


Where do we cover this part?

Here https://github.com/knative/serving/pull/14840/files#diff-f23ba125b15b4c6e491938d04e4d412d0b550197c51078c7de359ba7abc0da17R71

We always propagate the status now - and this is surfaced as a deployment condition

skonto · 2024-01-31T15:27:14Z

pkg/reconciler/revision/resources/pa.go

-		if c := rev.Status.GetCondition(cond); c != nil && c.IsFalse() {
+	if infraFailure && deployment != nil && deployment.Spec.Replicas != nil {
+		// If we have an infra failure and no ready replicas - then this revision is unreachable
+		if *deployment.Spec.Replicas > 0 && deployment.Status.ReadyReplicas == 0 {


Sorry for being verbose just trying to summarize.
So in the past we moved from checking if rev routing state is active for pa reachability:

if !rev.IsReachable() { return autoscalingv1alpha1.ReachabilityUnreachable } func (r *Revision) IsReachable() bool { return RoutingState(r.Labels[serving.RoutingStateLabelKey]) == RoutingStateActive }

to checking some rev conditions before checking the routing state in order to avoid old revision pods being created until the new revision is up.
However, it was proven a bit too aggressive with the case of a broken webhook and the upgrade of a deployment, thus cutting traffic.
Now with this patch we only set pa to "unreachable" due to revision being not healthy or active, only if there are no ready replicas (when >0 are required) to allow traffic.
Btw what is the effect on the revision when we mark this unreachable, does it propagate to the revision? I am bit confused with all these states.
Could we also do a follow up PR to document this state machine of resources, so we feel more confident about changing stuff?

Now with this patch we only set pa to "unreachable" due to revision being not healthy or active, only if there are no ready replicas (when >0 are required).

The other condition is if the revision is not being pointed to by a route - then it's unreachable as well.

Btw what is the effect on the revision when we mark this unreachable, does it propagate to the revision?

If the revision marks the PA unreachable then the autoscaler will scale the deployment down to zero.

Could we also do a follow up PR to document this state machine of resources, so we feel more confident about changing stuff?

Sure - I also included the necessary tests to cover this case

ReToCode · 2024-02-01T07:19:53Z

pkg/apis/serving/v1/revision_lifecycle.go

 	if ps.IsScaleTargetInitialized() && !resUnavailable {
 		// Precondition for PA being initialized is SKS being active and
 		// that implies that |service.endpoints| > 0.
 		rs.MarkResourcesAvailableTrue()
 		rs.MarkContainerHealthyTrue()
 	}

+	// Mark resource unavailable if we don't have a Service Name and the deployment is ready


Can we somehow combine this with the above statements (from https://github.com/knative/serving/pull/14840/files#diff-831a9383e7db7880978acf31f7dfec777beb08b900b1d0e1c55a5aed42e602cbR173 down)?
It feels like both parts work on RevisionConditionResourcesAvailable and PodAutoscalerConditionReady and set rs.MarkResourcesAvailableUnknown

Or in other words, the full function is a bit hard to grasp.

How do you want to combine it? My hope here is to keep the conditionals straight forward. Keeping them separate helps with that.

Hm I'd need more time to fiddle around with the current code. But maybe better to keep it here and do it on main afterwards (if even).

ReToCode

This is a tough one. I agree with Stavros that it's quite hard to understand the current state machine.

As far as I can tell (also reading your explanations and comments in the previous PR) I think this looks good.

skonto · 2024-02-01T11:37:19Z

pkg/apis/serving/k8s_lifecycle_test.go

+			}},
+		},
+	}, {
+		name: "replica failure has priority over progressing",


Could you elaborate where priority is defined, I see that DeploymentConditionProgressing is the same before and after, so no change there.

'priority' here means that the replicafailure message is the last one applied so it is surfaced to the deployment's Ready condition.

skonto · 2024-02-01T11:59:18Z

pkg/apis/serving/k8s_lifecycle_test.go

+		},
+		want: &duckv1.Status{
+			Conditions: []apis.Condition{{
+				Type:    DeploymentConditionProgressing,


Reading this I was kind of confused seeing that:

DeploymentConditionProgressing apis.ConditionType = "Progressing" DeploymentProgressing DeploymentConditionType = "Progressing"

condition types defined in knative.dev/pkg are just two:

// ConditionReady specifies that the resource is ready. // For long-running resources. ConditionReady ConditionType = "Ready" // ConditionSucceeded specifies that the resource has finished. // For resource which run to completion. ConditionSucceeded ConditionType = "Succeeded"

Also going from deployment conditions to duckv1 conditions and back
seems a bit complex, eventually we have:

func TransformDeploymentStatus(ds *appsv1.DeploymentStatus) *duckv1.Status { s := &duckv1.Status{} depCondSet.Manage(s).InitializeConditions() // The absence of this condition means no failure has occurred. If we find it // below, we'll overwrite this. depCondSet.Manage(s).MarkTrue(DeploymentConditionReplicaSetReady) depCondSet.Manage(s).MarkUnknown(DeploymentConditionProgressing, "Deploying", "") .... func (rs *RevisionStatus) PropagateDeploymentStatus(original *appsv1.DeploymentStatus) { ds := serving.TransformDeploymentStatus(original) cond := ds.GetCondition(serving.DeploymentConditionReady) ...

I am wondering if mapping deployment conditions directly to revision conditions would be more readable.

I am wondering if mapping deployment conditions directly to revision conditions would be more readable.

I'm open folks cleaning this up in a follow up PR.

The thing with the deployment conditions is their polarity is weird - ReplicaCreateFailure=False is actually good

The thing with the deployment conditions is their polarity is weird - ReplicaCreateFailure=False is actually good

Yes, had to read that three times to get :D

skonto · 2024-02-01T12:06:03Z

pkg/reconciler/autoscaling/kpa/kpa_test.go

@@ -1303,7 +1303,7 @@ func TestGlobalResyncOnUpdateAutoscalerConfigMap(t *testing.T) {
 	rev := newTestRevision(testNamespace, testRevision)
 	newDeployment(ctx, t, fakedynamicclient.Get(ctx), testRevision+"-deployment", 3)

-	kpa := revisionresources.MakePA(rev)
+	kpa := revisionresources.MakePA(rev, nil)


Should not we pass the deployment above instead of nil?

These are really fixtures - and passing in a deployment doesn't change the fixture so I didn't think it was necessary.

skonto · 2024-02-01T12:13:13Z

test/upgrade/deployment_failure.go

+		resources, err := v1test.CreateServiceReady(c.T, clients, names, func(s *v1.Service) {
+			s.Spec.Template.Annotations = map[string]string{
+				autoscaling.MinScaleAnnotation.Key(): "1",
+				autoscaling.MaxScaleAnnotation.Key(): "1",


What will happen if maxScale = 10 and we deploy the failing webhook before all replicas are up? Would the new revision be reachable since some replicas are up?

The replica set has techinically progressed so there would be no failure surfaced on the deployment because it's a scaling issue for the replicaset.

skonto · 2024-02-01T22:32:25Z

/lgtm
/hold for @ReToCode
Are ambient tests broken?

ReToCode · 2024-02-02T07:18:47Z

/lgtm
/unhold

Are ambient tests broken?

They are super flaky --> #14637 (I think we also have other tests that are pretty flaky, also in kourier for example). At some point we need to take some time to look into it.

dprotaso · 2024-02-02T14:44:45Z

Yeah I cherry-picked disabling ambient back to the 1.12 branch here - #14848

dprotaso · 2024-02-02T14:45:03Z

/override "test (v1.27.x, istio-ambient, api)"

knative-prow · 2024-02-02T14:45:07Z

@dprotaso: Overrode contexts on behalf of dprotaso: test (v1.27.x, istio-ambient, api)

In response to this:

/override "test (v1.27.x, istio-ambient, api)"

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

dprotaso · 2024-02-02T14:49:59Z

/override "test (v1.28.x, istio-ambient, e2e)"

knative-prow · 2024-02-02T14:50:03Z

@dprotaso: Overrode contexts on behalf of dprotaso: test (v1.28.x, istio-ambient, e2e)

In response to this:

/override "test (v1.28.x, istio-ambient, e2e)"

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

dprotaso · 2024-02-02T14:51:31Z

/override "test (v1.28.x, istio-ambient, runtime)"
/override "test (v1.28.x, istio-ambient, api)"

/override "test (v1.27.x, istio-ambient, e2e)"
/override "test (v1.27.x, istio-ambient, runtime)"

knative-prow · 2024-02-02T14:51:35Z

@dprotaso: /override requires failed status contexts, check run or a prowjob name to operate on.
The following unknown contexts/checkruns were given:

test (v1.27.x, istio-ambient, e2e)
test (v1.27.x, istio-ambient, runtime)
test (v1.28.x, istio-ambient, api)

Only the following failed contexts/checkruns were expected:

EasyCLA
build-tests_serving_release-1.12
istio-latest-no-mesh-tls_serving_release-1.12
istio-latest-no-mesh_serving_release-1.12
style / suggester / github_actions
style / suggester / shell
style / suggester / yaml
test (v1.28.x, istio-ambient, runtime)
tide
unit-tests_serving_release-1.12
upgrade-tests_serving_release-1.12

If you are trying to override a checkrun that has a space in it, you must put a double quote on the context.

In response to this:

/override "test (v1.28.x, istio-ambient, runtime)"
/override "test (v1.28.x, istio-ambient, api)"

/override "test (v1.27.x, istio-ambient, e2e)"
/override "test (v1.27.x, istio-ambient, runtime)"

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

dprotaso · 2024-02-02T14:52:23Z

/override "test (v1.28.x, istio-ambient, runtime)"

knative-prow · 2024-02-02T14:52:26Z

@dprotaso: Overrode contexts on behalf of dprotaso: test (v1.28.x, istio-ambient, runtime)

In response to this:

/override "test (v1.28.x, istio-ambient, runtime)"

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

dprotaso · 2024-02-02T22:21:03Z

/cherry-pick release-1.13

knative-prow-robot · 2024-02-02T22:21:42Z

@dprotaso: new pull request created: #14864

In response to this:

/cherry-pick release-1.13

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

knative-prow bot added do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. area/test-and-release It flags unit/e2e/conformance/perf test issues for product features labels Jan 28, 2024

knative-prow bot requested review from evankanderson and mgencur January 28, 2024 21:12

knative-prow bot added approved Indicates a PR has been approved by an approver from all required OWNERS files. size/L Denotes a PR that changes 100-499 lines, ignoring generated files. labels Jan 28, 2024

dprotaso force-pushed the test-upgrade-failure-test branch from 8f5284e to 5f22a01 Compare January 28, 2024 21:56

dprotaso mentioned this pull request Jan 29, 2024

Don't drop traffic when upgrading a deployment fails #14795

Merged

dprotaso force-pushed the test-upgrade-failure-test branch 2 times, most recently from dde064a to a6f145f Compare January 30, 2024 17:28

knative-prow bot added area/API API objects and controllers area/autoscale labels Jan 30, 2024

dprotaso changed the title ~~[wip] [release-1.12] test deployment failures don't drop traffic on upgrade~~ [release-1.12] Don't drop traffic when upgrading a deployment fails Jan 30, 2024

knative-prow bot removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Jan 30, 2024

dprotaso mentioned this pull request Jan 30, 2024

[release-1.13] Don't drop traffic when upgrading a deployment fails #14846

Closed

dprotaso added 5 commits January 30, 2024 14:02

Surface Replica failures over Progressing failures

896233b

When transforming the deployment status to the revision we want to bubble up the more severe condition to Ready. Since Replica failures will include a more actionable error message this condition is preferred

Stop always marking the revision healthy when the PA is Ready

3bb86a0

This isn't accurate when the Revision has failed to rollout an update to it's deployment

test deployment failures don't drop traffic on upgrade

fecc0a0

fix boilerplate check

f50114f

dprotaso force-pushed the test-upgrade-failure-test branch from e47b232 to f50114f Compare January 30, 2024 19:02

knative deleted a comment from knative-prow bot Jan 30, 2024

knative-prow bot assigned ReToCode and skonto Jan 30, 2024

skonto reviewed Jan 31, 2024

View reviewed changes

ReToCode reviewed Feb 1, 2024

View reviewed changes

skonto reviewed Feb 1, 2024

View reviewed changes

knative-prow bot added do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. lgtm Indicates that a PR is ready to be merged. labels Feb 1, 2024

knative-prow bot removed the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Feb 2, 2024

knative-prow bot merged commit 9d21588 into knative:release-1.12 Feb 2, 2024
79 of 85 checks passed

dprotaso deleted the test-upgrade-failure-test branch February 2, 2024 16:14

knative-prow-robot mentioned this pull request Feb 2, 2024

[release-1.13] Don't drop traffic when upgrading a deployment fails #14864

Merged

skonto mentioned this pull request Mar 26, 2024

Bump code dependencies to 1.13 openshift-knative/serverless-operator#2578

Merged

[release-1.12] Don't drop traffic when upgrading a deployment fails #14840

[release-1.12] Don't drop traffic when upgrading a deployment fails #14840

Conversation

dprotaso commented Jan 28, 2024 • edited

codecov bot commented Jan 28, 2024 • edited

Codecov Report

dprotaso commented Jan 29, 2024

dprotaso commented Jan 29, 2024

dprotaso commented Jan 29, 2024

dprotaso commented Jan 29, 2024

knative-prow bot commented Jan 30, 2024

dprotaso commented Jan 30, 2024

dprotaso commented Jan 30, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

skonto Jan 31, 2024 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ReToCode left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

skonto commented Feb 1, 2024 • edited

ReToCode commented Feb 2, 2024

dprotaso commented Feb 2, 2024

dprotaso commented Feb 2, 2024

knative-prow bot commented Feb 2, 2024

dprotaso commented Feb 2, 2024

knative-prow bot commented Feb 2, 2024

dprotaso commented Feb 2, 2024

knative-prow bot commented Feb 2, 2024

dprotaso commented Feb 2, 2024

knative-prow bot commented Feb 2, 2024

dprotaso commented Feb 2, 2024

knative-prow-robot commented Feb 2, 2024

dprotaso commented Jan 28, 2024 •

edited

codecov bot commented Jan 28, 2024 •

edited

skonto Jan 31, 2024 •

edited

skonto commented Feb 1, 2024 •

edited