Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix deployment status propagation when scaling from zero #15550

Open
wants to merge 4 commits into
base: main
Choose a base branch
from

Conversation

skonto
Copy link
Contributor

@skonto skonto commented Oct 4, 2024

Fixes #14157

Proposed Changes

  • Introduces a new PA condition (PodAutoscalerConditionScaleTargetScaled) that detects failures during scaling to zero, covering the K8s gaps where deployment status is not updated correctly. The condition is set to false just before we scale down to zero (before the deployment update happens) and if pods are crashing. We set it back to true when we scale from zero and we have enough ready pods.

  • Previously when deployment was scaled down to zero, revision ready status would be true (and stay that way), but with this patch the pod error is detected and propagated:

Ksvc status:

{
    "lastTransitionTime": "2024-10-04T13:57:35Z",
    "message": "Revision \"revision-failure-00001\" failed with message: Back-off pulling image \"index.docker.io/skonto/revisionfailure@sha256:c7dd34a5919877b89617c3a0df7382e7de0f98318f2c12bf4374bb293f104977\".",
    "reason": "RevisionFailed",
    "status": "False",
    "type": "ConfigurationsReady"
},

Revision:

k  get revision
NAME                     CONFIG NAME        GENERATION   READY   REASON             ACTUAL REPLICAS   DESIRED REPLICAS
revision-failure-00001   revision-failure   1            False   ImagePullBackOff   0                 0

PA status:
{
    "lastTransitionTime": "2024-10-04T13:57:35Z",
    "message": "Back-off pulling image \"index.docker.io/skonto/revisionfailure@sha256:c7dd34a5919877b89617c3a0df7382e7de0f98318f2c12bf4374bb293f104977\"",
    "reason": "ImagePullBackOff",
    "status": "False",
    "type": "ScaleTargetScaled"
}
],
  • Updates the pa status propagation logic in the revision reconciler.
  • Extends a bit the resource quota e2e test to show that when deployment is scaled to zero we will still report the error. That is irrelevant to this patch but we want to show that we cover certain scenarios more. Probably it would be good to add more e2e tests anyway.
  • The steps to test is simply start a skvc, let it scale to zero then remove the image from your registry, block any access (kill internet) and then issue a request.

@knative-prow knative-prow bot added do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. size/L Denotes a PR that changes 100-499 lines, ignoring generated files. labels Oct 4, 2024
@skonto skonto requested a review from dsimansk October 4, 2024 14:32
Copy link

codecov bot commented Oct 4, 2024

Codecov Report

Attention: Patch coverage is 36.84211% with 36 lines in your changes missing coverage. Please review.

Project coverage is 84.32%. Comparing base (c8e131b) to head (5ba5209).

Files with missing lines Patch % Lines
pkg/reconciler/autoscaling/kpa/scaler.go 30.76% 15 Missing and 3 partials ⚠️
pkg/resources/pods.go 0.00% 7 Missing ⚠️
pkg/apis/autoscaling/v1alpha1/pa_lifecycle.go 33.33% 4 Missing ⚠️
pkg/apis/serving/v1/revision_lifecycle.go 0.00% 2 Missing and 1 partial ⚠️
pkg/reconciler/autoscaling/kpa/kpa.go 60.00% 1 Missing and 1 partial ⚠️
pkg/testing/functional.go 71.42% 2 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main   #15550      +/-   ##
==========================================
- Coverage   84.49%   84.32%   -0.18%     
==========================================
  Files         219      219              
  Lines       13608    13662      +54     
==========================================
+ Hits        11498    11520      +22     
- Misses       1740     1769      +29     
- Partials      370      373       +3     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

Copy link

knative-prow bot commented Oct 4, 2024

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: skonto
Once this PR has been reviewed and has the lgtm label, please ask for approval from dprotaso. For more information see the Kubernetes Code Review Process.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@skonto skonto added this to the v1.16.0 milestone Oct 4, 2024
// Mark resource unavailable if we are scaling back to zero, but we never achieved the required scale
// and deployment status was not updated properly by K8s. For example due to an image pull error.
if ps.ScaleTargetNotScaled() {
condScaled := ps.GetCondition(autoscalingv1alpha1.PodAutoscalerConditionScaleTargetScaled)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We could set ContainerHealthyFalse here too but we need #15503

@skonto skonto changed the title [wip] Fix deployment status propagation when scaling from zero Fix deployment status propagation when scaling from zero Oct 7, 2024
@knative-prow knative-prow bot removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Oct 7, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
size/L Denotes a PR that changes 100-499 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Error for failed revision is not reported due to scaling to zero
2 participants