Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix crash-looping pods take a long time to terminate/clean #14607

Closed
wants to merge 2 commits into from

Conversation

andrew-delph
Copy link

@andrew-delph andrew-delph commented Nov 9, 2023

Pods that instantly crash do no scale to 0 until progress deadline is called for the deployment.

Proposed Changes

If pa is not ready and unreachable, It is marked inactive instead of queued.
This will scale the deployment to 0 even if metrics cannot be retrieved.

@knative-prow knative-prow bot added area/API API objects and controllers area/autoscale labels Nov 9, 2023
@knative-prow knative-prow bot requested review from KauzClay and nak3 November 9, 2023 18:30
@knative-prow knative-prow bot added size/M Denotes a PR that changes 30-99 lines, ignoring generated files. needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. labels Nov 9, 2023
Copy link

knative-prow bot commented Nov 9, 2023

Hi @andrew-delph. Thanks for your PR.

I'm waiting for a knative member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work. Regular contributors should join the org to skip this step.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

Copy link

knative-prow bot commented Nov 9, 2023

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: andrew-delph
Once this PR has been reviewed and has the lgtm label, please assign dprotaso for approval. For more information see the Kubernetes Code Review Process.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@andrew-delph andrew-delph changed the title Issue 12691 crash-looping pods take a long time to terminate/clean Nov 9, 2023
@nak3 nak3 added ok-to-test Indicates a non-member PR verified by an org member that is safe to test. and removed needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. labels Nov 10, 2023
@knative-prow knative-prow bot added size/L Denotes a PR that changes 100-499 lines, ignoring generated files. and removed size/M Denotes a PR that changes 30-99 lines, ignoring generated files. labels Nov 13, 2023
@andrew-delph
Copy link
Author

/test istio-latest-no-mesh

@ReToCode
Copy link
Member

8:41:40PM: ^ Pending: ErrImagePull (message: rpc error: code = Unknown desc = failed to pull and unpack image "quay.io/jetstack/cert-manager-webhook:v1.13.0": failed to copy: httpReadSeeker: failed open: unexpected status code https://quay.io/v2/jetstack/cert-manager-webhook/manifests/sha256:9f9bda751112262bbe0c0d55e8d06f0fc558870535e063f0c065d632199467f2: 504 Gateway Time-out)

This is definitely not related to the changes.

/test istio-latest-no-mesh

Copy link

codecov bot commented Nov 15, 2023

Codecov Report

Attention: Patch coverage is 87.50000% with 4 lines in your changes are missing coverage. Please review.

Project coverage is 86.06%. Comparing base (72f91e5) to head (1f2944d).
Report is 25 commits behind head on main.

❗ Current head 1f2944d differs from pull request most recent head fa32dee. Consider uploading reports for the commit fa32dee to get more accurate results

Files Patch % Lines
pkg/apis/autoscaling/v1alpha1/pa_lifecycle.go 0.00% 4 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main   #14607      +/-   ##
==========================================
+ Coverage   84.20%   86.06%   +1.85%     
==========================================
  Files         213      197      -16     
  Lines       16633    14936    -1697     
==========================================
- Hits        14006    12854    -1152     
+ Misses       2280     1774     -506     
+ Partials      347      308      -39     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@andrew-delph
Copy link
Author

/retest

@andrew-delph andrew-delph changed the title crash-looping pods take a long time to terminate/clean [WIP] crash-looping pods take a long time to terminate/clean Nov 20, 2023
@knative-prow knative-prow bot added the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Nov 20, 2023
@andrew-delph andrew-delph marked this pull request as draft November 20, 2023 00:30
@andrew-delph
Copy link
Author

/retest

pkg/apis/autoscaling/v1alpha1/pa_lifecycle.go Outdated Show resolved Hide resolved
pkg/apis/autoscaling/v1alpha1/pa_lifecycle.go Outdated Show resolved Hide resolved
Comment on lines 60 to 62
routingState := rev.GetRoutingState()
if routingState == v1.RoutingStateActive {
return autoscalingv1alpha1.ReachabilityReachable
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we want to make the RoutingState => Reachability essentially passthrough I think we should pull that out in the separate PR to see what implications it has - since it's not clear to me if this breaks anything.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think another thing to note is that with the pod informer if you remove the changes in the file I think it will mean CrashLooping Pods will be marked unhealthy faster so then we toggle reachability to false. Then I think the autoscaler changes are unnecessary

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we want to make the RoutingState => Reachability essentially passthrough I think we should pull that out in the separate PR to see what implications it has - since it's not clear to me if this breaks anything.

Should I still do this?

pkg/reconciler/revision/controller.go Outdated Show resolved Hide resolved
pkg/reconciler/autoscaling/kpa/kpa.go Outdated Show resolved Hide resolved
pkg/reconciler/autoscaling/kpa/kpa.go Outdated Show resolved Hide resolved
pkg/reconciler/autoscaling/kpa/kpa.go Show resolved Hide resolved
pkg/reconciler/autoscaling/kpa/kpa.go Show resolved Hide resolved
@dprotaso
Copy link
Member

/ok-to-test
/retest

@knative-prow knative-prow bot added area/networking area/test-and-release It flags unit/e2e/conformance/perf test issues for product features labels Nov 24, 2023
@knative-prow-robot knative-prow-robot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Nov 24, 2023
@knative-prow knative-prow bot added size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. and removed size/L Denotes a PR that changes 100-499 lines, ignoring generated files. labels Nov 24, 2023
Copy link
Member

@dprotaso dprotaso left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Take a look at my comment here (#14656 (comment)) I think it's worth revisiting some of the assumptions that lead to the creation of this PR.

Secondly, I notice this PR introduces a regression. If a service has all the traffic pinned to a specific revision when we rollout a new revision it's immediately scaled to zero and considered 'failed'. It should scale to 1 (or initial scale) and then scale down after since it's unreachable.

Comment on lines 276 to 277
cond := pa.Status.GetCondition("Active")
pa.Status.MarkInactive(cond.Reason, cond.Message)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Probably simpler to have an if block that prevents setting a PA as inactive if it already is inactive.

Curious -do you know offhand what reason/messages get overwritten? I'm wondering if it makes sense to pull this into a separate PR if it helps with surfacing error messages.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for getting back to me!
I think it was being overwritten as changes made in the PR. I'll have to test that again though. For starters I will create the pr as you suggest.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have created this pr for the computeActiveCondition. I think that it covers all the cases the same. #14940

@andrew-delph andrew-delph force-pushed the issue-12691 branch 4 times, most recently from c60c5cd to 0d10f98 Compare March 2, 2024 17:00
@knative-prow knative-prow bot added size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. and removed size/L Denotes a PR that changes 100-499 lines, ignoring generated files. labels Mar 3, 2024
@knative-prow-robot knative-prow-robot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Mar 14, 2024
@knative-prow knative-prow bot added size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files. and removed size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. labels Mar 14, 2024
Making changes to the pa active status has becomes difficult. This
change breaks down some of the logic so that it is more readable.
No changes to test cases were made as it doesn't actually change the
logical cases.
@knative-prow-robot knative-prow-robot removed the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Mar 14, 2024
@knative-prow knative-prow bot added size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. and removed size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files. labels Mar 14, 2024
@andrew-delph andrew-delph force-pushed the issue-12691 branch 5 times, most recently from 47347c4 to 7033fe1 Compare March 15, 2024 03:24
Copy link

knative-prow bot commented Mar 15, 2024

@andrew-delph: The following tests failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name Commit Details Required Rerun command
contour-latest_serving_main be3ec90 link true /test contour-latest
gateway-api-latest_serving_main be3ec90 link true /test gateway-api-latest
kourier-stable_serving_main be3ec90 link true /test kourier-stable
kourier-stable-tls_serving_main be3ec90 link true /test kourier-stable-tls
contour-tls_serving_main be3ec90 link true /test contour-tls
https_serving_main be3ec90 link false /test https
gateway-api-latest-and-contour_serving_main be3ec90 link false /test gateway-api-latest-and-contour

Your PR dashboard.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here.

@@ -0,0 +1,4 @@
# Deadstart test image
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added an image here to help with this

#14875

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Going to convert the test to use this image.

// This test case creates a service which can never reach a ready state.
// The service is then udpated with a healthy image and is verified that
// the healthy revision is ready and the unhealhy revision is scaled to zero.
func TestDeadStartToHealthy(t *testing.T) {
Copy link
Member

@dprotaso dprotaso Mar 15, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added a similar test here - #14909

But I was expecting it to fail without any fixes - but it actually did pass. So then I realized that we already do scale down crashing revisions when they are unreachable.

So we might not need any other changes than what's already in main - unless you've uncovered a scenario where it isn't

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The case here is still happening for me. #14656 (comment)
The rev-2 will scale down but the first stuck in restarting.

1. Unreachable pa becomes inactive

PA computeActiveCondition() will MarkInactive
in the case of "Queued" when the pa is
Unreachable.

2. Adding DeadStart e2e tests

TestDeadStartToHealthy: creates a service which is never able to
transition to a ready state by existing immediately.
The failed revision will scale to zero once the service
is updated with a healthy revision.

TestDeadStartFromHealthy: updates a healthy service with an image
that can never reach a ready state.
The healthy revision remains Ready and the DeadStart revision doesn't
not scale down until ProgressDeadline is reached.
@andrew-delph
Copy link
Author

I'm currently getting an issue when the progress deadline is reached but looking it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/API API objects and controllers area/autoscale area/networking area/test-and-release It flags unit/e2e/conformance/perf test issues for product features do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. ok-to-test Indicates a non-member PR verified by an org member that is safe to test. size/XL Denotes a PR that changes 500-999 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Crashing revision does not scale to 0 until ProgressDeadline is reached
5 participants