Defective revision can lead to pods never being removed #13677

DavidR91 · 2023-02-06T11:35:01Z

In what area(s)?

/area autoscale

What version of Knative?

Repro'ed in
1.3
1.5
1.9

(This repros with istio as networking + ingress 1.12.9. Besides using the operator for install the configuration is very vanilla but I can provide more details if useful)

Expected Behavior

Deployments that pass their initial progress deadline but contain pods that start to crashloop should eventually be scaled down and removed.

(Note: Scale to zero assumed)

Actual Behavior

If there is buffered traffic for a revision of a service, and the service passed its initial deployment progress deadline, knative will keep the revision's deployments alive forever with no obvious way to scale them down or remove them (keeping around the pods in a crashlooping state)

Example use case case encountered: a revision contains a pod with e.g. an address of an external resource like a database. The service is working with this revision for some time, and then the external resource address is changed (causing the pod to startup but the container to not serve requests and eventually enter restart loops). A new revision is created to amend this - but if there is any outstanding traffic for the old revision, the old defective pods are kept around and never scaled down.

The state of the PodAutoscaler in this instance becomes Ready=Unknown Reason=Queued with status messages to the effect of Requests to the target are being buffered as resources are provisioned

Removing the service is not a solution because the newest revision is correctly serving traffic.

Steps to Reproduce the Problem

Create a container that serves HTTP traffic correctly but ceases to start listening/functioning based on external criteria
- Simple example is to sleep for 5 seconds and exit before the listener starts if the minute of the current hour is >30
Create a service + revision for the container
Send traffic to the service while the external criteria allows the container to operate
- Make sure the service passes its initial deployment deadline (~10 mins)
Wait for it to scale back down to zero
Send traffic to the service now that the external criteria prevents it starting
The deployment will scale up and all the created pods will crashloop
Create a new revision that corrects the issue, and drive traffic to the service again
The service's new revision will start and serve traffic but the deployment and pods of the old defective revision will stick around with no clear way to remove them

NOTE: You can obviously delete the revision, but this is not a solution for services which have only a single revision (do we have to delete the entire service to kill these pods?). This bug is partly a question of whether knative is actually designed to be able to clean up this scenario, or whether it would rest on a human operator or additional orchestrator to resolve.

The text was updated successfully, but these errors were encountered:

dprotaso · 2023-02-09T17:56:29Z

/triage accepted

jsanin-vmw · 2023-02-15T17:52:00Z

/assign

jsanin-vmw · 2023-05-23T19:51:55Z

/unassign

jsanin-vmw · 2023-10-30T22:07:29Z

/assign

jsanin · 2024-02-01T14:09:45Z

PR 14573 aims to fix this issue.

The proposed fix is based on the TimeoutSeconds field in the Revision. After this timeoutSeconds has gone by there should not be any pending requests in the activator and the Unreachable revision can scale down with no risk of requests not being processed.

The default value for timeoutSeconds is 300, so the pods on the failing revision will only scale down after this time. TimeoutSeconds can be changed of course.

dprotaso · 2024-02-17T02:55:47Z

@DavidR91 I've been trying to reproduce this issue to verify the proposed fixed - but in my testings I'm seeing the revision pod scale down once the activator times out the request.

Do you have a consistent way to trigger this issue? Can you confirm requests are being timed out by the activator?

github-actions · 2024-05-21T01:25:37Z

This issue is stale because it has been open for 90 days with no
activity. It will automatically close after 30 more days of
inactivity. Reopen the issue with /reopen. Mark the issue as
fresh by adding the comment /remove-lifecycle stale.

dprotaso · 2024-05-21T16:03:47Z

Closing this out due to lack of user-input

DavidR91 added the kind/bug Categorizes issue or PR as related to a bug. label Feb 6, 2023

knative-prow bot added the area/autoscale label Feb 6, 2023

knative-prow bot added the triage/accepted Issues which should be fixed (post-triage) label Feb 9, 2023

dprotaso added this to the v1.10.0 milestone Feb 10, 2023

knative-prow bot assigned jsanin-vmw Feb 15, 2023

jsanin-vmw mentioned this issue Feb 23, 2023

[WIP] If there is no traffic routed to revision then scale to 0 #13731

Closed

dprotaso modified the milestones: v1.10.0, v1.11.0 Apr 26, 2023

knative-prow bot unassigned jsanin-vmw May 23, 2023

dprotaso mentioned this issue Jun 22, 2023

Old unreachable revision is causing new pods to get created when it should scale down #14115

Closed

dprotaso mentioned this issue Jul 10, 2023

Error for failed revision is not reported due to scaling to zero #14157

Open

dprotaso modified the milestones: v1.11.0, v1.12.0 Aug 16, 2023

jsanin-vmw mentioned this issue Oct 30, 2023

Fix defective revision can lead to pods never being removed #14573

Closed

knative-prow bot assigned jsanin-vmw Oct 30, 2023

gabo1208 mentioned this issue Nov 6, 2023

Failing pod never times out #6504

Open

dprotaso modified the milestones: v1.12.0, v1.13.0 Nov 22, 2023

dprotaso mentioned this issue Nov 22, 2023

Initial Revisions (at least) with crash-looping pods take a long time to terminate/clean up Pods #12691

Closed

dprotaso added a commit to dprotaso/serving that referenced this issue Feb 16, 2024

include a test that reproduces issue knative#13677

d4cf7fe

dprotaso mentioned this issue Feb 16, 2024

[wip] Include a test that reproduces issue #13677 #14909

Closed

dprotaso modified the milestones: v1.13.0, v1.14.0 Feb 17, 2024

dprotaso removed the triage/accepted Issues which should be fixed (post-triage) label Feb 17, 2024

dprotaso added the triage/needs-user-input Issues which are waiting on a response from the reporter label Feb 17, 2024

dprotaso removed this from the v1.14.0 milestone Feb 20, 2024

github-actions bot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label May 21, 2024

dprotaso unassigned jsanin-vmw May 21, 2024

dprotaso closed this as completed May 21, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Defective revision can lead to pods never being removed #13677

Defective revision can lead to pods never being removed #13677

DavidR91 commented Feb 6, 2023 •

edited

Loading

dprotaso commented Feb 9, 2023

jsanin-vmw commented Feb 15, 2023

jsanin-vmw commented May 23, 2023

jsanin-vmw commented Oct 30, 2023

jsanin commented Feb 1, 2024

dprotaso commented Feb 17, 2024

github-actions bot commented May 21, 2024

dprotaso commented May 21, 2024

Defective revision can lead to pods never being removed #13677

Defective revision can lead to pods never being removed #13677

Comments

DavidR91 commented Feb 6, 2023 • edited Loading

In what area(s)?

What version of Knative?

Expected Behavior

Actual Behavior

Steps to Reproduce the Problem

dprotaso commented Feb 9, 2023

jsanin-vmw commented Feb 15, 2023

jsanin-vmw commented May 23, 2023

jsanin-vmw commented Oct 30, 2023

jsanin commented Feb 1, 2024

dprotaso commented Feb 17, 2024

github-actions bot commented May 21, 2024

dprotaso commented May 21, 2024

DavidR91 commented Feb 6, 2023 •

edited

Loading