-
Notifications
You must be signed in to change notification settings - Fork 1.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Defective revision can lead to pods never being removed #13677
Comments
/triage accepted |
/assign |
/unassign |
/assign |
PR 14573 aims to fix this issue. The proposed fix is based on the TimeoutSeconds field in the Revision. After this The default value for |
@DavidR91 I've been trying to reproduce this issue to verify the proposed fixed - but in my testings I'm seeing the revision pod scale down once the activator times out the request. Do you have a consistent way to trigger this issue? Can you confirm requests are being timed out by the activator? |
This issue is stale because it has been open for 90 days with no |
Closing this out due to lack of user-input |
In what area(s)?
/area autoscale
What version of Knative?
Repro'ed in
1.3
1.5
1.9
(This repros with istio as networking + ingress 1.12.9. Besides using the operator for install the configuration is very vanilla but I can provide more details if useful)
Expected Behavior
Deployments that pass their initial progress deadline but contain pods that start to crashloop should eventually be scaled down and removed.
(Note: Scale to zero assumed)
Actual Behavior
If there is buffered traffic for a revision of a service, and the service passed its initial deployment progress deadline, knative will keep the revision's deployments alive forever with no obvious way to scale them down or remove them (keeping around the pods in a crashlooping state)
Example use case case encountered: a revision contains a pod with e.g. an address of an external resource like a database. The service is working with this revision for some time, and then the external resource address is changed (causing the pod to startup but the container to not serve requests and eventually enter restart loops). A new revision is created to amend this - but if there is any outstanding traffic for the old revision, the old defective pods are kept around and never scaled down.
The state of the PodAutoscaler in this instance becomes
Ready=Unknown Reason=Queued
with status messages to the effect ofRequests to the target are being buffered as resources are provisioned
Removing the service is not a solution because the newest revision is correctly serving traffic.
Steps to Reproduce the Problem
NOTE: You can obviously delete the revision, but this is not a solution for services which have only a single revision (do we have to delete the entire service to kill these pods?). This bug is partly a question of whether knative is actually designed to be able to clean up this scenario, or whether it would rest on a human operator or additional orchestrator to resolve.
The text was updated successfully, but these errors were encountered: