New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Scale down may cause downtime #40304
Comments
@caarlos0 this is how the initial pass on proportional scaling was implemented. Also when scaling down we always try to remove from old replica sets first. It's definitely desirable to enhance the deployment controller to scale down broken pods first. |
Autorollback (#23211) is yet to be implemented but in 1.5 you can use progressDeadlineSeconds and identify stuck deployments. https://kubernetes.io/docs/user-guide/deployments/#deployment-status |
Got it, thanks @Kargakis 👍 |
@kubernetes/sig-apps-misc |
@caarlos0 one suggestion for now - since it's hard to act on perma-failed errors eg. somebody may not care about ImagePullBackOff and expects the image to land at some point in the future - if you are going to scale down manually first make sure that your Deployment is healthy. In this case you should rollback |
Anybody from @kubernetes/sig-apps-misc have time to take a stab at this one? Basically we should cleanup unhealthy pods before estimating proportions when we scale down in scale. |
If someone point me to the right direction, I can try to tackle this... |
Ok, I just realized that trying to cleanup the new replica set will do no good. The system always tries to deploy the latest replica set so having a part of the controller scale down the new replica set (cleanup) and then another part scale up (the strategy) will drive in hotlooping of the controller. That being said I think this is an not issue, we provide you with ways/tools to diagnose failures (d.spec.progressDeadlineSeconds) and rollback ( |
@Kargakis maybe just fail then? Showing some error message saying that is not possible to scale down because there is no healthy instances in the new version, or something like that? Of course, the user can check before scaling down... this would be just some kind of safe guard... |
We cannot special-case the operation, otherwise we may drive autoscalers in hotloops. There is no reason not to rollback in this case other than when you expect the new image to be imported at some point in the future. |
OK, makes sense. Thanks @Kargakis =D |
I created a service
ops
with 10 replicas, and astrategy.rollingUpdate.maxUnavailable
= 1:If I deploy something wrong, let's suppose, a bad docker image that won't come up, or a tag that doesn't exist, some of the replicas will still be running:
Now, if I scale down (for some reason) to 2 replicas, for example, what happens is:
Which is OK, service is still up.
Now, if I scale down to 1 replica, things get ugly:
Downtime!
Why don't Kubernetes let the container that was working running instead of killing it and trying to launch a new one?
Is there a way of auto-rolling back when this kind of things happen (or even better, prevent it)?
The text was updated successfully, but these errors were encountered: