New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ReplicationController + crashlooping or invalid docker image = bad times #2529
Comments
To fix your cluster, not that anyone has ever done this:
|
I've been caught out by this recently too. Having configuration for max number of tries on a replication controller would be great. |
I am confused about this issue. IIUC replication controller only works with pods that specify RestartPolicy = Always. So wouldn't a fix for this (limit # retries, or rate limit retries) need to go into the Kubelet (or whatever is restarting the container -- is it Docker or the Kubelet)? I don't see how replication controller can help. It seems replication controller would only get involved if the pod needs to move off of that machine, but that doesn't seem like the case you're talking about here. |
@davidopp The problem is in how we report the pod's status; we report it as "failed" when kubelet is actually in the process of restarting it. This causes rep. ctrlr to make more pods, instead of waiting. So the bug here is in the assignment of status. Er, condition. Whatever we're calling it these days. |
ReplicationController shouldn't stop permanently, it should back off. It can be very hard to distinguish temporary from permanent failures, and we don't want users to need to create replication controller controllers. The same applies for Kubelet. We should find a way to facilitate implementation of more sophisticated policies by users. |
To recap. While kubelet is restarting a pod, apiserver will list its status as "Failed", which will cause replication controller to make additional pods. Action item here is to make apiserver give a reasonable amount of time (forever?) to kubelet to perform as many restarts as it wants, and during that time the pod should count as Running (if it ever ran) or Pending (if it never successfully started). |
(Eventually we want to push pod status generation down into kubelet but I think it's worthwhile making this intermediate fix because this really hoses your cluster when it happens to you.) |
I don't think it is about "reasonable amount of time" but having enough On Tue, Jan 6, 2015 at 4:40 PM, Daniel Smith notifications@github.com
|
Dawn is right, this bug seems to be fixed and should probably be closed. |
There's still the broader issue of making replication controller do something useful (e.g., raising events, including number of pending pods in status) in this scenario. |
xref #76370 |
ReplicationController has no idea when the pods it is making will deterministicly fail. This can rapidly fill your cluster with pod objects.
Mitigation: throttle the rate at which rep. ctlrs. can make new pods.
Real fix? Make replication controller watch events & stop after N tries failed without any successes?
The text was updated successfully, but these errors were encountered: