-
Notifications
You must be signed in to change notification settings - Fork 39.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Pod Phase - Vacating. #6951
Comments
+1 this affects namespace deletion. I noticed that the Kubelet delay caused events to sometimes be created in the namespace after it should have been removed. Sent from my iPhone
|
Won't graceful termination resolve this? Pods will stay present until grace period expires or node terminates process (or observes process exit)? |
@smarterclayton - eventually the system cleans up, it's just a matter of accuracy in reporting. Unless your referring to another issue.?.?.? |
When pods will be deleted, they will not disappear immediately: So delete will set deletiontimestamp, which is the time after which any processes may be hard killed (SIGKILL), and the time at which the pod will be automatically deleted (releasing the name for reuse). The kubelet can then SIGTERM the pod's containers. Once the processes are observed dead, the kubelet can either mark the pod status or simply delete the pod outright. So the observed delay should be grace - earlycompletion, and the pod will disappear. If the processes have not finalized, the kubelet may ballistically decommission the processes.
|
I think it is covered. The issue that I am encountering is because we run with NamespaceAutoProvision in KUBERNETES by default, you get in a state where finalization thinks all resources are deleted (because they were) and then moments later the Kubelet creates a bunch of events noting the true death of the pod. This in turn causes the namespace to be provisioned again (with just the death rattle of events from a past namespace's life). This is not a problem in OpenShift where we run with the NamespaceExists plugin. If we can get to a state where with graceful termination, the death events are dispatched before the pod is purged completely, it would be good. Sent from my iPhone
|
Agree w/ @smarterclayton. I am hesitant to add a new phase. What would it be used for? Some applications receive the termination notification (SIGTERM/prestop -- #6804), start to fail readiness checks, but continue responding to requests until they receive SIGKILL. Others immediately stop responding to new requests and try to drain any in-flight storage/db writes. The pod may or may not be replaced by a controller. Some applications may want to be replaced as soon as it's decided that they should be killed, while others require the existing instance to be fully terminated prior to replacement. I'm not sure much could be inferred from a Vacating phase in general. Looking at whether deletionTimestamp is set would be generically applicable across all object kinds. One thing it maybe could be used for is so Kubelet could determine it had already notified the containers in the pod. There would still be instances of duplicate notifications, though. |
@bgrant0607 In the case of graceful shutdown of applications, a large grace period may be applied in a worst case scenario to allow cleanup of those applications that have chosen to use signal escalation. A vacating|STOPPING phase #6804 provides the operator the ability to determine where the application is in it's lifecycle, as well as enabling developers an insight into system behavior. Right now it's all about accuracy in reporting to the user the true state of the cluster, because as of today the kubectl reports an empty cluster when in fact there are delays in the system that go outside of the scope of just signal escalation. |
The presence of "deletionTimestamp" is an observable metric that implies a phase. Are you saying we should impose both?
|
Yes, deletionTimestamp is intended to provide the indication, in addition to the container's response reflected in the Ready condition. Adding a Vacating phase would be a breaking API change. It would mean that every client/component that checked Running would need to also then consider Vacating to be Running:
This is also discussed #1899 (comment) and elsewhere. Enums aren't extensible. Every addition is a breaking API change. People will try to infer properties from the phases, using switch statements like the above. I relented on Succeeded/Failed, but probably shouldn't have. At some point, it will just become too painful to add new phases, and then we'll be left with an unprincipled distinction between phases and non-phases. I'm trying to keep the set of phases to the bare minimum and then explicitly specify other common properties for clients as needed. I can elaborate on this more in the api conventions doc: I'm happy to add new additional properties if that helps. The conditions array is designed to be extensible in that way. PodStatus also contains a Message field, which could be used: We should also really add a Reason field. ContainerStateTerminated already contains both Reason and Message: We could add an additional field to ContainerStateRunning: Comment on #6979 if you have feedback about the structure of ContainerStatuses. |
I don't have a philosophical objection to adding more phases but I do think we should try to avoid redundant representations of the same information within the server. IIUC, deletionTimestamp != 0 would be redundant with podPhase==vacating, so I don't think we need the latter. I do think the client will want to synthesize some nice human-readable set of states larger than the set we are providing on the server, and perhaps we want to put that logic in a client library, but I don't think we need to make these sever-side states. [For reference, in Borg tasks have a state that indicates they are "vacating" and also a timestamp for when the task should be kill -9'd (this timestamp is only nonzero when it has been requested to die and is in the grace period phase). It's confusing. Machines have a similar situation (a state that indicates they're shutting down, and a timestamp for when they will be removed, that is only nonzero when they are shutting down). It's also confusing.] |
Honestly, since Phase is such a honeypot, I'm tempted to delete it from the API and just create more conditions. |
I agree w/ @davidopp re. redundancy. We definitely should also enhance the presentation in kubectl and any UIs. |
Proposal:
Success could be recorded in the deletion reason. Any other reason is failure. We could add a Vacating (aka Terminating, ShuttingDown, Lame, Evicted) PodConditionType, with the specific meaning that notice has been given. NotifiedOfTermination seems too long. I considered updating the Reason and Message of the Instantiated condition, but that doesn't seem right if there's no change in the condition status. |
+1 for Vacating PodConditionType. Indicators like this would be helpful when Kubernetes has a Job Controller to manage workloads like batch processing and cron jobs. |
To avoid confusion, Conditions should be orthogonal. The ones you suggested -- Instantiated, Initialized, and Terminated -- don't seem to be orthogonal (in fact I would expect a pod to transition from FFF to TFF to TTF to TTT). What you've described sounds more like a state machine than a set of orthogonal conditions, and I think Conditions should only be used for the latter. Just adding a Vacating PodConditionType might be OK but still has the issue I mentioned earlier, where IIUC it is redundant with having nonzero DeletionTimestamp. |
@davidopp I intended the conditions to be independent properties. For instance, termination wouldn't cause a pod to become uninitialized. The intent is to eliminate the need for switch statements to infer these properties. |
@bgrant0607 would it be possible to create a state diagram for pod-states w/edges, for the proposed changes. I'm certain this would rectify the confusion. |
I'm waiting patiently for the Timestamp stuff to land, and check. |
Created #7856 re. replacement of phase with condition. |
I'm late to the party, but let me +1 the idea that we should have a general graceful deletion semantic and NOT use pod phase to represent this information. |
As an implementation note we may want to update the containers grace period during the first delete (when we set TTL) to the effective grace period. The timestamp is a valid marker but does not make it easy to calculate the desired grace period for termination (which should be calculated on the kubelet from the time it first observes the deletion, not as an absolute clock value).
|
When Kubelet sees the pod deleted from etcd it sets a local timer equal to the grace period, so it should always give the right grace period. |
I'm going to close *this one and join the herd on #7856 |
Currently if you have a large replication controller running and stop said controller, kubernetes will report that no pods are running, when indeed there are currently pods on the cluster.
This occurs because the entries are deleted from etcd and eventually the kubelets will catch up. Perhaps it makes more sense to update the phase to "vacating" and only delete once the final event has been sent.
@dchen1107 , @wojtek-t , @fgrzadkowski
The text was updated successfully, but these errors were encountered: