KEP for critical container feature#912
KEP for critical container feature#912CsatariGergely wants to merge 4 commits intokubernetes:masterfrom
Conversation
Signed-off-by: Gergely Csatari <gergely.csatari@nokia.com>
|
Hi @CsatariGergely. Thanks for your PR. I'm waiting for a kubernetes member to verify that this patch is reasonable to test. If it is, they should reply with Once the patch is verified, the new status will be reflected by the I understand the commands that are listed here. DetailsInstructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
|
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: CsatariGergely If they are not already assigned, you can assign the PR to them by writing The full list of commands accepted by this bot can be found here. The pull request process is described here DetailsNeeds approval from an approver in each of these files:Approvers can indicate their approval by writing |
derekwaynecarr
left a comment
There was a problem hiding this comment.
Thank you for proposing this idea.
Can we discuss this in more detail at a future SIG node meeting so we understand the problems and goals?
|
@derekwaynecarr You can find some use-cases here: kubernetes/kubernetes#40908 |
Some clarification on the definition of Pod replacement based on the pr discussion. Signed-off-by: Gergely Csatari <gergely.csatari@nokia.com>
|
@derekwaynecarr Thanks for the comments. To join to the sig-node meeting this week is a bit difficult from CET timezone. Next week I will be closer timezone wise and I will try to join. |
Just checked the other KEPs. I believe those are covering different use-cases around lifecycle management, while this one wants to cover failure/recovery scenarios. I would consider this one as independent from specification/implementation PoV. |
Signed-off-by: Gergely Csatari <gergely.csatari@nokia.com>
|
/ok-to-test |
|
This is a great KEP and I'd love to see it get approved. @CsatariGergely @derekwaynecarr what specifically is being waited for prior to approval? |
|
Thanks @InAnimaTe :) I would be also interested to figure out what is needed to approve this. I'm available in Barcelona to discuss if that helps. |
However it is nto clar from the KEP process, the initial state of a KEP should be provisional.
thockin
left a comment
There was a problem hiding this comment.
I just want to say that I have heard this request many times, and while I continue to assert that "you're doing it wrong" it seems common enough to consider. I urge you to look for the simplest design that satisfies the preponderance of users.
|
Issues go stale after 90d of inactivity. If this issue is safe to close now please do so with Send feedback to sig-testing, kubernetes/test-infra and/or fejta. |
|
/remove-lifecycle stale We've recently run across a potential bug in dotnet core containers that an enhancement like this would help us workaround downtime in our service (restarting containers is not enough - the pod needs to be restarted. I suspect an issue more with Docker and dotnet). |
|
This would be an amazingly useful feature for usecases where the pod health is dependent on the health of all component containers - a failure in one container invalidates the health of all others in the pod (yes, i know that isnt ideal SOA design, but legacy is going to legacy). I have tooling currently emulates this behavior external to k8s (via a k8s aware init replacement), but it is kludgy and would be much nicer if this support was mainlined :) |
|
Issues go stale after 90d of inactivity. If this issue is safe to close now please do so with Send feedback to sig-testing, kubernetes/test-infra and/or fejta. |
|
/remove-lifecycle still love to see this :) |
|
Stale issues rot after 30d of inactivity. If this issue is safe to close now please do so with Send feedback to sig-testing, kubernetes/test-infra and/or fejta. |
|
/remove-lifecycle rotten |
|
I think myself and others would still really like to see this get Merged in. Is there really anything else needed here to get the ball rolling? Can I help in any way like with documentation or design or something? I'd love to chat more about what this looks like and why it should exist. |
|
This overlaps a lot with other pod-lifecycle issues. I think we need holistic review of the topic. kubernetes/kubernetes#88886 @derekwaynecarr has expressed concerns here and the more I look, the more I agree. |
This seems to be a feature which will appear as a kubernetes feature. The issue is discussed at length in [this issue](kubernetes/kubernetes#25908) and there is an [open PR](kubernetes/enhancements#912) which should help close it. Signed-off-by: Ian Allison <iana@pims.math.ca>
|
Issues go stale after 90d of inactivity. If this issue is safe to close now please do so with Send feedback to sig-testing, kubernetes/test-infra and/or fejta. |
|
/remove-lifecycle stale By default, just restart the container as it does today. But an option to replace the pod would be super. |
|
Hi, I know this KEP has languished. IMO this needs to be part of a larger pod-lifecycle effort. I think it is likely to be highly entangled with a bunch of other proposals and ideas, and we should probably not be patching each one individually. e.g. kubernetes/community#2342 kubernetes/kubernetes#25908 and others that I can't find links for right now. I do support the idea expressed here, but I don't think we should do it piecemeal. |
|
@thockin I saw your comment about this needing a larger review of the pod-lifecycle. Do you have any suggestions for what the next steps for such an effort would be? My use-case for this is that we have a service which measures the time it takes an ingress to discover the pod the service is running in and reconfigure itself to serve traffic (e.g nginx reload). After the pod detects the ingress is up to date, it reports the metric and kills itself, so it can be rescheduled with a new pod ip. Unfortunately there is not a clean way to do this behavior currently: if a container exits abnormally, kubelet restarts the container in place and the pod retains its ip/network state. As a terrible workaround, we have the service exit by exceeding it's ephemeral disk quota, which causes kubelet to evict, delete the pod, then reschedule it. This KEP would allow us to exit cleanly and let kubelet restart the entire pod. I'm interested in this enhancement and want to contribute, but It is unclear to me what the next steps are. |
|
It feels like being able to set 'restartPolicy: Never' on a ReplicaSet container (and have it be honoured) is a cleaner solution to the use case presented in this issue, and as a bonus it requires less code change, and requires no spec change. This is already logged here if there's any interest (the issue badly needs reopening, imo): |
|
Hi all, This PR has not seen any updates in nearly two years, and I see there are some concerns about the approach. The KEP templates have changed pretty significantly since this PR was opened, so it is not in a state where it is mergeable. Hence, I am going to close this, and if someone has time to pick up this idea and ferry it through the KEP process and propose a new PR, that would be much appreciated. |
Signed-off-by: Gergely Csatari gergely.csatari@nokia.com