Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

KEP for critical container feature #912

Closed
wants to merge 4 commits into from

Conversation

@CsatariGergely
Copy link

@CsatariGergely CsatariGergely commented Mar 22, 2019

Signed-off-by: Gergely Csatari gergely.csatari@nokia.com

Signed-off-by: Gergely Csatari <gergely.csatari@nokia.com>
@k8s-ci-robot
Copy link
Contributor

@k8s-ci-robot k8s-ci-robot commented Mar 22, 2019

Hi @CsatariGergely. Thanks for your PR.

I'm waiting for a kubernetes member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work. Regular contributors should join the org to skip this step.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@k8s-ci-robot
Copy link
Contributor

@k8s-ci-robot k8s-ci-robot commented Mar 22, 2019

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: CsatariGergely
To fully approve this pull request, please assign additional approvers.
We suggest the following additional approver: derekwaynecarr

If they are not already assigned, you can assign the PR to them by writing /assign @derekwaynecarr in a comment when ready.

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@CsatariGergely
Copy link
Author

@CsatariGergely CsatariGergely commented Mar 22, 2019

Copy link
Member

@derekwaynecarr derekwaynecarr left a comment

Thank you for proposing this idea.

Can we discuss this in more detail at a future SIG node meeting so we understand the problems and goals?

keps/sig-node/20190333-critical-container.md Outdated Show resolved Hide resolved
keps/sig-node/20190333-critical-container.md Show resolved Hide resolved
@libesz
Copy link

@libesz libesz commented Mar 22, 2019

@derekwaynecarr You can find some use-cases here: kubernetes/kubernetes#40908
Pod replacement in my mind is the process what is done when all the container including the infra (pause) is terminated and re-created from the PodSpecs.
The idea was that we don't want to introduce a new procedure to "restart" a pod in-place (which i.e. preserves the emptyDir volumes, etc.), instead use the existing terminate and create procedures of the Pod, done by kubelet. Hope our assumptions are correct 🤞 .

Some clarification on the definition of Pod replacement based
on the pr discussion.

Signed-off-by: Gergely Csatari <gergely.csatari@nokia.com>
@CsatariGergely
Copy link
Author

@CsatariGergely CsatariGergely commented Mar 25, 2019

@derekwaynecarr Thanks for the comments. To join to the sig-node meeting this week is a bit difficult from CET timezone. Next week I will be closer timezone wise and I will try to join.

@CsatariGergely
Copy link
Author

@CsatariGergely CsatariGergely commented Apr 2, 2019

Notes from sig-node meeting at 02.04.2019:

  • Sidecar containers KEP1, KEP2 is doing similar in a sense that some containers are differently handled from the pods lifececyle point of view
  • We should check if this provides a solution to our problem

@libesz
Copy link

@libesz libesz commented Apr 25, 2019

Notes from sig-node meeting at 02.04.2019:

* Sidecar containers [KEP1](https://github.com/kubernetes/enhancements/pull/919), [KEP2](https://github.com/kubernetes/enhancements/blob/master/keps/sig-apps/sidecarcontainers.md) is doing similar in a sense that some containers are differently handled from the pods lifececyle point of view

* We should check if this provides a solution to our problem

Just checked the other KEPs. I believe those are covering different use-cases around lifecycle management, while this one wants to cover failure/recovery scenarios. I would consider this one as independent from specification/implementation PoV.

Signed-off-by: Gergely Csatari <gergely.csatari@nokia.com>
@justaugustus
Copy link
Member

@justaugustus justaugustus commented Apr 28, 2019

/ok-to-test

@InAnimaTe
Copy link

@InAnimaTe InAnimaTe commented May 18, 2019

This is a great KEP and I'd love to see it get approved. @CsatariGergely @derekwaynecarr what specifically is being waited for prior to approval?

@CsatariGergely
Copy link
Author

@CsatariGergely CsatariGergely commented May 21, 2019

Thanks @InAnimaTe :) I would be also interested to figure out what is needed to approve this. I'm available in Barcelona to discuss if that helps.

However it is nto clar from the KEP process, the initial state of
a KEP should be provisional.
Copy link
Member

@thockin thockin left a comment

I just want to say that I have heard this request many times, and while I continue to assert that "you're doing it wrong" it seems common enough to consider. I urge you to look for the simplest design that satisfies the preponderance of users.

@fejta-bot
Copy link

@fejta-bot fejta-bot commented Sep 8, 2019

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle stale

@bitva77
Copy link

@bitva77 bitva77 commented Sep 16, 2019

/remove-lifecycle stale

We've recently run across a potential bug in dotnet core containers that an enhancement like this would help us workaround downtime in our service (restarting containers is not enough - the pod needs to be restarted. I suspect an issue more with Docker and dotnet).

@byxorna
Copy link

@byxorna byxorna commented Oct 8, 2019

This would be an amazingly useful feature for usecases where the pod health is dependent on the health of all component containers - a failure in one container invalidates the health of all others in the pod (yes, i know that isnt ideal SOA design, but legacy is going to legacy). I have tooling currently emulates this behavior external to k8s (via a k8s aware init replacement), but it is kludgy and would be much nicer if this support was mainlined :)

@fejta-bot
Copy link

@fejta-bot fejta-bot commented Jan 6, 2020

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle stale

@bitva77
Copy link

@bitva77 bitva77 commented Jan 29, 2020

/remove-lifecycle

still love to see this :)

@fejta-bot
Copy link

@fejta-bot fejta-bot commented Feb 28, 2020

Stale issues rot after 30d of inactivity.
Mark the issue as fresh with /remove-lifecycle rotten.
Rotten issues close after an additional 30d of inactivity.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle rotten

@bitva77
Copy link

@bitva77 bitva77 commented Mar 3, 2020

/remove-lifecycle rotten

@InAnimaTe
Copy link

@InAnimaTe InAnimaTe commented Mar 19, 2020

I think myself and others would still really like to see this get Merged in. Is there really anything else needed here to get the ball rolling? Can I help in any way like with documentation or design or something? I'd love to chat more about what this looks like and why it should exist.

@thockin
Copy link
Member

@thockin thockin commented Mar 19, 2020

This overlaps a lot with other pod-lifecycle issues. I think we need holistic review of the topic.

kubernetes/kubernetes#88886
kubernetes/kubernetes#87801
kubernetes/kubernetes#87016
kubernetes/kubernetes#88257
kubernetes/kubernetes#65502
kubernetes/kubernetes#80744
kubernetes/kubernetes#40908
kubernetes/kubernetes#25908
The pod-restart one which I can't find right now
and so forth

@derekwaynecarr has expressed concerns here and the more I look, the more I agree.

ianabc added a commit to pimsmath/hubtraf that referenced this issue Mar 21, 2020
This seems to be a feature which will appear as a kubernetes feature.
The issue is discussed at length in [this
issue](kubernetes/kubernetes#25908) and there
is an [open PR](kubernetes/enhancements#912)
which should help close it.

Signed-off-by: Ian Allison <iana@pims.math.ca>
@fejta-bot
Copy link

@fejta-bot fejta-bot commented Jun 17, 2020

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle stale

@bitva77
Copy link

@bitva77 bitva77 commented Jun 18, 2020

/remove-lifecycle stale

By default, just restart the container as it does today. But an option to replace the pod would be super.

@thockin
Copy link
Member

@thockin thockin commented Jul 1, 2020

Hi,

I know this KEP has languished. IMO this needs to be part of a larger pod-lifecycle effort. I think it is likely to be highly entangled with a bunch of other proposals and ideas, and we should probably not be patching each one individually.

e.g. kubernetes/community#2342 kubernetes/kubernetes#25908 and others that I can't find links for right now.

I do support the idea expressed here, but I don't think we should do it piecemeal.

@cnmcavoy
Copy link

@cnmcavoy cnmcavoy commented Sep 16, 2020

@thockin I saw your comment about this needing a larger review of the pod-lifecycle. Do you have any suggestions for what the next steps for such an effort would be?

My use-case for this is that we have a service which measures the time it takes an ingress to discover the pod the service is running in and reconfigure itself to serve traffic (e.g nginx reload). After the pod detects the ingress is up to date, it reports the metric and kills itself, so it can be rescheduled with a new pod ip. Unfortunately there is not a clean way to do this behavior currently: if a container exits abnormally, kubelet restarts the container in place and the pod retains its ip/network state. As a terrible workaround, we have the service exit by exceeding it's ephemeral disk quota, which causes kubelet to evict, delete the pod, then reschedule it. This KEP would allow us to exit cleanly and let kubelet restart the entire pod.

I'm interested in this enhancement and want to contribute, but It is unclear to me what the next steps are.

@dbsanfte
Copy link

@dbsanfte dbsanfte commented Nov 23, 2020

It feels like being able to set 'restartPolicy: Never' on a ReplicaSet container (and have it be honoured) is a cleaner solution to the use case presented in this issue, and as a bonus it requires less code change, and requires no spec change.

This is already logged here if there's any interest (the issue badly needs reopening, imo):

kubernetes/kubernetes#24725

@ehashman
Copy link
Member

@ehashman ehashman commented Jan 9, 2021

Hi all,

This PR has not seen any updates in nearly two years, and I see there are some concerns about the approach. The KEP templates have changed pretty significantly since this PR was opened, so it is not in a state where it is mergeable. Hence, I am going to close this, and if someone has time to pick up this idea and ferry it through the KEP process and propose a new PR, that would be much appreciated.

@ehashman ehashman closed this Jan 9, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Linked issues

Successfully merging this pull request may close these issues.

None yet