-
Notifications
You must be signed in to change notification settings - Fork 1.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Graceful Termination for External Traffic Policy Local #1607
Graceful Termination for External Traffic Policy Local #1607
Conversation
b26a1dc
to
50b0c9e
Compare
// how long existing connections to an endpoint should be kept open after the endpoint | ||
// is deleted. | ||
// +optional | ||
TerminationGracePeriodSeconds *int64 `json:"terminationGracePeriodSeconds,omitempty" protobuf:"bytes,6,opt,name=terminationGracePeriodSeconds"` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@robscott mentioned that an alternative would be to add a Terminating
field in EndpointCondition
(kubernetes/kubernetes#88590 (comment)). That would require some larger changes in the endpointslice controller but it might be worth while, thoughts?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Procedurally this should be a new KEP.
I am not convinced this can really be used for what you want to use it for. I am also worried about the expansion of per-endpoint metadata. As we try to map endpoint slice onto other use-cases, the size of them starts to matter. If that propagates into a per-group value, this could become another thing that chops-up otherwise contiguous blocks. I expect in practice it will mostly be the same across all pods, so IF we decide to do something, it should probably be scoped to the whole slice. This does make the slice controller more complicated again.
|
||
An optional field `TerminationGracePeriodSeconds` can be added to each Endpoint indicating the longest duration an endpoint is given to gracefully shutdown. This is analogous to `TerminationGracePeriodSeconds` in `PodSpec`. In most cases, the endpointslice controller will set `TerminationGracePeriodSeconds` on an `Endpoint` based on the termination grace period on `PodSpec`. External implementations of the `EndpointSlice` API can set this field arbitrarily based on specific implementation requirements. | ||
|
||
This field allows consumers of the API to know the termination grace period of endpoints (pods in this case) without having to watch Pods. A specific use-case for this field is the [graceful termination handler in the IPVS proxy](https://github.com/kubernetes/kubernetes/blob/master/pkg/proxy/ipvs/graceful_termination.go). Because deleting an IPVS backend (a.k.a IPVS real server) immediately deletes all its correpsonding conntrack entries, deleting a real server is disruptive during a pod's graceful termination period. Today, the IPVS proxy will only delete a real server if it sees that it has no connections or it has exceeded a hardcoded duration limit (15m). The `TerminationGracePeriodSeconds` field would remove the need for this hardcoded limit, improving overall performance of the proxy as it no longer needs to track stale backends. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure I buy that this is useful for this. First, it's the longest timeout. The pod may choose to exit immediately or the node may go down, causing reduced grace. In that case, you will still be routing traffic to it as a realserver, when you could have stopped.
This also doesn't help if kube-proxy gets restarted, right? At that point the endpoint has been removed from the slice and you've lost any "memory" of it, so you will (should) nuke it. I guess that's strictly better than today, but see previous paragraph.
How does IPVS expect a backend to be removed? I don't see a "drain" command...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You have two options:
- set the backend weight to 0 which prevents new connections but allows existing ones to graceful terminate (this what we do today on endpoint deletion)
- delete the backend, which removes it and terminates the connection by clearing the conntrack and sending a RST on the next packet (which we only do today when there are no active connections to the backend)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The pod may choose to exit immediately or the node may go down, causing reduced grace. In that case, you will still be routing traffic to it as a realserver, when you could have stopped.
As @lbernail mentioned, for the IPVS proxier we reduce the weight of the backend to 0 so no new connections are created, so the grace period only applies to existing connections. The IPVS proxier also tracks connections by state, so we would remove the terminating endpoint early if we saw that all existing connections closed cleanly. In the few edge cases where the connection isn't closed properly, the real server would stay around. So at worse we would be leaving an existing connection open longer than it should up to terminationGracePeriodSeconds
, but this is still better than the 15m duration today.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see, yeah that makes more sense. :) Thanks for explaining.
You still have the kube-proxy restart problem, right? Specifically
- Pod X is deleted
- 10 minute grace period starts
- X is removed from endpoints
- kube-proxy sets IPVS weight to 0
- kube-proxy restarts
- WATCH returns endpoints without X
- kube-proxy observes realserver X with weight 0 but has no idea if that is seconds old or days old
So it seems to lose information, and at some point needs to make a dubious assumption about how long to wait.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yeah that's true -- for what it's worth the restart problem exists today since graceful termination handling is all done in local memory.
The solution I proposed #1607 (comment) would fix this, but requires changing the behavior so that terminating pods are kept in the slice.
|
||
This field allows consumers of the API to know the termination grace period of endpoints (pods in this case) without having to watch Pods. A specific use-case for this field is the [graceful termination handler in the IPVS proxy](https://github.com/kubernetes/kubernetes/blob/master/pkg/proxy/ipvs/graceful_termination.go). Because deleting an IPVS backend (a.k.a IPVS real server) immediately deletes all its correpsonding conntrack entries, deleting a real server is disruptive during a pod's graceful termination period. Today, the IPVS proxy will only delete a real server if it sees that it has no connections or it has exceeded a hardcoded duration limit (15m). The `TerminationGracePeriodSeconds` field would remove the need for this hardcoded limit, improving overall performance of the proxy as it no longer needs to track stale backends. | ||
|
||
why was "not ready" addresses not used? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a good question. I don't think I know the answer. @robscott you have more recent context... ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If NotReady endpoints were always in the slice, we could easily make the difference between a NotReady endpoint (weight=0) and a deleted endpoint (absent). However, I think NotReady endpoints are only present when PublishNotReadyAddresses is enabled.
If we always have NotReady endpoint in the slice, it will definitely be enough, but it would require some changes in the logic of kube-proxy (and maybe other tools relying on endpoints) to look at PublishNotReadyAddresses to decide on the behavior for a specific service
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
lol oops, this was not supposed to be in here but yeah this was also raiesd in this thread kubernetes/kubernetes#88590 (comment)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's try to drag the conversation from PR back to here.
Ignore "tolerate unready", which just merges the two states.
Endpoints (not slice) puts pods in NotReadyAddresses on startup, but on teardown (when DeletionTimestamp is set) it removes them entirely.
If we moved them BACK to NotReady on teardown it would be a little bit of a behavioral change, but it's semantically correct and seems unlikely to break users. In case of tolerate-unready, we'd need to decide what the "right" behavior is. Today pods being deleted are removed, regardless of tolerate-unready. We could a) put them in NotReadyAddresses (which retains existing behavior -- they are not in Addresses) or b) put then in Addresses (which is a behavioral change, but consistent).
The bigger issue is whether this would cause a scalability problem. It would cause another round of writes to the Endpoints object (ready->unready->gone vs today's ready->gone), which is already problematic at scale. It seems that we don't actively batch updates to a single Endpoints.
For EndpointSlice we have the same question, though less pathologically. For a single pod it's much less of a big deal. In practice, we're likely to see this when a deployment is being updated, so it will hit all pods anyway. This is why we held off adding more Conditions - each one is a cost scalar. This is roughly equivalent to adding a new Condition (and indeed we should discuss that choice).
That said, I prefer making this sort of change to adding a new field. I really want to keep EPSlice general-purpose and grace period feels weird. Also, the above "memory loss" problem exists, whereas "unready" is more complete.
eea2797
to
70b8672
Compare
Do you feel this way about adding to |
I think the Conditions is a better design, but we should think about the meaning of them and their orthogonality. What we don't really want is to encode a full state-machine of the pod lifecycle into a set of bit fields. For example (ready: false, terminating: false) -> (ready: true, terminating: false) -> (???) Is the terminal state (ready: false, terminating: true) or (ready: true, terminating: true) ? In either case, what does the opposite mean? They aren't really orthogonal. For better or worse, we did not call "ready" as "started", in part because that implies keeping history rather than simply observing state. This is important because history can get lost, but state is always re-observable. Upon experimenting, Pods (curiously) DO NOT get set to unready (status.Conditions "Ready"="False") when they are being deleted (deletionTimestamp != nil). I don't know if this is an oversight or intentional, but if it is intentional (or can't be changed) the compatible thing to do would be to add a new condition (and a new TerminatingAddresses field in classic Endpoints? Euchh). @derekwaynecarr since we were JUST talking about pod lifecycle. Do you consider this paragraph above a "bug"? Specifically "pods do not get set to unready when they are being deleted". This suggests (ready:true, terminating: true) is "normal end of life". So what does (ready: false, terminating: true) mean? |
70b8672
to
9ea1c61
Compare
@andrewsykim PTAL at kubernetes/kubernetes#85643 (comment) I think this KEP and that issue are pretty closely related. I would sign up to shepherd the work, but I don't have the time to do all the implementation legwork right now. Is this something you'd be willing/able to carry? It's a slight expansion of scope over what you proposed here (which is just the API, no consumers thereof). |
I've just read that comment and I think it may have significant performance/scalability implications. So if we decide to put it into this KEP, I would like to be a reviewer (BTW - will be great thing to valdiate prod readiness questionaire too, which you will have to fill anyway). |
Yes, I can try to carry this one forward. Will dig into the issue for more context.
Noted, I'll ping you once I've updated the KEP. |
@thockin @wojtek-t I took a stab at implementing kubernetes/kubernetes#85643 (comment) to get a better idea of the work involved and to help facilitate discussion in the KEP (see kubernetes/kubernetes#89780). I think the semantics get a little weird if we fall back to local not ready endpoints because a not ready endpoint can be terminating or actually failing readiness. In that case should we still fallback to not ready endpoints? If not I think a |
So I missed this when it was open but I consider the current behavior of services while terminating under external traffic policy local to be a bug. In general a terminating pod (deletionTimestamp set) wants:
I think we are are as a platform not in a great spot to offer correct zero-downtime drains and relabalances for workers (I have DaemonSet surge #1590 open to fix it so that network pods can correctly hand off responsibilities, we have fixed maybe 5-10 issues in drain over the last few months). I would urge us to ensure services can be used safely for zero disruption handoff of new connections before we add more features to services. |
Pod that has a valid `DeletionTimestamp`. Controllers that manage EndpointSlice objects may want to opt out of this feature for optimization reasons (e.g. avoiding per-endpoint metadata expansion) | ||
which can be accomplished by setting the field to nil. | ||
|
||
This field allows consumers of the API to know if endpoints are currently terminating without having to initiate a watch on Pods. A specific use-case for this field is the [graceful termination handler in the IPVS proxy](https://github.com/kubernetes/kubernetes/blob/master/pkg/proxy/ipvs/graceful_termination.go). Because deleting an IPVS backend (a.k.a IPVS real server) immediately deletes all its corresponding conntrack entries, it can be disruptive to delete a real server during its graceful termination period. The solution today is to only delete a real server if it sees that it has no connections or it has exceeded a hardcoded duration limit (15m). This approach is not full-proof as the state of "terminating endpoints" is not preserved when kube-proxy is restarted. The `Terminating` field would remove the need for storing termintaing state if memory, improving overall design and performance of the proxy as it no longer needs to track stale backends in memory. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we should explicitly call it in either this KEP or another the use case around graceful shutdown without disruption of pods using services for nodeport, local traffic policy, and service proxy at minimum as the guiding goal. This actually loses some of the context - which is that we generally want to ensure that pods continue to receive existing connection traffic for all services while new connections are rapidly (or less rapidly with readiness gates) diverted away during termination.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm trying to add e2e tests to kube suite that will ensure this is possible for a range of services (kubernetes/kubernetes#89547 is one of the first changes we need to so we have a test shim)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Noted, I'll put consumption of the API in a separate KEP.
ccc0cda
to
e1bf06f
Compare
FYI: updated the KEP so it's split into two separate KEPs and proposes to only fallback to ready terminating endpoints (instead of all terminating endpoints). |
e1bf06f
to
b2b3174
Compare
71e2356
to
b5d8f28
Compare
Based on discussions in #1607 (comment), I've updated the KEP to walk through all terminating states for external node local traffic: {R, !T} -> {R, T} -> {!R, T} -> {}. I've also updated kubernetes/kubernetes#89780 to reflect the changes in this KEP in case some of the details were not clear in writing. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm approving. I'd like Rob to confirm this will not be a shock for EPSlice. Only one comment on note in my feedback.
/approve
* local ready & terminating endpoints | ||
* local not ready & terminating endpoints | ||
* blackhole traffic | ||
* for all other traffic, preserve existing behavior where traffic is only sent to ready && !terminating endpoints. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why this distinction? I forget if we decided "start here and expand" or if there's a reason to stop at this level. Why NOT do this for all?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This distinction is actually referring to the externalTrafficPolicy=Cluster
caes.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Updated to make this clearer
- "@smarterclayton" | ||
approvers: | ||
- "@thockin" | ||
editor: TBD |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
editor has been removed from the template :)
@@ -0,0 +1,20 @@ | |||
title: Tracking Terminating Endpoints in EndpointSlice |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@robscott Are you good with this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added a couple small nits, but LGTM overall. This specific KEP feels like it might make sense as part of the existing EndpointSlice KEP. That has traditionally been where the EndpointSlice API and any changes to it have been defined. With that said, I know that could result in it becoming a massive KEP. Not sure what the best practice is here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Tim had previously suggested that this should be a new KEP (see here) and I agree.
I think it would make sense for this KEP to summarize the motivation for the change but we should still backport the API changes into the main EPSlice KEP with maybe a reference to this one?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(I'll open a follow-up PR for this)
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: andrewsykim, thockin The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the work on this!
// terminating indicates if this endpoint is terminating. Consumers should assume a | ||
// nil value indicates the endpoint is not terminating. | ||
// +optional | ||
Terminating *bool `json:"terminating,omitempty" protobuf:"bytes,2,name=terminating"` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why a pointer here? Would it be safe to get rid of the nil case since it's indistinguishable from false here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm assuming that the nil state could be useful in the future. Some consumeres of EPSlice might have a special case for when the terminating state is unknown vs when it's explicitly false. Maybe I'm overthinking this?
### Upgrade / Downgrade Strategy | ||
|
||
Since this is an addition to the EndpointSlice API, the upgrade/downgrade strategy will follow that | ||
of the [EndpointSlice API work](/keps/sig-network/20190603-EndpointSlice-API.md). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This and other links to the EndpointSlice KEP are out of date: https://github.com/kubernetes/enhancements/blob/master/keps/sig-network/20190603-endpointslices/README.md.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fixed
@@ -0,0 +1,20 @@ | |||
title: Tracking Terminating Endpoints in EndpointSlice |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added a couple small nits, but LGTM overall. This specific KEP feels like it might make sense as part of the existing EndpointSlice KEP. That has traditionally been where the EndpointSlice API and any changes to it have been defined. With that said, I know that could result in it becoming a massive KEP. Not sure what the best practice is here.
b5d8f28
to
559d4aa
Compare
Signed-off-by: Andrew Sy Kim <kim.andrewsy@gmail.com>
Signed-off-by: Andrew Sy Kim <kim.andrewsy@gmail.com>
559d4aa
to
d1e0d0e
Compare
/lgtm |
This PR adds two related, but separate KEPS:
Tracking Terminating Endpoints in EndpointSlice API (KEP-1672)
Adds a field
terminating
to theendpointCondition
field in the EndpointSlice API. This allows consumers of the API to know when a pod is terminating in case extra steps are required during a pod's graceful termination period. One use-case (in addition to the one below) is to simplify the graceful termination logic in the IPVS proxier, more details in the PR.Graceful Termination for External Traffic Policy Local (KEP-1669)
Uses the proposed changes to EndpointSlice API above to enable zero downtime deployments for Services with Type=LoadBalancer/NodePort and ExternalTrafficPolicy=Local. This is mainly accomplished by redirecting traffic to ready terminating endpoints if all endpoints on a given node are terminating.
Continuing discussion from kubernetes/kubernetes#85643 (comment)
note: first commit just fixes some trailing white spaces.