Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Graceful Termination for External Traffic Policy Local #1607

Merged

Conversation

andrewsykim
Copy link
Member

@andrewsykim andrewsykim commented Mar 10, 2020

This PR adds two related, but separate KEPS:

Tracking Terminating Endpoints in EndpointSlice API (KEP-1672)

Adds a field terminating to the endpointCondition field in the EndpointSlice API. This allows consumers of the API to know when a pod is terminating in case extra steps are required during a pod's graceful termination period. One use-case (in addition to the one below) is to simplify the graceful termination logic in the IPVS proxier, more details in the PR.

Graceful Termination for External Traffic Policy Local (KEP-1669)

Uses the proposed changes to EndpointSlice API above to enable zero downtime deployments for Services with Type=LoadBalancer/NodePort and ExternalTrafficPolicy=Local. This is mainly accomplished by redirecting traffic to ready terminating endpoints if all endpoints on a given node are terminating.

Continuing discussion from kubernetes/kubernetes#85643 (comment)

note: first commit just fixes some trailing white spaces.

@k8s-ci-robot k8s-ci-robot added cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. size/M Denotes a PR that changes 30-99 lines, ignoring generated files. kind/kep Categorizes KEP tracking issues and PRs modifying the KEP directory sig/network Categorizes an issue or PR as relevant to SIG Network. labels Mar 10, 2020
@andrewsykim
Copy link
Member Author

// how long existing connections to an endpoint should be kept open after the endpoint
// is deleted.
// +optional
TerminationGracePeriodSeconds *int64 `json:"terminationGracePeriodSeconds,omitempty" protobuf:"bytes,6,opt,name=terminationGracePeriodSeconds"`
Copy link
Member Author

@andrewsykim andrewsykim Mar 10, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@robscott mentioned that an alternative would be to add a Terminating field in EndpointCondition (kubernetes/kubernetes#88590 (comment)). That would require some larger changes in the endpointslice controller but it might be worth while, thoughts?

Copy link
Member

@thockin thockin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Procedurally this should be a new KEP.

I am not convinced this can really be used for what you want to use it for. I am also worried about the expansion of per-endpoint metadata. As we try to map endpoint slice onto other use-cases, the size of them starts to matter. If that propagates into a per-group value, this could become another thing that chops-up otherwise contiguous blocks. I expect in practice it will mostly be the same across all pods, so IF we decide to do something, it should probably be scoped to the whole slice. This does make the slice controller more complicated again.


An optional field `TerminationGracePeriodSeconds` can be added to each Endpoint indicating the longest duration an endpoint is given to gracefully shutdown. This is analogous to `TerminationGracePeriodSeconds` in `PodSpec`. In most cases, the endpointslice controller will set `TerminationGracePeriodSeconds` on an `Endpoint` based on the termination grace period on `PodSpec`. External implementations of the `EndpointSlice` API can set this field arbitrarily based on specific implementation requirements.

This field allows consumers of the API to know the termination grace period of endpoints (pods in this case) without having to watch Pods. A specific use-case for this field is the [graceful termination handler in the IPVS proxy](https://github.com/kubernetes/kubernetes/blob/master/pkg/proxy/ipvs/graceful_termination.go). Because deleting an IPVS backend (a.k.a IPVS real server) immediately deletes all its correpsonding conntrack entries, deleting a real server is disruptive during a pod's graceful termination period. Today, the IPVS proxy will only delete a real server if it sees that it has no connections or it has exceeded a hardcoded duration limit (15m). The `TerminationGracePeriodSeconds` field would remove the need for this hardcoded limit, improving overall performance of the proxy as it no longer needs to track stale backends.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure I buy that this is useful for this. First, it's the longest timeout. The pod may choose to exit immediately or the node may go down, causing reduced grace. In that case, you will still be routing traffic to it as a realserver, when you could have stopped.

This also doesn't help if kube-proxy gets restarted, right? At that point the endpoint has been removed from the slice and you've lost any "memory" of it, so you will (should) nuke it. I guess that's strictly better than today, but see previous paragraph.

How does IPVS expect a backend to be removed? I don't see a "drain" command...

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You have two options:

  • set the backend weight to 0 which prevents new connections but allows existing ones to graceful terminate (this what we do today on endpoint deletion)
  • delete the backend, which removes it and terminates the connection by clearing the conntrack and sending a RST on the next packet (which we only do today when there are no active connections to the backend)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The pod may choose to exit immediately or the node may go down, causing reduced grace. In that case, you will still be routing traffic to it as a realserver, when you could have stopped.

As @lbernail mentioned, for the IPVS proxier we reduce the weight of the backend to 0 so no new connections are created, so the grace period only applies to existing connections. The IPVS proxier also tracks connections by state, so we would remove the terminating endpoint early if we saw that all existing connections closed cleanly. In the few edge cases where the connection isn't closed properly, the real server would stay around. So at worse we would be leaving an existing connection open longer than it should up to terminationGracePeriodSeconds, but this is still better than the 15m duration today.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see, yeah that makes more sense. :) Thanks for explaining.

You still have the kube-proxy restart problem, right? Specifically

  • Pod X is deleted
  • 10 minute grace period starts
  • X is removed from endpoints
  • kube-proxy sets IPVS weight to 0
  • kube-proxy restarts
  • WATCH returns endpoints without X
  • kube-proxy observes realserver X with weight 0 but has no idea if that is seconds old or days old

So it seems to lose information, and at some point needs to make a dubious assumption about how long to wait.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah that's true -- for what it's worth the restart problem exists today since graceful termination handling is all done in local memory.

The solution I proposed #1607 (comment) would fix this, but requires changing the behavior so that terminating pods are kept in the slice.


This field allows consumers of the API to know the termination grace period of endpoints (pods in this case) without having to watch Pods. A specific use-case for this field is the [graceful termination handler in the IPVS proxy](https://github.com/kubernetes/kubernetes/blob/master/pkg/proxy/ipvs/graceful_termination.go). Because deleting an IPVS backend (a.k.a IPVS real server) immediately deletes all its correpsonding conntrack entries, deleting a real server is disruptive during a pod's graceful termination period. Today, the IPVS proxy will only delete a real server if it sees that it has no connections or it has exceeded a hardcoded duration limit (15m). The `TerminationGracePeriodSeconds` field would remove the need for this hardcoded limit, improving overall performance of the proxy as it no longer needs to track stale backends.

why was "not ready" addresses not used?
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a good question. I don't think I know the answer. @robscott you have more recent context... ?

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If NotReady endpoints were always in the slice, we could easily make the difference between a NotReady endpoint (weight=0) and a deleted endpoint (absent). However, I think NotReady endpoints are only present when PublishNotReadyAddresses is enabled.
If we always have NotReady endpoint in the slice, it will definitely be enough, but it would require some changes in the logic of kube-proxy (and maybe other tools relying on endpoints) to look at PublishNotReadyAddresses to decide on the behavior for a specific service

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lol oops, this was not supposed to be in here but yeah this was also raiesd in this thread kubernetes/kubernetes#88590 (comment)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's try to drag the conversation from PR back to here.

Ignore "tolerate unready", which just merges the two states.

Endpoints (not slice) puts pods in NotReadyAddresses on startup, but on teardown (when DeletionTimestamp is set) it removes them entirely.

If we moved them BACK to NotReady on teardown it would be a little bit of a behavioral change, but it's semantically correct and seems unlikely to break users. In case of tolerate-unready, we'd need to decide what the "right" behavior is. Today pods being deleted are removed, regardless of tolerate-unready. We could a) put them in NotReadyAddresses (which retains existing behavior -- they are not in Addresses) or b) put then in Addresses (which is a behavioral change, but consistent).

The bigger issue is whether this would cause a scalability problem. It would cause another round of writes to the Endpoints object (ready->unready->gone vs today's ready->gone), which is already problematic at scale. It seems that we don't actively batch updates to a single Endpoints.

For EndpointSlice we have the same question, though less pathologically. For a single pod it's much less of a big deal. In practice, we're likely to see this when a deployment is being updated, so it will hit all pods anyway. This is why we held off adding more Conditions - each one is a cost scalar. This is roughly equivalent to adding a new Condition (and indeed we should discuss that choice).

That said, I prefer making this sort of change to adding a new field. I really want to keep EPSlice general-purpose and grace period feels weird. Also, the above "memory loss" problem exists, whereas "unready" is more complete.

@robscott @wojtek-t re scalability.

@andrewsykim andrewsykim force-pushed the endpointslice-graceful-termination branch 2 times, most recently from eea2797 to 70b8672 Compare March 11, 2020 13:16
@andrewsykim
Copy link
Member Author

andrewsykim commented Mar 11, 2020

I am not convinced this can really be used for what you want to use it for. I am also worried about the expansion of per-endpoint metadata. As we try to map endpoint slice onto other use-cases, the size of them starts to matter. If that propagates into a per-group value, this could become another thing that chops-up otherwise contiguous blocks. I expect in practice it will mostly be the same across all pods, so IF we decide to do something, it should probably be scoped to the whole slice. This does make the slice controller more complicated again.

Do you feel this way about adding to EndpointCondition as well? The alternative would be to add a Terminating field to EndpointCondition, which only expands metadata for terminating endpoints (using nil default?) -- which presumably is a smaller set of the endpoints for short durations during rolling updates. This would also address concerns for endpoints that terminate earlier than the grace period and handling terminating endpoints during kube-proxy restarts.

@thockin
Copy link
Member

thockin commented Mar 11, 2020

I am also worried about the expansion of per-endpoint metadata. As we try to map endpoint slice onto other use-cases, the size of them starts to matter. If that propagates into a per-group value, this could become another thing that chops-up otherwise contiguous blocks. I expect in practice it will mostly be the same across all pods, so IF we decide to do something, it should probably be scoped to the whole slice. This does make the slice controller more complicated again.

Do you feel this way about adding to EndpointCondition as well? The alternative would be to add a Terminating field to EndpointCondition, which only expands metadata for terminating endpoints (using nil default?) -- which presumably is a smaller set of the endpoints for short durations during rolling updates. This would also address concerns for endpoints that terminate earlier than the grace period and handling terminating endpoints during kube-proxy restarts.

I think the Conditions is a better design, but we should think about the meaning of them and their orthogonality.

What we don't really want is to encode a full state-machine of the pod lifecycle into a set of bit fields. For example (ready: false, terminating: false) -> (ready: true, terminating: false) -> (???) Is the terminal state (ready: false, terminating: true) or (ready: true, terminating: true) ? In either case, what does the opposite mean? They aren't really orthogonal.

For better or worse, we did not call "ready" as "started", in part because that implies keeping history rather than simply observing state. This is important because history can get lost, but state is always re-observable. Upon experimenting, Pods (curiously) DO NOT get set to unready (status.Conditions "Ready"="False") when they are being deleted (deletionTimestamp != nil). I don't know if this is an oversight or intentional, but if it is intentional (or can't be changed) the compatible thing to do would be to add a new condition (and a new TerminatingAddresses field in classic Endpoints? Euchh).

@derekwaynecarr since we were JUST talking about pod lifecycle. Do you consider this paragraph above a "bug"? Specifically "pods do not get set to unready when they are being deleted".

This suggests (ready:true, terminating: true) is "normal end of life". So what does (ready: false, terminating: true) mean?

@andrewsykim andrewsykim force-pushed the endpointslice-graceful-termination branch from 70b8672 to 9ea1c61 Compare March 17, 2020 19:47
@andrewsykim andrewsykim changed the title EndpointSlice: add terminationGracePeriodSeconds to Endpoint EndpointSlice: add Terminating field to EndpointCondition Mar 17, 2020
@thockin
Copy link
Member

thockin commented Mar 26, 2020

@andrewsykim PTAL at kubernetes/kubernetes#85643 (comment)

I think this KEP and that issue are pretty closely related. I would sign up to shepherd the work, but I don't have the time to do all the implementation legwork right now. Is this something you'd be willing/able to carry? It's a slight expansion of scope over what you proposed here (which is just the API, no consumers thereof).

@wojtek-t
Copy link
Member

I've just read that comment and I think it may have significant performance/scalability implications. So if we decide to put it into this KEP, I would like to be a reviewer (BTW - will be great thing to valdiate prod readiness questionaire too, which you will have to fill anyway).

@andrewsykim
Copy link
Member Author

I think this KEP and that issue are pretty closely related. I would sign up to shepherd the work, but I don't have the time to do all the implementation legwork right now. Is this something you'd be willing/able to carry? It's a slight expansion of scope over what you proposed here (which is just the API, no consumers thereof).

Yes, I can try to carry this one forward. Will dig into the issue for more context.

I've just read that comment and I think it may have significant performance/scalability implications. So if we decide to put it into this KEP, I would like to be a reviewer

Noted, I'll ping you once I've updated the KEP.

@andrewsykim
Copy link
Member Author

@thockin @wojtek-t I took a stab at implementing kubernetes/kubernetes#85643 (comment) to get a better idea of the work involved and to help facilitate discussion in the KEP (see kubernetes/kubernetes#89780).

I think the semantics get a little weird if we fall back to local not ready endpoints because a not ready endpoint can be terminating or actually failing readiness. In that case should we still fallback to not ready endpoints? If not I think a terminating bool field would still be required? Thoughts?

@smarterclayton
Copy link
Contributor

So I missed this when it was open but I consider the current behavior of services while terminating under external traffic policy local to be a bug.

In general a terminating pod (deletionTimestamp set) wants:

  1. load balancers to redirect new traffic
  2. existing connections to continue
  3. to stop accepting connections after it is sure the frontends are pointed away and no new connections are sent
  4. to finish its existing connections
  5. to exit
  6. and for 1-5 to control the pace of drain on a node

I think we are are as a platform not in a great spot to offer correct zero-downtime drains and relabalances for workers (I have DaemonSet surge #1590 open to fix it so that network pods can correctly hand off responsibilities, we have fixed maybe 5-10 issues in drain over the last few months).

I would urge us to ensure services can be used safely for zero disruption handoff of new connections before we add more features to services.

Pod that has a valid `DeletionTimestamp`. Controllers that manage EndpointSlice objects may want to opt out of this feature for optimization reasons (e.g. avoiding per-endpoint metadata expansion)
which can be accomplished by setting the field to nil.

This field allows consumers of the API to know if endpoints are currently terminating without having to initiate a watch on Pods. A specific use-case for this field is the [graceful termination handler in the IPVS proxy](https://github.com/kubernetes/kubernetes/blob/master/pkg/proxy/ipvs/graceful_termination.go). Because deleting an IPVS backend (a.k.a IPVS real server) immediately deletes all its corresponding conntrack entries, it can be disruptive to delete a real server during its graceful termination period. The solution today is to only delete a real server if it sees that it has no connections or it has exceeded a hardcoded duration limit (15m). This approach is not full-proof as the state of "terminating endpoints" is not preserved when kube-proxy is restarted. The `Terminating` field would remove the need for storing termintaing state if memory, improving overall design and performance of the proxy as it no longer needs to track stale backends in memory.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should explicitly call it in either this KEP or another the use case around graceful shutdown without disruption of pods using services for nodeport, local traffic policy, and service proxy at minimum as the guiding goal. This actually loses some of the context - which is that we generally want to ensure that pods continue to receive existing connection traffic for all services while new connections are rapidly (or less rapidly with readiness gates) diverted away during termination.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm trying to add e2e tests to kube suite that will ensure this is possible for a range of services (kubernetes/kubernetes#89547 is one of the first changes we need to so we have a test shim)

Copy link
Member Author

@andrewsykim andrewsykim Apr 6, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Noted, I'll put consumption of the API in a separate KEP.

@k8s-ci-robot k8s-ci-robot removed the size/M Denotes a PR that changes 30-99 lines, ignoring generated files. label Apr 6, 2020
@andrewsykim andrewsykim force-pushed the endpointslice-graceful-termination branch 2 times, most recently from ccc0cda to e1bf06f Compare April 7, 2020 21:47
@andrewsykim
Copy link
Member Author

FYI: updated the KEP so it's split into two separate KEPs and proposes to only fallback to ready terminating endpoints (instead of all terminating endpoints).

@andrewsykim andrewsykim force-pushed the endpointslice-graceful-termination branch from e1bf06f to b2b3174 Compare April 23, 2020 15:41
@k8s-ci-robot k8s-ci-robot added the sig/architecture Categorizes an issue or PR as relevant to SIG Architecture. label Apr 23, 2020
@andrewsykim andrewsykim force-pushed the endpointslice-graceful-termination branch 4 times, most recently from 71e2356 to b5d8f28 Compare April 23, 2020 16:08
@andrewsykim
Copy link
Member Author

Based on discussions in #1607 (comment), I've updated the KEP to walk through all terminating states for external node local traffic: {R, !T} -> {R, T} -> {!R, T} -> {}.

I've also updated kubernetes/kubernetes#89780 to reflect the changes in this KEP in case some of the details were not clear in writing.

Copy link
Member

@thockin thockin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm approving. I'd like Rob to confirm this will not be a shock for EPSlice. Only one comment on note in my feedback.

/approve

* local ready & terminating endpoints
* local not ready & terminating endpoints
* blackhole traffic
* for all other traffic, preserve existing behavior where traffic is only sent to ready && !terminating endpoints.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why this distinction? I forget if we decided "start here and expand" or if there's a reason to stop at this level. Why NOT do this for all?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This distinction is actually referring to the externalTrafficPolicy=Cluster caes.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated to make this clearer

- "@smarterclayton"
approvers:
- "@thockin"
editor: TBD
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

editor has been removed from the template :)

@@ -0,0 +1,20 @@
title: Tracking Terminating Endpoints in EndpointSlice
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@robscott Are you good with this?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added a couple small nits, but LGTM overall. This specific KEP feels like it might make sense as part of the existing EndpointSlice KEP. That has traditionally been where the EndpointSlice API and any changes to it have been defined. With that said, I know that could result in it becoming a massive KEP. Not sure what the best practice is here.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Tim had previously suggested that this should be a new KEP (see here) and I agree.

I think it would make sense for this KEP to summarize the motivation for the change but we should still backport the API changes into the main EPSlice KEP with maybe a reference to this one?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(I'll open a follow-up PR for this)

@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: andrewsykim, thockin

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot k8s-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Jun 16, 2020
Copy link
Member

@robscott robscott left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the work on this!

// terminating indicates if this endpoint is terminating. Consumers should assume a
// nil value indicates the endpoint is not terminating.
// +optional
Terminating *bool `json:"terminating,omitempty" protobuf:"bytes,2,name=terminating"`
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why a pointer here? Would it be safe to get rid of the nil case since it's indistinguishable from false here?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm assuming that the nil state could be useful in the future. Some consumeres of EPSlice might have a special case for when the terminating state is unknown vs when it's explicitly false. Maybe I'm overthinking this?

### Upgrade / Downgrade Strategy

Since this is an addition to the EndpointSlice API, the upgrade/downgrade strategy will follow that
of the [EndpointSlice API work](/keps/sig-network/20190603-EndpointSlice-API.md).
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed

@@ -0,0 +1,20 @@
title: Tracking Terminating Endpoints in EndpointSlice
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added a couple small nits, but LGTM overall. This specific KEP feels like it might make sense as part of the existing EndpointSlice KEP. That has traditionally been where the EndpointSlice API and any changes to it have been defined. With that said, I know that could result in it becoming a massive KEP. Not sure what the best practice is here.

@andrewsykim andrewsykim force-pushed the endpointslice-graceful-termination branch from b5d8f28 to 559d4aa Compare June 17, 2020 00:20
Signed-off-by: Andrew Sy Kim <kim.andrewsy@gmail.com>
Signed-off-by: Andrew Sy Kim <kim.andrewsy@gmail.com>
@andrewsykim andrewsykim force-pushed the endpointslice-graceful-termination branch from 559d4aa to d1e0d0e Compare June 17, 2020 00:26
@robscott
Copy link
Member

/lgtm

@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Jun 17, 2020
@k8s-ci-robot k8s-ci-robot merged commit c005089 into kubernetes:master Jun 17, 2020
@k8s-ci-robot k8s-ci-robot added this to the v1.19 milestone Jun 17, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. kind/kep Categorizes KEP tracking issues and PRs modifying the KEP directory lgtm "Looks good to me", indicates that a PR is ready to be merged. sig/architecture Categorizes an issue or PR as relevant to SIG Architecture. sig/network Categorizes an issue or PR as relevant to SIG Network. size/L Denotes a PR that changes 100-499 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

7 participants