KEP-3015: PreferLocal traffic policy #3016

danwinship · 2021-10-21T19:42:58Z

One-line PR description:
Addition of PreferLocal (ie, Local with fallback) to externalTrafficPolicy and internalTrafficPolicy

Issue link: PreferLocal traffic policy / Node-level topology #3015

Other comments:
based on discussion in KEP-2086: Add internalTrafficPolicy: PreferLocal #3010, which I'll close now

k8s-ci-robot · 2021-10-21T19:43:20Z

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: danwinship
To complete the pull request process, please assign dcbw, wojtek-t after the PR has been reviewed.
You can assign the PR to them by writing /assign @dcbw @wojtek-t in a comment when ready.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

keps/prod-readiness/OWNERS
keps/sig-network/OWNERS

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

robscott · 2021-10-21T21:03:40Z

/cc

thockin

It's worth exploring how this intersects with topology, I think. Can we guarantee that all the endpoints in zone A are hinted "for zone A" ? If not, could we end up with a node that holds local endpoints which are dedicated to a different zone (because of imbalance) ?

@robscott @swetharepakula

thockin · 2021-10-22T16:06:45Z

keps/sig-network/3015-prefer-local-traffic-policy/README.md

+  load balancers do not balance between nodes themselves. (eg, perhaps
+  you are using a `NodePort` service, but only advertising a single
+  node IP for clients to connect to). Does not preserve source IP, but
+  guarantees the traffic is always delivered.


s/guarantees.../will use any available endpoint to avoid delivery failure/

"guarantee" is such a strong word

thockin · 2021-10-22T16:09:01Z

keps/sig-network/3015-prefer-local-traffic-policy/README.md

+- `PreferLocal`: Prioritizes efficiency and reliability. When
+  possible, traffic will be delivered to a local endpoint, but when no
+  local endpoints exist, it will be delivered to a remote endpoint
+  instead.


"If the upstream routing over-uses a node, endpoints on that node can become overloaded."

danwinship · 2021-10-25T13:19:08Z

It's worth exploring how this intersects with topology, I think.

Well, I think PreferLocal should intersect in exactly the same way as Local does...

Can we guarantee that all the endpoints in zone A are hinted "for zone A" ? If not, could we end up with a node that holds local endpoints which are dedicated to a different zone (because of imbalance) ?

The topology KEP says "Both ExternalTrafficPolicy and InternalTrafficPolicy will be given precedence over topology aware routing."

I guess "precedence" is kind of vague; the current implementation is that traffic policy overrides topology, so if Node 1 in Zone A has two endpoints, and one is topology-hinted for Zone A and the other is hinted for Zone B, then Local traffic can go to either local endpoint, ignoring the topology hints.

The other interpretation would be "endpoints are filtered for traffic policy first, and then for topology second" (although there isn't actually "precedence" in that case since you'd get the same answer regardless of which order you filtered them in). But I don't think we want that, because the rule for assigning some endpoints to other zones is going to thwart the administrator's attempts to ensure that every node has a local endpoint. (eg, if Node 2 in Zone A has one endpoint which is hinted as being for Zone B, then if we apply both Local traffic policy and topology, then pods on Node 2 would not be able to reach the service, even though the administrator had made sure that there was an endpoint of the service on that node.)

robscott · 2021-10-26T00:41:47Z

There's a lot of overlap in kubernetes/kubernetes#100313 with the discussion here. I think fundamentally we should consider traffic policy as having the highest precedence for internal or external traffic. When set to "Local", there is no fallback and therefore no ambiguity on interaction with other config. On the other hand, "Cluster" is really falling back to the next level of configuration. That means either topology filtered endpoints or all endpoints.

I think "PreferLocal" is really just trying to merge "Local" with a "Cluster" fallback. In that case, I think the behavior should line up with that. That means the following approximate logic:

If trafficPolicy == "PreferLocal" && len(localEndpoints) > 0, use local endpoints exclusively
If topology == "Auto" && len(hintedEndpoints) > 0, use hinted endpoints exclusively
Use all endpoints

As mentioned here and in kubernetes/kubernetes#100313, this does mean that we either need to maintain separate filtered sets of internal and external endpoints in the shared proxy code, or we need to send the full list of endpoints to proxiers and then use shared code to filter internal and external endpoints separately. I think @danwinship is working on that logic.

danwinship · 2022-01-15T22:09:49Z

Updated:

don't say "guarantee"
totally rewrote the bullet points at the end of the "External Traffic Policy" section to better describe the benefits and tradeoffs (also without saying "guarantee")
added a section on how it interacts with topology
[EDIT] updated versions in kep.yaml to be alpha in 1.24

robscott · 2022-01-18T21:37:01Z

keps/sig-network/3015-prefer-local-traffic-policy/README.md

+that happens, there's a good chance that the topology code will _also_
+hit its fallback case (due to there being at least one zone which is
+missing an endpoint), and so topology won't even get used anyway. At


I'm not sure this is true. For example, there may be a Service with 3 endpoints spread over 6 nodes. That would mean "Local" would fallback to "Cluster" ~1/2 of the time, but topology would consistently be helpful. Having a "PreferLocal" mode may result in a larger set of users than the traditional "Local" use cases, including some cases where users are not intending to deploy workloads on every possible node.

there may be a Service with 3 endpoints spread over 6 nodes. That would mean "Local" would fallback to "Cluster" ~1/2 of the time

(I assume you meant PreferLocal would fall back to Cluster.)

including some cases where users are not intending to deploy workloads on every possible node.

Hm... I was not considering that as a use case, although it does make sense. It's the same behavior, you're just thinking of it as "Cluster with optimization" rather than "Local with fallback". I should add a User Story for that.

At any rate, the comment here about PreferLocal and Topology failing at the same time was purely "informational" and doesn't affect the actual behavior; the proposed rule ("use a local endpoint if you can, otherwise do exactly what you would have done if the policy was Cluster") will give the right answer whether or not Topology is in effect.

astoycos · 2022-01-20T22:22:12Z

/cc

danwinship · 2022-01-21T14:30:12Z

Updated:

added new User Story "PreferLocal as More-Efficient Cluster"
removed incorrect parenthetical comment about PreferLocal vs Topology
updated the "SLO" table at the end for the new use case

keps/sig-network/3015-prefer-local-traffic-policy/README.md

andrewsykim · 2022-01-21T21:30:12Z

keps/sig-network/3015-prefer-local-traffic-policy/README.md

+ignore topology, just like `Local` policy does.
+
+When delivering packets remotely, it makes the most sense to have
+`PreferLocal` policy make use of topology. (This is explicitly wanted


One of the reasons we scrapped PreferLocal in the other KEP is because the assumption was you can achieve the same behavior by setting internalTrafficPolicy: Cluster and topology-aware routing that strongly prefers node-local endpoints. What is the semantic difference between that and using PreferLocal?

(sorry if we already discussed this somewhere else, catching up on some context I missed)

If we had node-level topology, then we probably would not need PreferLocal... but do we even have a plan for node-level topology yet?

Arguably if we had node-level topology, then we would not need internalTrafficPolicy: Local either, since (unlike externalTrafficPolicy: Local) it provides no advantages other than locality.

If we had node-level topology, then we probably would not need PreferLocal... but do we even have a plan for node-level topology yet?

I guess this is a question for @robscott -- is "requires node" a thing? My current understanding so far is that topology-aware routing is more designed to handle traffic routing automatically so it should always fall back to the next preferred topology instead of dropping. This was one of the reasons we added internalTrafficPolicy: Local because there are cases where you actually want "local or drop", which fell outside the scope of topology-aware routing.

Fwiw I do think there's a lot of value in "PreferLocal" still -- it seems particularly more useful than something like "PreferZone". Routing locally has more specific use-cases as opposed to "route to closest zone" which is almost-always around cost-savings and reducing latency.

I'll attach my main feedback here.

Let's consider: Why do people use [EI]TP ?

ETP: I know of 2 reasons. 1) "I need to preserve source IP". 2) "Efficiency".

ITP: I know of 2 reasons. 1) "I need to stay on this node (semantic correctness)". 2) "Efficiency".

For both classes of traffic, "Local" is either a hard requirement (i.e. "prefer" is a non-starter) or just an optimization over "Cluster".

I don't like that we put the responsibility onto our users tell us which optimizations we should be using. That seems like a total failure on our part. OF COURSE they want us to optimize if we can. The specific optimization that PreferLocal dictates is pretty black or white, which means there's not much room for being "smart". This then FORCES other implementations, which might be capable of more smartness, to play dumb.

For example, it's really not hard to imagine an implementation that sends the number of active flows per-endpoint to an aggregator, and then have that aggregator give the node-agents hints about which endpoints are overloaded. It doesn't even need to be super timely to have some impact, I bet. But the semantics of "PreferLocal" disallow that.

If we think this is the optimizatrion with the biggest bang for the buck, let's talk about how we can do something like if num_endpoints > (num_nodes/2) && have_local_eps() { use_local_eps() }. I just made that up - obviously we can tune the heuristic.

PreferLocal is a loadbalancing/topology algorithm problem and not a traffic policy problem, hence this is not the right place to fix it

@aojea That's a reasonable distillation :)

@terinjokes thanks for the additional input. "Local" gets better with the terminating endpoints work that is in flight, but still there's a moment when it has to die for a successor to start. If we really want to make node-level "sidecars" a thing that is reliable, we need a way to "hand off".

But such a mechanism does (kind of) exist - the updateStrategy of a DaemonSet allows maxSurge to be positive, so we can start a new replica before killing an old one. This is subject to resource limitations on heavily packed nodes. If you think that is you, you can choose to run 2 DaemonSets (e.g. "dns-1" and "dns-a") so you pre-allocate space. Maybe DaemonSet actually needs "replicas" field to run N copies per target node...

Topology, as implemented today, is not really considering node. It could, and yes my proposed heuristic is far from perfect. But not considering node was sort of intentional. In my experience, such optimizations quietly become functional requirements. Apps (or SLOs) start to depend on the optimization, and when the optimization isn't available for some reason, apps break.

So my intuitive lean is to either NOT optimize that way, or else to make it really clear that it is an optimization, not a policy. If you depend on the optimization, then "prefer" is wrong, and we should focus on making "Local" work for you. If you don't depend on it, then you should let the system try to be smart, and not tell it how to do its job.

That said I really do understand how "PreferLocal" is attractive, especially while the system is NOT as smart as you want it to be. On this topic, I have cited this before: https://en.wikipedia.org/wiki/Attractive_nuisance_doctrine . As you said Trying to be "smart" can potentially hotspot specific instances. It's an overly-sharp tool that SEEMS like a good idea, but probably isn't.

In some sense what you are asking for is not an abstract policy, but an implementation-centric toggle that says "I know what I am doing and I know how the implementation works". If we put that into the API we get to support it FOREVER. We reduce the degrees of freedom implementations have for doing good things, and we lock them into sub-optimal semantics.

ITP: I know of 2 reasons. 1) "I need to stay on this node (semantic correctness)". 2) "Efficiency".

internalTrafficPolicy: Local doesn't solve the "efficiency" use case though, because it implies "I would rather lose packets than let this connection go to another node", which is never what you want for the "efficiency" use case.

I don't like that we put the responsibility onto our users tell us which optimizations we should be using. That seems like a total failure on our part. OF COURSE they want us to optimize if we can.

But optimize for what?

"Cluster" traffic policy optimizes under the assumption that your service will work best if kube-proxy distributes connections evenly across all the endpoints. This might be the best choice if each connection creates a lot of load on the endpoint, so overloading a local endpoint would slow response time down by more than forwarding requests to another node would.

"PreferLocal" traffic policy optimizes under the assumption that the service will work best if kube-proxy routes connections to the closest endpoint. This is the right choice for something like DNS where connections are short lived, and processing them doesn't require much CPU, so it doesn't matter if lots of connections get sent to the same endpoint all at once.

The specific optimization that PreferLocal dictates is pretty black or white, which means there's not much room for being "smart".

Yeah, because actually I lied above: "PreferLocal" doesn't say "route connections to the closest endpoint", it says "route connections to a local endpoint, or if you can't, then route it anywhere", which is better than "Cluster"'s behavior, but not really exactly what you want.

This then FORCES other implementations, which might be capable of more smartness, to play dumb.

In theory an implementation might be super smart and actively monitor the cluster to figure out the best way to route connections, but I think it's more likely that you'll get good behavior if we let the user give the proxy implementation hints. We just need to figure out the right set of hints.

I guess having traffic policy encode both behavioral changes ("preserve source IP", "must stay on node for semantic correctness") and efficiency hints is probably a bad approach. But in that case we should clarify that traffic policy is only describing desired behavior, not desired efficiency.

internalTrafficPolicy: Local doesn't solve the "efficiency" use case though, because it implies "I would rather lose packets than let this connection go to another node", which is never what you want for the "efficiency" use case.

I think that this is the key ^^^ , internalTrafficPolicy: Local doesn't have any value if doesn't fall back, or at least I can't understand when and how should I use it, but we want to keep the same semantics that it has for externalTrafficPolicy and we are running in circles because of that .... I would bet that when people say Local for internalTrafficPolicy they want to mean LocalWithFallback / PreferLocal

I know of cases where falling back to a non-local endpoint is semantically wrong, and isn't acceptable. Setting a policy to "Local" ought to be a rule), not a hint. IF we support "PreferLocal", it has to be distinct from "Local". It still seems wrong to conflate policy and efficiency hints.

E.g. imagine a near-ideal implementation. It would gather stats about how many connections and bytes were sent per usint time. Maybe it would consider latency (cracking flows to recognize DNS or maybe just based on ports). Maybe it would consider CPU load on the server. It would use that to make a "good" choice. Hopefully "pods on this node" would be the choice whenever possible. We don't need a user to tell us that.

Now, we don't have a near-ideal implementation, so should we:

a) give a policy knob that a closer-to-ideal solution might wrestle with

b) give a policy knob that other implementations can choose to ignore (e.g. annotation);

c) let the user offer a hint that is explicitly a hint (which implementations can choose to respect or not)

d) start building a more ideal implementation

e) ignore this problem and hope it goes away

I know of cases where falling back to a non-local endpoint is semantically wrong, and isn't acceptable. Setting a policy to "Local" ought to be a rule

Fair enough

b) seems the simplest solution

aojea · 2022-01-21T22:55:11Z

keps/sig-network/3015-prefer-local-traffic-policy/README.md

+```
+<<[UNRESOLVED renaming ]>>
+
+Should we add a new alias for `Local`, such as `PreserveSourceIP`, and


:) kubernetes/kubernetes#41162 (comment)

aojea · 2022-01-21T22:59:59Z

keps/sig-network/3015-prefer-local-traffic-policy/README.md

+    As a platform owner, I want to create a Service that always
+    directs traffic to a logging daemon on the same node. Traffic
+    should never bounce to a daemon on another node.


Singleton pattern? I honestly tried to find an example and I couldn't

I'm not sure what you're saying.

this can be useful if you want to have a service with special characteristics per node... I can't find an example neither, I was thinking if maybe someone want to have an unique service in the cluster that only works in one node ...

It's not that you want a unique service that only works in one node, it's that some daemons are designed as "side cars for a node". Some examples I've seen in that past:

A CoreDNS pod per node to reduce extra hop and reduce DNS latency

Ingress controller on every node to reduce extra hop, but you want to still connect using Service domain or clusterIP (as opposed to local node IP for example). This is typically the case where a user thinks service-mesh is too expensive for their use-cases but they want similar functionality.

Logging/telemetry/monitoring agent might receive events/data from pods on that node but it shouldn't receive from another node since that would skew the telemetry data.

Some of these use-cases could be solved today though by using downward API to get your local host IP and routing there, but it would be much easier to use a consistent Service cluster domain and have that route local all the time. This was the primary driver for internalTrafficPolicy: Local|PreferLocal in the original KEP

thockin

Overall I am -1 on this proposal. Rationale in a line-comment below.

thockin · 2022-01-26T23:54:27Z

keps/sig-network/3015-prefer-local-traffic-policy/README.md

+ignore topology, just like `Local` policy does.
+
+When delivering packets remotely, it makes the most sense to have
+`PreferLocal` policy make use of topology. (This is explicitly wanted


I'll attach my main feedback here.

Let's consider: Why do people use [EI]TP ?

ETP: I know of 2 reasons. 1) "I need to preserve source IP". 2) "Efficiency".

ITP: I know of 2 reasons. 1) "I need to stay on this node (semantic correctness)". 2) "Efficiency".

For both classes of traffic, "Local" is either a hard requirement (i.e. "prefer" is a non-starter) or just an optimization over "Cluster".

I don't like that we put the responsibility onto our users tell us which optimizations we should be using. That seems like a total failure on our part. OF COURSE they want us to optimize if we can. The specific optimization that PreferLocal dictates is pretty black or white, which means there's not much room for being "smart". This then FORCES other implementations, which might be capable of more smartness, to play dumb.

For example, it's really not hard to imagine an implementation that sends the number of active flows per-endpoint to an aggregator, and then have that aggregator give the node-agents hints about which endpoints are overloaded. It doesn't even need to be super timely to have some impact, I bet. But the semantics of "PreferLocal" disallow that.

If we think this is the optimizatrion with the biggest bang for the buck, let's talk about how we can do something like if num_endpoints > (num_nodes/2) && have_local_eps() { use_local_eps() }. I just made that up - obviously we can tune the heuristic.

k8s-triage-robot · 2022-05-02T08:59:13Z

The Kubernetes project currently lacks enough contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Mark this issue or PR as fresh with /remove-lifecycle stale
Mark this issue or PR as rotten with /lifecycle rotten
Close this issue or PR with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

danwinship · 2022-05-03T19:12:22Z

I'm going to close this PR and push a new more topology-focused version

k8s-ci-robot added cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. labels Oct 21, 2021

k8s-ci-robot requested review from caseydavenport and dcbw October 21, 2021 19:43

k8s-ci-robot added the sig/network Categorizes an issue or PR as relevant to SIG Network. label Oct 21, 2021

This was referenced Oct 21, 2021

KEP-2086: Add internalTrafficPolicy: PreferLocal #3010

Closed

PreferLocal traffic policy / Node-level topology #3015

Closed

k8s-ci-robot requested a review from robscott October 21, 2021 21:03

thockin reviewed Oct 22, 2021

View reviewed changes

danwinship force-pushed the prefer-local branch from 56c2e00 to d0cbb1e Compare January 15, 2022 22:08

danwinship force-pushed the prefer-local branch from d0cbb1e to da2981e Compare January 15, 2022 22:10

robscott reviewed Jan 18, 2022

View reviewed changes

k8s-ci-robot requested a review from astoycos January 20, 2022 22:22

danwinship force-pushed the prefer-local branch from da2981e to 010a683 Compare January 21, 2022 14:28

KEP-3015: PreferLocal traffic policy

5185808

danwinship force-pushed the prefer-local branch from 010a683 to 5185808 Compare January 21, 2022 14:28

andrewsykim reviewed Jan 21, 2022

View reviewed changes

keps/sig-network/3015-prefer-local-traffic-policy/README.md Show resolved Hide resolved

andrewsykim reviewed Jan 21, 2022

View reviewed changes

aojea mentioned this pull request Jan 21, 2022

KEP-1669: updates for ProxyTerminatingEndpoints feature in v1.24 #3174

Merged

aojea reviewed Jan 21, 2022

View reviewed changes

thockin self-assigned this Jan 21, 2022

thockin reviewed Jan 26, 2022

View reviewed changes

k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label May 2, 2022

danwinship closed this May 3, 2022

danwinship mentioned this pull request May 3, 2022

KEP-3015: Node-level topology #3293

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

KEP-3015: PreferLocal traffic policy #3016

KEP-3015: PreferLocal traffic policy #3016

danwinship commented Oct 21, 2021

k8s-ci-robot commented Oct 21, 2021

robscott commented Oct 21, 2021

thockin left a comment

thockin Oct 22, 2021

thockin Oct 22, 2021

danwinship commented Oct 25, 2021

robscott commented Oct 26, 2021

danwinship commented Jan 15, 2022 •

edited

robscott Jan 18, 2022

danwinship Jan 19, 2022

astoycos commented Jan 20, 2022

danwinship commented Jan 21, 2022

andrewsykim Jan 21, 2022 •

edited

danwinship Jan 23, 2022

danwinship Jan 23, 2022

andrewsykim Jan 24, 2022

thockin Jan 26, 2022

thockin Jan 27, 2022

danwinship Jan 28, 2022

aojea Jan 28, 2022 •

edited

thockin Jan 31, 2022 •

edited

aojea Feb 1, 2022

aojea Jan 21, 2022

aojea Jan 21, 2022

danwinship Jan 23, 2022

aojea Jan 24, 2022

andrewsykim Jan 24, 2022 •

edited

thockin left a comment

thockin Jan 26, 2022

k8s-triage-robot commented May 2, 2022

danwinship commented May 3, 2022

KEP-3015: PreferLocal traffic policy #3016

KEP-3015: PreferLocal traffic policy #3016

Conversation

danwinship commented Oct 21, 2021

k8s-ci-robot commented Oct 21, 2021

robscott commented Oct 21, 2021

thockin left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

danwinship commented Oct 25, 2021

robscott commented Oct 26, 2021

danwinship commented Jan 15, 2022 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

astoycos commented Jan 20, 2022

danwinship commented Jan 21, 2022

andrewsykim Jan 21, 2022 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

aojea Jan 28, 2022 • edited

Choose a reason for hiding this comment

thockin Jan 31, 2022 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

andrewsykim Jan 24, 2022 • edited

Choose a reason for hiding this comment

thockin left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

k8s-triage-robot commented May 2, 2022

danwinship commented May 3, 2022

danwinship commented Jan 15, 2022 •

edited

andrewsykim Jan 21, 2022 •

edited

aojea Jan 28, 2022 •

edited

thockin Jan 31, 2022 •

edited

andrewsykim Jan 24, 2022 •

edited