KEP-2433: add new heuristic to topology routing #4003

aojea · 2023-05-15T13:12:30Z

KEP-2433: add new heuristic to topology routing

Issue link: Topology Aware Routing #2433

This simplify the existing heuristic that is based on CPUs cores because is difficult to get it working on deployments where the law of large numbers can not help with the statistics.

Add a new heuristic to only PreferZone, ie. traffic will be directed to the endpoints in the same zone if exist, or fall back to cluster wide routing if there are no endpoints in the zone

As a side benefit of this new heuristic, we can abstract the topology using an interface that will help with the KEP #3685 to stage the endpointslice controller.

aojea · 2023-05-15T13:19:27Z

/assign @robscott @thockin
/cc @ialidzhikov

thockin · 2023-05-16T00:06:38Z

This seems reasonable but I don't think it solves the problem as stated by several users, which I think distills into "same node if possible, otherwise same zone if possible, otherwise same region if possible, otherwise random". I'm not against adding the heuristic proposed here, but only if we think it is solving for someone's use-case. Is it?

robscott · 2023-05-16T00:48:55Z

This seems reasonable but I don't think it solves the problem as stated by several users, which I think distills into "same node if possible, otherwise same zone if possible, otherwise same region if possible, otherwise random". I'm not against adding the heuristic proposed here, but only if we think it is solving for someone's use-case. Is it?

Agree with @thockin here. I chatted with @aojea about this PR some this afternoon, will add a high level summary here:

The proposed heuristic is probably better for many than the current CPU based approach. I was a little curious if we could potentially make this more of a migration to a new heuristic than needing to support both indefinitely, but I don't think we can justify removing the old one since it is legitimately useful in some cases and the one that's proposed here doesn't cover all of those use cases.
Any proportional algorithm (both new and old) does not work well with HPAs when load is unevenly originating from one zone. We really want to enable users to specify per-zone Deployments and HPAs that can accurately respond to load coming from that zone.
The only way I can think of to support HPAs well is to have a simple PreferZone type algorithm that routes to ready endpoint(s) in a zone if there are any, and if there aren't, falls back to cluster-wide routing. Maybe node-local is also helpful here, but I'm not as convinced about that, since it's harder to combine with autoscaling.
I'd personally rather implement PreferZone at the proxy level without hints. At least in kube-proxy, this code would be quite straightforward, and would fit in somewhere around here. Implementing with hints would simplify logic for dataplanes but at the cost of seemingly needless extra bytes (this is not a blocker for me).
Although topology hints technically allow us to support as many heuristics as we want, the existing docs for topology are already pretty complicated, and any additional heuristic will only further complicate them. Unless we're very sure we need additional heuristics, I'd lean towards not adding them. (I do feel like something resembling PreferZone is required).

aojea · 2023-05-16T11:11:49Z

Fair enough, for hints I still think we need to replace the existing complex heuristic or add a more naive one and easier to think about.

For prefer zone then we need a new KEP to reserve the annotation value and add criteria for graduation (at least 2 proxies implement it, or something like that)

WDYT?

thockin · 2023-05-16T16:11:38Z

I do think a new KEP makes sense. A few things to cover: 1) This is a heuristic which depends on the proxy implementation choosing to support it. We can fix kube-proxy, for sure. We can try to add support to other implementations, but we don't control those. 2) I think we should explicitly design it to consider region - it shouldn't be significantly more complicated and we know some people DO have multi-regional clusters (or use zone to mean rack and region to mean what we usually call zone) 3) We should discuss same-node - I kind of feel like leaving it out will just land us back in this discussion, but I am open to debate it :)

…

On Tue, May 16, 2023 at 4:12 AM Antonio Ojea ***@***.***> wrote: Fair enough, for hints I still think we need to replace the existing complex heuristic or add a more naive one and easier to think about. For prefer zone then we need a new KEP to reserve the annotation value and add criteria for graduation (at least 2 proxies implement it, or something like that) WDYT? — Reply to this email directly, view it on GitHub <#4003 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/ABKWAVBHSFGXO7HVN4PJ23DXGNOH7ANCNFSM6AAAAAAYCFVSVI> . You are receiving this because you were mentioned.Message ID: ***@***.***>

ialidzhikov · 2023-06-08T15:32:30Z

keps/sig-network/2433-topology-aware-hints/README.md

+zone-a: 2 endpoints
+zone-b: 1 endpoint
+zone-c: 3 endpoints


What happens if I have 4 Endpoints distributed such as:

zone-a: 1 endpoints zone-b: 1 endpoint zone-c: 2 endpoints

?

Will the hints be removed again as it is not possible to achieve a balanced distribution of endpoints across the zones?

We are looking into a deterministic way to have the hints set. We already use topologySpreadConstraints and nodeAffinity to spread the endpoints evenly across the zones. The endpoints will be spread evenly when the number of endpoints % number of zones = 0. But when the number of endpoints % number of zones != 0 then there will be some zones with more replicas than the other. Or, in case of Deployment/StatefuleSet rollout, then the even distribution can be also violated.
Currently we have a mutating webhook that mutates the EndpointSlice's hints such as: it always sets the endpoint's hints to the endpoint's zone.
Shouldn't we consider adding this strategy as well? Or can you elaborate the newly added heuristic whether it is close to it? Because I am confused and cannot judge about it.
Assumptions for this strategy:

You already have guaranteed appr. balanced distribution of endpoints across zones via scheduling means (topologySpreadConstraint, nodeAffinity)

Each endpoint has equivalent capacity.

What the strategy does is to maintain the endpoint's zone as hint.

Currently we have a mutating webhook that mutates the EndpointSlice's hints such as: it always sets the endpoint's hints to the endpoint's zone.

sorry for the late response, I was suppose to close this PR to send a new one with the simple heuristic you mention, to copy the zone to the hint for the reasons that Rob mentions #4003 (comment) (specially 2.) but then we need to find a good solution to be completely sure we can handle Tim's comments #4003 (comment)

@ialidzhikov one curiosity, how is copying over directly the zone to hints working for you? is that something that is completely solving your problems?
cc: @robscott

@ialidzhikov one curiosity, how is copying over directly the zone to hints working for you? is that something that is completely solving your problems?

Some details for our setup are revealed in kubernetes/kubernetes#110714 (comment). For example, one of our use-case is to use topology-aware routing for the communication between the kube-apiserver and etcd. We run etcd in 3 zones, 1 replica per zone. The kube-apiserver replicas are also spread in similar way but the kube-apiserver replicas may vary between 3/4 replicas. The idea is that each kube-apiserver talks to the etcd Pod in its zone.
We also use topology-aware routing for the webhook communication. When kube-apiserver needs to talk to webhook deployed in the same cluster that the kube-apiserver runs. So, again, kube-apiserver talks to the webhook endpoint that is located in its zone (if there is such, of course).

Just to make sure that I understand the PreferZone heuristic. It will always maintain the endpoint's zone as its hint, right?

aojea · 2023-06-12T09:08:17Z

@thockin @robscott @ialidzhikov updated the kep to include PreferZone and keep it in beta during 1.28 , so we can go GA in 1.29 if feedback is correct, PTAL

Add a new simple heuristic to minimize traffic cost at the expense of higher risk of traffic imbalance and endpoints overload Signed-off-by: Antonio Ojea <aojea@google.com>

wojtek-t · 2023-06-12T09:22:41Z

LGTM

thockin

Thanks!

/lgtm
/approve

k8s-ci-robot · 2023-06-12T19:14:39Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: aojea, thockin

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

~~keps/sig-network/OWNERS~~ [thockin]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

k8s-ci-robot added the cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. label May 15, 2023

k8s-ci-robot requested a review from dcbw May 15, 2023 13:12

k8s-ci-robot added the kind/kep Categorizes KEP tracking issues and PRs modifying the KEP directory label May 15, 2023

k8s-ci-robot requested a review from thockin May 15, 2023 13:12

k8s-ci-robot added sig/network Categorizes an issue or PR as relevant to SIG Network. size/M Denotes a PR that changes 30-99 lines, ignoring generated files. labels May 15, 2023

aojea force-pushed the proxy_prefer_zone branch from f2b5bac to 3ff6294 Compare May 15, 2023 13:16

aojea mentioned this pull request May 15, 2023

Topology Aware Routing #2433

Open

4 tasks

k8s-ci-robot assigned robscott and thockin May 15, 2023

k8s-ci-robot requested a review from ialidzhikov May 15, 2023 13:19

aojea force-pushed the proxy_prefer_zone branch from 3ff6294 to b2d50c2 Compare May 15, 2023 13:23

ialidzhikov reviewed Jun 8, 2023

View reviewed changes

aojea force-pushed the proxy_prefer_zone branch from b2d50c2 to ba115df Compare June 12, 2023 09:07

k8s-ci-robot added size/L Denotes a PR that changes 100-499 lines, ignoring generated files. and removed size/M Denotes a PR that changes 30-99 lines, ignoring generated files. labels Jun 12, 2023

aojea added 3 commits June 12, 2023 09:14

add PreferZone heuristic to topology routing

e2a17c7

Add a new simple heuristic to minimize traffic cost at the expense of higher risk of traffic imbalance and endpoints overload Signed-off-by: Antonio Ojea <aojea@google.com>

update plan to ga topology in 1.29

b38caf1

rename ProportionalByCore to ProportionalZoneCPU

15fb5f8

aojea force-pushed the proxy_prefer_zone branch from ba115df to 15fb5f8 Compare June 12, 2023 09:14

thockin reviewed Jun 12, 2023

View reviewed changes

k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Jun 12, 2023

k8s-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Jun 12, 2023

k8s-ci-robot merged commit e1bdbca into kubernetes:master Jun 12, 2023

k8s-ci-robot added this to the v1.28 milestone Jun 12, 2023

szuecs mentioned this pull request Jun 16, 2023

Support endpointslices - Prefer Same zone traffic to save costs zalando/skipper#1446

Open

2 tasks

gauravkghildiyal mentioned this pull request Oct 9, 2023

Support multiple Topologies in EndpointSlices and introduce PreferZone topology kubernetes/kubernetes#121060

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

KEP-2433: add new heuristic to topology routing #4003

KEP-2433: add new heuristic to topology routing #4003

aojea commented May 15, 2023 •

edited

Loading

aojea commented May 15, 2023

thockin commented May 16, 2023

robscott commented May 16, 2023

aojea commented May 16, 2023

thockin commented May 16, 2023 via email

ialidzhikov Jun 8, 2023

aojea Jun 9, 2023

ialidzhikov Jun 13, 2023

ialidzhikov Jun 13, 2023

aojea commented Jun 12, 2023

wojtek-t commented Jun 12, 2023

thockin left a comment

k8s-ci-robot commented Jun 12, 2023

KEP-2433: add new heuristic to topology routing #4003

KEP-2433: add new heuristic to topology routing #4003

Conversation

aojea commented May 15, 2023 • edited Loading

aojea commented May 15, 2023

thockin commented May 16, 2023

robscott commented May 16, 2023

aojea commented May 16, 2023

thockin commented May 16, 2023 via email

ialidzhikov Jun 8, 2023

Choose a reason for hiding this comment

aojea Jun 9, 2023

Choose a reason for hiding this comment

ialidzhikov Jun 13, 2023

Choose a reason for hiding this comment

ialidzhikov Jun 13, 2023

Choose a reason for hiding this comment

aojea commented Jun 12, 2023

wojtek-t commented Jun 12, 2023

thockin left a comment

Choose a reason for hiding this comment

k8s-ci-robot commented Jun 12, 2023

aojea commented May 15, 2023 •

edited

Loading