Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

KEP-2433: add new heuristic to topology routing #4003

Merged
merged 3 commits into from
Jun 12, 2023

Conversation

aojea
Copy link
Member

@aojea aojea commented May 15, 2023

KEP-2433: add new heuristic to topology routing

This simplify the existing heuristic that is based on CPUs cores because is difficult to get it working on deployments where the law of large numbers can not help with the statistics.

Add a new heuristic to only PreferZone, ie. traffic will be directed to the endpoints in the same zone if exist, or fall back to cluster wide routing if there are no endpoints in the zone

As a side benefit of this new heuristic, we can abstract the topology using an interface that will help with the KEP #3685 to stage the endpointslice controller.

@k8s-ci-robot k8s-ci-robot added the cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. label May 15, 2023
@k8s-ci-robot k8s-ci-robot requested a review from dcbw May 15, 2023 13:12
@k8s-ci-robot k8s-ci-robot added the kind/kep Categorizes KEP tracking issues and PRs modifying the KEP directory label May 15, 2023
@k8s-ci-robot k8s-ci-robot requested a review from thockin May 15, 2023 13:12
@k8s-ci-robot k8s-ci-robot added sig/network Categorizes an issue or PR as relevant to SIG Network. size/M Denotes a PR that changes 30-99 lines, ignoring generated files. labels May 15, 2023
@aojea aojea mentioned this pull request May 15, 2023
4 tasks
@aojea
Copy link
Member Author

aojea commented May 15, 2023

/assign @robscott @thockin
/cc @ialidzhikov

@thockin
Copy link
Member

thockin commented May 16, 2023

This seems reasonable but I don't think it solves the problem as stated by several users, which I think distills into "same node if possible, otherwise same zone if possible, otherwise same region if possible, otherwise random". I'm not against adding the heuristic proposed here, but only if we think it is solving for someone's use-case. Is it?

@robscott
Copy link
Member

This seems reasonable but I don't think it solves the problem as stated by several users, which I think distills into "same node if possible, otherwise same zone if possible, otherwise same region if possible, otherwise random". I'm not against adding the heuristic proposed here, but only if we think it is solving for someone's use-case. Is it?

Agree with @thockin here. I chatted with @aojea about this PR some this afternoon, will add a high level summary here:

  1. The proposed heuristic is probably better for many than the current CPU based approach. I was a little curious if we could potentially make this more of a migration to a new heuristic than needing to support both indefinitely, but I don't think we can justify removing the old one since it is legitimately useful in some cases and the one that's proposed here doesn't cover all of those use cases.
  2. Any proportional algorithm (both new and old) does not work well with HPAs when load is unevenly originating from one zone. We really want to enable users to specify per-zone Deployments and HPAs that can accurately respond to load coming from that zone.
  3. The only way I can think of to support HPAs well is to have a simple PreferZone type algorithm that routes to ready endpoint(s) in a zone if there are any, and if there aren't, falls back to cluster-wide routing. Maybe node-local is also helpful here, but I'm not as convinced about that, since it's harder to combine with autoscaling.
  4. I'd personally rather implement PreferZone at the proxy level without hints. At least in kube-proxy, this code would be quite straightforward, and would fit in somewhere around here. Implementing with hints would simplify logic for dataplanes but at the cost of seemingly needless extra bytes (this is not a blocker for me).
  5. Although topology hints technically allow us to support as many heuristics as we want, the existing docs for topology are already pretty complicated, and any additional heuristic will only further complicate them. Unless we're very sure we need additional heuristics, I'd lean towards not adding them. (I do feel like something resembling PreferZone is required).

@aojea
Copy link
Member Author

aojea commented May 16, 2023

Fair enough, for hints I still think we need to replace the existing complex heuristic or add a more naive one and easier to think about.

For prefer zone then we need a new KEP to reserve the annotation value and add criteria for graduation (at least 2 proxies implement it, or something like that)

WDYT?

@thockin
Copy link
Member

thockin commented May 16, 2023 via email

Comment on lines 412 to 415
zone-a: 2 endpoints
zone-b: 1 endpoint
zone-c: 3 endpoints

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What happens if I have 4 Endpoints distributed such as:

zone-a: 1 endpoints
zone-b: 1 endpoint
zone-c: 2 endpoints

?

Will the hints be removed again as it is not possible to achieve a balanced distribution of endpoints across the zones?

We are looking into a deterministic way to have the hints set. We already use topologySpreadConstraints and nodeAffinity to spread the endpoints evenly across the zones. The endpoints will be spread evenly when the number of endpoints % number of zones = 0. But when the number of endpoints % number of zones != 0 then there will be some zones with more replicas than the other. Or, in case of Deployment/StatefuleSet rollout, then the even distribution can be also violated.
Currently we have a mutating webhook that mutates the EndpointSlice's hints such as: it always sets the endpoint's hints to the endpoint's zone.
Shouldn't we consider adding this strategy as well? Or can you elaborate the newly added heuristic whether it is close to it? Because I am confused and cannot judge about it.
Assumptions for this strategy:

  • You already have guaranteed appr. balanced distribution of endpoints across zones via scheduling means (topologySpreadConstraint, nodeAffinity)
  • Each endpoint has equivalent capacity.

What the strategy does is to maintain the endpoint's zone as hint.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Currently we have a mutating webhook that mutates the EndpointSlice's hints such as: it always sets the endpoint's hints to the endpoint's zone.

sorry for the late response, I was suppose to close this PR to send a new one with the simple heuristic you mention, to copy the zone to the hint for the reasons that Rob mentions #4003 (comment) (specially 2.) but then we need to find a good solution to be completely sure we can handle Tim's comments #4003 (comment)

@ialidzhikov one curiosity, how is copying over directly the zone to hints working for you? is that something that is completely solving your problems?
cc: @robscott

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ialidzhikov one curiosity, how is copying over directly the zone to hints working for you? is that something that is completely solving your problems?

Some details for our setup are revealed in kubernetes/kubernetes#110714 (comment). For example, one of our use-case is to use topology-aware routing for the communication between the kube-apiserver and etcd. We run etcd in 3 zones, 1 replica per zone. The kube-apiserver replicas are also spread in similar way but the kube-apiserver replicas may vary between 3/4 replicas. The idea is that each kube-apiserver talks to the etcd Pod in its zone.
We also use topology-aware routing for the webhook communication. When kube-apiserver needs to talk to webhook deployed in the same cluster that the kube-apiserver runs. So, again, kube-apiserver talks to the webhook endpoint that is located in its zone (if there is such, of course).

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just to make sure that I understand the PreferZone heuristic. It will always maintain the endpoint's zone as its hint, right?

@k8s-ci-robot k8s-ci-robot added size/L Denotes a PR that changes 100-499 lines, ignoring generated files. and removed size/M Denotes a PR that changes 30-99 lines, ignoring generated files. labels Jun 12, 2023
@aojea
Copy link
Member Author

aojea commented Jun 12, 2023

@thockin @robscott @ialidzhikov updated the kep to include PreferZone and keep it in beta during 1.28 , so we can go GA in 1.29 if feedback is correct, PTAL

Add a new simple heuristic to minimize traffic cost
at the expense of higher risk of traffic imbalance and
endpoints overload

Signed-off-by: Antonio Ojea <aojea@google.com>
@wojtek-t
Copy link
Member

LGTM

Copy link
Member

@thockin thockin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks!

/lgtm
/approve

@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Jun 12, 2023
@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: aojea, thockin

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot k8s-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Jun 12, 2023
@k8s-ci-robot k8s-ci-robot merged commit e1bdbca into kubernetes:master Jun 12, 2023
@k8s-ci-robot k8s-ci-robot added this to the v1.28 milestone Jun 12, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. kind/kep Categorizes KEP tracking issues and PRs modifying the KEP directory lgtm "Looks good to me", indicates that a PR is ready to be merged. sig/network Categorizes an issue or PR as relevant to SIG Network. size/L Denotes a PR that changes 100-499 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants