New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
L2: use externalips for speaker allocation and remove the need for a local backend in case of traffic policy cluster #976
Conversation
uhm, tests failed 👀 |
cfea141
to
aa53e58
Compare
You’ve hit an area that has needed to be fixed for a long time. We have many open issues and PRs that are related to this. I had to spend some time going back through all the related issues to help decide what to do with this, so I’m going to share some of those notes here. I’d love to sort all of this out in this PR and get a bunch of stuff closed out. You’ve made two key changes here:
On 1 -- related PRs include #781, #665, and #613. Related open issues include #646 and #302. One thing I liked about other changes compared to yours is to prefer the old algorithm if it produces at least one candidate node. That way we’ll still put the IP on a node with a local endpoint if possible. Otherwise, we can fall back to what you’ve proposed here if we’re allowed to (with the On 2, you called out issue #968. There’s another class of issues related to change #2, which is to deal with shared IPs for services that use the Local traffic policy. See issues #558, #801 and PRs #562, #887, and #922. I do see that your change would resolve #968, but I don’t think we would fully resolve the issues related to sharing an IP between multiple IP addresses using the Local traffic policy. Sharing an IP across multiple Local policy services is not going to always work, and we need to get better at detecting that. Perhaps we can defer that to a future PR because I don’t think what you’re proposing here makes it any worse. Making a change here will potentially cause many/most/all IPs to get reassigned to a new node. It would be nice if we could sort out these local traffic policy issues at the same time so we impose the disruption in node selection just once. We should also document this in the release notes. If you can, take a look at these other PRs and issues. I’d like to try to minimize the number of times we release a change to node selection. If you’d like, we could split out change #1 (with my suggested updates) into a separate PR to merge first. |
Oh, sounds cool. But please remind that if there are more than one backend pods, there's no guarantee at all that the traffic will be directed to the local pod (unless traffic policy is local).
Well, if the policy is local AND we have two pods on two different nodes assigned to two different services that must share the same ip, that's mechanically impossible, so yes the only thing that we can do is to throw an error in that case.
Do you mean after an upgrade?
I will! |
Right. I was thinking that at least some percentage of the traffic would stay local, which is a little better than none of it.
Yes |
I gave a look at the others:
Said that (and having looked at the code again):
Long story short, if we want to share the same ip to two services with a different backend selector, the second service won't get its ip. If the set of backends is the same, the winner node in each speaker selection will be the same assuming we don't use the service's name but the service's ip(s) / the sharing key as ordering criteria. #801 (and it's related PR) handles a corner case. I think the conclusion is, the current PR would address all the mentioned issues / prs you mentioned except |
Regarding this, I re-checked the issues plus the old prs. And it can't be any different.
We could make the logic more complex, collecting all the endpoints belonging to all the services sharing the same ip with the one currently being handled, but I am not sure it's worth the effort. If users care about the traffic being local, they will use the TrafficPolicy local anyhow. TL;DR: I think it makes sense to keep this as it is, and both changes are needed. |
Thanks a lot for putting the time into looking over all of those issues and PRs. I'm OK with proceeding with this PR once you update the related tests. Can you also update the PR / commit message to reflect all of the things it will close? |
How about mentioning the things we close in the PR only? I always thought as commit message explaining the rationale, but I can also mention issues if you think it's worth. |
PR is fine |
2e6bb50
to
6ada084
Compare
speaker/layer2_controller.go
Outdated
for _, i := range svc.Status.LoadBalancer.Ingress { | ||
ss = append(ss, i.IP) | ||
} | ||
return nodeName + "#" + strings.Join(ss, "#") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
you are adding nodeName twice
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh, thanks! Hope it doesn't shuffle the tests again 😅
You might want to share an IPv4 but have 2 different IPv6 between 2 services. The sort key should be the |
Hi, I just saw this and thought I would chime in with the reason #801 was written. Gardener is a hyperscaler provider for Kubernetes. It uses what are called "Cloud Providers", a software abstraction to provide worker nodes. The provider I am using creates Kubevirt VMs for the nodes. Kubevirt can schedule child ("shoot") cluster worker nodes to be created anywhere in the root ("seed") cluster, which likely spans several hardware nodes. As these child clusters are created, the root cluster scheduler will typically distribute these worker nodes all across the root cluster to distribute load, regardless of the child cluster grouping. Closer to the problem, let's say a child cluster wants to pull an address from a MetalLB pool. MetalLB is only running on the root cluster, not the child clusters. To expose services in child clusters, they must be exposed through the root cluster. This happens by creating a service proxy in the root that selects the services on the child by selector. In order to get a public address, the typical MLB annotations are propagated up to the root cluster proxy in kubevirt/cloud-provider-kubevirt#16. MLB sees the annotation on a service and does its normal thing. The actual problem #801 addresses is in kubevirt/cloud-provider-kubevirt#18:
Does this patch solve the problem? If not, I'd like to help upgrade the patch to do so. I can't see how it does though. |
This is an interesting point (and currently covered behind the fact that we don't handle dual stack yet). I assumed that two services with the same ipv4 would also have had the same ipv6, but you are right, that's not always the case. I'd rather try to change the interface and the logic to focus on the IP (because at the end of the day the shouldAnnounce logic is applied per load balancer ip Line 272 in 31f4de0
Does it make sense? |
I saw your PR when wrapping my head around the open work, and I think it's orthogonal to what's being addressed here, which is the other way around: remove the constraint of having a local pod in case of traffic policy cluster. I did a recap in #976 (comment) , but I'd keep the two things separate instead of collapsing that other logic here, which requires a separate discussion. |
efcc28b
to
a0775d9
Compare
Thanks @fedepaol, that's fine for me if your proposed changes don't block a solution to #801. It sounds like you are saying that's not the case. Merging them at the same time would help ensure that doesn't happen. I've been waiting six months for some actionable feedback on that issue and I'm not sure why it never happened. |
This is half-baked commit adding support to Podman-based kind environments. It continues assuming the container runtime is docker but checks behind the scenes if docker is a wrapper for podman, i.e. a symlink to podman for cases where commands and/or outputs differ. Users on Red Hat and Debian family distributions can either install package 'podman-docker' or create a symbolic link docker -> podman. For more information of how user:group ownership is mapped, please refer to https://www.redhat.com/sysadmin/rootless-podman-makes-sense Signed-off-by: Carlos Goncalves <cgoncalves@redhat.com>
Merge-with-parent seems not to be working when the main branch is not well aligned with the branch the PR is coming from. Here we roll it back. Signed-off-by: Federico Paolinelli <fpaoline@redhat.com>
Updating the release notes to reflect the new L2 announcement logic based on loadbalancer ip, and change of behaviour in case of traffic policy cluster. Signed-off-by: Federico Paolinelli <fpaoline@redhat.com>
Having a way to skip tests is required when we define ipv4 / ipv6 only tests. Signed-off-by: Federico Paolinelli <fpaoline@redhat.com>
Now that the tests are able to configure dinamically, having two separate lanes for two protocols is not needed anymore. On top of that, we add skips for ipv6 tests on the ipv4 lane and viceversa. Finally, BGP is skipped in the ipv6 lane. Signed-off-by: Federico Paolinelli <fpaoline@redhat.com>
Instead of failing after the first attempt, we validate the behaviour inside an eventually loop in order to avoid flakes due to slowness in CI. Signed-off-by: Federico Paolinelli <fpaoline@redhat.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
lgtm ... just waiting for CI to finish
Addresses the doc impact for SDN-2480. The metallb/metallb/pull/976 PR seems related in the sense that an limiting the `speaker` pods and ETP of `cluster` seem to jive well together. Add link to "Placing pods on specific nodes using node selectors"--address Mohamed's feedback. Address comments from Federico. Address comments from Arti.
Fixes:
Obsoletes the following PRs:
This started with adding a test to reproduce the issue above, and after investigating it the fix ended up to be removing the need for locality for endpoints in case of traffic policy of type cluster, plus changing the way we determine if a speaker should advertise a service, consuming the external ip(s) instead of the service name + namespace to sort.
Note: I haven't aligned the unit tests (yet), will do if this changes make sense to the reviewers.