L2: use externalips for speaker allocation and remove the need for a local backend in case of traffic policy cluster #976

fedepaol · 2021-09-30T11:36:30Z

Fixes:

Obsoletes the following PRs:

This started with adding a test to reproduce the issue above, and after investigating it the fix ended up to be removing the need for locality for endpoints in case of traffic policy of type cluster, plus changing the way we determine if a speaker should advertise a service, consuming the external ip(s) instead of the service name + namespace to sort.

Note: I haven't aligned the unit tests (yet), will do if this changes make sense to the reviewers.

fedepaol · 2021-09-30T14:06:26Z

uhm, tests failed 👀

russellb · 2021-10-08T19:10:13Z

You’ve hit an area that has needed to be fixed for a long time. We have many open issues and PRs that are related to this. I had to spend some time going back through all the related issues to help decide what to do with this, so I’m going to share some of those notes here.

I’d love to sort all of this out in this PR and get a bunch of stuff closed out.

You’ve made two key changes here:

You changed the way we calculate the set of candidate nodes to be all nodes when the traffic policy is cluster instead of using only the nodes where a speaker is running along side an Endpoint backing that Service.
You changed the sort key of the candidate node list to be the node name + load balancer IP addresses instead of node name + service name.

On 1 -- related PRs include #781, #665, and #613. Related open issues include #646 and #302.

One thing I liked about other changes compared to yours is to prefer the old algorithm if it produces at least one candidate node. That way we’ll still put the IP on a node with a local endpoint if possible. Otherwise, we can fall back to what you’ve proposed here if we’re allowed to (with the cluster traffic policy). If you agree, updating to incorporate this behavior is my first request.

On 2, you called out issue #968. There’s another class of issues related to change #2, which is to deal with shared IPs for services that use the Local traffic policy. See issues #558, #801 and PRs #562, #887, and #922.

I do see that your change would resolve #968, but I don’t think we would fully resolve the issues related to sharing an IP between multiple IP addresses using the Local traffic policy. Sharing an IP across multiple Local policy services is not going to always work, and we need to get better at detecting that. Perhaps we can defer that to a future PR because I don’t think what you’re proposing here makes it any worse.

Making a change here will potentially cause many/most/all IPs to get reassigned to a new node. It would be nice if we could sort out these local traffic policy issues at the same time so we impose the disruption in node selection just once. We should also document this in the release notes.

If you can, take a look at these other PRs and issues. I’d like to try to minimize the number of times we release a change to node selection. If you’d like, we could split out change #1 (with my suggested updates) into a separate PR to merge first.

fedepaol · 2021-10-11T12:49:17Z

You’ve hit an area that has needed to be fixed for a long time. We have many open issues and PRs that are related to this. I had to spend some time going back through all the related issues to help decide what to do with this, so I’m going to share some of those notes here.

I’d love to sort all of this out in this PR and get a bunch of stuff closed out.

You’ve made two key changes here:
1. You changed the way we calculate the set of candidate nodes to be all nodes when the traffic policy is `cluster` instead of using only the nodes where a speaker is running along side an Endpoint backing that Service.

2. You changed the sort key of the candidate node list to be the node name + load balancer IP addresses instead of node name + service name.
On 1 -- related PRs include #781, #665, and #613. Related open issues include #646 and #302.

One thing I liked about other changes compared to yours is to prefer the old algorithm if it produces at least one candidate node. That way we’ll still put the IP on a node with a local endpoint if possible. Otherwise, we can fall back to what you’ve proposed here if we’re allowed to (with the cluster traffic policy). If you agree, updating to incorporate this behavior is my first request.

Oh, sounds cool. But please remind that if there are more than one backend pods, there's no guarantee at all that the traffic will be directed to the local pod (unless traffic policy is local).

On 2, you called out issue #968. There’s another class of issues related to change #2, which is to deal with shared IPs for services that use the Local traffic policy. See issues #558, #801 and PRs #562, #887, and #922.

I do see that your change would resolve #968, but I don’t think we would fully resolve the issues related to sharing an IP between multiple IP addresses using the Local traffic policy. Sharing an IP across multiple Local policy services is not going to always work, and we need to get better at detecting that. Perhaps we can defer that to a future PR because I don’t think what you’re proposing here makes it any worse.

Well, if the policy is local AND we have two pods on two different nodes assigned to two different services that must share the same ip, that's mechanically impossible, so yes the only thing that we can do is to throw an error in that case.

Making a change here will potentially cause many/most/all IPs to get reassigned to a new node. It would be nice if we could sort out these local traffic policy issues at the same time so we impose the disruption in node selection just once. We should also document this in the release notes.

Do you mean after an upgrade?

If you can, take a look at these other PRs and issues. I’d like to try to minimize the number of times we release a change to node selection. If you’d like, we could split out change #1 (with my suggested updates) into a separate PR to merge first.

I will!

russellb · 2021-10-11T13:08:37Z

One thing I liked about other changes compared to yours is to prefer the old algorithm if it produces at least one candidate node. That way we’ll still put the IP on a node with a local endpoint if possible. Otherwise, we can fall back to what you’ve proposed here if we’re allowed to (with the cluster traffic policy). If you agree, updating to incorporate this behavior is my first request.

Oh, sounds cool. But please remind that if there are more than one backend pods, there's no guarantee at all that the traffic will be directed to the local pod (unless traffic policy is local).

Right. I was thinking that at least some percentage of the traffic would stay local, which is a little better than none of it.

Making a change here will potentially cause many/most/all IPs to get reassigned to a new node. It would be nice if we could sort out these local traffic policy issues at the same time so we impose the disruption in node selection just once. We should also document this in the release notes.

Do you mean after an upgrade?

Yes

fedepaol · 2021-10-11T21:36:26Z

On 2, you called out issue #968. There’s another class of issues related to change #2, which is to deal with shared IPs for services that use the Local traffic policy. See issues #558, #801 and PRs #562, #887, and #922.

I gave a look at the others:

Multiple services with same IP announced multiple times #558 will be solved by this very same logic
[RFC] Fix conflicting ARP when IP is shared #562 and Fix conflicting ARP when IP is shared #922 are more or less a copy of both changes contained in this PR (and sorry, did not realize it was already tackled), I started investigating Shared Layer 2 IPs should not be announced by multiple nodes #968

Said that (and having looked at the code again):

having shared ip between services with local policy means that they have the same endpoints (

metallb/internal/allocator/k8salloc/k8salloc.go

Line 27 in 6126100

func BackendKey(svc *v1.Service) string {

/

metallb/internal/allocator/allocator.go

Line 145 in 997a177

if existingSK := a.sharingKeyForIP[ip.String()]; existingSK != nil {
this makes it impossible to have two services, with one of them being local, with a different set of endpoints (and the risk of announcing one from one node, the other from the other)
This means that, assuming a stable ordering function of the nodes based on active speakers / endpoints in case of traffic policy locals, all the speakers will agree on which node should advertise a given service (provided that we use the ip as "service identifier".
If such conflict exists, the second service won't get its ip and will remain in pending

Long story short, if we want to share the same ip to two services with a different backend selector, the second service won't get its ip. If the set of backends is the same, the winner node in each speaker selection will be the same assuming we don't use the service's name but the service's ip(s) / the sharing key as ordering criteria.

#801 (and it's related PR) handles a corner case.
MetalLB runs with the simplification that, if two services have local policy and want to share the same IP address, they MUST have the same set of pods as backend.
My understanding is that #887 tries to check if the set of backend pods is the same, regardless of the fact that they are choosen using the same selector. This would not change the considerations above, would just make the logic a bit smarter.

I think the conclusion is, the current PR would address all the mentioned issues / prs you mentioned except
#801 which would not affect the node selection behaviour.

fedepaol · 2021-10-12T07:18:54Z

On 1 -- related PRs include #781, #665, and #613. Related open issues include #646 and #302.

One thing I liked about other changes compared to yours is to prefer the old algorithm if it produces at least one candidate node. That way we’ll still put the IP on a node with a local endpoint if possible. Otherwise, we can fall back to what you’ve proposed here if we’re allowed to (with the cluster traffic policy). If you agree, updating to incorporate this behavior is my first request.

Regarding this, I re-checked the issues plus the old prs.
All the old PRs behave like mine, they take either the list of nodes OR the list of nodes with backends depending on the traffic policy.
See https://github.com/metallb/metallb/pull/562/files#diff-2c4fb432cbb842626bb58d73a6d40bc04718aeaca625b44d1711e7e174548ec3R52 and https://github.com/metallb/metallb/pull/922/files#diff-2c4fb432cbb842626bb58d73a6d40bc04718aeaca625b44d1711e7e174548ec3R47

And it can't be any different.
The scenario we want to address here is: two different services with traffic policy cluster and different set of backends.
So, the worst case scenario is the one with the test I added:

same ip
one service with one backend on one node, the other service with the backend on the other node
When a speaker produces the list of possible nodes, it only has the list of the backends of the current service available, which will be different for the two services (without the current change).
We need to change the sorting key, so that the two services sharing the same ip will end up sorted in the same way

We could make the logic more complex, collecting all the endpoints belonging to all the services sharing the same ip with the one currently being handled, but I am not sure it's worth the effort. If users care about the traffic being local, they will use the TrafficPolicy local anyhow.

TL;DR: I think it makes sense to keep this as it is, and both changes are needed.

russellb · 2021-10-12T15:17:01Z

Thanks a lot for putting the time into looking over all of those issues and PRs. I'm OK with proceeding with this PR once you update the related tests. Can you also update the PR / commit message to reflect all of the things it will close?

fedepaol · 2021-10-12T15:27:41Z

Thanks a lot for putting the time into looking over all of those issues and PRs. I'm OK with proceeding with this PR once you update the related tests. Can you also update the PR / commit message to reflect all of the things it will close?

How about mentioning the things we close in the PR only? I always thought as commit message explaining the rationale, but I can also mention issues if you think it's worth.

russellb · 2021-10-12T16:24:25Z

Thanks a lot for putting the time into looking over all of those issues and PRs. I'm OK with proceeding with this PR once you update the related tests. Can you also update the PR / commit message to reflect all of the things it will close?

How about mentioning the things we close in the PR only? I always thought as commit message explaining the rationale, but I can also mention issues if you think it's worth.

PR is fine

champtar · 2021-10-13T18:33:14Z

speaker/layer2_controller.go

+	for _, i := range svc.Status.LoadBalancer.Ingress {
+		ss = append(ss, i.IP)
+	}
+	return nodeName + "#" + strings.Join(ss, "#")


you are adding nodeName twice

Oh, thanks! Hope it doesn't shuffle the tests again 😅

champtar · 2021-10-13T18:36:39Z

You might want to share an IPv4 but have 2 different IPv6 between 2 services. The sort key should be the metallb.universe.tf/allow-shared-ip annotation if not empty else the svc name (like right now).

briantopping · 2021-10-14T04:01:34Z

Hi, I just saw this and thought I would chime in with the reason #801 was written. Gardener is a hyperscaler provider for Kubernetes. It uses what are called "Cloud Providers", a software abstraction to provide worker nodes. The provider I am using creates Kubevirt VMs for the nodes. Kubevirt can schedule child ("shoot") cluster worker nodes to be created anywhere in the root ("seed") cluster, which likely spans several hardware nodes. As these child clusters are created, the root cluster scheduler will typically distribute these worker nodes all across the root cluster to distribute load, regardless of the child cluster grouping.

Closer to the problem, let's say a child cluster wants to pull an address from a MetalLB pool. MetalLB is only running on the root cluster, not the child clusters. To expose services in child clusters, they must be exposed through the root cluster. This happens by creating a service proxy in the root that selects the services on the child by selector.

In order to get a public address, the typical MLB annotations are propagated up to the root cluster proxy in kubevirt/cloud-provider-kubevirt#16. MLB sees the annotation on a service and does its normal thing. The actual problem #801 addresses is in kubevirt/cloud-provider-kubevirt#18:

This behavior in MetalLB in Local mode makes sense: A single IP address cannot route to multiple service nodes. At the same time, it doesn't seem fixed service selectors referencing dynamic pod labels should be a constrained, so this seems to be a misfeature in MetalLB.

Does this patch solve the problem? If not, I'd like to help upgrade the patch to do so. I can't see how it does though.

fedepaol · 2021-10-14T07:19:33Z

You might want to share an IPv4 but have 2 different IPv6 between 2 services. The sort key should be the metallb.universe.tf/allow-shared-ip annotation if not empty else the svc name (like right now).

This is an interesting point (and currently covered behind the fact that we don't handle dual stack yet). I assumed that two services with the same ipv4 would also have had the same ipv6, but you are right, that's not always the case.

I'd rather try to change the interface and the logic to focus on the IP (because at the end of the day the shouldAnnounce logic is applied per load balancer ip

metallb/speaker/main.go

Line 272 in 31f4de0

    
           if deleteReason := handler.ShouldAnnounce(l, name, svc, eps); deleteReason != "" {

) so we avoid having crowded announcers in those cases where the sharing key is the same BUT the ips are different.
Does it make sense?

fedepaol · 2021-10-14T07:26:23Z

Does this patch solve the problem? If not, I'd like to help upgrade the patch to do so. I can't see how it does though.

I saw your PR when wrapping my head around the open work, and I think it's orthogonal to what's being addressed here, which is the other way around: remove the constraint of having a local pod in case of traffic policy cluster.

I did a recap in #976 (comment) , but I'd keep the two things separate instead of collapsing that other logic here, which requires a separate discussion.

briantopping · 2021-10-14T13:57:32Z

Thanks @fedepaol, that's fine for me if your proposed changes don't block a solution to #801. It sounds like you are saying that's not the case. Merging them at the same time would help ensure that doesn't happen. I've been waiting six months for some actionable feedback on that issue and I'm not sure why it never happened.

This is half-baked commit adding support to Podman-based kind environments. It continues assuming the container runtime is docker but checks behind the scenes if docker is a wrapper for podman, i.e. a symlink to podman for cases where commands and/or outputs differ. Users on Red Hat and Debian family distributions can either install package 'podman-docker' or create a symbolic link docker -> podman. For more information of how user:group ownership is mapped, please refer to https://www.redhat.com/sysadmin/rootless-podman-makes-sense Signed-off-by: Carlos Goncalves <cgoncalves@redhat.com>

Merge-with-parent seems not to be working when the main branch is not well aligned with the branch the PR is coming from. Here we roll it back. Signed-off-by: Federico Paolinelli <fpaoline@redhat.com>

Updating the release notes to reflect the new L2 announcement logic based on loadbalancer ip, and change of behaviour in case of traffic policy cluster. Signed-off-by: Federico Paolinelli <fpaoline@redhat.com>

Having a way to skip tests is required when we define ipv4 / ipv6 only tests. Signed-off-by: Federico Paolinelli <fpaoline@redhat.com>

Now that the tests are able to configure dinamically, having two separate lanes for two protocols is not needed anymore. On top of that, we add skips for ipv6 tests on the ipv4 lane and viceversa. Finally, BGP is skipped in the ipv6 lane. Signed-off-by: Federico Paolinelli <fpaoline@redhat.com>

Instead of failing after the first attempt, we validate the behaviour inside an eventually loop in order to avoid flakes due to slowness in CI. Signed-off-by: Federico Paolinelli <fpaoline@redhat.com>

russellb

lgtm ... just waiting for CI to finish

Addresses the doc impact for SDN-2480. The metallb/metallb/pull/976 PR seems related in the sense that an limiting the `speaker` pods and ETP of `cluster` seem to jive well together. Add link to "Placing pods on specific nodes using node selectors"--address Mohamed's feedback. Address comments from Federico. Address comments from Arti.

fedepaol requested review from daxmc99, johananl, rata and russellb as code owners September 30, 2021 11:36

fedepaol changed the title ~~Testl2issue~~ L2: use externalips for speaker allocation and remove the need for a local backend in case of traffic policy cluster Sep 30, 2021

fedepaol force-pushed the testl2issue branch 3 times, most recently from cfea141 to aa53e58 Compare October 1, 2021 09:36

russellb added bug protocol/layer2 labels Oct 8, 2021

fedepaol force-pushed the testl2issue branch 2 times, most recently from 2e6bb50 to 6ada084 Compare October 13, 2021 17:16

champtar reviewed Oct 13, 2021

View reviewed changes

fedepaol force-pushed the testl2issue branch 3 times, most recently from efcc28b to a0775d9 Compare October 14, 2021 13:40

cgoncalves and others added 6 commits October 27, 2021 17:58

CI: revert merge-with-parent

5ee02b1

Merge-with-parent seems not to be working when the main branch is not well aligned with the branch the PR is coming from. Here we roll it back. Signed-off-by: Federico Paolinelli <fpaoline@redhat.com>

L2 new allocation logic: update release notes

1aae9e0

Updating the release notes to reflect the new L2 announcement logic based on loadbalancer ip, and change of behaviour in case of traffic policy cluster. Signed-off-by: Federico Paolinelli <fpaoline@redhat.com>

Add a skip parameter to inv e2e

fdf236e

Having a way to skip tests is required when we define ipv4 / ipv6 only tests. Signed-off-by: Federico Paolinelli <fpaoline@redhat.com>

E2E: Make the bgp checks inside an eventually loop

36d3058

Instead of failing after the first attempt, we validate the behaviour inside an eventually loop in order to avoid flakes due to slowness in CI. Signed-off-by: Federico Paolinelli <fpaoline@redhat.com>

fedepaol force-pushed the testl2issue branch from 1d0a074 to 36d3058 Compare October 27, 2021 16:00

russellb approved these changes Oct 27, 2021

View reviewed changes

russellb merged commit ccc9122 into metallb:main Oct 27, 2021

russellb added this to the v0.11.0 milestone Oct 27, 2021

mikemckiernan mentioned this pull request Nov 19, 2021

OSDOCS-2681: nodesel and new algo openshift/openshift-docs#39068

Merged

fedepaol mentioned this pull request Dec 17, 2021

Add a pool configurability design proposal. #942

Merged

msherif1234 mentioned this pull request Feb 9, 2022

[release-4.9] Bug 2052549: L2: use externalips for speaker allocation openshift/metallb#33

Merged

bbockelm mentioned this pull request Apr 13, 2022

layer2 mode doesn't receive broadcast packets on VM unless promiscuous mode is enabled #253

Closed

fedepaol mentioned this pull request Feb 21, 2024

L2 mode, IP assigned, but service not responding after upgrade from v.0.12.1 to v.0.13.2 #2295

Closed

8 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

L2: use externalips for speaker allocation and remove the need for a local backend in case of traffic policy cluster #976

L2: use externalips for speaker allocation and remove the need for a local backend in case of traffic policy cluster #976

fedepaol commented Sep 30, 2021 •

edited

fedepaol commented Sep 30, 2021

russellb commented Oct 8, 2021 •

edited

fedepaol commented Oct 11, 2021

russellb commented Oct 11, 2021

fedepaol commented Oct 11, 2021

fedepaol commented Oct 12, 2021 •

edited

russellb commented Oct 12, 2021

fedepaol commented Oct 12, 2021

russellb commented Oct 12, 2021

champtar Oct 13, 2021

fedepaol Oct 14, 2021

champtar commented Oct 13, 2021 •

edited

briantopping commented Oct 14, 2021 •

edited

fedepaol commented Oct 14, 2021

fedepaol commented Oct 14, 2021

briantopping commented Oct 14, 2021 •

edited

russellb left a comment

L2: use externalips for speaker allocation and remove the need for a local backend in case of traffic policy cluster #976

L2: use externalips for speaker allocation and remove the need for a local backend in case of traffic policy cluster #976

Conversation

fedepaol commented Sep 30, 2021 • edited

fedepaol commented Sep 30, 2021

russellb commented Oct 8, 2021 • edited

fedepaol commented Oct 11, 2021

russellb commented Oct 11, 2021

fedepaol commented Oct 11, 2021

fedepaol commented Oct 12, 2021 • edited

russellb commented Oct 12, 2021

fedepaol commented Oct 12, 2021

russellb commented Oct 12, 2021

champtar Oct 13, 2021

Choose a reason for hiding this comment

fedepaol Oct 14, 2021

Choose a reason for hiding this comment

champtar commented Oct 13, 2021 • edited

briantopping commented Oct 14, 2021 • edited

fedepaol commented Oct 14, 2021

fedepaol commented Oct 14, 2021

briantopping commented Oct 14, 2021 • edited

russellb left a comment

Choose a reason for hiding this comment

fedepaol commented Sep 30, 2021 •

edited

russellb commented Oct 8, 2021 •

edited

fedepaol commented Oct 12, 2021 •

edited

champtar commented Oct 13, 2021 •

edited

briantopping commented Oct 14, 2021 •

edited

briantopping commented Oct 14, 2021 •

edited